linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: ValdikSS <iam@valdikss.org.ru>
To: Alexey Avramov <hakavlad@inbox.lv>, linux-mm@kvack.org
Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, corbet@lwn.net,
	akpm@linux-foundation.org, mcgrof@kernel.org,
	keescook@chromium.org, yzaikin@google.com,
	oleksandr@natalenko.name, kernel@xanmod.org, aros@gmx.com,
	hakavlad@gmail.com, Yu Zhao <yuzhao@google.com>
Subject: Re: [PATCH] mm/vmscan: add sysctl knobs for protecting the working set
Date: Thu, 2 Dec 2021 21:05:01 +0300	[thread overview]
Message-ID: <2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru> (raw)
In-Reply-To: <20211130201652.2218636d@mail.inbox.lv>


[-- Attachment #1.1: Type: text/plain, Size: 16872 bytes --]

This patchset is surprisingly effective and very useful for low-end PC 
with slow HDD, single-board ARM boards with slow storage, cheap Android 
smartphones with limited amount of memory. It almost completely prevents 
thrashing condition and aids in fast OOM killer invocation.

The similar file-locking patch is used in ChromeOS for nearly 10 years 
but not on stock Linux or Android. It would be very beneficial for 
lower-performance Android phones, SBCs, old PCs and other devices.

With this patch, combined with zram, I'm able to run the following 
software on an old office PC from 2007 with __only 2GB of RAM__ 
simultaneously:

  * Firefox with 37 active tabs (all data in RAM, no tab unloading)
  * Discord
  * Skype
  * LibreOffice with the document opened
  * Two PDF files (14 and 47 megabytes in size)

And the PC doesn't crawl like a snail, even with 2+ GB in zram!
Without the patch, this PC is barely usable.
Please watch the video:
https://notes.valdikss.org.ru/linux-for-old-pc-from-2007/en/



On 30.11.2021 14:16, Alexey Avramov wrote:
> The kernel does not provide a way to protect the working set under memory
> pressure. A certain amount of anonymous and clean file pages is required by
> the userspace for normal operation. First of all, the userspace needs a
> cache of shared libraries and executable binaries. If the amount of the
> clean file pages falls below a certain level, then thrashing and even
> livelock can take place.
> 
> The patch provides sysctl knobs for protecting the working set (anonymous
> and clean file pages) under memory pressure.
> 
> The vm.anon_min_kbytes sysctl knob provides *hard* protection of anonymous
> pages. The anonymous pages on the current node won't be reclaimed under any
> conditions when their amount is below vm.anon_min_kbytes. This knob may be
> used to prevent excessive swap thrashing when anonymous memory is low (for
> example, when memory is going to be overfilled by compressed data of zram
> module). The default value is defined by CONFIG_ANON_MIN_KBYTES (suggested
> 0 in Kconfig).
> 
> The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
> clean file pages. The file pages on the current node won't be reclaimed
> under memory pressure when the amount of clean file pages is below
> vm.clean_low_kbytes *unless* we threaten to OOM. Protection of clean file
> pages using this knob may be used when swapping is still possible to
>    - prevent disk I/O thrashing under memory pressure;
>    - improve performance in disk cache-bound tasks under memory pressure.
> The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
> Kconfig).
> 
> The vm.clean_min_kbytes sysctl knob provides *hard* protection of clean
> file pages. The file pages on the current node won't be reclaimed under
> memory pressure when the amount of clean file pages is below
> vm.clean_min_kbytes. Hard protection of clean file pages using this knob
> may be used to
>    - prevent disk I/O thrashing under memory pressure even with no free swap
>      space;
>    - improve performance in disk cache-bound tasks under memory pressure;
>    - avoid high latency and prevent livelock in near-OOM conditions.
> The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
> Kconfig).
> 
> Signed-off-by: Alexey Avramov <hakavlad@inbox.lv>
> Reported-by: Artem S. Tashkinov <aros@gmx.com>
> ---
>   Repo:
>   https://github.com/hakavlad/le9-patch
> 
>   Documentation/admin-guide/sysctl/vm.rst | 66 ++++++++++++++++++++++++
>   include/linux/mm.h                      |  4 ++
>   kernel/sysctl.c                         | 21 ++++++++
>   mm/Kconfig                              | 63 +++++++++++++++++++++++
>   mm/vmscan.c                             | 91 +++++++++++++++++++++++++++++++++
>   5 files changed, 245 insertions(+)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 5e7952021..2f606e23b 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -25,6 +25,9 @@ files can be found in mm/swap.c.
>   Currently, these files are in /proc/sys/vm:
> 
>   - admin_reserve_kbytes
> +- anon_min_kbytes
> +- clean_low_kbytes
> +- clean_min_kbytes
>   - compact_memory
>   - compaction_proactiveness
>   - compact_unevictable_allowed
> @@ -105,6 +108,61 @@ On x86_64 this is about 128MB.
>   Changing this takes effect whenever an application requests memory.
> 
> 
> +anon_min_kbytes
> +===============
> +
> +This knob provides *hard* protection of anonymous pages. The anonymous pages
> +on the current node won't be reclaimed under any conditions when their amount
> +is below vm.anon_min_kbytes.
> +
> +This knob may be used to prevent excessive swap thrashing when anonymous
> +memory is low (for example, when memory is going to be overfilled by
> +compressed data of zram module).
> +
> +Setting this value too high (close to MemTotal) can result in inability to
> +swap and can lead to early OOM under memory pressure.
> +
> +The default value is defined by CONFIG_ANON_MIN_KBYTES.
> +
> +
> +clean_low_kbytes
> +================
> +
> +This knob provides *best-effort* protection of clean file pages. The file pages
> +on the current node won't be reclaimed under memory pressure when the amount of
> +clean file pages is below vm.clean_low_kbytes *unless* we threaten to OOM.
> +
> +Protection of clean file pages using this knob may be used when swapping is
> +still possible to
> +  - prevent disk I/O thrashing under memory pressure;
> +  - improve performance in disk cache-bound tasks under memory pressure.
> +
> +Setting it to a high value may result in a early eviction of anonymous pages
> +into the swap space by attempting to hold the protected amount of clean file
> +pages in memory.
> +
> +The default value is defined by CONFIG_CLEAN_LOW_KBYTES.
> +
> +
> +clean_min_kbytes
> +================
> +
> +This knob provides *hard* protection of clean file pages. The file pages on the
> +current node won't be reclaimed under memory pressure when the amount of clean
> +file pages is below vm.clean_min_kbytes.
> +
> +Hard protection of clean file pages using this knob may be used to
> +  - prevent disk I/O thrashing under memory pressure even with no free swap space;
> +  - improve performance in disk cache-bound tasks under memory pressure;
> +  - avoid high latency and prevent livelock in near-OOM conditions.
> +
> +Setting it to a high value may result in a early out-of-memory condition due to
> +the inability to reclaim the protected amount of clean file pages when other
> +types of pages cannot be reclaimed.
> +
> +The default value is defined by CONFIG_CLEAN_MIN_KBYTES.
> +
> +
>   compact_memory
>   ==============
> 
> @@ -864,6 +922,14 @@ be 133 (x + 2x = 200, 2x = 133.33).
>   At 0, the kernel will not initiate swap until the amount of free and
>   file-backed pages is less than the high watermark in a zone.
> 
> +This knob has no effect if the amount of clean file pages on the current
> +node is below vm.clean_low_kbytes or vm.clean_min_kbytes. In this case,
> +only anonymous pages can be reclaimed.
> +
> +If the number of anonymous pages on the current node is below
> +vm.anon_min_kbytes, then only file pages can be reclaimed with
> +any vm.swappiness value.
> +
> 
>   unprivileged_userfaultfd
>   ========================
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a7e4a9e7d..bee9807d5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -200,6 +200,10 @@ static inline void __mm_zero_struct_page(struct page *page)
> 
>   extern int sysctl_max_map_count;
> 
> +extern unsigned long sysctl_anon_min_kbytes;
> +extern unsigned long sysctl_clean_low_kbytes;
> +extern unsigned long sysctl_clean_min_kbytes;
> +
>   extern unsigned long sysctl_user_reserve_kbytes;
>   extern unsigned long sysctl_admin_reserve_kbytes;
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 083be6af2..65fc38756 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -3132,6 +3132,27 @@ static struct ctl_table vm_table[] = {
>   	},
>   #endif
>   	{
> +		.procname	= "anon_min_kbytes",
> +		.data		= &sysctl_anon_min_kbytes,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= proc_doulongvec_minmax,
> +	},
> +	{
> +		.procname	= "clean_low_kbytes",
> +		.data		= &sysctl_clean_low_kbytes,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= proc_doulongvec_minmax,
> +	},
> +	{
> +		.procname	= "clean_min_kbytes",
> +		.data		= &sysctl_clean_min_kbytes,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= proc_doulongvec_minmax,
> +	},
> +	{
>   		.procname	= "user_reserve_kbytes",
>   		.data		= &sysctl_user_reserve_kbytes,
>   		.maxlen		= sizeof(sysctl_user_reserve_kbytes),
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 28edafc82..dea0806d7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -89,6 +89,69 @@ config SPARSEMEM_VMEMMAP
>   	  pfn_to_page and page_to_pfn operations.  This is the most
>   	  efficient option when sufficient kernel resources are available.
> 
> +config ANON_MIN_KBYTES
> +	int "Default value for vm.anon_min_kbytes"
> +	depends on SYSCTL
> +	range 0 4294967295
> +	default 0
> +	help
> +	  This option sets the default value for vm.anon_min_kbytes sysctl knob.
> +
> +	  The vm.anon_min_kbytes sysctl knob provides *hard* protection of
> +	  anonymous pages. The anonymous pages on the current node won't be
> +	  reclaimed under any conditions when their amount is below
> +	  vm.anon_min_kbytes. This knob may be used to prevent excessive swap
> +	  thrashing when anonymous memory is low (for example, when memory is
> +	  going to be overfilled by compressed data of zram module).
> +
> +	  Setting this value too high (close to MemTotal) can result in
> +	  inability to swap and can lead to early OOM under memory pressure.
> +
> +config CLEAN_LOW_KBYTES
> +	int "Default value for vm.clean_low_kbytes"
> +	depends on SYSCTL
> +	range 0 4294967295
> +	default 0
> +	help
> +	  This option sets the default value for vm.clean_low_kbytes sysctl knob.
> +
> +	  The vm.clean_low_kbytes sysctl knob provides *best-effort*
> +	  protection of clean file pages. The file pages on the current node
> +	  won't be reclaimed under memory pressure when the amount of clean file
> +	  pages is below vm.clean_low_kbytes *unless* we threaten to OOM.
> +	  Protection of clean file pages using this knob may be used when
> +	  swapping is still possible to
> +	    - prevent disk I/O thrashing under memory pressure;
> +	    - improve performance in disk cache-bound tasks under memory
> +	      pressure.
> +
> +	  Setting it to a high value may result in a early eviction of anonymous
> +	  pages into the swap space by attempting to hold the protected amount
> +	  of clean file pages in memory.
> +
> +config CLEAN_MIN_KBYTES
> +	int "Default value for vm.clean_min_kbytes"
> +	depends on SYSCTL
> +	range 0 4294967295
> +	default 0
> +	help
> +	  This option sets the default value for vm.clean_min_kbytes sysctl knob.
> +
> +	  The vm.clean_min_kbytes sysctl knob provides *hard* protection of
> +	  clean file pages. The file pages on the current node won't be
> +	  reclaimed under memory pressure when the amount of clean file pages is
> +	  below vm.clean_min_kbytes. Hard protection of clean file pages using
> +	  this knob may be used to
> +	    - prevent disk I/O thrashing under memory pressure even with no free
> +	      swap space;
> +	    - improve performance in disk cache-bound tasks under memory
> +	      pressure;
> +	    - avoid high latency and prevent livelock in near-OOM conditions.
> +
> +	  Setting it to a high value may result in a early out-of-memory condition
> +	  due to the inability to reclaim the protected amount of clean file pages
> +	  when other types of pages cannot be reclaimed.
> +
>   config HAVE_MEMBLOCK_PHYS_MAP
>   	bool
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fb9584641..928f3371d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -122,6 +122,15 @@ struct scan_control {
>   	/* The file pages on the current node are dangerously low */
>   	unsigned int file_is_tiny:1;
> 
> +	/* The anonymous pages on the current node are below vm.anon_min_kbytes */
> +	unsigned int anon_below_min:1;
> +
> +	/* The clean file pages on the current node are below vm.clean_low_kbytes */
> +	unsigned int clean_below_low:1;
> +
> +	/* The clean file pages on the current node are below vm.clean_min_kbytes */
> +	unsigned int clean_below_min:1;
> +
>   	/* Always discard instead of demoting to lower tier memory */
>   	unsigned int no_demotion:1;
> 
> @@ -171,6 +180,10 @@ struct scan_control {
>   #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
>   #endif
> 
> +unsigned long sysctl_anon_min_kbytes __read_mostly = CONFIG_ANON_MIN_KBYTES;
> +unsigned long sysctl_clean_low_kbytes __read_mostly = CONFIG_CLEAN_LOW_KBYTES;
> +unsigned long sysctl_clean_min_kbytes __read_mostly = CONFIG_CLEAN_MIN_KBYTES;
> +
>   /*
>    * From 0 .. 200.  Higher means more swappy.
>    */
> @@ -2734,6 +2747,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>   	}
> 
>   	/*
> +	 * Force-scan anon if clean file pages is under vm.clean_low_kbytes
> +	 * or vm.clean_min_kbytes.
> +	 */
> +	if (sc->clean_below_low || sc->clean_below_min) {
> +		scan_balance = SCAN_ANON;
> +		goto out;
> +	}
> +
> +	/*
>   	 * If there is enough inactive page cache, we do not reclaim
>   	 * anything from the anonymous working right now.
>   	 */
> @@ -2877,6 +2899,25 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>   			BUG();
>   		}
> 
> +		/*
> +		 * Hard protection of the working set.
> +		 */
> +		if (file) {
> +			/*
> +			 * Don't reclaim file pages when the amount of
> +			 * clean file pages is below vm.clean_min_kbytes.
> +			 */
> +			if (sc->clean_below_min)
> +				scan = 0;
> +		} else {
> +			/*
> +			 * Don't reclaim anonymous pages when their
> +			 * amount is below vm.anon_min_kbytes.
> +			 */
> +			if (sc->anon_below_min)
> +				scan = 0;
> +		}
> +
>   		nr[lru] = scan;
>   	}
>   }
> @@ -3082,6 +3123,54 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>   	return inactive_lru_pages > pages_for_compaction;
>   }
> 
> +static void prepare_workingset_protection(pg_data_t *pgdat, struct scan_control *sc)
> +{
> +	/*
> +	 * Check the number of anonymous pages to protect them from
> +	 * reclaiming if their amount is below the specified.
> +	 */
> +	if (sysctl_anon_min_kbytes) {
> +		unsigned long reclaimable_anon;
> +
> +		reclaimable_anon =
> +			node_page_state(pgdat, NR_ACTIVE_ANON) +
> +			node_page_state(pgdat, NR_INACTIVE_ANON) +
> +			node_page_state(pgdat, NR_ISOLATED_ANON);
> +		reclaimable_anon <<= (PAGE_SHIFT - 10);
> +
> +		sc->anon_below_min = reclaimable_anon < sysctl_anon_min_kbytes;
> +	} else
> +		sc->anon_below_min = 0;
> +
> +	/*
> +	 * Check the number of clean file pages to protect them from
> +	 * reclaiming if their amount is below the specified.
> +	 */
> +	if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) {
> +		unsigned long reclaimable_file, dirty, clean;
> +
> +		reclaimable_file =
> +			node_page_state(pgdat, NR_ACTIVE_FILE) +
> +			node_page_state(pgdat, NR_INACTIVE_FILE) +
> +			node_page_state(pgdat, NR_ISOLATED_FILE);
> +		dirty = node_page_state(pgdat, NR_FILE_DIRTY);
> +		/*
> +		 * node_page_state() sum can go out of sync since
> +		 * all the values are not read at once.
> +		 */
> +		if (likely(reclaimable_file > dirty))
> +			clean = (reclaimable_file - dirty) << (PAGE_SHIFT - 10);
> +		else
> +			clean = 0;
> +
> +		sc->clean_below_low = clean < sysctl_clean_low_kbytes;
> +		sc->clean_below_min = clean < sysctl_clean_min_kbytes;
> +	} else {
> +		sc->clean_below_low = 0;
> +		sc->clean_below_min = 0;
> +	}
> +}
> +
>   static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>   {
>   	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
> @@ -3249,6 +3338,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>   			anon >> sc->priority;
>   	}
> 
> +	prepare_workingset_protection(pgdat, sc);
> +
>   	shrink_node_memcgs(pgdat, sc);
> 
>   	if (reclaim_state) {
> 
> base-commit: d58071a8a76d779eedab38033ae4c821c30295a5
> --
> 2.11.0
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

  parent reply	other threads:[~2021-12-02 18:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20211130201652.2218636d@mail.inbox.lv>
2021-11-30 15:28 ` Luis Chamberlain
2021-11-30 18:56 ` Oleksandr Natalenko
2021-12-01 15:51   ` Alexey Avramov
2021-12-02 18:05 ` ValdikSS [this message]
2021-12-02 21:58   ` Andrew Morton
2021-12-03 11:59     ` Vlastimil Babka
2021-12-03 13:27       ` Alexey Avramov
2021-12-06  9:59         ` Michal Hocko
2022-01-09 22:59           ` Barry Song
2021-12-03 14:01     ` Oleksandr Natalenko
2021-12-12 20:15     ` Alexey Avramov
2021-12-13  9:06       ` Barry Song
2021-12-13  9:07       ` Michal Hocko
2021-12-13  8:38   ` Barry Song
2022-01-25  8:19     ` ValdikSS
2022-02-12  0:01       ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru \
    --to=iam@valdikss.org.ru \
    --cc=akpm@linux-foundation.org \
    --cc=aros@gmx.com \
    --cc=corbet@lwn.net \
    --cc=hakavlad@gmail.com \
    --cc=hakavlad@inbox.lv \
    --cc=keescook@chromium.org \
    --cc=kernel@xanmod.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=oleksandr@natalenko.name \
    --cc=yuzhao@google.com \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox