* [PATCH V2 0/1] Add swappiness argument to memory.reclaim @ 2023-12-06 16:28 Dan Schatzberg 2023-12-06 16:28 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg 0 siblings, 1 reply; 11+ messages in thread From: Dan Schatzberg @ 2023-12-06 16:28 UTC (permalink / raw) To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang Cc: linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li, Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Kefeng Wang, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins Changes since V1: * Added documentation This patch proposes augmenting the memory.reclaim interface with a swappiness=<val> argument that overrides the swappiness value for that instance of proactive reclaim. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. Previously, this exact interface addition was proposed by Yosry[3]. In response, Roman proposed instead an interface to specify precise file/anon/slab reclaim amounts[4]. More recently Huan also proposed this as well[5] and others similarly questioned if this was the proper interface. Previous proposals sought to use this to allow proactive reclaimers to effectively perform a custom reclaim algorithm by issuing proactive reclaim with different settings to control file vs anon reclaim (e.g. to only reclaim anon from some applications). Responses argued that adjusting swappiness is a poor interface for custom reclaim. In contrast, I argue in favor of a swappiness setting not as a way to implement custom reclaim algorithms but rather to bias the balance of anon vs file due to differences of proactive vs reactive reclaim. In this context, swappiness is the existing interface for controlling this balance and this patch simply allows for it to be configured differently for proactive vs reactive reclaim. Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate for this prupose. Proactive reclaimers are un-aware of the relative age of file vs anon for a cgroup which makes it difficult to manage proactive reclaim of different memory pools. A proactive reclaimer would need some amount of anon reclaim attempts separate from the amount of file reclaim attempts which seems brittle given that it's difficult to observe the impact. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/ [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/ [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/ Dan Schatzberg (1): mm: add swapiness= arg to memory.reclaim Documentation/admin-guide/cgroup-v2.rst | 15 ++++++- include/linux/swap.h | 3 +- mm/memcontrol.c | 55 ++++++++++++++++++++----- mm/vmscan.c | 13 +++++- 4 files changed, 70 insertions(+), 16 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-06 16:28 [PATCH V2 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg @ 2023-12-06 16:28 ` Dan Schatzberg 2023-12-07 19:00 ` Michal Koutný 0 siblings, 1 reply; 11+ messages in thread From: Dan Schatzberg @ 2023-12-06 16:28 UTC (permalink / raw) To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang Cc: linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li, Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Kefeng Wang, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins Allow proactive reclaimers to submit an additional swappiness=<val> argument to memory.reclaim. This overrides the global or per-memcg swappiness setting for that reclaim attempt. For example: echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 0 (no swap) regardless of the vm.swappiness sysctl setting. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> --- Documentation/admin-guide/cgroup-v2.rst | 15 ++++++- include/linux/swap.h | 3 +- mm/memcontrol.c | 55 ++++++++++++++++++++----- mm/vmscan.c | 13 +++++- 4 files changed, 70 insertions(+), 16 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3f85254f3cef..fc2b379dbd0f 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1282,8 +1282,8 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. + This file accepts a string which containers thhe number of bytes + to reclaim. Example:: @@ -1304,6 +1304,17 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. + This file also allows the user to specify the swappiness value + to be used for the reclaim. For example: + + echo "1G swappiness=60" > memory.reclaim + + The above instructs the kernel to perform the reclaim with + a swappiness value of 60. Note that this has the same semantics + as the vm.swappiness sysctl - it sets the relative IO cost of + reclaiming anon vs file memory but does not allow for reclaiming + specific amounts of anon or file memory. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index f6dd6575b905..c6e309199f10 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -410,7 +410,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + int *swappiness); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1c1061df9cd1..ba1c89455ab0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include <linux/resume_user_mode.h> #include <linux/psi.h> #include <linux/seq_buf.h> +#include <linux/parser.h> #include <linux/sched/isolation.h> #include "internal.h" #include <net/sock.h> @@ -2449,7 +2450,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2740,7 +2741,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3660,7 +3661,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { ret = -EBUSY; break; } @@ -3774,7 +3775,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_retries--; } @@ -6720,7 +6721,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); if (!reclaimed && !nr_retries--) break; @@ -6769,7 +6770,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_reclaims--; continue; } @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_SWAPPINESS = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t if_tokens = { + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; unsigned int reclaim_options; - int err; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; + int swappiness = -1; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; + + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + switch (match_token(start, if_tokens, args)) { + case MEMORY_RECLAIM_SWAPPINESS: + if (match_int(&args[0], &swappiness)) + return -EINVAL; + if (swappiness < 0 || swappiness > 200) + return -EINVAL; + break; + default: + return -EINVAL; + } + } reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { @@ -6926,7 +6958,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + swappiness == -1 ? NULL : &swappiness); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index 506f8220c5fe..546704ea01e1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -136,6 +136,9 @@ struct scan_control { /* Always discard instead of demoting to lower tier memory */ unsigned int no_demotion:1; + /* Swappiness value for reclaim, if NULL use memcg/global value */ + int *swappiness; + /* Allocation order */ s8 order; @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup *memcg = lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; - int swappiness = mem_cgroup_swappiness(memcg); + int swappiness = sc->swappiness ? + *sc->swappiness : mem_cgroup_swappiness(memcg); u64 fraction[ANON_AND_FILE]; u64 denominator = 0; /* gcc */ enum scan_balance scan_balance; @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; + if (sc->swappiness) + return *sc->swappiness; + return mem_cgroup_swappiness(memcg); } @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + int *swappiness) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .swappiness = swappiness, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put -- 2.34.1 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-06 16:28 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg @ 2023-12-07 19:00 ` Michal Koutný 0 siblings, 0 replies; 11+ messages in thread From: Michal Koutný @ 2023-12-07 19:00 UTC (permalink / raw) To: Dan Schatzberg Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li, Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Kefeng Wang, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins [-- Attachment #1: Type: text/plain, Size: 541 bytes --] On Wed, Dec 06, 2023 at 08:28:56AM -0800, Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, ... > + int swappiness = -1; Here you use a negative number... > @@ -136,6 +136,9 @@ struct scan_control { ... > + /* Swappiness value for reclaim, if NULL use memcg/global value */ > + int *swappiness; ... and here a NULL to denote the unset value. I'd suggest unifying those. Perhaps the negative to avoid unnecessary dereferences. Thanks, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 0/1] Add swappiness argument to memory.reclaim @ 2023-11-30 15:36 Dan Schatzberg 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg 0 siblings, 1 reply; 11+ messages in thread From: Dan Schatzberg @ 2023-11-30 15:36 UTC (permalink / raw) To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang Cc: linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins (Sorry for the resend - forgot to cc the mailing lists) This patch proposes augmenting the memory.reclaim interface with a swappiness=<val> argument that overrides the swappiness value for that instance of proactive reclaim. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. Previously, this exact interface addition was proposed by Yosry[3]. In response, Roman proposed instead an interface to specify precise file/anon/slab reclaim amounts[4]. More recently Huan also proposed this as well[5] and others similarly questioned if this was the proper interface. Previous proposals sought to use this to allow proactive reclaimers to effectively perform a custom reclaim algorithm by issuing proactive reclaim with different settings to control file vs anon reclaim (e.g. to only reclaim anon from some applications). Responses argued that adjusting swappiness is a poor interface for custom reclaim. In contrast, I argue in favor of a swappiness setting not as a way to implement custom reclaim algorithms but rather to bias the balance of anon vs file due to differences of proactive vs reactive reclaim. In this context, swappiness is the existing interface for controlling this balance and this patch simply allows for it to be configured differently for proactive vs reactive reclaim. Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate for this prupose. Proactive reclaimers are un-aware of the relative age of file vs anon for a cgroup which makes it difficult to manage proactive reclaim of different memory pools. A proactive reclaimer would need some amount of anon reclaim attempts separate from the amount of file reclaim attempts which seems brittle given that it's difficult to observe the impact. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/ [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/ [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/ Dan Schatzberg (1): mm: add swapiness= arg to memory.reclaim include/linux/swap.h | 3 ++- mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- mm/vmscan.c | 13 +++++++++-- 3 files changed, 57 insertions(+), 14 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-11-30 15:36 [PATCH 0/1] Add swappiness argument " Dan Schatzberg @ 2023-11-30 15:36 ` Dan Schatzberg 2023-11-30 21:33 ` Andrew Morton 2023-12-01 1:56 ` Huan Yang 0 siblings, 2 replies; 11+ messages in thread From: Dan Schatzberg @ 2023-11-30 15:36 UTC (permalink / raw) To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang Cc: linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins Allow proactive reclaimers to submit an additional swappiness=<val> argument to memory.reclaim. This overrides the global or per-memcg swappiness setting for that reclaim attempt. For example: echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 0 (no swap) regardless of the vm.swappiness sysctl setting. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> --- include/linux/swap.h | 3 ++- mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- mm/vmscan.c | 13 +++++++++-- 3 files changed, 57 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index f6dd6575b905..c6e309199f10 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -410,7 +410,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + int *swappiness); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1c1061df9cd1..ba1c89455ab0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include <linux/resume_user_mode.h> #include <linux/psi.h> #include <linux/seq_buf.h> +#include <linux/parser.h> #include <linux/sched/isolation.h> #include "internal.h" #include <net/sock.h> @@ -2449,7 +2450,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2740,7 +2741,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3660,7 +3661,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { ret = -EBUSY; break; } @@ -3774,7 +3775,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_retries--; } @@ -6720,7 +6721,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); if (!reclaimed && !nr_retries--) break; @@ -6769,7 +6770,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_reclaims--; continue; } @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_SWAPPINESS = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t if_tokens = { + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; unsigned int reclaim_options; - int err; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; + int swappiness = -1; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; + + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + switch (match_token(start, if_tokens, args)) { + case MEMORY_RECLAIM_SWAPPINESS: + if (match_int(&args[0], &swappiness)) + return -EINVAL; + if (swappiness < 0 || swappiness > 200) + return -EINVAL; + break; + default: + return -EINVAL; + } + } reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { @@ -6926,7 +6958,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + swappiness == -1 ? NULL : &swappiness); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index 506f8220c5fe..546704ea01e1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -136,6 +136,9 @@ struct scan_control { /* Always discard instead of demoting to lower tier memory */ unsigned int no_demotion:1; + /* Swappiness value for reclaim, if NULL use memcg/global value */ + int *swappiness; + /* Allocation order */ s8 order; @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup *memcg = lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; - int swappiness = mem_cgroup_swappiness(memcg); + int swappiness = sc->swappiness ? + *sc->swappiness : mem_cgroup_swappiness(memcg); u64 fraction[ANON_AND_FILE]; u64 denominator = 0; /* gcc */ enum scan_balance scan_balance; @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; + if (sc->swappiness) + return *sc->swappiness; + return mem_cgroup_swappiness(memcg); } @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + int *swappiness) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .swappiness = swappiness, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put -- 2.34.1 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg @ 2023-11-30 21:33 ` Andrew Morton 2023-11-30 21:46 ` Dan Schatzberg 2023-12-01 1:56 ` Huan Yang 1 sibling, 1 reply; 11+ messages in thread From: Andrew Morton @ 2023-11-30 21:33 UTC (permalink / raw) To: Dan Schatzberg Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, 30 Nov 2023 07:36:54 -0800 Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> > --- > include/linux/swap.h | 3 ++- > mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- > mm/vmscan.c | 13 +++++++++-- Documentation/admin-guide/cgroup-v2.rst is feeling unloved! Please check whether this interface change can lead to non-backward-compatible userspace. If someone's script does the above echo command, will it break on older kernels? If so, does it matter? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-11-30 21:33 ` Andrew Morton @ 2023-11-30 21:46 ` Dan Schatzberg 0 siblings, 0 replies; 11+ messages in thread From: Dan Schatzberg @ 2023-11-30 21:46 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 01:33:40PM -0800, Andrew Morton wrote: > On Thu, 30 Nov 2023 07:36:54 -0800 Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > > Allow proactive reclaimers to submit an additional swappiness=<val> > > argument to memory.reclaim. This overrides the global or per-memcg > > swappiness setting for that reclaim attempt. > > > > For example: > > > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > > swap) regardless of the vm.swappiness sysctl setting. > > > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> > > --- > > include/linux/swap.h | 3 ++- > > mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- > > mm/vmscan.c | 13 +++++++++-- > > Documentation/admin-guide/cgroup-v2.rst is feeling unloved! Oops - total oversight on my part. I'll add docs in a V2 if we can come to consensus on this interface change in general. > > Please check whether this interface change can lead to > non-backward-compatible userspace. If someone's script does the above > echo command, will it break on older kernels? If so, does it matter? Older kernels will return -EINVAL with such a command - that seems appropriate, indicating that the argument is not supported. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg 2023-11-30 21:33 ` Andrew Morton @ 2023-12-01 1:56 ` Huan Yang 2023-12-01 2:05 ` Yosry Ahmed 1 sibling, 1 reply; 11+ messages in thread From: Huan Yang @ 2023-12-01 1:56 UTC (permalink / raw) To: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang Cc: linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins [-- Attachment #1: Type: text/plain, Size: 10492 bytes --] 在 2023/11/30 23:36, Dan Schatzberg 写道: > [?????????schatzberg.dan@gmail.com ?????????https://aka.ms/LearnAboutSenderIdentification,????????????] > > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Signed-off-by: Dan Schatzberg<schatzberg.dan@gmail.com> > --- > include/linux/swap.h | 3 ++- > mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- > mm/vmscan.c | 13 +++++++++-- > 3 files changed, 57 insertions(+), 14 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index f6dd6575b905..c6e309199f10 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -410,7 +410,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options); > + unsigned int reclaim_options, > + int *swappiness); > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > pg_data_t *pgdat, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 1c1061df9cd1..ba1c89455ab0 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -63,6 +63,7 @@ > #include <linux/resume_user_mode.h> > #include <linux/psi.h> > #include <linux/seq_buf.h> > +#include <linux/parser.h> > #include <linux/sched/isolation.h> > #include "internal.h" > #include <net/sock.h> > @@ -2449,7 +2450,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, > psi_memstall_enter(&pflags); > nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > gfp_mask, > - MEMCG_RECLAIM_MAY_SWAP); > + MEMCG_RECLAIM_MAY_SWAP, NULL); > psi_memstall_leave(&pflags); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > @@ -2740,7 +2741,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > > psi_memstall_enter(&pflags); > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > - gfp_mask, reclaim_options); > + gfp_mask, reclaim_options, NULL); > psi_memstall_leave(&pflags); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > @@ -3660,7 +3661,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, > } > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { > ret = -EBUSY; > break; > } > @@ -3774,7 +3775,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > return -EINTR; > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - MEMCG_RECLAIM_MAY_SWAP)) > + MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_retries--; > } > > @@ -6720,7 +6721,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > } > > reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); > > if (!reclaimed && !nr_retries--) > break; > @@ -6769,7 +6770,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > > if (nr_reclaims) { > if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_reclaims--; > continue; > } > @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t if_tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > unsigned int reclaim_options; > - int err; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > + int swappiness = -1; > > buf = strstrip(buf); > - err = page_counter_memparse(buf, "", &nr_to_reclaim); > - if (err) > - return err; > + > + old_buf = buf; > + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip(buf); > + > + while ((start = strsep(&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, if_tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < 0 || swappiness > 200) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + } > > reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; > while (nr_reclaimed < nr_to_reclaim) { > @@ -6926,7 +6958,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + GFP_KERNEL, reclaim_options, > + swappiness == -1 ? NULL : &swappiness); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 506f8220c5fe..546704ea01e1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -136,6 +136,9 @@ struct scan_control { > /* Always discard instead of demoting to lower tier memory */ > unsigned int no_demotion:1; > > + /* Swappiness value for reclaim, if NULL use memcg/global value */ > + int *swappiness; > + > /* Allocation order */ > s8 order; > > @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct mem_cgroup *memcg = lruvec_memcg(lruvec); > unsigned long anon_cost, file_cost, total_cost; > - int swappiness = mem_cgroup_swappiness(memcg); > + int swappiness = sc->swappiness ? > + *sc->swappiness : mem_cgroup_swappiness(memcg); Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? Due to current use case only apply in proactive reclaim. > u64 fraction[ANON_AND_FILE]; > u64 denominator = 0; /* gcc */ > enum scan_balance scan_balance; > @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > return 0; > > + if (sc->swappiness) > + return *sc->swappiness; Also there. > + > return mem_cgroup_swappiness(memcg); > } > > @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options) > + unsigned int reclaim_options, > + int *swappiness) > { > unsigned long nr_reclaimed; > unsigned int noreclaim_flag; > @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > .may_unmap = 1, > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > + .swappiness = swappiness, > }; > /* > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > -- > 2.34.1 My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. For example, when swappiness is set to 200, only anonymous pages will be reclaimed. Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want it to attempt this behavior.) How do you think about extreme swappiness scenarios? [-- Attachment #2: Type: text/html, Size: 12064 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-01 1:56 ` Huan Yang @ 2023-12-01 2:05 ` Yosry Ahmed 2023-12-01 2:13 ` Huan Yang 0 siblings, 1 reply; 11+ messages in thread From: Yosry Ahmed @ 2023-12-01 2:05 UTC (permalink / raw) To: Huan Yang Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins > @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct mem_cgroup *memcg = lruvec_memcg(lruvec); > unsigned long anon_cost, file_cost, total_cost; > - int swappiness = mem_cgroup_swappiness(memcg); > + int swappiness = sc->swappiness ? > + *sc->swappiness : mem_cgroup_swappiness(memcg); > > Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? > Due to current use case only apply in proactive reclaim. On a system that is not under memory pressure, the rate of proactive reclaim could be higher than reactive reclaim. We should only use likely/unlikely when it's obvious a scenario will happen most of the time. I don't believe that's the case here. > > u64 fraction[ANON_AND_FILE]; > u64 denominator = 0; /* gcc */ > enum scan_balance scan_balance; > @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > return 0; > > + if (sc->swappiness) > + return *sc->swappiness; > > Also there. > > + > return mem_cgroup_swappiness(memcg); > } > > @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options) > + unsigned int reclaim_options, > + int *swappiness) > { > unsigned long nr_reclaimed; > unsigned int noreclaim_flag; > @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > .may_unmap = 1, > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > + .swappiness = swappiness, > }; > /* > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > -- > 2.34.1 > > My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. > For example, when swappiness is set to 200, only anonymous pages will be reclaimed. > Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want > it to attempt this behavior.) > How do you think about extreme swappiness scenarios? I think having different semantics between swappiness passed to proactive reclaim and global swappiness can be confusing. If it's needed to have a swappiness value that says "anon only no matter what", perhaps we should introduce such a new value and make it supported by both global and proactive reclaim swappiness? We could support writing "max" or something similar instead of a special value to mean that. Writing such value to global swappiness may cause problems and premature OOMs IIUC, but that would be misconfiguration. If we think that's dangerous, we can introduce this new value but make it valid only for proactive reclaim for now. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-01 2:05 ` Yosry Ahmed @ 2023-12-01 2:13 ` Huan Yang 2023-12-01 2:17 ` Yosry Ahmed 0 siblings, 1 reply; 11+ messages in thread From: Huan Yang @ 2023-12-01 2:13 UTC (permalink / raw) To: Yosry Ahmed Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins 在 2023/12/1 10:05, Yosry Ahmed 写道: >> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, >> struct pglist_data *pgdat = lruvec_pgdat(lruvec); >> struct mem_cgroup *memcg = lruvec_memcg(lruvec); >> unsigned long anon_cost, file_cost, total_cost; >> - int swappiness = mem_cgroup_swappiness(memcg); >> + int swappiness = sc->swappiness ? >> + *sc->swappiness : mem_cgroup_swappiness(memcg); >> >> Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? >> Due to current use case only apply in proactive reclaim. > On a system that is not under memory pressure, the rate of proactive > reclaim could be higher than reactive reclaim. We should only use > likely/unlikely when it's obvious a scenario will happen most of the > time. I don't believe that's the case here. Not all vendors will use proactive interfaces, and reactive reclaim are a normal system behavior. In this regard, I think it is appropriate to add "unlikely". > >> u64 fraction[ANON_AND_FILE]; >> u64 denominator = 0; /* gcc */ >> enum scan_balance scan_balance; >> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) >> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) >> return 0; >> >> + if (sc->swappiness) >> + return *sc->swappiness; >> >> Also there. >> >> + >> return mem_cgroup_swappiness(memcg); >> } >> >> @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, >> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >> unsigned long nr_pages, >> gfp_t gfp_mask, >> - unsigned int reclaim_options) >> + unsigned int reclaim_options, >> + int *swappiness) >> { >> unsigned long nr_reclaimed; >> unsigned int noreclaim_flag; >> @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >> .may_unmap = 1, >> .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), >> .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), >> + .swappiness = swappiness, >> }; >> /* >> * Traverse the ZONELIST_FALLBACK zonelist of the current node to put >> -- >> 2.34.1 >> >> My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. >> For example, when swappiness is set to 200, only anonymous pages will be reclaimed. >> Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want >> it to attempt this behavior.) >> How do you think about extreme swappiness scenarios? > I think having different semantics between swappiness passed to > proactive reclaim and global swappiness can be confusing. If it's > needed to have a swappiness value that says "anon only no matter > what", perhaps we should introduce such a new value and make it > supported by both global and proactive reclaim swappiness? We could > support writing "max" or something similar instead of a special value > to mean that. Yes, use other hint more suitable for this scenario. However, from this patch, it seems that this feature is not supported. Do you have a demand for this scenario? > > Writing such value to global swappiness may cause problems and > premature OOMs IIUC, but that would be misconfiguration. If we think > that's dangerous, we can introduce this new value but make it valid > only for proactive reclaim for now. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-01 2:13 ` Huan Yang @ 2023-12-01 2:17 ` Yosry Ahmed 2023-12-01 2:24 ` Huan Yang 0 siblings, 1 reply; 11+ messages in thread From: Yosry Ahmed @ 2023-12-01 2:17 UTC (permalink / raw) To: Huan Yang Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 6:14 PM Huan Yang <11133793@vivo.com> wrote: > > > 在 2023/12/1 10:05, Yosry Ahmed 写道: > >> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > >> struct pglist_data *pgdat = lruvec_pgdat(lruvec); > >> struct mem_cgroup *memcg = lruvec_memcg(lruvec); > >> unsigned long anon_cost, file_cost, total_cost; > >> - int swappiness = mem_cgroup_swappiness(memcg); > >> + int swappiness = sc->swappiness ? > >> + *sc->swappiness : mem_cgroup_swappiness(memcg); > >> > >> Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? > >> Due to current use case only apply in proactive reclaim. > > On a system that is not under memory pressure, the rate of proactive > > reclaim could be higher than reactive reclaim. We should only use > > likely/unlikely when it's obvious a scenario will happen most of the > > time. I don't believe that's the case here. > Not all vendors will use proactive interfaces, and reactive reclaim are > a normal > system behavior. In this regard, I think it is appropriate to add > "unlikely". The general guidance is not to use likely/unlikely when it's not certain, which I believe is the case here. I think the CPU will make better decisions on its own than if we give it hints that's wrong in some situations. Others please correct me if I am wrong. > > > >> u64 fraction[ANON_AND_FILE]; > >> u64 denominator = 0; /* gcc */ > >> enum scan_balance scan_balance; > >> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > >> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > >> return 0; > >> > >> + if (sc->swappiness) > >> + return *sc->swappiness; > >> > >> Also there. > >> > >> + > >> return mem_cgroup_swappiness(memcg); > >> } > >> > >> @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > >> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > >> unsigned long nr_pages, > >> gfp_t gfp_mask, > >> - unsigned int reclaim_options) > >> + unsigned int reclaim_options, > >> + int *swappiness) > >> { > >> unsigned long nr_reclaimed; > >> unsigned int noreclaim_flag; > >> @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > >> .may_unmap = 1, > >> .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > >> .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > >> + .swappiness = swappiness, > >> }; > >> /* > >> * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > >> -- > >> 2.34.1 > >> > >> My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. > >> For example, when swappiness is set to 200, only anonymous pages will be reclaimed. > >> Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want > >> it to attempt this behavior.) > >> How do you think about extreme swappiness scenarios? > > I think having different semantics between swappiness passed to > > proactive reclaim and global swappiness can be confusing. If it's > > needed to have a swappiness value that says "anon only no matter > > what", perhaps we should introduce such a new value and make it > > supported by both global and proactive reclaim swappiness? We could > > support writing "max" or something similar instead of a special value > > to mean that. > > Yes, use other hint more suitable for this scenario. > > However, from this patch, it seems that this feature is not supported. > Do you have a demand for this scenario? We do anonymous-only proactive reclaim in some setups, so it would be nice to have. I am not sure if it's absolutely needed vs. just using swappiness=200 and living with the possibility of reclaiming some file pages. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-01 2:17 ` Yosry Ahmed @ 2023-12-01 2:24 ` Huan Yang 0 siblings, 0 replies; 11+ messages in thread From: Huan Yang @ 2023-12-01 2:24 UTC (permalink / raw) To: Yosry Ahmed Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins 在 2023/12/1 10:17, Yosry Ahmed 写道: > On Thu, Nov 30, 2023 at 6:14 PM Huan Yang <11133793@vivo.com> wrote: >> >> 在 2023/12/1 10:05, Yosry Ahmed 写道: >>>> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, >>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec); >>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec); >>>> unsigned long anon_cost, file_cost, total_cost; >>>> - int swappiness = mem_cgroup_swappiness(memcg); >>>> + int swappiness = sc->swappiness ? >>>> + *sc->swappiness : mem_cgroup_swappiness(memcg); >>>> >>>> Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? >>>> Due to current use case only apply in proactive reclaim. >>> On a system that is not under memory pressure, the rate of proactive >>> reclaim could be higher than reactive reclaim. We should only use >>> likely/unlikely when it's obvious a scenario will happen most of the >>> time. I don't believe that's the case here. >> Not all vendors will use proactive interfaces, and reactive reclaim are >> a normal >> system behavior. In this regard, I think it is appropriate to add >> "unlikely". > The general guidance is not to use likely/unlikely when it's not > certain, which I believe is the case here. I think the CPU will make OK, I will remember this part. > better decisions on its own than if we give it hints that's wrong in > some situations. Others please correct me if I am wrong. No, you're right. CPU is good to do this. > >>>> u64 fraction[ANON_AND_FILE]; >>>> u64 denominator = 0; /* gcc */ >>>> enum scan_balance scan_balance; >>>> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) >>>> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) >>>> return 0; >>>> >>>> + if (sc->swappiness) >>>> + return *sc->swappiness; >>>> >>>> Also there. >>>> >>>> + >>>> return mem_cgroup_swappiness(memcg); >>>> } >>>> >>>> @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, >>>> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >>>> unsigned long nr_pages, >>>> gfp_t gfp_mask, >>>> - unsigned int reclaim_options) >>>> + unsigned int reclaim_options, >>>> + int *swappiness) >>>> { >>>> unsigned long nr_reclaimed; >>>> unsigned int noreclaim_flag; >>>> @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >>>> .may_unmap = 1, >>>> .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), >>>> .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), >>>> + .swappiness = swappiness, >>>> }; >>>> /* >>>> * Traverse the ZONELIST_FALLBACK zonelist of the current node to put >>>> -- >>>> 2.34.1 >>>> >>>> My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. >>>> For example, when swappiness is set to 200, only anonymous pages will be reclaimed. >>>> Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want >>>> it to attempt this behavior.) >>>> How do you think about extreme swappiness scenarios? >>> I think having different semantics between swappiness passed to >>> proactive reclaim and global swappiness can be confusing. If it's >>> needed to have a swappiness value that says "anon only no matter >>> what", perhaps we should introduce such a new value and make it >>> supported by both global and proactive reclaim swappiness? We could >>> support writing "max" or something similar instead of a special value >>> to mean that. >> Yes, use other hint more suitable for this scenario. >> >> However, from this patch, it seems that this feature is not supported. >> Do you have a demand for this scenario? > We do anonymous-only proactive reclaim in some setups, so it would be > nice to have. I am not sure if it's absolutely needed vs. just using > swappiness=200 and living with the possibility of reclaiming some file > pages. Right now, the scenario where swappiness=200 is sufficient for us, but having the tendency to only reclaim anonymous pages has a clear semantics that is suitable for upper-level strategy scenarios, rather than relying solely on the functionality of swappiness. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-12-07 19:00 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-12-06 16:28 [PATCH V2 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg 2023-12-06 16:28 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg 2023-12-07 19:00 ` Michal Koutný -- strict thread matches above, loose matches on Subject: below -- 2023-11-30 15:36 [PATCH 0/1] Add swappiness argument " Dan Schatzberg 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg 2023-11-30 21:33 ` Andrew Morton 2023-11-30 21:46 ` Dan Schatzberg 2023-12-01 1:56 ` Huan Yang 2023-12-01 2:05 ` Yosry Ahmed 2023-12-01 2:13 ` Huan Yang 2023-12-01 2:17 ` Yosry Ahmed 2023-12-01 2:24 ` Huan Yang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox