* [PATCH 0/1] Add swappiness argument to memory.reclaim
@ 2023-11-30 15:36 Dan Schatzberg
2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Dan Schatzberg @ 2023-11-30 15:36 UTC (permalink / raw)
To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang
Cc: linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt,
Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox,
Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle),
Yue Zhao, Hugh Dickins
(Sorry for the resend - forgot to cc the mailing lists)
This patch proposes augmenting the memory.reclaim interface with a
swappiness=<val> argument that overrides the swappiness value for that instance
of proactive reclaim.
Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim. The memory.reclaim interface does not allow for any way to effect the
balance of file vs anon during proactive reclaim. The only approach is to adjust
the vm.swappiness setting. However, there are a few reasons we look to control
the balance of file vs anon during proactive reclaim, separately from reactive
reclaim:
* Swapout should be limited to manage SSD write endurance. In near-OOM
situations we are fine with lots of swap-out to avoid OOMs. As these are
typically rare events, they have relatively little impact on write endurance.
However, proactive reclaim runs continuously and so its impact on SSD write
endurance is more significant. Therefore it is desireable to control swap-out
for proactive reclaim separately from reactive reclaim
* Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
exhaustion. This makes sense if the swap exhaustion is triggered due to
reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
one could see OOMs when free memory is ample but anon is just particularly
cold). Therefore, it's desireable to have proactive reclaim reduce or stop
swap-out before the threshold at which OOM killing occurs.
In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
writes to memory.reclaim[2]. This has been in production for nearly two years
and has addressed our needs to control proactive vs reactive reclaim behavior
but is still not ideal for a number of reasons:
* vm.swappiness is a global setting, adjusting it can race/interfere with other
system administration that wishes to control vm.swappiness. In our case, we
need to disable Senpai before adjusting vm.swappiness.
* vm.swappiness is stateful - so a crash or restart of Senpai can leave a
misconfigured setting. This requires some additional management to record the
"desired" setting and ensure Senpai always adjusts to it.
With this patch, we avoid these downsides of adjusting vm.swappiness globally.
Previously, this exact interface addition was proposed by Yosry[3]. In response,
Roman proposed instead an interface to specify precise file/anon/slab reclaim
amounts[4]. More recently Huan also proposed this as well[5] and others
similarly questioned if this was the proper interface.
Previous proposals sought to use this to allow proactive reclaimers to
effectively perform a custom reclaim algorithm by issuing proactive reclaim with
different settings to control file vs anon reclaim (e.g. to only reclaim anon
from some applications). Responses argued that adjusting swappiness is a poor
interface for custom reclaim.
In contrast, I argue in favor of a swappiness setting not as a way to implement
custom reclaim algorithms but rather to bias the balance of anon vs file due to
differences of proactive vs reactive reclaim. In this context, swappiness is the
existing interface for controlling this balance and this patch simply allows for
it to be configured differently for proactive vs reactive reclaim.
Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
for this prupose. Proactive reclaimers are un-aware of the relative age of file
vs anon for a cgroup which makes it difficult to manage proactive reclaim of
different memory pools. A proactive reclaimer would need some amount of anon
reclaim attempts separate from the amount of file reclaim attempts which seems
brittle given that it's difficult to observe the impact.
[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
[3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
[4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
[5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/
Dan Schatzberg (1):
mm: add swapiness= arg to memory.reclaim
include/linux/swap.h | 3 ++-
mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++---------
mm/vmscan.c | 13 +++++++++--
3 files changed, 57 insertions(+), 14 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 27+ messages in thread* [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-11-30 15:36 [PATCH 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg @ 2023-11-30 15:36 ` Dan Schatzberg 2023-11-30 21:33 ` Andrew Morton 2023-12-01 1:56 ` Huan Yang 2023-11-30 15:57 ` [PATCH 0/1] Add swappiness argument " Michal Hocko 2023-11-30 18:44 ` Shakeel Butt 2 siblings, 2 replies; 27+ messages in thread From: Dan Schatzberg @ 2023-11-30 15:36 UTC (permalink / raw) To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang Cc: linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins Allow proactive reclaimers to submit an additional swappiness=<val> argument to memory.reclaim. This overrides the global or per-memcg swappiness setting for that reclaim attempt. For example: echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 0 (no swap) regardless of the vm.swappiness sysctl setting. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> --- include/linux/swap.h | 3 ++- mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- mm/vmscan.c | 13 +++++++++-- 3 files changed, 57 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index f6dd6575b905..c6e309199f10 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -410,7 +410,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + int *swappiness); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1c1061df9cd1..ba1c89455ab0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include <linux/resume_user_mode.h> #include <linux/psi.h> #include <linux/seq_buf.h> +#include <linux/parser.h> #include <linux/sched/isolation.h> #include "internal.h" #include <net/sock.h> @@ -2449,7 +2450,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2740,7 +2741,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3660,7 +3661,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { ret = -EBUSY; break; } @@ -3774,7 +3775,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_retries--; } @@ -6720,7 +6721,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); if (!reclaimed && !nr_retries--) break; @@ -6769,7 +6770,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_reclaims--; continue; } @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_SWAPPINESS = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t if_tokens = { + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; unsigned int reclaim_options; - int err; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; + int swappiness = -1; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; + + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + switch (match_token(start, if_tokens, args)) { + case MEMORY_RECLAIM_SWAPPINESS: + if (match_int(&args[0], &swappiness)) + return -EINVAL; + if (swappiness < 0 || swappiness > 200) + return -EINVAL; + break; + default: + return -EINVAL; + } + } reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { @@ -6926,7 +6958,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + swappiness == -1 ? NULL : &swappiness); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index 506f8220c5fe..546704ea01e1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -136,6 +136,9 @@ struct scan_control { /* Always discard instead of demoting to lower tier memory */ unsigned int no_demotion:1; + /* Swappiness value for reclaim, if NULL use memcg/global value */ + int *swappiness; + /* Allocation order */ s8 order; @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup *memcg = lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; - int swappiness = mem_cgroup_swappiness(memcg); + int swappiness = sc->swappiness ? + *sc->swappiness : mem_cgroup_swappiness(memcg); u64 fraction[ANON_AND_FILE]; u64 denominator = 0; /* gcc */ enum scan_balance scan_balance; @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; + if (sc->swappiness) + return *sc->swappiness; + return mem_cgroup_swappiness(memcg); } @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + int *swappiness) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .swappiness = swappiness, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put -- 2.34.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg @ 2023-11-30 21:33 ` Andrew Morton 2023-11-30 21:46 ` Dan Schatzberg 2023-12-01 1:56 ` Huan Yang 1 sibling, 1 reply; 27+ messages in thread From: Andrew Morton @ 2023-11-30 21:33 UTC (permalink / raw) To: Dan Schatzberg Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, 30 Nov 2023 07:36:54 -0800 Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> > --- > include/linux/swap.h | 3 ++- > mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- > mm/vmscan.c | 13 +++++++++-- Documentation/admin-guide/cgroup-v2.rst is feeling unloved! Please check whether this interface change can lead to non-backward-compatible userspace. If someone's script does the above echo command, will it break on older kernels? If so, does it matter? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-11-30 21:33 ` Andrew Morton @ 2023-11-30 21:46 ` Dan Schatzberg 0 siblings, 0 replies; 27+ messages in thread From: Dan Schatzberg @ 2023-11-30 21:46 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 01:33:40PM -0800, Andrew Morton wrote: > On Thu, 30 Nov 2023 07:36:54 -0800 Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > > Allow proactive reclaimers to submit an additional swappiness=<val> > > argument to memory.reclaim. This overrides the global or per-memcg > > swappiness setting for that reclaim attempt. > > > > For example: > > > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > > swap) regardless of the vm.swappiness sysctl setting. > > > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> > > --- > > include/linux/swap.h | 3 ++- > > mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- > > mm/vmscan.c | 13 +++++++++-- > > Documentation/admin-guide/cgroup-v2.rst is feeling unloved! Oops - total oversight on my part. I'll add docs in a V2 if we can come to consensus on this interface change in general. > > Please check whether this interface change can lead to > non-backward-compatible userspace. If someone's script does the above > echo command, will it break on older kernels? If so, does it matter? Older kernels will return -EINVAL with such a command - that seems appropriate, indicating that the argument is not supported. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg 2023-11-30 21:33 ` Andrew Morton @ 2023-12-01 1:56 ` Huan Yang 2023-12-01 2:05 ` Yosry Ahmed 1 sibling, 1 reply; 27+ messages in thread From: Huan Yang @ 2023-12-01 1:56 UTC (permalink / raw) To: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang Cc: linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins [-- Attachment #1: Type: text/plain, Size: 10492 bytes --] 在 2023/11/30 23:36, Dan Schatzberg 写道: > [?????????schatzberg.dan@gmail.com ?????????https://aka.ms/LearnAboutSenderIdentification,????????????] > > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Signed-off-by: Dan Schatzberg<schatzberg.dan@gmail.com> > --- > include/linux/swap.h | 3 ++- > mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++++--------- > mm/vmscan.c | 13 +++++++++-- > 3 files changed, 57 insertions(+), 14 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index f6dd6575b905..c6e309199f10 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -410,7 +410,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options); > + unsigned int reclaim_options, > + int *swappiness); > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > pg_data_t *pgdat, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 1c1061df9cd1..ba1c89455ab0 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -63,6 +63,7 @@ > #include <linux/resume_user_mode.h> > #include <linux/psi.h> > #include <linux/seq_buf.h> > +#include <linux/parser.h> > #include <linux/sched/isolation.h> > #include "internal.h" > #include <net/sock.h> > @@ -2449,7 +2450,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, > psi_memstall_enter(&pflags); > nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > gfp_mask, > - MEMCG_RECLAIM_MAY_SWAP); > + MEMCG_RECLAIM_MAY_SWAP, NULL); > psi_memstall_leave(&pflags); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > @@ -2740,7 +2741,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > > psi_memstall_enter(&pflags); > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > - gfp_mask, reclaim_options); > + gfp_mask, reclaim_options, NULL); > psi_memstall_leave(&pflags); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > @@ -3660,7 +3661,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, > } > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { > ret = -EBUSY; > break; > } > @@ -3774,7 +3775,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > return -EINTR; > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - MEMCG_RECLAIM_MAY_SWAP)) > + MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_retries--; > } > > @@ -6720,7 +6721,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > } > > reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); > > if (!reclaimed && !nr_retries--) > break; > @@ -6769,7 +6770,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > > if (nr_reclaims) { > if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_reclaims--; > continue; > } > @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t if_tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > unsigned int reclaim_options; > - int err; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > + int swappiness = -1; > > buf = strstrip(buf); > - err = page_counter_memparse(buf, "", &nr_to_reclaim); > - if (err) > - return err; > + > + old_buf = buf; > + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip(buf); > + > + while ((start = strsep(&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, if_tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < 0 || swappiness > 200) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + } > > reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; > while (nr_reclaimed < nr_to_reclaim) { > @@ -6926,7 +6958,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + GFP_KERNEL, reclaim_options, > + swappiness == -1 ? NULL : &swappiness); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 506f8220c5fe..546704ea01e1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -136,6 +136,9 @@ struct scan_control { > /* Always discard instead of demoting to lower tier memory */ > unsigned int no_demotion:1; > > + /* Swappiness value for reclaim, if NULL use memcg/global value */ > + int *swappiness; > + > /* Allocation order */ > s8 order; > > @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct mem_cgroup *memcg = lruvec_memcg(lruvec); > unsigned long anon_cost, file_cost, total_cost; > - int swappiness = mem_cgroup_swappiness(memcg); > + int swappiness = sc->swappiness ? > + *sc->swappiness : mem_cgroup_swappiness(memcg); Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? Due to current use case only apply in proactive reclaim. > u64 fraction[ANON_AND_FILE]; > u64 denominator = 0; /* gcc */ > enum scan_balance scan_balance; > @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > return 0; > > + if (sc->swappiness) > + return *sc->swappiness; Also there. > + > return mem_cgroup_swappiness(memcg); > } > > @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options) > + unsigned int reclaim_options, > + int *swappiness) > { > unsigned long nr_reclaimed; > unsigned int noreclaim_flag; > @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > .may_unmap = 1, > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > + .swappiness = swappiness, > }; > /* > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > -- > 2.34.1 My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. For example, when swappiness is set to 200, only anonymous pages will be reclaimed. Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want it to attempt this behavior.) How do you think about extreme swappiness scenarios? [-- Attachment #2: Type: text/html, Size: 12064 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-01 1:56 ` Huan Yang @ 2023-12-01 2:05 ` Yosry Ahmed 2023-12-01 2:13 ` Huan Yang 0 siblings, 1 reply; 27+ messages in thread From: Yosry Ahmed @ 2023-12-01 2:05 UTC (permalink / raw) To: Huan Yang Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins > @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct mem_cgroup *memcg = lruvec_memcg(lruvec); > unsigned long anon_cost, file_cost, total_cost; > - int swappiness = mem_cgroup_swappiness(memcg); > + int swappiness = sc->swappiness ? > + *sc->swappiness : mem_cgroup_swappiness(memcg); > > Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? > Due to current use case only apply in proactive reclaim. On a system that is not under memory pressure, the rate of proactive reclaim could be higher than reactive reclaim. We should only use likely/unlikely when it's obvious a scenario will happen most of the time. I don't believe that's the case here. > > u64 fraction[ANON_AND_FILE]; > u64 denominator = 0; /* gcc */ > enum scan_balance scan_balance; > @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > return 0; > > + if (sc->swappiness) > + return *sc->swappiness; > > Also there. > > + > return mem_cgroup_swappiness(memcg); > } > > @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options) > + unsigned int reclaim_options, > + int *swappiness) > { > unsigned long nr_reclaimed; > unsigned int noreclaim_flag; > @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > .may_unmap = 1, > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > + .swappiness = swappiness, > }; > /* > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > -- > 2.34.1 > > My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. > For example, when swappiness is set to 200, only anonymous pages will be reclaimed. > Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want > it to attempt this behavior.) > How do you think about extreme swappiness scenarios? I think having different semantics between swappiness passed to proactive reclaim and global swappiness can be confusing. If it's needed to have a swappiness value that says "anon only no matter what", perhaps we should introduce such a new value and make it supported by both global and proactive reclaim swappiness? We could support writing "max" or something similar instead of a special value to mean that. Writing such value to global swappiness may cause problems and premature OOMs IIUC, but that would be misconfiguration. If we think that's dangerous, we can introduce this new value but make it valid only for proactive reclaim for now. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-01 2:05 ` Yosry Ahmed @ 2023-12-01 2:13 ` Huan Yang 2023-12-01 2:17 ` Yosry Ahmed 0 siblings, 1 reply; 27+ messages in thread From: Huan Yang @ 2023-12-01 2:13 UTC (permalink / raw) To: Yosry Ahmed Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins 在 2023/12/1 10:05, Yosry Ahmed 写道: >> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, >> struct pglist_data *pgdat = lruvec_pgdat(lruvec); >> struct mem_cgroup *memcg = lruvec_memcg(lruvec); >> unsigned long anon_cost, file_cost, total_cost; >> - int swappiness = mem_cgroup_swappiness(memcg); >> + int swappiness = sc->swappiness ? >> + *sc->swappiness : mem_cgroup_swappiness(memcg); >> >> Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? >> Due to current use case only apply in proactive reclaim. > On a system that is not under memory pressure, the rate of proactive > reclaim could be higher than reactive reclaim. We should only use > likely/unlikely when it's obvious a scenario will happen most of the > time. I don't believe that's the case here. Not all vendors will use proactive interfaces, and reactive reclaim are a normal system behavior. In this regard, I think it is appropriate to add "unlikely". > >> u64 fraction[ANON_AND_FILE]; >> u64 denominator = 0; /* gcc */ >> enum scan_balance scan_balance; >> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) >> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) >> return 0; >> >> + if (sc->swappiness) >> + return *sc->swappiness; >> >> Also there. >> >> + >> return mem_cgroup_swappiness(memcg); >> } >> >> @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, >> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >> unsigned long nr_pages, >> gfp_t gfp_mask, >> - unsigned int reclaim_options) >> + unsigned int reclaim_options, >> + int *swappiness) >> { >> unsigned long nr_reclaimed; >> unsigned int noreclaim_flag; >> @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >> .may_unmap = 1, >> .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), >> .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), >> + .swappiness = swappiness, >> }; >> /* >> * Traverse the ZONELIST_FALLBACK zonelist of the current node to put >> -- >> 2.34.1 >> >> My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. >> For example, when swappiness is set to 200, only anonymous pages will be reclaimed. >> Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want >> it to attempt this behavior.) >> How do you think about extreme swappiness scenarios? > I think having different semantics between swappiness passed to > proactive reclaim and global swappiness can be confusing. If it's > needed to have a swappiness value that says "anon only no matter > what", perhaps we should introduce such a new value and make it > supported by both global and proactive reclaim swappiness? We could > support writing "max" or something similar instead of a special value > to mean that. Yes, use other hint more suitable for this scenario. However, from this patch, it seems that this feature is not supported. Do you have a demand for this scenario? > > Writing such value to global swappiness may cause problems and > premature OOMs IIUC, but that would be misconfiguration. If we think > that's dangerous, we can introduce this new value but make it valid > only for proactive reclaim for now. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-01 2:13 ` Huan Yang @ 2023-12-01 2:17 ` Yosry Ahmed 2023-12-01 2:24 ` Huan Yang 0 siblings, 1 reply; 27+ messages in thread From: Yosry Ahmed @ 2023-12-01 2:17 UTC (permalink / raw) To: Huan Yang Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 6:14 PM Huan Yang <11133793@vivo.com> wrote: > > > 在 2023/12/1 10:05, Yosry Ahmed 写道: > >> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > >> struct pglist_data *pgdat = lruvec_pgdat(lruvec); > >> struct mem_cgroup *memcg = lruvec_memcg(lruvec); > >> unsigned long anon_cost, file_cost, total_cost; > >> - int swappiness = mem_cgroup_swappiness(memcg); > >> + int swappiness = sc->swappiness ? > >> + *sc->swappiness : mem_cgroup_swappiness(memcg); > >> > >> Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? > >> Due to current use case only apply in proactive reclaim. > > On a system that is not under memory pressure, the rate of proactive > > reclaim could be higher than reactive reclaim. We should only use > > likely/unlikely when it's obvious a scenario will happen most of the > > time. I don't believe that's the case here. > Not all vendors will use proactive interfaces, and reactive reclaim are > a normal > system behavior. In this regard, I think it is appropriate to add > "unlikely". The general guidance is not to use likely/unlikely when it's not certain, which I believe is the case here. I think the CPU will make better decisions on its own than if we give it hints that's wrong in some situations. Others please correct me if I am wrong. > > > >> u64 fraction[ANON_AND_FILE]; > >> u64 denominator = 0; /* gcc */ > >> enum scan_balance scan_balance; > >> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > >> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > >> return 0; > >> > >> + if (sc->swappiness) > >> + return *sc->swappiness; > >> > >> Also there. > >> > >> + > >> return mem_cgroup_swappiness(memcg); > >> } > >> > >> @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > >> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > >> unsigned long nr_pages, > >> gfp_t gfp_mask, > >> - unsigned int reclaim_options) > >> + unsigned int reclaim_options, > >> + int *swappiness) > >> { > >> unsigned long nr_reclaimed; > >> unsigned int noreclaim_flag; > >> @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > >> .may_unmap = 1, > >> .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > >> .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > >> + .swappiness = swappiness, > >> }; > >> /* > >> * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > >> -- > >> 2.34.1 > >> > >> My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. > >> For example, when swappiness is set to 200, only anonymous pages will be reclaimed. > >> Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want > >> it to attempt this behavior.) > >> How do you think about extreme swappiness scenarios? > > I think having different semantics between swappiness passed to > > proactive reclaim and global swappiness can be confusing. If it's > > needed to have a swappiness value that says "anon only no matter > > what", perhaps we should introduce such a new value and make it > > supported by both global and proactive reclaim swappiness? We could > > support writing "max" or something similar instead of a special value > > to mean that. > > Yes, use other hint more suitable for this scenario. > > However, from this patch, it seems that this feature is not supported. > Do you have a demand for this scenario? We do anonymous-only proactive reclaim in some setups, so it would be nice to have. I am not sure if it's absolutely needed vs. just using swappiness=200 and living with the possibility of reclaiming some file pages. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] mm: add swapiness= arg to memory.reclaim 2023-12-01 2:17 ` Yosry Ahmed @ 2023-12-01 2:24 ` Huan Yang 0 siblings, 0 replies; 27+ messages in thread From: Huan Yang @ 2023-12-01 2:24 UTC (permalink / raw) To: Yosry Ahmed Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins 在 2023/12/1 10:17, Yosry Ahmed 写道: > On Thu, Nov 30, 2023 at 6:14 PM Huan Yang <11133793@vivo.com> wrote: >> >> 在 2023/12/1 10:05, Yosry Ahmed 写道: >>>> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, >>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec); >>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec); >>>> unsigned long anon_cost, file_cost, total_cost; >>>> - int swappiness = mem_cgroup_swappiness(memcg); >>>> + int swappiness = sc->swappiness ? >>>> + *sc->swappiness : mem_cgroup_swappiness(memcg); >>>> >>>> Should we use "unlikely" here to indicate that sc->swappiness is an unexpected behavior? >>>> Due to current use case only apply in proactive reclaim. >>> On a system that is not under memory pressure, the rate of proactive >>> reclaim could be higher than reactive reclaim. We should only use >>> likely/unlikely when it's obvious a scenario will happen most of the >>> time. I don't believe that's the case here. >> Not all vendors will use proactive interfaces, and reactive reclaim are >> a normal >> system behavior. In this regard, I think it is appropriate to add >> "unlikely". > The general guidance is not to use likely/unlikely when it's not > certain, which I believe is the case here. I think the CPU will make OK, I will remember this part. > better decisions on its own than if we give it hints that's wrong in > some situations. Others please correct me if I am wrong. No, you're right. CPU is good to do this. > >>>> u64 fraction[ANON_AND_FILE]; >>>> u64 denominator = 0; /* gcc */ >>>> enum scan_balance scan_balance; >>>> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) >>>> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) >>>> return 0; >>>> >>>> + if (sc->swappiness) >>>> + return *sc->swappiness; >>>> >>>> Also there. >>>> >>>> + >>>> return mem_cgroup_swappiness(memcg); >>>> } >>>> >>>> @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, >>>> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >>>> unsigned long nr_pages, >>>> gfp_t gfp_mask, >>>> - unsigned int reclaim_options) >>>> + unsigned int reclaim_options, >>>> + int *swappiness) >>>> { >>>> unsigned long nr_reclaimed; >>>> unsigned int noreclaim_flag; >>>> @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >>>> .may_unmap = 1, >>>> .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), >>>> .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), >>>> + .swappiness = swappiness, >>>> }; >>>> /* >>>> * Traverse the ZONELIST_FALLBACK zonelist of the current node to put >>>> -- >>>> 2.34.1 >>>> >>>> My previous patch attempted to ensure fully deterministic semantics under extreme swappiness. >>>> For example, when swappiness is set to 200, only anonymous pages will be reclaimed. >>>> Due to code in MGLRU isolate_folios will try scan anon if no scanned, will try other type.(We do not want >>>> it to attempt this behavior.) >>>> How do you think about extreme swappiness scenarios? >>> I think having different semantics between swappiness passed to >>> proactive reclaim and global swappiness can be confusing. If it's >>> needed to have a swappiness value that says "anon only no matter >>> what", perhaps we should introduce such a new value and make it >>> supported by both global and proactive reclaim swappiness? We could >>> support writing "max" or something similar instead of a special value >>> to mean that. >> Yes, use other hint more suitable for this scenario. >> >> However, from this patch, it seems that this feature is not supported. >> Do you have a demand for this scenario? > We do anonymous-only proactive reclaim in some setups, so it would be > nice to have. I am not sure if it's absolutely needed vs. just using > swappiness=200 and living with the possibility of reclaiming some file > pages. Right now, the scenario where swappiness=200 is sufficient for us, but having the tendency to only reclaim anonymous pages has a clear semantics that is suitable for upper-level strategy scenarios, rather than relying solely on the functionality of swappiness. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 15:36 [PATCH 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg @ 2023-11-30 15:57 ` Michal Hocko 2023-11-30 16:56 ` Johannes Weiner 2023-11-30 18:44 ` Shakeel Butt 2 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2023-11-30 15:57 UTC (permalink / raw) To: Dan Schatzberg Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu 30-11-23 07:36:53, Dan Schatzberg wrote: [...] > In contrast, I argue in favor of a swappiness setting not as a way to implement > custom reclaim algorithms but rather to bias the balance of anon vs file due to > differences of proactive vs reactive reclaim. In this context, swappiness is the > existing interface for controlling this balance and this patch simply allows for > it to be configured differently for proactive vs reactive reclaim. I do agree that swappiness is a better interface than explicit anon/file but the problem with swappiness is that it is more of a hint for the reclaim rather than a real control. Just look at get_scan_count and its history. Not only its range has been extended also the extent when it is actually used has been changing all the time and I think it is not a stretch to assume that trend to continue. Now if we extend the user interface to trigger the reclaim do we expect that we always do SCAN_EQUAL if a user specifies swappiness or are we OK that the implementation is free to ignore that "hint"? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 15:57 ` [PATCH 0/1] Add swappiness argument " Michal Hocko @ 2023-11-30 16:56 ` Johannes Weiner 2023-11-30 18:49 ` Shakeel Butt ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Johannes Weiner @ 2023-11-30 16:56 UTC (permalink / raw) To: Michal Hocko Cc: Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 04:57:41PM +0100, Michal Hocko wrote: > On Thu 30-11-23 07:36:53, Dan Schatzberg wrote: > [...] > > In contrast, I argue in favor of a swappiness setting not as a way to implement > > custom reclaim algorithms but rather to bias the balance of anon vs file due to > > differences of proactive vs reactive reclaim. In this context, swappiness is the > > existing interface for controlling this balance and this patch simply allows for > > it to be configured differently for proactive vs reactive reclaim. > > I do agree that swappiness is a better interface than explicit anon/file > but the problem with swappiness is that it is more of a hint for the reclaim > rather than a real control. Just look at get_scan_count and its history. > Not only its range has been extended also the extent when it is actually > used has been changing all the time and I think it is not a stretch to > assume that trend to continue. Right, we did tweak the edge behavior of e.g. swappiness=0. And we extended the range to express "anon is cheaper than file", which wasn't possible before, to support the compressed memory case. However, its meaning and impact has been remarkably stable over the years: it allows userspace to specify the relative cost of paging IO between file and anon pages. This comment is from 2.6.28: /* * With swappiness at 100, anonymous and file have the same priority. * This scanning priority is essentially the inverse of IO cost. */ anon_prio = sc->swappiness; file_prio = 200 - sc->swappiness; And this is it today: /* * Calculate the pressure balance between anon and file pages. * * The amount of pressure we put on each LRU is inversely * proportional to the cost of reclaiming each list, as * determined by the share of pages that are refaulting, times * the relative IO cost of bringing back a swapped out * anonymous page vs reloading a filesystem page (swappiness). * * Although we limit that influence to ensure no list gets * left behind completely: at least a third of the pressure is * applied, before swappiness. * * With swappiness at 100, anon and file have equal IO cost. */ total_cost = sc->anon_cost + sc->file_cost; anon_cost = total_cost + sc->anon_cost; file_cost = total_cost + sc->file_cost; total_cost = anon_cost + file_cost; ap = swappiness * (total_cost + 1); ap /= anon_cost + 1; fp = (200 - swappiness) * (total_cost + 1); fp /= file_cost + 1; So swappiness still means the same it did 15 years ago. We haven't changed the default swappiness setting, and we haven't broken any existing swappiness configurations through VM changes in that time. There are a few scenarios where swappiness doesn't apply: - No swap. Oh well, that seems reasonable. - Priority=0. This applies to near-OOM situations where the MM system tries to save itself. This isn't a range in which proactive reclaimers (should) operate. - sc->file_is_tiny. This doesn't apply to cgroup reclaim and thus proactive reclaim. - sc->cache_trim_mode. This implements clean cache dropbehind, and applies in the presence of large, non-refaulting inactive cache. The assumption there is that this data is reclaimable without involving IO to evict, and without the expectation of refault IO in the future. Without IO involvement, the relative IO cost isn't a factor. This will back off when refaults are observed, and the IO cost setting is then taken into account again as expected. If you consider swappiness to mean "reclaim what I ask you to", then this would override that, yes. But in the definition of relative IO cost, this decision making is permissible. Note that this applies to the global swappiness setting as well, and nobody has complained about it. So I wouldn't say it's merely a reclaim hint. It controls a very concrete and influential factor in VM decision making. And since the global swappiness is long-established ABI, I don't expect its meaning to change significantly any time soon. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 16:56 ` Johannes Weiner @ 2023-11-30 18:49 ` Shakeel Butt 2023-11-30 19:47 ` Dan Schatzberg 2023-12-01 9:33 ` Michal Hocko 2 siblings, 0 replies; 27+ messages in thread From: Shakeel Butt @ 2023-11-30 18:49 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 11:56:42AM -0500, Johannes Weiner wrote: > On Thu, Nov 30, 2023 at 04:57:41PM +0100, Michal Hocko wrote: > > On Thu 30-11-23 07:36:53, Dan Schatzberg wrote: > > [...] > > > In contrast, I argue in favor of a swappiness setting not as a way to implement > > > custom reclaim algorithms but rather to bias the balance of anon vs file due to > > > differences of proactive vs reactive reclaim. In this context, swappiness is the > > > existing interface for controlling this balance and this patch simply allows for > > > it to be configured differently for proactive vs reactive reclaim. > > > > I do agree that swappiness is a better interface than explicit anon/file > > but the problem with swappiness is that it is more of a hint for the reclaim > > rather than a real control. Just look at get_scan_count and its history. > > Not only its range has been extended also the extent when it is actually > > used has been changing all the time and I think it is not a stretch to > > assume that trend to continue. > > Right, we did tweak the edge behavior of e.g. swappiness=0. And we > extended the range to express "anon is cheaper than file", which > wasn't possible before, to support the compressed memory case. > > However, its meaning and impact has been remarkably stable over the > years: it allows userspace to specify the relative cost of paging IO > between file and anon pages. This comment is from 2.6.28: > > /* > * With swappiness at 100, anonymous and file have the same priority. > * This scanning priority is essentially the inverse of IO cost. > */ > anon_prio = sc->swappiness; > file_prio = 200 - sc->swappiness; > > And this is it today: > > /* > * Calculate the pressure balance between anon and file pages. > * > * The amount of pressure we put on each LRU is inversely > * proportional to the cost of reclaiming each list, as > * determined by the share of pages that are refaulting, times > * the relative IO cost of bringing back a swapped out > * anonymous page vs reloading a filesystem page (swappiness). > * > * Although we limit that influence to ensure no list gets > * left behind completely: at least a third of the pressure is > * applied, before swappiness. > * > * With swappiness at 100, anon and file have equal IO cost. > */ > total_cost = sc->anon_cost + sc->file_cost; > anon_cost = total_cost + sc->anon_cost; > file_cost = total_cost + sc->file_cost; > total_cost = anon_cost + file_cost; > > ap = swappiness * (total_cost + 1); > ap /= anon_cost + 1; > > fp = (200 - swappiness) * (total_cost + 1); > fp /= file_cost + 1; > > So swappiness still means the same it did 15 years ago. We haven't > changed the default swappiness setting, and we haven't broken any > existing swappiness configurations through VM changes in that time. > > There are a few scenarios where swappiness doesn't apply: > > - No swap. Oh well, that seems reasonable. > > - Priority=0. This applies to near-OOM situations where the MM system > tries to save itself. This isn't a range in which proactive > reclaimers (should) operate. > > - sc->file_is_tiny. This doesn't apply to cgroup reclaim and thus > proactive reclaim. > > - sc->cache_trim_mode. This implements clean cache dropbehind, and > applies in the presence of large, non-refaulting inactive cache. The > assumption there is that this data is reclaimable without involving > IO to evict, and without the expectation of refault IO in the > future. Without IO involvement, the relative IO cost isn't a > factor. This will back off when refaults are observed, and the IO > cost setting is then taken into account again as expected. > > If you consider swappiness to mean "reclaim what I ask you to", then > this would override that, yes. But in the definition of relative IO > cost, this decision making is permissible. > > Note that this applies to the global swappiness setting as well, and > nobody has complained about it. > > So I wouldn't say it's merely a reclaim hint. It controls a very > concrete and influential factor in VM decision making. And since the > global swappiness is long-established ABI, I don't expect its meaning > to change significantly any time soon. Are you saying the edge case behavior of global swappiness and the user provided swappiness through memory.reclaim should remain same? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 16:56 ` Johannes Weiner 2023-11-30 18:49 ` Shakeel Butt @ 2023-11-30 19:47 ` Dan Schatzberg 2023-11-30 20:30 ` Shakeel Butt 2023-12-01 9:33 ` Michal Hocko 2 siblings, 1 reply; 27+ messages in thread From: Dan Schatzberg @ 2023-11-30 19:47 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 11:56:42AM -0500, Johannes Weiner wrote: > [...] > So I wouldn't say it's merely a reclaim hint. It controls a very > concrete and influential factor in VM decision making. And since the > global swappiness is long-established ABI, I don't expect its meaning > to change significantly any time soon. I want to add to this last point. While swappiness does not have terribly well-defined semantics - it is the (only?) existing mechanism to control balance between anon and file reclaim. I'm merely advocating for the ability to adjust swappiness during proactive reclaim separately from reactive reclaim. To what degree the behavior and semantics of swappiness change is a bit orthogonal here. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 19:47 ` Dan Schatzberg @ 2023-11-30 20:30 ` Shakeel Butt 2023-11-30 21:37 ` Dan Schatzberg 0 siblings, 1 reply; 27+ messages in thread From: Shakeel Butt @ 2023-11-30 20:30 UTC (permalink / raw) To: Dan Schatzberg Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 11:47 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > On Thu, Nov 30, 2023 at 11:56:42AM -0500, Johannes Weiner wrote: > > [...] > > So I wouldn't say it's merely a reclaim hint. It controls a very > > concrete and influential factor in VM decision making. And since the > > global swappiness is long-established ABI, I don't expect its meaning > > to change significantly any time soon. > > I want to add to this last point. While swappiness does not have > terribly well-defined semantics - it is the (only?) existing mechanism > to control balance between anon and file reclaim. I'm merely > advocating for the ability to adjust swappiness during proactive > reclaim separately from reactive reclaim. To what degree the behavior > and semantics of swappiness change is a bit orthogonal here. Let me ask my question in this chain as it might have been missed: Whatever the semantics of swappiness are (including the edge cases like no swap, file_is_tiny, trim cache), should the reclaim code treat the global swappiness and user-provided swappiness differently? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 20:30 ` Shakeel Butt @ 2023-11-30 21:37 ` Dan Schatzberg 2023-11-30 21:52 ` Shakeel Butt 0 siblings, 1 reply; 27+ messages in thread From: Dan Schatzberg @ 2023-11-30 21:37 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 12:30:27PM -0800, Shakeel Butt wrote: > On Thu, Nov 30, 2023 at 11:47 AM Dan Schatzberg > <schatzberg.dan@gmail.com> wrote: > > > > On Thu, Nov 30, 2023 at 11:56:42AM -0500, Johannes Weiner wrote: > > > [...] > > > So I wouldn't say it's merely a reclaim hint. It controls a very > > > concrete and influential factor in VM decision making. And since the > > > global swappiness is long-established ABI, I don't expect its meaning > > > to change significantly any time soon. > > > > I want to add to this last point. While swappiness does not have > > terribly well-defined semantics - it is the (only?) existing mechanism > > to control balance between anon and file reclaim. I'm merely > > advocating for the ability to adjust swappiness during proactive > > reclaim separately from reactive reclaim. To what degree the behavior > > and semantics of swappiness change is a bit orthogonal here. > > Let me ask my question in this chain as it might have been missed: > > Whatever the semantics of swappiness are (including the edge cases > like no swap, file_is_tiny, trim cache), should the reclaim code treat > the global swappiness and user-provided swappiness differently? I can't think of any reason why we would want swappiness interpreted differently if it's provided at proactive reclaim time vs globally. Did you have something in mind here? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 21:37 ` Dan Schatzberg @ 2023-11-30 21:52 ` Shakeel Butt 0 siblings, 0 replies; 27+ messages in thread From: Shakeel Butt @ 2023-11-30 21:52 UTC (permalink / raw) To: Dan Schatzberg Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 1:37 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > On Thu, Nov 30, 2023 at 12:30:27PM -0800, Shakeel Butt wrote: > > On Thu, Nov 30, 2023 at 11:47 AM Dan Schatzberg > > <schatzberg.dan@gmail.com> wrote: > > > > > > On Thu, Nov 30, 2023 at 11:56:42AM -0500, Johannes Weiner wrote: > > > > [...] > > > > So I wouldn't say it's merely a reclaim hint. It controls a very > > > > concrete and influential factor in VM decision making. And since the > > > > global swappiness is long-established ABI, I don't expect its meaning > > > > to change significantly any time soon. > > > > > > I want to add to this last point. While swappiness does not have > > > terribly well-defined semantics - it is the (only?) existing mechanism > > > to control balance between anon and file reclaim. I'm merely > > > advocating for the ability to adjust swappiness during proactive > > > reclaim separately from reactive reclaim. To what degree the behavior > > > and semantics of swappiness change is a bit orthogonal here. > > > > Let me ask my question in this chain as it might have been missed: > > > > Whatever the semantics of swappiness are (including the edge cases > > like no swap, file_is_tiny, trim cache), should the reclaim code treat > > the global swappiness and user-provided swappiness differently? > > I can't think of any reason why we would want swappiness interpreted > differently if it's provided at proactive reclaim time vs > globally. Did you have something in mind here? Nah just wanted to know what you are aiming for. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 16:56 ` Johannes Weiner 2023-11-30 18:49 ` Shakeel Butt 2023-11-30 19:47 ` Dan Schatzberg @ 2023-12-01 9:33 ` Michal Hocko 2023-12-01 15:49 ` Dan Schatzberg 2023-12-01 17:09 ` Johannes Weiner 2 siblings, 2 replies; 27+ messages in thread From: Michal Hocko @ 2023-12-01 9:33 UTC (permalink / raw) To: Johannes Weiner Cc: Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu 30-11-23 11:56:42, Johannes Weiner wrote: [...] > So I wouldn't say it's merely a reclaim hint. It controls a very > concrete and influential factor in VM decision making. And since the > global swappiness is long-established ABI, I don't expect its meaning > to change significantly any time soon. As I've said I am more worried about potential future changes which would modify existing, reduce or add more corner cases which would be seen as a change of behavior from the user space POV. That means that we would have to be really explicit about the fact that the reclaim is free to override the swappiness provided by user. So essentially a best effort interface without any actual guarantees. That surely makes it harder to use. Is it still useable? Btw. IIRC these concerns were part of the reason why memcg v2 doesn't have swappiness interface. If we decide to export swappiness via memory.reclaim interface does it mean we will do so on per-memcg level as well? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-12-01 9:33 ` Michal Hocko @ 2023-12-01 15:49 ` Dan Schatzberg 2023-12-01 17:09 ` Johannes Weiner 1 sibling, 0 replies; 27+ messages in thread From: Dan Schatzberg @ 2023-12-01 15:49 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Fri, Dec 01, 2023 at 10:33:01AM +0100, Michal Hocko wrote: > On Thu 30-11-23 11:56:42, Johannes Weiner wrote: > [...] > > So I wouldn't say it's merely a reclaim hint. It controls a very > > concrete and influential factor in VM decision making. And since the > > global swappiness is long-established ABI, I don't expect its meaning > > to change significantly any time soon. > > As I've said I am more worried about potential future changes which > would modify existing, reduce or add more corner cases which would be > seen as a change of behavior from the user space POV. That means that we > would have to be really explicit about the fact that the reclaim is free > to override the swappiness provided by user. So essentially a best > effort interface without any actual guarantees. That surely makes it > harder to use. Is it still useable? For our needs (limiting swapout and avoiding swap-depletion) we rely on two semantics of vm.swappiness. 1) Lower swappiness results in less swap-out, more swappiness results in more swap-out - for the same workload. Our proactive reclaimer monitors swap-out and lowers swappiness in response if we exceed our target swap-out rate. 2) swappiness = 0 results in no or very little swap-out. We rely on this to avoid exhausting swap due to proactive reclaim and triggering OOMs. We already depend on these semantics of vm.swappiness *today*. I think changing either of these would be seen as a behavior change from user space POV irrespective of this patch. The proposal in this patch only allows for vm.swappiness (whatever its semantics) to be configured separately for proactive reclaim. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-12-01 9:33 ` Michal Hocko 2023-12-01 15:49 ` Dan Schatzberg @ 2023-12-01 17:09 ` Johannes Weiner 2023-12-04 15:23 ` Michal Hocko 2023-12-07 18:57 ` Michal Koutný 1 sibling, 2 replies; 27+ messages in thread From: Johannes Weiner @ 2023-12-01 17:09 UTC (permalink / raw) To: Michal Hocko Cc: Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Fri, Dec 01, 2023 at 10:33:01AM +0100, Michal Hocko wrote: > On Thu 30-11-23 11:56:42, Johannes Weiner wrote: > [...] > > So I wouldn't say it's merely a reclaim hint. It controls a very > > concrete and influential factor in VM decision making. And since the > > global swappiness is long-established ABI, I don't expect its meaning > > to change significantly any time soon. > > As I've said I am more worried about potential future changes which > would modify existing, reduce or add more corner cases which would be > seen as a change of behavior from the user space POV. That means that we > would have to be really explicit about the fact that the reclaim is free > to override the swappiness provided by user. So essentially a best > effort interface without any actual guarantees. That surely makes it > harder to use. Is it still useable? But it's not free to override the setting as it pleases. I wrote a detailed list of the current exceptions, and why the user wouldn't have strong expectations of swappiness being respected in those cases. Having reasonable limitations is not the same as everything being up for grabs. Again, the swappiness setting is ABI, and people would definitely complain if we ignored their request in an unexpected situation and regressed their workloads. I'm not against documenting the exceptions and limitations. Not just for proactive reclaim, but for swappiness in general. But I don't think it's fair to say that there are NO rules and NO userspace contract around this parameter (and I'm the one who wrote most of the balancing code that implements the swappiness control). So considering what swappiness DOES provide, and the definition and behavior to which we're tied by ABI rules, yes I do think it's useful to control this from the proactive reclaim context. In fact, we know it's useful, because we've been doing it for a while in production now - just in a hacky way, and this patch is merely making it less hacky. > Btw. IIRC these concerns were part of the reason why memcg v2 doesn't > have swappiness interface. If we decide to export swappiness via > memory.reclaim interface does it mean we will do so on per-memcg level > as well? Well I'm the person who wrote the initial cgroup2 memory interface, and I left it out because there was no clear usecase for why you'd want to tweak it on a per-container basis. But Dan did bring up a new and very concrete usecase: controlling for write endurance. And it's not just a theoretical one, but a proven real world application. As far as adding a static memory.swappiness goes, I wouldn't add it just because, but wait for a concrete usecase for that specifically. I don't think Dan's rationale extends to it. But if a usecase comes up and is convincing, I wouldn't be opposed to it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-12-01 17:09 ` Johannes Weiner @ 2023-12-04 15:23 ` Michal Hocko 2023-12-05 16:19 ` Johannes Weiner 2023-12-07 18:57 ` Michal Koutný 1 sibling, 1 reply; 27+ messages in thread From: Michal Hocko @ 2023-12-04 15:23 UTC (permalink / raw) To: Johannes Weiner Cc: Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Fri 01-12-23 12:09:55, Johannes Weiner wrote: > On Fri, Dec 01, 2023 at 10:33:01AM +0100, Michal Hocko wrote: > > On Thu 30-11-23 11:56:42, Johannes Weiner wrote: > > [...] > > > So I wouldn't say it's merely a reclaim hint. It controls a very > > > concrete and influential factor in VM decision making. And since the > > > global swappiness is long-established ABI, I don't expect its meaning > > > to change significantly any time soon. > > > > As I've said I am more worried about potential future changes which > > would modify existing, reduce or add more corner cases which would be > > seen as a change of behavior from the user space POV. That means that we > > would have to be really explicit about the fact that the reclaim is free > > to override the swappiness provided by user. So essentially a best > > effort interface without any actual guarantees. That surely makes it > > harder to use. Is it still useable? > > But it's not free to override the setting as it pleases. I wrote a > detailed list of the current exceptions, and why the user wouldn't > have strong expectations of swappiness being respected in those > cases. Having reasonable limitations is not the same as everything > being up for grabs. Well, I was not suggesting that future changes would be intentionally breaking swappiness. But look at the history, we've had times when swappiness was ignored most of the time due to heavy page cache bias. Now it is really hard to assume future reclaim changes but I can easily imagine that IO refault cost to balance file vs. anon lrus would be in future reclaim improvements and extensions. > Again, the swappiness setting is ABI, and people would definitely > complain if we ignored their request in an unexpected situation and > regressed their workloads. > > I'm not against documenting the exceptions and limitations. Not just > for proactive reclaim, but for swappiness in general. But I don't > think it's fair to say that there are NO rules and NO userspace > contract around this parameter (and I'm the one who wrote most of the > balancing code that implements the swappiness control). Right, but the behavior might change considerably between different kernel versions and that is something to be really careful about. One think I would really like to avoid is to provide any guarantee that swappiness X and nr_to_reclaim has an exact anon/file pages reclaimed or this is a regression because $VER-1 behaved that way. There might be very ligitimate reasons to use different heuristics in the memory reclaim. Another option would be drop any heuristics when swappiness is provided for the memory.reclaim interface which would be much more predictable but it would also diverge from the normal reclaim and that is quite bad IMHO. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-12-04 15:23 ` Michal Hocko @ 2023-12-05 16:19 ` Johannes Weiner 0 siblings, 0 replies; 27+ messages in thread From: Johannes Weiner @ 2023-12-05 16:19 UTC (permalink / raw) To: Michal Hocko Cc: Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Mon, Dec 04, 2023 at 04:23:47PM +0100, Michal Hocko wrote: > On Fri 01-12-23 12:09:55, Johannes Weiner wrote: > > On Fri, Dec 01, 2023 at 10:33:01AM +0100, Michal Hocko wrote: > > > On Thu 30-11-23 11:56:42, Johannes Weiner wrote: > > > [...] > > > > So I wouldn't say it's merely a reclaim hint. It controls a very > > > > concrete and influential factor in VM decision making. And since the > > > > global swappiness is long-established ABI, I don't expect its meaning > > > > to change significantly any time soon. > > > > > > As I've said I am more worried about potential future changes which > > > would modify existing, reduce or add more corner cases which would be > > > seen as a change of behavior from the user space POV. That means that we > > > would have to be really explicit about the fact that the reclaim is free > > > to override the swappiness provided by user. So essentially a best > > > effort interface without any actual guarantees. That surely makes it > > > harder to use. Is it still useable? > > > > But it's not free to override the setting as it pleases. I wrote a > > detailed list of the current exceptions, and why the user wouldn't > > have strong expectations of swappiness being respected in those > > cases. Having reasonable limitations is not the same as everything > > being up for grabs. > > Well, I was not suggesting that future changes would be intentionally > breaking swappiness. But look at the history, we've had times when > swappiness was ignored most of the time due to heavy page cache bias. > Now it is really hard to assume future reclaim changes but I can easily > imagine that IO refault cost to balance file vs. anon lrus would be in > future reclaim improvements and extensions. That's a possibility, but it would be an explicit *replacement* of the swappiness setting. Since swappiness is already exposed in more than one way (global, cgroup1), truly overriding it would need to be opt-in. We could only remove the heavy page cache bias without a switch because it didn't actually break user expectations. In fact it followed them more closely, since previously with a swappiness=60 the kernel might not swap at all even with a heavily thrashing cache. The thing Linus keeps stressing is not that we cannot change behavior, but that we cannot break (reasonable) existing setups. So we can make improvements as long as we don't violate general user expectations or generate IO patterns that are completely contradictory to the setting. > > Again, the swappiness setting is ABI, and people would definitely > > complain if we ignored their request in an unexpected situation and > > regressed their workloads. > > > > I'm not against documenting the exceptions and limitations. Not just > > for proactive reclaim, but for swappiness in general. But I don't > > think it's fair to say that there are NO rules and NO userspace > > contract around this parameter (and I'm the one who wrote most of the > > balancing code that implements the swappiness control). > > Right, but the behavior might change considerably between different > kernel versions and that is something to be really careful about. One > think I would really like to avoid is to provide any guarantee that > swappiness X and nr_to_reclaim has an exact anon/file pages reclaimed > or this is a regression because $VER-1 behaved that way. There might be > very ligitimate reasons to use different heuristics in the memory > reclaim. Yes, it shouldn't mean any more than the global swappiness means. > Another option would be drop any heuristics when swappiness is provided > for the memory.reclaim interface which would be much more predictable > but it would also diverge from the normal reclaim and that is quite bad > IMHO. I would prefer to keep the semantics for global/reactive and proactive reclaim the same. Making an existing tunable available in cgroup2 is much lower risk than providing something new and different under the same name. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-12-01 17:09 ` Johannes Weiner 2023-12-04 15:23 ` Michal Hocko @ 2023-12-07 18:57 ` Michal Koutný 1 sibling, 0 replies; 27+ messages in thread From: Michal Koutný @ 2023-12-07 18:57 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins [-- Attachment #1: Type: text/plain, Size: 495 bytes --] On Fri, Dec 01, 2023 at 12:09:55PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote: > Well I'm the person who wrote the initial cgroup2 memory interface, > and I left it out because there was no clear usecase for why you'd > want to tweak it on a per-container basis. There is also the issue of the default value and its relation to the global vm.swappiness. Which disappeared with v2. The "stateless" passing of per-reclaim swappiness looks cleaner in this way. 0.02€, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 15:36 [PATCH 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg 2023-11-30 15:57 ` [PATCH 0/1] Add swappiness argument " Michal Hocko @ 2023-11-30 18:44 ` Shakeel Butt 2023-11-30 18:54 ` Matthew Wilcox ` (2 more replies) 2 siblings, 3 replies; 27+ messages in thread From: Shakeel Butt @ 2023-11-30 18:44 UTC (permalink / raw) To: Dan Schatzberg Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 07:36:53AM -0800, Dan Schatzberg wrote: > (Sorry for the resend - forgot to cc the mailing lists) > > This patch proposes augmenting the memory.reclaim interface with a > swappiness=<val> argument that overrides the swappiness value for that instance > of proactive reclaim. > > Userspace proactive reclaimers use the memory.reclaim interface to trigger > reclaim. The memory.reclaim interface does not allow for any way to effect the > balance of file vs anon during proactive reclaim. The only approach is to adjust > the vm.swappiness setting. However, there are a few reasons we look to control > the balance of file vs anon during proactive reclaim, separately from reactive > reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM Is this about swapout to SSD only? > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write endurance. > However, proactive reclaim runs continuously and so its impact on SSD write > endurance is more significant. Therefore it is desireable to control swap-out > for proactive reclaim separately from reactive reclaim This is understandable but swapout to zswap should be fine, right? (Sorry I am not following the discussion on zswap patches from Nhat. Is the answer there?) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 18:44 ` Shakeel Butt @ 2023-11-30 18:54 ` Matthew Wilcox 2023-11-30 19:39 ` Johannes Weiner 2023-11-30 19:49 ` Johannes Weiner 2023-11-30 19:50 ` Dan Schatzberg 2 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-11-30 18:54 UTC (permalink / raw) To: Shakeel Butt Cc: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Muchun Song, Andrew Morton, David Hildenbrand, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 06:44:24PM +0000, Shakeel Butt wrote: > On Thu, Nov 30, 2023 at 07:36:53AM -0800, Dan Schatzberg wrote: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > > Is this about swapout to SSD only? If you're using spinning rust with its limit of, what, 200 seeks per second, I'd suggest that swapout should also be limited but for different reasons. We can set you up with a laptop with insufficient RAM and a 5400rpm drive if you'd like to be convinced ... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 18:54 ` Matthew Wilcox @ 2023-11-30 19:39 ` Johannes Weiner 0 siblings, 0 replies; 27+ messages in thread From: Johannes Weiner @ 2023-11-30 19:39 UTC (permalink / raw) To: Matthew Wilcox Cc: Shakeel Butt, Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Muchun Song, Andrew Morton, David Hildenbrand, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 06:54:37PM +0000, Matthew Wilcox wrote: > On Thu, Nov 30, 2023 at 06:44:24PM +0000, Shakeel Butt wrote: > > On Thu, Nov 30, 2023 at 07:36:53AM -0800, Dan Schatzberg wrote: > > > * Swapout should be limited to manage SSD write endurance. In near-OOM > > > > Is this about swapout to SSD only? > > If you're using spinning rust with its limit of, what, 200 seeks per > second, I'd suggest that swapout should also be limited but for > different reasons. We can set you up with a laptop with insufficient > RAM and a 5400rpm drive if you'd like to be convinced ... ^_^ Yeah, we don't enable disk swap on the hard drive hosts that are left. This aspect is only about write endurance management on SSD during proactive reclaim. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 18:44 ` Shakeel Butt 2023-11-30 18:54 ` Matthew Wilcox @ 2023-11-30 19:49 ` Johannes Weiner 2023-11-30 19:50 ` Dan Schatzberg 2 siblings, 0 replies; 27+ messages in thread From: Johannes Weiner @ 2023-11-30 19:49 UTC (permalink / raw) To: Shakeel Butt Cc: Dan Schatzberg, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 06:44:24PM +0000, Shakeel Butt wrote: > On Thu, Nov 30, 2023 at 07:36:53AM -0800, Dan Schatzberg wrote: > > (Sorry for the resend - forgot to cc the mailing lists) > > > > This patch proposes augmenting the memory.reclaim interface with a > > swappiness=<val> argument that overrides the swappiness value for that instance > > of proactive reclaim. > > > > Userspace proactive reclaimers use the memory.reclaim interface to trigger > > reclaim. The memory.reclaim interface does not allow for any way to effect the > > balance of file vs anon during proactive reclaim. The only approach is to adjust > > the vm.swappiness setting. However, there are a few reasons we look to control > > the balance of file vs anon during proactive reclaim, separately from reactive > > reclaim: > > > > * Swapout should be limited to manage SSD write endurance. In near-OOM > > Is this about swapout to SSD only? > > > situations we are fine with lots of swap-out to avoid OOMs. As these are > > typically rare events, they have relatively little impact on write endurance. > > However, proactive reclaim runs continuously and so its impact on SSD write > > endurance is more significant. Therefore it is desireable to control swap-out > > for proactive reclaim separately from reactive reclaim > > This is understandable but swapout to zswap should be fine, right? > (Sorry I am not following the discussion on zswap patches from Nhat. Is > the answer there?) Memory compression alone would be fine, yes. However, we don't use zswap in all cgroups. Lower priority things are forced directly to disk. Some workloads compress poorly and also go directly to disk for better memory efficiency. On such cgroups, it's important for proactive reclaim to manage swap rates to avoid burning out the flash. Note that zswap also does SSD writes during writeback. I know this doesn't apply to Google because of the ghost files, but we have SSD swapfiles behind zswap. And this part will become more relevant with Nhat's enhanced writeback patches. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] Add swappiness argument to memory.reclaim 2023-11-30 18:44 ` Shakeel Butt 2023-11-30 18:54 ` Matthew Wilcox 2023-11-30 19:49 ` Johannes Weiner @ 2023-11-30 19:50 ` Dan Schatzberg 2 siblings, 0 replies; 27+ messages in thread From: Dan Schatzberg @ 2023-11-30 19:50 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang, linux-kernel, cgroups, linux-mm, Michal Hocko, Muchun Song, Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying, Kefeng Wang, Peter Xu, Vishal Moola (Oracle), Yue Zhao, Hugh Dickins On Thu, Nov 30, 2023 at 06:44:24PM +0000, Shakeel Butt wrote: > [...] > > * Swapout should be limited to manage SSD write endurance. In near-OOM > > Is this about swapout to SSD only? Correct > > situations we are fine with lots of swap-out to avoid OOMs. As these are > > typically rare events, they have relatively little impact on write endurance. > > However, proactive reclaim runs continuously and so its impact on SSD write > > endurance is more significant. Therefore it is desireable to control swap-out > > for proactive reclaim separately from reactive reclaim > > This is understandable but swapout to zswap should be fine, right? > (Sorry I am not following the discussion on zswap patches from Nhat. Is > the answer there?) You're correct here as well - we're not concerned about swapout to zswap as that does not impact SSD write endurance. This is not related to Nhat's patches. ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2023-12-07 18:57 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-11-30 15:36 [PATCH 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg 2023-11-30 15:36 ` [PATCH 1/1] mm: add swapiness= arg " Dan Schatzberg 2023-11-30 21:33 ` Andrew Morton 2023-11-30 21:46 ` Dan Schatzberg 2023-12-01 1:56 ` Huan Yang 2023-12-01 2:05 ` Yosry Ahmed 2023-12-01 2:13 ` Huan Yang 2023-12-01 2:17 ` Yosry Ahmed 2023-12-01 2:24 ` Huan Yang 2023-11-30 15:57 ` [PATCH 0/1] Add swappiness argument " Michal Hocko 2023-11-30 16:56 ` Johannes Weiner 2023-11-30 18:49 ` Shakeel Butt 2023-11-30 19:47 ` Dan Schatzberg 2023-11-30 20:30 ` Shakeel Butt 2023-11-30 21:37 ` Dan Schatzberg 2023-11-30 21:52 ` Shakeel Butt 2023-12-01 9:33 ` Michal Hocko 2023-12-01 15:49 ` Dan Schatzberg 2023-12-01 17:09 ` Johannes Weiner 2023-12-04 15:23 ` Michal Hocko 2023-12-05 16:19 ` Johannes Weiner 2023-12-07 18:57 ` Michal Koutný 2023-11-30 18:44 ` Shakeel Butt 2023-11-30 18:54 ` Matthew Wilcox 2023-11-30 19:39 ` Johannes Weiner 2023-11-30 19:49 ` Johannes Weiner 2023-11-30 19:50 ` Dan Schatzberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox