* [PATCH] mm: memcg: Use larger chunks for proactive reclaim
@ 2024-01-31 16:24 T.J. Mercier
2024-01-31 17:50 ` Johannes Weiner
2024-02-01 13:57 ` Michal Koutný
0 siblings, 2 replies; 12+ messages in thread
From: T.J. Mercier @ 2024-01-31 16:24 UTC (permalink / raw)
To: tjmercier, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Efly Young
Cc: android-mm, yuzhao, cgroups, linux-mm, linux-kernel
Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive
reclaim") we passed the number of pages for the reclaim request directly
to try_to_free_mem_cgroup_pages, which could lead to significant
overreclaim in order to achieve fairness. After 0388536ac291 the number
of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce
the amount of overreclaim. However such a small chunk size caused a
regression in reclaim performance due to many more reclaim start/stop
cycles inside memory_reclaim.
Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant,
adjust the chunk size proportionally with number of pages left to
reclaim. This allows for higher reclaim efficiency with large chunk
sizes during the beginning of memory_reclaim, and reduces the amount of
potential overreclaim by using small chunk sizes as the total reclaim
amount is approached. Using 1/4 of the amount left to reclaim as the
chunk size gives a good compromise between reclaim performance and
overreclaim:
root - full reclaim pages/sec time (sec)
pre-0388536ac291 : 68047 10.46
post-0388536ac291 : 13742 inf
(reclaim-reclaimed)/4 : 67352 10.51
/uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB)
pre-0388536ac291 : 258822 1.12 107.8
post-0388536ac291 : 105174 2.49 3.5
(reclaim-reclaimed)/4 : 233396 1.12 -7.4
/uid_0 - full reclaim pages/sec time (sec)
pre-0388536ac291 : 72334 7.09
post-0388536ac291 : 38105 14.45
(reclaim-reclaimed)/4 : 72914 6.96
Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
mm/memcontrol.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46d8d02114cf..d68fb89eadd2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
lru_add_drain_all();
reclaimed = try_to_free_mem_cgroup_pages(memcg,
- min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
+ max((nr_to_reclaim - nr_reclaimed) / 4,
+ (nr_to_reclaim - nr_reclaimed) % 4),
GFP_KERNEL, reclaim_options);
if (!reclaimed && !nr_retries--)
--
2.43.0.594.gd9cf4e227d-goog
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-01-31 16:24 [PATCH] mm: memcg: Use larger chunks for proactive reclaim T.J. Mercier @ 2024-01-31 17:50 ` Johannes Weiner 2024-01-31 18:01 ` T.J. Mercier 2024-02-01 13:57 ` Michal Koutný 1 sibling, 1 reply; 12+ messages in thread From: Johannes Weiner @ 2024-01-31 17:50 UTC (permalink / raw) To: T.J. Mercier Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, Efly Young, android-mm, yuzhao, cgroups, linux-mm, linux-kernel On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote: > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > reclaim") we passed the number of pages for the reclaim request directly > to try_to_free_mem_cgroup_pages, which could lead to significant > overreclaim in order to achieve fairness. After 0388536ac291 the number > of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce > the amount of overreclaim. However such a small chunk size caused a > regression in reclaim performance due to many more reclaim start/stop > cycles inside memory_reclaim. > > Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant, > adjust the chunk size proportionally with number of pages left to > reclaim. This allows for higher reclaim efficiency with large chunk > sizes during the beginning of memory_reclaim, and reduces the amount of > potential overreclaim by using small chunk sizes as the total reclaim > amount is approached. Using 1/4 of the amount left to reclaim as the > chunk size gives a good compromise between reclaim performance and > overreclaim: > > root - full reclaim pages/sec time (sec) > pre-0388536ac291 : 68047 10.46 > post-0388536ac291 : 13742 inf > (reclaim-reclaimed)/4 : 67352 10.51 > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > pre-0388536ac291 : 258822 1.12 107.8 > post-0388536ac291 : 105174 2.49 3.5 > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > /uid_0 - full reclaim pages/sec time (sec) > pre-0388536ac291 : 72334 7.09 > post-0388536ac291 : 38105 14.45 > (reclaim-reclaimed)/4 : 72914 6.96 > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > Signed-off-by: T.J. Mercier <tjmercier@google.com> > --- > mm/memcontrol.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 46d8d02114cf..d68fb89eadd2 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > lru_add_drain_all(); > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > + max((nr_to_reclaim - nr_reclaimed) / 4, > + (nr_to_reclaim - nr_reclaimed) % 4), I don't see why the % 4 is needed. It only kicks in when the delta drops below 4, but try_to_free_mem_cgroup_pages() already has .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), so it looks like dead code. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-01-31 17:50 ` Johannes Weiner @ 2024-01-31 18:01 ` T.J. Mercier 2024-01-31 20:12 ` Johannes Weiner 0 siblings, 1 reply; 12+ messages in thread From: T.J. Mercier @ 2024-01-31 18:01 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, Efly Young, android-mm, yuzhao, cgroups, linux-mm, linux-kernel On Wed, Jan 31, 2024 at 9:51 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote: > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > reclaim") we passed the number of pages for the reclaim request directly > > to try_to_free_mem_cgroup_pages, which could lead to significant > > overreclaim in order to achieve fairness. After 0388536ac291 the number > > of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce > > the amount of overreclaim. However such a small chunk size caused a > > regression in reclaim performance due to many more reclaim start/stop > > cycles inside memory_reclaim. > > > > Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant, > > adjust the chunk size proportionally with number of pages left to > > reclaim. This allows for higher reclaim efficiency with large chunk > > sizes during the beginning of memory_reclaim, and reduces the amount of > > potential overreclaim by using small chunk sizes as the total reclaim > > amount is approached. Using 1/4 of the amount left to reclaim as the > > chunk size gives a good compromise between reclaim performance and > > overreclaim: > > > > root - full reclaim pages/sec time (sec) > > pre-0388536ac291 : 68047 10.46 > > post-0388536ac291 : 13742 inf > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > pre-0388536ac291 : 258822 1.12 107.8 > > post-0388536ac291 : 105174 2.49 3.5 > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > /uid_0 - full reclaim pages/sec time (sec) > > pre-0388536ac291 : 72334 7.09 > > post-0388536ac291 : 38105 14.45 > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > --- > > mm/memcontrol.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 46d8d02114cf..d68fb89eadd2 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > lru_add_drain_all(); > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > + (nr_to_reclaim - nr_reclaimed) % 4), > > I don't see why the % 4 is needed. It only kicks in when the delta > drops below 4, but try_to_free_mem_cgroup_pages() already has > > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > > so it looks like dead code. That right, it's only there for when the integer division reaches zero. I didn't want to assume anything about the implementation of try_to_free_mem_cgroup_pages, but I can just remove it entirely if you'd like. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-01-31 18:01 ` T.J. Mercier @ 2024-01-31 20:12 ` Johannes Weiner 0 siblings, 0 replies; 12+ messages in thread From: Johannes Weiner @ 2024-01-31 20:12 UTC (permalink / raw) To: T.J. Mercier Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, Efly Young, android-mm, yuzhao, cgroups, linux-mm, linux-kernel On Wed, Jan 31, 2024 at 10:01:27AM -0800, T.J. Mercier wrote: > On Wed, Jan 31, 2024 at 9:51 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote: > > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > > reclaim") we passed the number of pages for the reclaim request directly > > > to try_to_free_mem_cgroup_pages, which could lead to significant > > > overreclaim in order to achieve fairness. After 0388536ac291 the number > > > of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce > > > the amount of overreclaim. However such a small chunk size caused a > > > regression in reclaim performance due to many more reclaim start/stop > > > cycles inside memory_reclaim. > > > > > > Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant, > > > adjust the chunk size proportionally with number of pages left to > > > reclaim. This allows for higher reclaim efficiency with large chunk > > > sizes during the beginning of memory_reclaim, and reduces the amount of > > > potential overreclaim by using small chunk sizes as the total reclaim > > > amount is approached. Using 1/4 of the amount left to reclaim as the > > > chunk size gives a good compromise between reclaim performance and > > > overreclaim: > > > > > > root - full reclaim pages/sec time (sec) > > > pre-0388536ac291 : 68047 10.46 > > > post-0388536ac291 : 13742 inf > > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > > pre-0388536ac291 : 258822 1.12 107.8 > > > post-0388536ac291 : 105174 2.49 3.5 > > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > > > /uid_0 - full reclaim pages/sec time (sec) > > > pre-0388536ac291 : 72334 7.09 > > > post-0388536ac291 : 38105 14.45 > > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > > --- > > > mm/memcontrol.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 46d8d02114cf..d68fb89eadd2 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > lru_add_drain_all(); > > > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > > + (nr_to_reclaim - nr_reclaimed) % 4), > > > > I don't see why the % 4 is needed. It only kicks in when the delta > > drops below 4, but try_to_free_mem_cgroup_pages() already has > > > > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > > > > so it looks like dead code. > > That right, it's only there for when the integer division reaches > zero. I didn't want to assume anything about the implementation of > try_to_free_mem_cgroup_pages, but I can just remove it entirely if > you'd like. What do others think? We rely on the rounding up in a few other places and it's been doing that for a decade. Maybe lampshade it for the benefit of the reader: /* Will converge on zero, but reclaim enforces a minimum */ but otherwise there is IMO no need to have defensive extra code. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-01-31 16:24 [PATCH] mm: memcg: Use larger chunks for proactive reclaim T.J. Mercier 2024-01-31 17:50 ` Johannes Weiner @ 2024-02-01 13:57 ` Michal Koutný 2024-02-01 15:34 ` Johannes Weiner 1 sibling, 1 reply; 12+ messages in thread From: Michal Koutný @ 2024-02-01 13:57 UTC (permalink / raw) To: T.J. Mercier Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, Efly Young, android-mm, yuzhao, cgroups, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 697 bytes --] Hello. On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > + max((nr_to_reclaim - nr_reclaimed) / 4, > + (nr_to_reclaim - nr_reclaimed) % 4), The 1/4 factor looks like magic. Commit 0388536ac291 says: | In theory, the amount of reclaimed would be in [request, 2 * request). Doesn't this suggest 1/2 as a better option? (I didn't pursue the theory.) Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8, the formula gives arbitrary (unrelated to delta's magnitude) values. Regards, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-02-01 13:57 ` Michal Koutný @ 2024-02-01 15:34 ` Johannes Weiner 2024-02-01 18:10 ` T.J. Mercier 2024-02-02 5:02 ` Efly Young 0 siblings, 2 replies; 12+ messages in thread From: Johannes Weiner @ 2024-02-01 15:34 UTC (permalink / raw) To: Michal Koutný Cc: T.J. Mercier, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, Efly Young, android-mm, yuzhao, cgroups, linux-mm, linux-kernel On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote: > Hello. > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > + (nr_to_reclaim - nr_reclaimed) % 4), > > The 1/4 factor looks like magic. It's just cutting the work into quarters to balance throughput with goal accuracy. It's no more or less magic than DEF_PRIORITY being 12, or SWAP_CLUSTER_MAX being 32. > Commit 0388536ac291 says: > | In theory, the amount of reclaimed would be in [request, 2 * request). Looking at the code, I'm not quite sure if this can be read this literally. Efly might be able to elaborate, but we do a full loop of all nodes and cgroups in the tree before checking nr_to_reclaimed, and rely on priority level for granularity. So request size and complexity of the cgroup tree play a role. I don't know where the exact factor two would come from. IMO it's more accurate to phrase it like this: Reclaim tries to balance nr_to_reclaim fidelity with fairness across nodes and cgroups over which the pages are spread. As such, the bigger the request, the bigger the absolute overreclaim error. Historic in-kernel users of reclaim have used fixed, small request batches to approach an appropriate reclaim rate over time. When we reclaim a user request of arbitrary size, use decaying batches to manage error while maintaining reasonable throughput. > Doesn't this suggest 1/2 as a better option? (I didn't pursue the > theory.) That was TJ's first suggestion as well, but as per above I suggested quartering as a safer option. > Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8, > the formula gives arbitrary (unrelated to delta's magnitude) values. try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the error margin is much higher at the smaller end of requests anyway. But practically speaking, users care much less if you reclaim 32 pages when 16 were requested than if you reclaim 2G when 1G was requested. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-02-01 15:34 ` Johannes Weiner @ 2024-02-01 18:10 ` T.J. Mercier 2024-02-02 5:02 ` Efly Young 1 sibling, 0 replies; 12+ messages in thread From: T.J. Mercier @ 2024-02-01 18:10 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Koutný, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, Efly Young, android-mm, yuzhao, cgroups, linux-mm, linux-kernel On Thu, Feb 1, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote: > > Hello. > > > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > > + (nr_to_reclaim - nr_reclaimed) % 4), > > > > The 1/4 factor looks like magic. > > It's just cutting the work into quarters to balance throughput with > goal accuracy. It's no more or less magic than DEF_PRIORITY being 12, > or SWAP_CLUSTER_MAX being 32. Using SWAP_CLUSTER_MAX is sort of like having a really large divisor instead of 4 (or 1 like before). I recorded the average number of iterations required to complete the 1G reclaim for the measurements I took and it looks like this: pre-0388536ac291 : 1 post-0388536ac291 : 1814 (reclaim-reclaimed)/4: 17 Given the results with /4, I don't think the perf we get here is particularly sensitive to the number we choose, but it's definitely a tradeoff. <snip> > > Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8, > > the formula gives arbitrary (unrelated to delta's magnitude) values. > > try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the > error margin is much higher at the smaller end of requests anyway. > But practically speaking, users care much less if you reclaim 32 pages > when 16 were requested than if you reclaim 2G when 1G was requested. I like Johannes's suggestion of just a comment instead of the mod op. It's easier to read, slightly less generated code, and even if we didn't have the .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX) in try_to_free_mem_cgroup_pages, memory_reclaim would still get very close to the target before running out of nr_retries. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-02-01 15:34 ` Johannes Weiner 2024-02-01 18:10 ` T.J. Mercier @ 2024-02-02 5:02 ` Efly Young 2024-02-02 10:15 ` Michal Koutný 1 sibling, 1 reply; 12+ messages in thread From: Efly Young @ 2024-02-02 5:02 UTC (permalink / raw) To: hannes Cc: akpm, android-mm, cgroups, linux-kernel, linux-mm, mhocko, mkoutny, muchun.song, roman.gushchin, shakeelb, tjmercier, yangyifei03, yuzhao [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="y", Size: 2958 bytes --] > On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote: > > Hello. > > > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > > + (nr_to_reclaim - nr_reclaimed) % 4), > > > > The 1/4 factor looks like magic. > > It's just cutting the work into quarters to balance throughput with > goal accuracy. It's no more or less magic than DEF_PRIORITY being 12, > or SWAP_CLUSTER_MAX being 32. > > > Commit 0388536ac291 says: > > | In theory, the amount of reclaimed would be in [request, 2 * request). > > Looking at the code, I'm not quite sure if this can be read this > literally. Efly might be able to elaborate, but we do a full loop of > all nodes and cgroups in the tree before checking nr_to_reclaimed, and > rely on priority level for granularity. So request size and complexity > of the cgroup tree play a role. I don't know where the exact factor > two would come from. I'm sorry that this conclusion may be arbitrary. It might just only suit for my case. In my case, I traced it loop twice every time before checking nr_reclaimed, and it reclaimed less than my request size(1G) every time. So I think the upper bound is 2 * request. But now it seems that this is related to cgroup tree I constucted and my system status and my request size(a relatively large chunk). So there are many influencing factors, a specific upper bound is not accurate. > IMO it's more accurate to phrase it like this: > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > nodes and cgroups over which the pages are spread. As such, the bigger > the request, the bigger the absolute overreclaim error. Historic > in-kernel users of reclaim have used fixed, small request batches to > approach an appropriate reclaim rate over time. When we reclaim a user > request of arbitrary size, use decaying batches to manage error while > maintaining reasonable throughput. > > > Doesn't this suggest 1/2 as a better option? (I didn't pursue the > > theory.) > > That was TJ's first suggestion as well, but as per above I suggested > quartering as a safer option. > > > Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8, > > the formula gives arbitrary (unrelated to delta's magnitude) values. > > try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the > error margin is much higher at the smaller end of requests anyway. > But practically speaking, users care much less if you reclaim 32 pages > when 16 were requested than if you reclaim 2G when 1G was requested. Yes, I agreed completely that the bigger the request the bigger the absolute overreclaim error. The focus now is the tradeoff between accurate reclaim and efficient reclaim. I think TJ's test is suggestive. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-02-02 5:02 ` Efly Young @ 2024-02-02 10:15 ` Michal Koutný 2024-02-02 18:22 ` T.J. Mercier 0 siblings, 1 reply; 12+ messages in thread From: Michal Koutný @ 2024-02-02 10:15 UTC (permalink / raw) To: Efly Young Cc: hannes, akpm, android-mm, cgroups, linux-kernel, linux-mm, mhocko, muchun.song, roman.gushchin, shakeelb, tjmercier, yuzhao [-- Attachment #1: Type: text/plain, Size: 1930 bytes --] On Fri, Feb 02, 2024 at 01:02:47PM +0800, Efly Young <yangyifei03@kuaishou.com> wrote: > > Looking at the code, I'm not quite sure if this can be read this > > literally. Efly might be able to elaborate, but we do a full loop of > > all nodes and cgroups in the tree before checking nr_to_reclaimed, and > > rely on priority level for granularity. So request size and complexity > > of the cgroup tree play a role. I don't know where the exact factor > > two would come from. > > I'm sorry that this conclusion may be arbitrary. It might just only suit > for my case. In my case, I traced it loop twice every time before checking > nr_reclaimed, and it reclaimed less than my request size(1G) every time. > So I think the upper bound is 2 * request. But now it seems that this is > related to cgroup tree I constucted and my system status and my request > size(a relatively large chunk). So there are many influencing factors, > a specific upper bound is not accurate. Alright, thanks for the background. > > IMO it's more accurate to phrase it like this: > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > nodes and cgroups over which the pages are spread. As such, the bigger > > the request, the bigger the absolute overreclaim error. Historic > > in-kernel users of reclaim have used fixed, small request batches to > > approach an appropriate reclaim rate over time. When we reclaim a user > > request of arbitrary size, use decaying batches to manage error while > > maintaining reasonable throughput. Hm, decay... So shouldn't the formula be nr_pages = delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 where delta = nr_to_reclaim - nr_reclaimed ? (So that convergence for smaller deltas is same like original- and other reclaims while conservative factor is applied for effectivity of higher user requests.) Thanks, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-02-02 10:15 ` Michal Koutný @ 2024-02-02 18:22 ` T.J. Mercier 2024-02-02 19:46 ` Michal Koutný 0 siblings, 1 reply; 12+ messages in thread From: T.J. Mercier @ 2024-02-02 18:22 UTC (permalink / raw) To: Michal Koutný Cc: Efly Young, hannes, akpm, android-mm, cgroups, linux-kernel, linux-mm, mhocko, muchun.song, roman.gushchin, shakeelb, yuzhao On Fri, Feb 2, 2024 at 2:15 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Fri, Feb 02, 2024 at 01:02:47PM +0800, Efly Young <yangyifei03@kuaishou.com> wrote: > > > Looking at the code, I'm not quite sure if this can be read this > > > literally. Efly might be able to elaborate, but we do a full loop of > > > all nodes and cgroups in the tree before checking nr_to_reclaimed, and > > > rely on priority level for granularity. So request size and complexity > > > of the cgroup tree play a role. I don't know where the exact factor > > > two would come from. > > > > I'm sorry that this conclusion may be arbitrary. It might just only suit > > for my case. In my case, I traced it loop twice every time before checking > > nr_reclaimed, and it reclaimed less than my request size(1G) every time. > > So I think the upper bound is 2 * request. But now it seems that this is > > related to cgroup tree I constucted and my system status and my request > > size(a relatively large chunk). So there are many influencing factors, > > a specific upper bound is not accurate. > > Alright, thanks for the background. > > > > IMO it's more accurate to phrase it like this: > > > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > > nodes and cgroups over which the pages are spread. As such, the bigger > > > the request, the bigger the absolute overreclaim error. Historic > > > in-kernel users of reclaim have used fixed, small request batches to > > > approach an appropriate reclaim rate over time. When we reclaim a user > > > request of arbitrary size, use decaying batches to manage error while > > > maintaining reasonable throughput. > > Hm, decay... > So shouldn't the formula be > nr_pages = delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 > where > delta = nr_to_reclaim - nr_reclaimed > ? > (So that convergence for smaller deltas is same like original- and other > reclaims while conservative factor is applied for effectivity of higher > user requests.) Tapering out at 32 instead of 4 doesn't make much difference in practice because of how far off the actually reclaimed amount can be from the request size. We're talking thousands of pages of error for a request size of a few megs, and hundreds of pages of error for requests less than 100 pages. So all of these should be more or less equivalent: delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4) (nr_to_reclaim - nr_reclaimed) / 4 + 4 (nr_to_reclaim - nr_reclaimed) / 4 I was just trying to avoid putting in a 0 for the request size with the mod. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Re: Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-02-02 18:22 ` T.J. Mercier @ 2024-02-02 19:46 ` Michal Koutný 2024-02-02 21:42 ` T.J. Mercier 0 siblings, 1 reply; 12+ messages in thread From: Michal Koutný @ 2024-02-02 19:46 UTC (permalink / raw) To: T.J. Mercier Cc: Efly Young, hannes, akpm, android-mm, cgroups, linux-kernel, linux-mm, mhocko, muchun.song, roman.gushchin, shakeelb, yuzhao [-- Attachment #1: Type: text/plain, Size: 751 bytes --] On Fri, Feb 02, 2024 at 10:22:34AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote: > So all of these should be more or less equivalent: > delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 > max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4) > (nr_to_reclaim - nr_reclaimed) / 4 + 4 > (nr_to_reclaim - nr_reclaimed) / 4 > > I was just trying to avoid putting in a 0 for the request size with the mod. The third variant would be simpler then. Modulo looks weird. Oh, and I just realized that try_to_free_mem_cgroup_pages() does max(nr_pages, SWAP_CLUSTER_MAX). Then I'd vote for the fourth variant + possible comment about harmless 0. (I'm sorry if this was discussed before.) Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Re: Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim 2024-02-02 19:46 ` Michal Koutný @ 2024-02-02 21:42 ` T.J. Mercier 0 siblings, 0 replies; 12+ messages in thread From: T.J. Mercier @ 2024-02-02 21:42 UTC (permalink / raw) To: Michal Koutný Cc: Efly Young, hannes, akpm, android-mm, cgroups, linux-kernel, linux-mm, mhocko, muchun.song, roman.gushchin, shakeelb, yuzhao On Fri, Feb 2, 2024 at 11:46 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Fri, Feb 02, 2024 at 10:22:34AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote: > > So all of these should be more or less equivalent: > > delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 > > max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4) > > (nr_to_reclaim - nr_reclaimed) / 4 + 4 > > (nr_to_reclaim - nr_reclaimed) / 4 > > > > I was just trying to avoid putting in a 0 for the request size with the mod. > > The third variant would be simpler then. Modulo looks weird. > > Oh, and I just realized that try_to_free_mem_cgroup_pages() does > max(nr_pages, SWAP_CLUSTER_MAX). Then I'd vote for the fourth variant + > possible comment about harmless 0. > (I'm sorry if this was discussed before.) > > Michal Ok great, let's do that. Thanks for your input. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-02-02 21:43 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-01-31 16:24 [PATCH] mm: memcg: Use larger chunks for proactive reclaim T.J. Mercier 2024-01-31 17:50 ` Johannes Weiner 2024-01-31 18:01 ` T.J. Mercier 2024-01-31 20:12 ` Johannes Weiner 2024-02-01 13:57 ` Michal Koutný 2024-02-01 15:34 ` Johannes Weiner 2024-02-01 18:10 ` T.J. Mercier 2024-02-02 5:02 ` Efly Young 2024-02-02 10:15 ` Michal Koutný 2024-02-02 18:22 ` T.J. Mercier 2024-02-02 19:46 ` Michal Koutný 2024-02-02 21:42 ` T.J. Mercier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox