* [RFC PATCH 0/3] mm: memcontrol: delayed force empty
@ 2019-01-02 20:05 Yang Shi
2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi
` (3 more replies)
0 siblings, 4 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-02 20:05 UTC (permalink / raw)
To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel
Currently, force empty reclaims memory synchronously when writing to
memory.force_empty. It may take some time to return and the afterwards
operations are blocked by it. Although it can be interrupted by signal,
it still seems suboptimal.
Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline. So, handling force empty in css offline
sounds reasonable.
The user may write into any value to memory.force_empty, but I'm
supposed the most used value should be 0 and 1. To not break existing
applications, writing 0 or 1 still do force empty synchronously, any
other value will tell kernel to do force empty in css offline worker.
Patch #1: Fix some obsolete information about force_empty in the document
Patch #2: A minor improvement to skip swap for force_empty
Patch #3: Implement delayed force_empty
Yang Shi (3):
doc: memcontrol: fix the obsolete content about force empty
mm: memcontrol: do not try to do swap when force empty
mm: memcontrol: delay force empty to css offline
Documentation/cgroup-v1/memory.txt | 15 ++++++++++-----
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 20 +++++++++++++++++++-
3 files changed, 31 insertions(+), 6 deletions(-)
^ permalink raw reply [flat|nested] 30+ messages in thread* [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty 2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi @ 2019-01-02 20:05 ` Yang Shi 2019-01-02 21:18 ` Shakeel Butt 2019-01-03 10:13 ` Michal Hocko 2019-01-02 20:05 ` [PATCH 2/3] mm: memcontrol: do not try to do swap when " Yang Shi ` (2 subsequent siblings) 3 siblings, 2 replies; 30+ messages in thread From: Yang Shi @ 2019-01-02 20:05 UTC (permalink / raw) To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel We don't do page cache reparent anymore when offlining memcg, so update force empty related content accordingly. Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> --- Documentation/cgroup-v1/memory.txt | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt index 3682e99..8e2cb1d 100644 --- a/Documentation/cgroup-v1/memory.txt +++ b/Documentation/cgroup-v1/memory.txt @@ -70,7 +70,7 @@ Brief summary of control files. memory.soft_limit_in_bytes # set/show soft limit of memory usage memory.stat # show various statistics memory.use_hierarchy # set/show hierarchical account enabled - memory.force_empty # trigger forced move charge to parent + memory.force_empty # trigger forced page reclaim memory.pressure_level # set memory pressure notifications memory.swappiness # set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6. the cgroup will be reclaimed and as many pages reclaimed as possible. The typical use case for this interface is before calling rmdir(). - Because rmdir() moves all pages to parent, some out-of-use page caches can be - moved to the parent. If you want to avoid that, force_empty will be useful. + Though rmdir() offlines memcg, but the memcg may still stay there due to + charged file caches. Some out-of-use page caches may keep charged until + memory pressure happens. If you want to avoid that, force_empty will be useful. Also, note that when memory.kmem.limit_in_bytes is set the charges due to kernel pages will still be seen. This is not considered a failure and the -- 1.8.3.1 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty 2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi @ 2019-01-02 21:18 ` Shakeel Butt 2019-01-02 21:18 ` Shakeel Butt 2019-01-03 10:13 ` Michal Hocko 1 sibling, 1 reply; 30+ messages in thread From: Shakeel Butt @ 2019-01-02 21:18 UTC (permalink / raw) To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML On Wed, Jan 2, 2019 at 12:07 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: > > We don't do page cache reparent anymore when offlining memcg, so update > force empty related content accordingly. > > Cc: Michal Hocko <mhocko@suse.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> > --- > Documentation/cgroup-v1/memory.txt | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt > index 3682e99..8e2cb1d 100644 > --- a/Documentation/cgroup-v1/memory.txt > +++ b/Documentation/cgroup-v1/memory.txt > @@ -70,7 +70,7 @@ Brief summary of control files. > memory.soft_limit_in_bytes # set/show soft limit of memory usage > memory.stat # show various statistics > memory.use_hierarchy # set/show hierarchical account enabled > - memory.force_empty # trigger forced move charge to parent > + memory.force_empty # trigger forced page reclaim > memory.pressure_level # set memory pressure notifications > memory.swappiness # set/show swappiness parameter of vmscan > (See sysctl's vm.swappiness) > @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6. > the cgroup will be reclaimed and as many pages reclaimed as possible. > > The typical use case for this interface is before calling rmdir(). > - Because rmdir() moves all pages to parent, some out-of-use page caches can be > - moved to the parent. If you want to avoid that, force_empty will be useful. > + Though rmdir() offlines memcg, but the memcg may still stay there due to > + charged file caches. Some out-of-use page caches may keep charged until > + memory pressure happens. If you want to avoid that, force_empty will be useful. > > Also, note that when memory.kmem.limit_in_bytes is set the charges due to > kernel pages will still be seen. This is not considered a failure and the > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty 2019-01-02 21:18 ` Shakeel Butt @ 2019-01-02 21:18 ` Shakeel Butt 0 siblings, 0 replies; 30+ messages in thread From: Shakeel Butt @ 2019-01-02 21:18 UTC (permalink / raw) To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML On Wed, Jan 2, 2019 at 12:07 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: > > We don't do page cache reparent anymore when offlining memcg, so update > force empty related content accordingly. > > Cc: Michal Hocko <mhocko@suse.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> > --- > Documentation/cgroup-v1/memory.txt | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt > index 3682e99..8e2cb1d 100644 > --- a/Documentation/cgroup-v1/memory.txt > +++ b/Documentation/cgroup-v1/memory.txt > @@ -70,7 +70,7 @@ Brief summary of control files. > memory.soft_limit_in_bytes # set/show soft limit of memory usage > memory.stat # show various statistics > memory.use_hierarchy # set/show hierarchical account enabled > - memory.force_empty # trigger forced move charge to parent > + memory.force_empty # trigger forced page reclaim > memory.pressure_level # set memory pressure notifications > memory.swappiness # set/show swappiness parameter of vmscan > (See sysctl's vm.swappiness) > @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6. > the cgroup will be reclaimed and as many pages reclaimed as possible. > > The typical use case for this interface is before calling rmdir(). > - Because rmdir() moves all pages to parent, some out-of-use page caches can be > - moved to the parent. If you want to avoid that, force_empty will be useful. > + Though rmdir() offlines memcg, but the memcg may still stay there due to > + charged file caches. Some out-of-use page caches may keep charged until > + memory pressure happens. If you want to avoid that, force_empty will be useful. > > Also, note that when memory.kmem.limit_in_bytes is set the charges due to > kernel pages will still be seen. This is not considered a failure and the > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty 2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi 2019-01-02 21:18 ` Shakeel Butt @ 2019-01-03 10:13 ` Michal Hocko 1 sibling, 0 replies; 30+ messages in thread From: Michal Hocko @ 2019-01-03 10:13 UTC (permalink / raw) To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel On Thu 03-01-19 04:05:31, Yang Shi wrote: > We don't do page cache reparent anymore when offlining memcg, so update > force empty related content accordingly. > > Cc: Michal Hocko <mhocko@suse.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Thanks for the clean up. Acked-by: Michal Hocko <mhocko@suse.com> > --- > Documentation/cgroup-v1/memory.txt | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt > index 3682e99..8e2cb1d 100644 > --- a/Documentation/cgroup-v1/memory.txt > +++ b/Documentation/cgroup-v1/memory.txt > @@ -70,7 +70,7 @@ Brief summary of control files. > memory.soft_limit_in_bytes # set/show soft limit of memory usage > memory.stat # show various statistics > memory.use_hierarchy # set/show hierarchical account enabled > - memory.force_empty # trigger forced move charge to parent > + memory.force_empty # trigger forced page reclaim > memory.pressure_level # set memory pressure notifications > memory.swappiness # set/show swappiness parameter of vmscan > (See sysctl's vm.swappiness) > @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6. > the cgroup will be reclaimed and as many pages reclaimed as possible. > > The typical use case for this interface is before calling rmdir(). > - Because rmdir() moves all pages to parent, some out-of-use page caches can be > - moved to the parent. If you want to avoid that, force_empty will be useful. > + Though rmdir() offlines memcg, but the memcg may still stay there due to > + charged file caches. Some out-of-use page caches may keep charged until > + memory pressure happens. If you want to avoid that, force_empty will be useful. > > Also, note that when memory.kmem.limit_in_bytes is set the charges due to > kernel pages will still be seen. This is not considered a failure and the > -- > 1.8.3.1 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty 2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi 2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi @ 2019-01-02 20:05 ` Yang Shi 2019-01-02 21:45 ` Shakeel Butt 2019-01-02 20:05 ` [PATCH 3/3] mm: memcontrol: delay force empty to css offline Yang Shi 2019-01-03 10:12 ` [RFC PATCH 0/3] mm: memcontrol: delayed force empty Michal Hocko 3 siblings, 1 reply; 30+ messages in thread From: Yang Shi @ 2019-01-02 20:05 UTC (permalink / raw) To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel The typical usecase of force empty is to try to reclaim as much as possible memory before offlining a memcg. Since there should be no attached tasks to offlining memcg, the tasks anonymous pages would have already been freed or uncharged. Even though anonymous pages get swapped out, but they still get charged to swap space. So, it sounds pointless to do swap for force empty. I tried to dig into the history of this, it was introduced by commit 8c7c6e34a125 ("memcg: mem+swap controller core"), but there is not any clue about why it was done so at the first place. The below simple test script shows slight file cache reclaim improvement when swap is on. echo 3 > /proc/sys/vm/drop_caches mkdir /sys/fs/cgroup/memory/test echo 30 > /sys/fs/cgroup/memory/test/memory.swappiness echo $$ >/sys/fs/cgroup/memory/test/cgroup.procs cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}' dd if=/dev/zero of=/mnt/test bs=1M count=1024 ping localhost > /dev/null & echo 1 > /sys/fs/cgroup/memory/test/memory.force_empty killall ping echo $$ >/sys/fs/cgroup/memory/cgroup.procs cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}' rmdir /sys/fs/cgroup/memory/test cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}' The number of page cache is: w/o w/ before force empty 1088792 1088784 after force empty 41492 39428 reclaimed 1047300 1049356 Without doing swap, force empty can reclaim 2MB more memory in 1GB page cache. Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6e1469b..bbf39b5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2872,7 +2872,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; progress = try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, true); + GFP_KERNEL, false); if (!progress) { nr_retries--; /* maybe some writeback is necessary */ -- 1.8.3.1 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty 2019-01-02 20:05 ` [PATCH 2/3] mm: memcontrol: do not try to do swap when " Yang Shi @ 2019-01-02 21:45 ` Shakeel Butt 2019-01-02 21:45 ` Shakeel Butt 2019-01-03 16:56 ` Yang Shi 0 siblings, 2 replies; 30+ messages in thread From: Shakeel Butt @ 2019-01-02 21:45 UTC (permalink / raw) To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: > > The typical usecase of force empty is to try to reclaim as much as > possible memory before offlining a memcg. Since there should be no > attached tasks to offlining memcg, the tasks anonymous pages would have > already been freed or uncharged. Anon pages can come from tmpfs files as well. > Even though anonymous pages get > swapped out, but they still get charged to swap space. So, it sounds > pointless to do swap for force empty. > I understand that force_empty is typically used before rmdir'ing a memcg but it might be used differently by some users. We use this interface to test memory reclaim behavior (anon and file). Anyways, I am not against changing the behavior, we can adapt internally but there might be other users using this interface differently. thanks, Shakeel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty 2019-01-02 21:45 ` Shakeel Butt @ 2019-01-02 21:45 ` Shakeel Butt 2019-01-03 16:56 ` Yang Shi 1 sibling, 0 replies; 30+ messages in thread From: Shakeel Butt @ 2019-01-02 21:45 UTC (permalink / raw) To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: > > The typical usecase of force empty is to try to reclaim as much as > possible memory before offlining a memcg. Since there should be no > attached tasks to offlining memcg, the tasks anonymous pages would have > already been freed or uncharged. Anon pages can come from tmpfs files as well. > Even though anonymous pages get > swapped out, but they still get charged to swap space. So, it sounds > pointless to do swap for force empty. > I understand that force_empty is typically used before rmdir'ing a memcg but it might be used differently by some users. We use this interface to test memory reclaim behavior (anon and file). Anyways, I am not against changing the behavior, we can adapt internally but there might be other users using this interface differently. thanks, Shakeel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty 2019-01-02 21:45 ` Shakeel Butt 2019-01-02 21:45 ` Shakeel Butt @ 2019-01-03 16:56 ` Yang Shi 2019-01-03 17:03 ` Shakeel Butt 1 sibling, 1 reply; 30+ messages in thread From: Yang Shi @ 2019-01-03 16:56 UTC (permalink / raw) To: Shakeel Butt; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML On 1/2/19 1:45 PM, Shakeel Butt wrote: > On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: >> The typical usecase of force empty is to try to reclaim as much as >> possible memory before offlining a memcg. Since there should be no >> attached tasks to offlining memcg, the tasks anonymous pages would have >> already been freed or uncharged. > Anon pages can come from tmpfs files as well. Yes, but they are charged to swap space as regular anon pages. > >> Even though anonymous pages get >> swapped out, but they still get charged to swap space. So, it sounds >> pointless to do swap for force empty. >> > I understand that force_empty is typically used before rmdir'ing a > memcg but it might be used differently by some users. We use this > interface to test memory reclaim behavior (anon and file). Thanks for sharing your usecase. So, you uses this for test only? > > Anyways, I am not against changing the behavior, we can adapt > internally but there might be other users using this interface > differently. Thanks. Yang > > thanks, > Shakeel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty 2019-01-03 16:56 ` Yang Shi @ 2019-01-03 17:03 ` Shakeel Butt 2019-01-03 17:03 ` Shakeel Butt 2019-01-03 18:19 ` Yang Shi 0 siblings, 2 replies; 30+ messages in thread From: Shakeel Butt @ 2019-01-03 17:03 UTC (permalink / raw) To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML On Thu, Jan 3, 2019 at 8:57 AM Yang Shi <yang.shi@linux.alibaba.com> wrote: > > > > On 1/2/19 1:45 PM, Shakeel Butt wrote: > > On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: > >> The typical usecase of force empty is to try to reclaim as much as > >> possible memory before offlining a memcg. Since there should be no > >> attached tasks to offlining memcg, the tasks anonymous pages would have > >> already been freed or uncharged. > > Anon pages can come from tmpfs files as well. > > Yes, but they are charged to swap space as regular anon pages. > The point was the lifetime of tmpfs anon pages are not tied to any task. Even though there aren't any task attached to a memcg, the tmpfs anon pages will remain charged. Other than that, the old anon pages of a task which have migrated away might still be charged to the old memcg (if move_charge_at_immigrate is not set). > > > >> Even though anonymous pages get > >> swapped out, but they still get charged to swap space. So, it sounds > >> pointless to do swap for force empty. > >> > > I understand that force_empty is typically used before rmdir'ing a > > memcg but it might be used differently by some users. We use this > > interface to test memory reclaim behavior (anon and file). > > Thanks for sharing your usecase. So, you uses this for test only? > Yes. Shakeel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty 2019-01-03 17:03 ` Shakeel Butt @ 2019-01-03 17:03 ` Shakeel Butt 2019-01-03 18:19 ` Yang Shi 1 sibling, 0 replies; 30+ messages in thread From: Shakeel Butt @ 2019-01-03 17:03 UTC (permalink / raw) To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML On Thu, Jan 3, 2019 at 8:57 AM Yang Shi <yang.shi@linux.alibaba.com> wrote: > > > > On 1/2/19 1:45 PM, Shakeel Butt wrote: > > On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: > >> The typical usecase of force empty is to try to reclaim as much as > >> possible memory before offlining a memcg. Since there should be no > >> attached tasks to offlining memcg, the tasks anonymous pages would have > >> already been freed or uncharged. > > Anon pages can come from tmpfs files as well. > > Yes, but they are charged to swap space as regular anon pages. > The point was the lifetime of tmpfs anon pages are not tied to any task. Even though there aren't any task attached to a memcg, the tmpfs anon pages will remain charged. Other than that, the old anon pages of a task which have migrated away might still be charged to the old memcg (if move_charge_at_immigrate is not set). > > > >> Even though anonymous pages get > >> swapped out, but they still get charged to swap space. So, it sounds > >> pointless to do swap for force empty. > >> > > I understand that force_empty is typically used before rmdir'ing a > > memcg but it might be used differently by some users. We use this > > interface to test memory reclaim behavior (anon and file). > > Thanks for sharing your usecase. So, you uses this for test only? > Yes. Shakeel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty 2019-01-03 17:03 ` Shakeel Butt 2019-01-03 17:03 ` Shakeel Butt @ 2019-01-03 18:19 ` Yang Shi 1 sibling, 0 replies; 30+ messages in thread From: Yang Shi @ 2019-01-03 18:19 UTC (permalink / raw) To: Shakeel Butt; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML On 1/3/19 9:03 AM, Shakeel Butt wrote: > On Thu, Jan 3, 2019 at 8:57 AM Yang Shi <yang.shi@linux.alibaba.com> wrote: >> >> >> On 1/2/19 1:45 PM, Shakeel Butt wrote: >>> On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: >>>> The typical usecase of force empty is to try to reclaim as much as >>>> possible memory before offlining a memcg. Since there should be no >>>> attached tasks to offlining memcg, the tasks anonymous pages would have >>>> already been freed or uncharged. >>> Anon pages can come from tmpfs files as well. >> Yes, but they are charged to swap space as regular anon pages. >> > The point was the lifetime of tmpfs anon pages are not tied to any > task. Even though there aren't any task attached to a memcg, the tmpfs > anon pages will remain charged. Other than that, the old anon pages of > a task which have migrated away might still be charged to the old > memcg (if move_charge_at_immigrate is not set). Yes, my understanding is even though they are swapped out but they are still charged to memsw for cgroupv1. force_empty is supposed to reclaim as much as possible memory, here I'm supposed "reclaim" also means "uncharge". Even though the anon pages are still charged to the old memcg, it will be moved the new memcg when the old one is deleted, or the pages will be just released if the task is killed. So, IMHO, I don't see the point why swapping anon pages when doing force_empty. Thanks, Yang >>>> Even though anonymous pages get >>>> swapped out, but they still get charged to swap space. So, it sounds >>>> pointless to do swap for force empty. >>>> >>> I understand that force_empty is typically used before rmdir'ing a >>> memcg but it might be used differently by some users. We use this >>> interface to test memory reclaim behavior (anon and file). >> Thanks for sharing your usecase. So, you uses this for test only? >> > Yes. > > Shakeel ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 3/3] mm: memcontrol: delay force empty to css offline 2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi 2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi 2019-01-02 20:05 ` [PATCH 2/3] mm: memcontrol: do not try to do swap when " Yang Shi @ 2019-01-02 20:05 ` Yang Shi 2019-01-03 10:12 ` [RFC PATCH 0/3] mm: memcontrol: delayed force empty Michal Hocko 3 siblings, 0 replies; 30+ messages in thread From: Yang Shi @ 2019-01-02 20:05 UTC (permalink / raw) To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel Currently, force empty reclaims memory synchronously when writing to memory.force_empty. It may take some time to return and the afterwards operations are blocked by it. Although it can be interrupted by signal, it still seems suboptimal. Now css offline is handled by worker, and the typical usecase of force empty is before memcg offline. So, handling force empty in css offline sounds reasonable. The user may write into any value to memory.force_empty, but I'm supposed the most used value should be 0 and 1. To not break existing applications, writing 0 or 1 still do force empty synchronously, any other value will tell kernel to do force empty in css offline worker. Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> --- Documentation/cgroup-v1/memory.txt | 8 ++++++-- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c | 18 ++++++++++++++++++ 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt index 8e2cb1d..313d45f 100644 --- a/Documentation/cgroup-v1/memory.txt +++ b/Documentation/cgroup-v1/memory.txt @@ -452,11 +452,15 @@ About use_hierarchy, see Section 6. 5.1 force_empty memory.force_empty interface is provided to make cgroup's memory usage empty. - When writing anything to this + When writing 0 or 1 to this # echo 0 > memory.force_empty - the cgroup will be reclaimed and as many pages reclaimed as possible. + the cgroup will be reclaimed and as many pages reclaimed as possible + synchronously. + + Writing any other value to this, the cgroup will delay the memory reclaim + to css offline. The typical use case for this interface is before calling rmdir(). Though rmdir() offlines memcg, but the memcg may still stay there due to diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7ab2120..48a5cf2 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -311,6 +311,8 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; + bool delayed_force_empty; + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index bbf39b5..620b6c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2888,10 +2888,25 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { + unsigned long val; + ssize_t ret; struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); if (mem_cgroup_is_root(memcg)) return -EINVAL; + + buf = strstrip(buf); + + ret = kstrtoul(buf, 10, &val); + if (ret < 0) + return ret; + + if (val != 0 && val != 1) { + memcg->delayed_force_empty = true; + return nbytes; + } + + memcg->delayed_force_empty = false; return mem_cgroup_force_empty(memcg) ?: nbytes; } @@ -4531,6 +4546,9 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) struct mem_cgroup *memcg = mem_cgroup_from_css(css); struct mem_cgroup_event *event, *tmp; + if (memcg->delayed_force_empty) + mem_cgroup_force_empty(memcg); + /* * Unregister events and notify userspace. * Notify userspace about cgroup removing only after rmdir of cgroup -- 1.8.3.1 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi ` (2 preceding siblings ...) 2019-01-02 20:05 ` [PATCH 3/3] mm: memcontrol: delay force empty to css offline Yang Shi @ 2019-01-03 10:12 ` Michal Hocko 2019-01-03 17:33 ` Yang Shi 3 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2019-01-03 10:12 UTC (permalink / raw) To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel On Thu 03-01-19 04:05:30, Yang Shi wrote: > > Currently, force empty reclaims memory synchronously when writing to > memory.force_empty. It may take some time to return and the afterwards > operations are blocked by it. Although it can be interrupted by signal, > it still seems suboptimal. Why it is suboptimal? We are doing that operation on behalf of the process requesting it. What should anybody else pay for it? In other words why should we hide the overhead? > Now css offline is handled by worker, and the typical usecase of force > empty is before memcg offline. So, handling force empty in css offline > sounds reasonable. Hmm, so I guess you are talking about echo 1 > $MEMCG/force_empty rmdir $MEMCG and you are complaining that the operation takes too long. Right? Why do you care actually? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 10:12 ` [RFC PATCH 0/3] mm: memcontrol: delayed force empty Michal Hocko @ 2019-01-03 17:33 ` Yang Shi 2019-01-03 18:13 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Yang Shi @ 2019-01-03 17:33 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/3/19 2:12 AM, Michal Hocko wrote: > On Thu 03-01-19 04:05:30, Yang Shi wrote: >> Currently, force empty reclaims memory synchronously when writing to >> memory.force_empty. It may take some time to return and the afterwards >> operations are blocked by it. Although it can be interrupted by signal, >> it still seems suboptimal. > Why it is suboptimal? We are doing that operation on behalf of the > process requesting it. What should anybody else pay for it? In other > words why should we hide the overhead? Please see the below explanation. > >> Now css offline is handled by worker, and the typical usecase of force >> empty is before memcg offline. So, handling force empty in css offline >> sounds reasonable. > Hmm, so I guess you are talking about > echo 1 > $MEMCG/force_empty > rmdir $MEMCG > > and you are complaining that the operation takes too long. Right? Why do > you care actually? We have some usecases which create and remove memcgs very frequently, and the tasks in the memcg may just access the files which are unlikely accessed by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to reclaim the page cache so that they don't get accumulated to incur unnecessary memory pressure. Since the memory pressure may incur direct reclaim to harm some latency sensitive applications. And, the create/remove might be run in a script sequentially (there might be a lot scripts or applications are run in parallel to do this), i.e. mkdir cg1 do something echo 0 > cg1/memory.force_empty rmdir cg1 mkdir cg2 ... The creation of the afterwards memcg might be blocked by the force_empty for long time if there are a lot page caches, so the overall throughput of the system may get hurt. And, it is not that urgent to reclaim the page cache right away and it is not that important who pays the cost, we just need a mechanism to reclaim the pages soon in a short while. The overhead could be smoothed by background workqueue. And, the patch still keeps the old behavior, just in case someone else still depends on it. Thanks, Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 17:33 ` Yang Shi @ 2019-01-03 18:13 ` Michal Hocko 2019-01-03 18:40 ` Yang Shi 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2019-01-03 18:13 UTC (permalink / raw) To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel On Thu 03-01-19 09:33:14, Yang Shi wrote: > > > On 1/3/19 2:12 AM, Michal Hocko wrote: > > On Thu 03-01-19 04:05:30, Yang Shi wrote: > > > Currently, force empty reclaims memory synchronously when writing to > > > memory.force_empty. It may take some time to return and the afterwards > > > operations are blocked by it. Although it can be interrupted by signal, > > > it still seems suboptimal. > > Why it is suboptimal? We are doing that operation on behalf of the > > process requesting it. What should anybody else pay for it? In other > > words why should we hide the overhead? > > Please see the below explanation. > > > > > > Now css offline is handled by worker, and the typical usecase of force > > > empty is before memcg offline. So, handling force empty in css offline > > > sounds reasonable. > > Hmm, so I guess you are talking about > > echo 1 > $MEMCG/force_empty > > rmdir $MEMCG > > > > and you are complaining that the operation takes too long. Right? Why do > > you care actually? > > We have some usecases which create and remove memcgs very frequently, and > the tasks in the memcg may just access the files which are unlikely accessed > by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to > reclaim the page cache so that they don't get accumulated to incur > unnecessary memory pressure. Since the memory pressure may incur direct > reclaim to harm some latency sensitive applications. Yes, this makes sense to me. > And, the create/remove might be run in a script sequentially (there might be > a lot scripts or applications are run in parallel to do this), i.e. > mkdir cg1 > do something > echo 0 > cg1/memory.force_empty > rmdir cg1 > > mkdir cg2 > ... > > The creation of the afterwards memcg might be blocked by the force_empty for > long time if there are a lot page caches, so the overall throughput of the > system may get hurt. Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 18:13 ` Michal Hocko @ 2019-01-03 18:40 ` Yang Shi 2019-01-03 18:53 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Yang Shi @ 2019-01-03 18:40 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/3/19 10:13 AM, Michal Hocko wrote: > On Thu 03-01-19 09:33:14, Yang Shi wrote: >> >> On 1/3/19 2:12 AM, Michal Hocko wrote: >>> On Thu 03-01-19 04:05:30, Yang Shi wrote: >>>> Currently, force empty reclaims memory synchronously when writing to >>>> memory.force_empty. It may take some time to return and the afterwards >>>> operations are blocked by it. Although it can be interrupted by signal, >>>> it still seems suboptimal. >>> Why it is suboptimal? We are doing that operation on behalf of the >>> process requesting it. What should anybody else pay for it? In other >>> words why should we hide the overhead? >> Please see the below explanation. >> >>>> Now css offline is handled by worker, and the typical usecase of force >>>> empty is before memcg offline. So, handling force empty in css offline >>>> sounds reasonable. >>> Hmm, so I guess you are talking about >>> echo 1 > $MEMCG/force_empty >>> rmdir $MEMCG >>> >>> and you are complaining that the operation takes too long. Right? Why do >>> you care actually? >> We have some usecases which create and remove memcgs very frequently, and >> the tasks in the memcg may just access the files which are unlikely accessed >> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to >> reclaim the page cache so that they don't get accumulated to incur >> unnecessary memory pressure. Since the memory pressure may incur direct >> reclaim to harm some latency sensitive applications. > Yes, this makes sense to me. > >> And, the create/remove might be run in a script sequentially (there might be >> a lot scripts or applications are run in parallel to do this), i.e. >> mkdir cg1 >> do something >> echo 0 > cg1/memory.force_empty >> rmdir cg1 >> >> mkdir cg2 >> ... >> >> The creation of the afterwards memcg might be blocked by the force_empty for >> long time if there are a lot page caches, so the overall throughput of the >> system may get hurt. > Is there any reason for your scripts to be strictly sequential here? In > other words why cannot you offload those expensive operations to a > detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. Thanks, Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 18:40 ` Yang Shi @ 2019-01-03 18:53 ` Michal Hocko 2019-01-03 19:10 ` Yang Shi 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2019-01-03 18:53 UTC (permalink / raw) To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > On 1/3/19 10:13 AM, Michal Hocko wrote: > > On Thu 03-01-19 09:33:14, Yang Shi wrote: > > > > > > On 1/3/19 2:12 AM, Michal Hocko wrote: > > > > On Thu 03-01-19 04:05:30, Yang Shi wrote: > > > > > Currently, force empty reclaims memory synchronously when writing to > > > > > memory.force_empty. It may take some time to return and the afterwards > > > > > operations are blocked by it. Although it can be interrupted by signal, > > > > > it still seems suboptimal. > > > > Why it is suboptimal? We are doing that operation on behalf of the > > > > process requesting it. What should anybody else pay for it? In other > > > > words why should we hide the overhead? > > > Please see the below explanation. > > > > > > > > Now css offline is handled by worker, and the typical usecase of force > > > > > empty is before memcg offline. So, handling force empty in css offline > > > > > sounds reasonable. > > > > Hmm, so I guess you are talking about > > > > echo 1 > $MEMCG/force_empty > > > > rmdir $MEMCG > > > > > > > > and you are complaining that the operation takes too long. Right? Why do > > > > you care actually? > > > We have some usecases which create and remove memcgs very frequently, and > > > the tasks in the memcg may just access the files which are unlikely accessed > > > by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to > > > reclaim the page cache so that they don't get accumulated to incur > > > unnecessary memory pressure. Since the memory pressure may incur direct > > > reclaim to harm some latency sensitive applications. > > Yes, this makes sense to me. > > > > > And, the create/remove might be run in a script sequentially (there might be > > > a lot scripts or applications are run in parallel to do this), i.e. > > > mkdir cg1 > > > do something > > > echo 0 > cg1/memory.force_empty > > > rmdir cg1 > > > > > > mkdir cg2 > > > ... > > > > > > The creation of the afterwards memcg might be blocked by the force_empty for > > > long time if there are a lot page caches, so the overall throughput of the > > > system may get hurt. > > Is there any reason for your scripts to be strictly sequential here? In > > other words why cannot you offload those expensive operations to a > > detached context in _userspace_? > > I would say it has not to be strictly sequential. The above script is just > an example to illustrate the pattern. But, sometimes it may hit such pattern > due to the complicated cluster scheduling and container scheduling in the > production environment, for example the creation process might be scheduled > to the same CPU which is doing force_empty. I have to say I don't know too > much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 18:53 ` Michal Hocko @ 2019-01-03 19:10 ` Yang Shi 2019-01-03 19:23 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Yang Shi @ 2019-01-03 19:10 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/3/19 10:53 AM, Michal Hocko wrote: > On Thu 03-01-19 10:40:54, Yang Shi wrote: >> >> On 1/3/19 10:13 AM, Michal Hocko wrote: >>> On Thu 03-01-19 09:33:14, Yang Shi wrote: >>>> On 1/3/19 2:12 AM, Michal Hocko wrote: >>>>> On Thu 03-01-19 04:05:30, Yang Shi wrote: >>>>>> Currently, force empty reclaims memory synchronously when writing to >>>>>> memory.force_empty. It may take some time to return and the afterwards >>>>>> operations are blocked by it. Although it can be interrupted by signal, >>>>>> it still seems suboptimal. >>>>> Why it is suboptimal? We are doing that operation on behalf of the >>>>> process requesting it. What should anybody else pay for it? In other >>>>> words why should we hide the overhead? >>>> Please see the below explanation. >>>> >>>>>> Now css offline is handled by worker, and the typical usecase of force >>>>>> empty is before memcg offline. So, handling force empty in css offline >>>>>> sounds reasonable. >>>>> Hmm, so I guess you are talking about >>>>> echo 1 > $MEMCG/force_empty >>>>> rmdir $MEMCG >>>>> >>>>> and you are complaining that the operation takes too long. Right? Why do >>>>> you care actually? >>>> We have some usecases which create and remove memcgs very frequently, and >>>> the tasks in the memcg may just access the files which are unlikely accessed >>>> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to >>>> reclaim the page cache so that they don't get accumulated to incur >>>> unnecessary memory pressure. Since the memory pressure may incur direct >>>> reclaim to harm some latency sensitive applications. >>> Yes, this makes sense to me. >>> >>>> And, the create/remove might be run in a script sequentially (there might be >>>> a lot scripts or applications are run in parallel to do this), i.e. >>>> mkdir cg1 >>>> do something >>>> echo 0 > cg1/memory.force_empty >>>> rmdir cg1 >>>> >>>> mkdir cg2 >>>> ... >>>> >>>> The creation of the afterwards memcg might be blocked by the force_empty for >>>> long time if there are a lot page caches, so the overall throughput of the >>>> system may get hurt. >>> Is there any reason for your scripts to be strictly sequential here? In >>> other words why cannot you offload those expensive operations to a >>> detached context in _userspace_? >> I would say it has not to be strictly sequential. The above script is just >> an example to illustrate the pattern. But, sometimes it may hit such pattern >> due to the complicated cluster scheduling and container scheduling in the >> production environment, for example the creation process might be scheduled >> to the same CPU which is doing force_empty. I have to say I don't know too >> much about the internals of the container scheduling. > In that case I do not see a strong reason to implement the offloding > into the kernel. It is an additional code and semantic to maintain. Yes, it does introduce some additional code and semantic, but IMHO, it is quite simple and very straight forward, isn't it? Just utilize the existing css offline worker. And, that a couple of lines of code do improve some throughput issues for some real usecases. > > I think it is more important to discuss whether we want to introduce > force_empty in cgroup v2. We would prefer have it in v2 as well. Thanks, Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 19:10 ` Yang Shi @ 2019-01-03 19:23 ` Michal Hocko 2019-01-03 19:49 ` Yang Shi 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2019-01-03 19:23 UTC (permalink / raw) To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel On Thu 03-01-19 11:10:00, Yang Shi wrote: > > > On 1/3/19 10:53 AM, Michal Hocko wrote: > > On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote: [...] > > > > Is there any reason for your scripts to be strictly sequential here? In > > > > other words why cannot you offload those expensive operations to a > > > > detached context in _userspace_? > > > I would say it has not to be strictly sequential. The above script is just > > > an example to illustrate the pattern. But, sometimes it may hit such pattern > > > due to the complicated cluster scheduling and container scheduling in the > > > production environment, for example the creation process might be scheduled > > > to the same CPU which is doing force_empty. I have to say I don't know too > > > much about the internals of the container scheduling. > > In that case I do not see a strong reason to implement the offloding > > into the kernel. It is an additional code and semantic to maintain. > > Yes, it does introduce some additional code and semantic, but IMHO, it is > quite simple and very straight forward, isn't it? Just utilize the existing > css offline worker. And, that a couple of lines of code do improve some > throughput issues for some real usecases. I do not really care it is few LOC. It is more important that it is conflating force_empty into offlining logic. There was a good reason to remove reparenting/emptying the memcg during the offline. Considering that you can offload force_empty from userspace trivially then I do not see any reason to implement it in the kernel. > > I think it is more important to discuss whether we want to introduce > > force_empty in cgroup v2. > > We would prefer have it in v2 as well. Then bring this up in a separate email thread please. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 19:23 ` Michal Hocko @ 2019-01-03 19:49 ` Yang Shi 2019-01-03 20:01 ` Michal Hocko 2019-01-04 20:03 ` Greg Thelen 0 siblings, 2 replies; 30+ messages in thread From: Yang Shi @ 2019-01-03 19:49 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/3/19 11:23 AM, Michal Hocko wrote: > On Thu 03-01-19 11:10:00, Yang Shi wrote: >> >> On 1/3/19 10:53 AM, Michal Hocko wrote: >>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>> On 1/3/19 10:13 AM, Michal Hocko wrote: > [...] >>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>> other words why cannot you offload those expensive operations to a >>>>> detached context in _userspace_? >>>> I would say it has not to be strictly sequential. The above script is just >>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>> due to the complicated cluster scheduling and container scheduling in the >>>> production environment, for example the creation process might be scheduled >>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>> much about the internals of the container scheduling. >>> In that case I do not see a strong reason to implement the offloding >>> into the kernel. It is an additional code and semantic to maintain. >> Yes, it does introduce some additional code and semantic, but IMHO, it is >> quite simple and very straight forward, isn't it? Just utilize the existing >> css offline worker. And, that a couple of lines of code do improve some >> throughput issues for some real usecases. > I do not really care it is few LOC. It is more important that it is > conflating force_empty into offlining logic. There was a good reason to > remove reparenting/emptying the memcg during the offline. Considering > that you can offload force_empty from userspace trivially then I do not > see any reason to implement it in the kernel. Er, I may not articulate in the earlier email, force_empty can not be offloaded from userspace *trivially*. IOWs the container scheduler may unexpectedly overcommit something due to the stall of synchronous force empty, which can't be figured out by userspace before it actually happens. The scheduler doesn't know how long force_empty would take. If the force_empty could be offloaded by kernel, it would make scheduler's life much easier. This is not something userspace could do. > >>> I think it is more important to discuss whether we want to introduce >>> force_empty in cgroup v2. >> We would prefer have it in v2 as well. > Then bring this up in a separate email thread please. Sure. Will prepare the patches later. Thanks, Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 19:49 ` Yang Shi @ 2019-01-03 20:01 ` Michal Hocko 2019-01-04 4:15 ` Yang Shi 2019-01-04 20:03 ` Greg Thelen 1 sibling, 1 reply; 30+ messages in thread From: Michal Hocko @ 2019-01-03 20:01 UTC (permalink / raw) To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel On Thu 03-01-19 11:49:32, Yang Shi wrote: > > > On 1/3/19 11:23 AM, Michal Hocko wrote: > > On Thu 03-01-19 11:10:00, Yang Shi wrote: > > > > > > On 1/3/19 10:53 AM, Michal Hocko wrote: > > > > On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote: > > [...] > > > > > > Is there any reason for your scripts to be strictly sequential here? In > > > > > > other words why cannot you offload those expensive operations to a > > > > > > detached context in _userspace_? > > > > > I would say it has not to be strictly sequential. The above script is just > > > > > an example to illustrate the pattern. But, sometimes it may hit such pattern > > > > > due to the complicated cluster scheduling and container scheduling in the > > > > > production environment, for example the creation process might be scheduled > > > > > to the same CPU which is doing force_empty. I have to say I don't know too > > > > > much about the internals of the container scheduling. > > > > In that case I do not see a strong reason to implement the offloding > > > > into the kernel. It is an additional code and semantic to maintain. > > > Yes, it does introduce some additional code and semantic, but IMHO, it is > > > quite simple and very straight forward, isn't it? Just utilize the existing > > > css offline worker. And, that a couple of lines of code do improve some > > > throughput issues for some real usecases. > > I do not really care it is few LOC. It is more important that it is > > conflating force_empty into offlining logic. There was a good reason to > > remove reparenting/emptying the memcg during the offline. Considering > > that you can offload force_empty from userspace trivially then I do not > > see any reason to implement it in the kernel. > > Er, I may not articulate in the earlier email, force_empty can not be > offloaded from userspace *trivially*. IOWs the container scheduler may > unexpectedly overcommit something due to the stall of synchronous force > empty, which can't be figured out by userspace before it actually happens. > The scheduler doesn't know how long force_empty would take. If the > force_empty could be offloaded by kernel, it would make scheduler's life > much easier. This is not something userspace could do. What exactly prevents ( echo 1 > $memecg/force_empty rmdir $memcg ) & so that this sequence doesn't really block anything? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 20:01 ` Michal Hocko @ 2019-01-04 4:15 ` Yang Shi 2019-01-04 8:55 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Yang Shi @ 2019-01-04 4:15 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/3/19 12:01 PM, Michal Hocko wrote: > On Thu 03-01-19 11:49:32, Yang Shi wrote: >> >> On 1/3/19 11:23 AM, Michal Hocko wrote: >>> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >>> [...] >>>>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>>>> other words why cannot you offload those expensive operations to a >>>>>>> detached context in _userspace_? >>>>>> I would say it has not to be strictly sequential. The above script is just >>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>>>> due to the complicated cluster scheduling and container scheduling in the >>>>>> production environment, for example the creation process might be scheduled >>>>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>>>> much about the internals of the container scheduling. >>>>> In that case I do not see a strong reason to implement the offloding >>>>> into the kernel. It is an additional code and semantic to maintain. >>>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>>> quite simple and very straight forward, isn't it? Just utilize the existing >>>> css offline worker. And, that a couple of lines of code do improve some >>>> throughput issues for some real usecases. >>> I do not really care it is few LOC. It is more important that it is >>> conflating force_empty into offlining logic. There was a good reason to >>> remove reparenting/emptying the memcg during the offline. Considering >>> that you can offload force_empty from userspace trivially then I do not >>> see any reason to implement it in the kernel. >> Er, I may not articulate in the earlier email, force_empty can not be >> offloaded from userspace *trivially*. IOWs the container scheduler may >> unexpectedly overcommit something due to the stall of synchronous force >> empty, which can't be figured out by userspace before it actually happens. >> The scheduler doesn't know how long force_empty would take. If the >> force_empty could be offloaded by kernel, it would make scheduler's life >> much easier. This is not something userspace could do. > What exactly prevents > ( > echo 1 > $memecg/force_empty > rmdir $memcg > ) & > > so that this sequence doesn't really block anything? We have "restarting the same name job" logic in our usecase (I'm not quite sure why they do so). Basically, it means to create memcg with the exact same name right after the old one is deleted, but may have different limit or other settings. The creation has to wait for rmdir is done. Even though rmdir is done in background like the above, the stall still exists since rmdir simply is waiting for force_empty. Thanks, Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-04 4:15 ` Yang Shi @ 2019-01-04 8:55 ` Michal Hocko 2019-01-04 16:46 ` Yang Shi 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2019-01-04 8:55 UTC (permalink / raw) To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel On Thu 03-01-19 20:15:30, Yang Shi wrote: > > > On 1/3/19 12:01 PM, Michal Hocko wrote: > > On Thu 03-01-19 11:49:32, Yang Shi wrote: > > > > > > On 1/3/19 11:23 AM, Michal Hocko wrote: > > > > On Thu 03-01-19 11:10:00, Yang Shi wrote: > > > > > On 1/3/19 10:53 AM, Michal Hocko wrote: > > > > > > On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote: > > > > [...] > > > > > > > > Is there any reason for your scripts to be strictly sequential here? In > > > > > > > > other words why cannot you offload those expensive operations to a > > > > > > > > detached context in _userspace_? > > > > > > > I would say it has not to be strictly sequential. The above script is just > > > > > > > an example to illustrate the pattern. But, sometimes it may hit such pattern > > > > > > > due to the complicated cluster scheduling and container scheduling in the > > > > > > > production environment, for example the creation process might be scheduled > > > > > > > to the same CPU which is doing force_empty. I have to say I don't know too > > > > > > > much about the internals of the container scheduling. > > > > > > In that case I do not see a strong reason to implement the offloding > > > > > > into the kernel. It is an additional code and semantic to maintain. > > > > > Yes, it does introduce some additional code and semantic, but IMHO, it is > > > > > quite simple and very straight forward, isn't it? Just utilize the existing > > > > > css offline worker. And, that a couple of lines of code do improve some > > > > > throughput issues for some real usecases. > > > > I do not really care it is few LOC. It is more important that it is > > > > conflating force_empty into offlining logic. There was a good reason to > > > > remove reparenting/emptying the memcg during the offline. Considering > > > > that you can offload force_empty from userspace trivially then I do not > > > > see any reason to implement it in the kernel. > > > Er, I may not articulate in the earlier email, force_empty can not be > > > offloaded from userspace *trivially*. IOWs the container scheduler may > > > unexpectedly overcommit something due to the stall of synchronous force > > > empty, which can't be figured out by userspace before it actually happens. > > > The scheduler doesn't know how long force_empty would take. If the > > > force_empty could be offloaded by kernel, it would make scheduler's life > > > much easier. This is not something userspace could do. > > What exactly prevents > > ( > > echo 1 > $memecg/force_empty > > rmdir $memcg > > ) & > > > > so that this sequence doesn't really block anything? > > We have "restarting the same name job" logic in our usecase (I'm not quite > sure why they do so). Basically, it means to create memcg with the exact > same name right after the old one is deleted, but may have different limit > or other settings. The creation has to wait for rmdir is done. Even though > rmdir is done in background like the above, the stall still exists since > rmdir simply is waiting for force_empty. OK, I see. This is an important detail you didn't mention previously (or at least I didn't understand it). One thing is still not clear to me. "Restarting the same job" sounds as if the memcg itself could be recycled as well. You are saying that the setting might change but if that is about limits then we should handle that just fine. Or what other kind of setting changes that wouldn't work properly? If the recycling is not possible then I would suggest to not reuse force_empty interface but add wipe_on_destruction or similar new knob which would enforce reclaim on offlining. It seems we have several people asking for something like that already. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-04 8:55 ` Michal Hocko @ 2019-01-04 16:46 ` Yang Shi 0 siblings, 0 replies; 30+ messages in thread From: Yang Shi @ 2019-01-04 16:46 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/4/19 12:55 AM, Michal Hocko wrote: > On Thu 03-01-19 20:15:30, Yang Shi wrote: >> >> On 1/3/19 12:01 PM, Michal Hocko wrote: >>> On Thu 03-01-19 11:49:32, Yang Shi wrote: >>>> On 1/3/19 11:23 AM, Michal Hocko wrote: >>>>> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>>>>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >>>>> [...] >>>>>>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>>>>>> other words why cannot you offload those expensive operations to a >>>>>>>>> detached context in _userspace_? >>>>>>>> I would say it has not to be strictly sequential. The above script is just >>>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>>>>>> due to the complicated cluster scheduling and container scheduling in the >>>>>>>> production environment, for example the creation process might be scheduled >>>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>>>>>> much about the internals of the container scheduling. >>>>>>> In that case I do not see a strong reason to implement the offloding >>>>>>> into the kernel. It is an additional code and semantic to maintain. >>>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>>>>> quite simple and very straight forward, isn't it? Just utilize the existing >>>>>> css offline worker. And, that a couple of lines of code do improve some >>>>>> throughput issues for some real usecases. >>>>> I do not really care it is few LOC. It is more important that it is >>>>> conflating force_empty into offlining logic. There was a good reason to >>>>> remove reparenting/emptying the memcg during the offline. Considering >>>>> that you can offload force_empty from userspace trivially then I do not >>>>> see any reason to implement it in the kernel. >>>> Er, I may not articulate in the earlier email, force_empty can not be >>>> offloaded from userspace *trivially*. IOWs the container scheduler may >>>> unexpectedly overcommit something due to the stall of synchronous force >>>> empty, which can't be figured out by userspace before it actually happens. >>>> The scheduler doesn't know how long force_empty would take. If the >>>> force_empty could be offloaded by kernel, it would make scheduler's life >>>> much easier. This is not something userspace could do. >>> What exactly prevents >>> ( >>> echo 1 > $memecg/force_empty >>> rmdir $memcg >>> ) & >>> >>> so that this sequence doesn't really block anything? >> We have "restarting the same name job" logic in our usecase (I'm not quite >> sure why they do so). Basically, it means to create memcg with the exact >> same name right after the old one is deleted, but may have different limit >> or other settings. The creation has to wait for rmdir is done. Even though >> rmdir is done in background like the above, the stall still exists since >> rmdir simply is waiting for force_empty. > OK, I see. This is an important detail you didn't mention previously (or > at least I didn't understand it). One thing is still not clear to me. Sorry, I should articulated at the first place. > "Restarting the same job" sounds as if the memcg itself could be > recycled as well. You are saying that the setting might change but if > that is about limits then we should handle that just fine. Or what other > kind of setting changes that wouldn't work properly? We did try resize limit, but it may also incur costly direct reclaim to block something. Other than this we also want to reset all the counters/stats to get clearer and cleaner resource isolation since the container may run different jobs although they use the same name. > > If the recycling is not possible then I would suggest to not reuse > force_empty interface but add wipe_on_destruction or similar new knob > which would enforce reclaim on offlining. It seems we have several > people asking for something like that already. We did have a new knob in our in-house implementation, it just did force_empty on offlining. So, you mean to have a new knob to just do force empty offlining, and keep force_empty's behavior, right? Thanks, Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-03 19:49 ` Yang Shi 2019-01-03 20:01 ` Michal Hocko @ 2019-01-04 20:03 ` Greg Thelen 2019-01-04 20:03 ` Greg Thelen ` (2 more replies) 1 sibling, 3 replies; 30+ messages in thread From: Greg Thelen @ 2019-01-04 20:03 UTC (permalink / raw) To: Yang Shi, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel Yang Shi <yang.shi@linux.alibaba.com> wrote: > On 1/3/19 11:23 AM, Michal Hocko wrote: >> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>> >>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >> [...] >>>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>>> other words why cannot you offload those expensive operations to a >>>>>> detached context in _userspace_? >>>>> I would say it has not to be strictly sequential. The above script is just >>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>>> due to the complicated cluster scheduling and container scheduling in the >>>>> production environment, for example the creation process might be scheduled >>>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>>> much about the internals of the container scheduling. >>>> In that case I do not see a strong reason to implement the offloding >>>> into the kernel. It is an additional code and semantic to maintain. >>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>> quite simple and very straight forward, isn't it? Just utilize the existing >>> css offline worker. And, that a couple of lines of code do improve some >>> throughput issues for some real usecases. >> I do not really care it is few LOC. It is more important that it is >> conflating force_empty into offlining logic. There was a good reason to >> remove reparenting/emptying the memcg during the offline. Considering >> that you can offload force_empty from userspace trivially then I do not >> see any reason to implement it in the kernel. > > Er, I may not articulate in the earlier email, force_empty can not be > offloaded from userspace *trivially*. IOWs the container scheduler may > unexpectedly overcommit something due to the stall of synchronous force > empty, which can't be figured out by userspace before it actually > happens. The scheduler doesn't know how long force_empty would take. If > the force_empty could be offloaded by kernel, it would make scheduler's > life much easier. This is not something userspace could do. If kernel workqueues are doing more work (i.e. force_empty processing), then it seem like the time to offline could grow. I'm not sure if that's important. I assume that if we make force_empty an async side effect of rmdir then user space scheduler would not be unable to immediately assume the rmdir'd container memory is available without subjecting a new container to direct reclaim. So it seems like user space would use a mechanism to wait for reclaim: either the existing sync force_empty or polling meminfo/etc waiting for free memory to appear. >>>> I think it is more important to discuss whether we want to introduce >>>> force_empty in cgroup v2. >>> We would prefer have it in v2 as well. >> Then bring this up in a separate email thread please. > > Sure. Will prepare the patches later. > > Thanks, > Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-04 20:03 ` Greg Thelen @ 2019-01-04 20:03 ` Greg Thelen 2019-01-04 21:41 ` Yang Shi 2019-01-04 22:57 ` Yang Shi 2 siblings, 0 replies; 30+ messages in thread From: Greg Thelen @ 2019-01-04 20:03 UTC (permalink / raw) To: Yang Shi, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel Yang Shi <yang.shi@linux.alibaba.com> wrote: > On 1/3/19 11:23 AM, Michal Hocko wrote: >> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>> >>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >> [...] >>>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>>> other words why cannot you offload those expensive operations to a >>>>>> detached context in _userspace_? >>>>> I would say it has not to be strictly sequential. The above script is just >>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>>> due to the complicated cluster scheduling and container scheduling in the >>>>> production environment, for example the creation process might be scheduled >>>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>>> much about the internals of the container scheduling. >>>> In that case I do not see a strong reason to implement the offloding >>>> into the kernel. It is an additional code and semantic to maintain. >>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>> quite simple and very straight forward, isn't it? Just utilize the existing >>> css offline worker. And, that a couple of lines of code do improve some >>> throughput issues for some real usecases. >> I do not really care it is few LOC. It is more important that it is >> conflating force_empty into offlining logic. There was a good reason to >> remove reparenting/emptying the memcg during the offline. Considering >> that you can offload force_empty from userspace trivially then I do not >> see any reason to implement it in the kernel. > > Er, I may not articulate in the earlier email, force_empty can not be > offloaded from userspace *trivially*. IOWs the container scheduler may > unexpectedly overcommit something due to the stall of synchronous force > empty, which can't be figured out by userspace before it actually > happens. The scheduler doesn't know how long force_empty would take. If > the force_empty could be offloaded by kernel, it would make scheduler's > life much easier. This is not something userspace could do. If kernel workqueues are doing more work (i.e. force_empty processing), then it seem like the time to offline could grow. I'm not sure if that's important. I assume that if we make force_empty an async side effect of rmdir then user space scheduler would not be unable to immediately assume the rmdir'd container memory is available without subjecting a new container to direct reclaim. So it seems like user space would use a mechanism to wait for reclaim: either the existing sync force_empty or polling meminfo/etc waiting for free memory to appear. >>>> I think it is more important to discuss whether we want to introduce >>>> force_empty in cgroup v2. >>> We would prefer have it in v2 as well. >> Then bring this up in a separate email thread please. > > Sure. Will prepare the patches later. > > Thanks, > Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-04 20:03 ` Greg Thelen 2019-01-04 20:03 ` Greg Thelen @ 2019-01-04 21:41 ` Yang Shi 2019-01-04 22:57 ` Yang Shi 2 siblings, 0 replies; 30+ messages in thread From: Yang Shi @ 2019-01-04 21:41 UTC (permalink / raw) To: Greg Thelen, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/4/19 12:03 PM, Greg Thelen wrote: > Yang Shi <yang.shi@linux.alibaba.com> wrote: > >> On 1/3/19 11:23 AM, Michal Hocko wrote: >>> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >>> [...] >>>>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>>>> other words why cannot you offload those expensive operations to a >>>>>>> detached context in _userspace_? >>>>>> I would say it has not to be strictly sequential. The above script is just >>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>>>> due to the complicated cluster scheduling and container scheduling in the >>>>>> production environment, for example the creation process might be scheduled >>>>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>>>> much about the internals of the container scheduling. >>>>> In that case I do not see a strong reason to implement the offloding >>>>> into the kernel. It is an additional code and semantic to maintain. >>>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>>> quite simple and very straight forward, isn't it? Just utilize the existing >>>> css offline worker. And, that a couple of lines of code do improve some >>>> throughput issues for some real usecases. >>> I do not really care it is few LOC. It is more important that it is >>> conflating force_empty into offlining logic. There was a good reason to >>> remove reparenting/emptying the memcg during the offline. Considering >>> that you can offload force_empty from userspace trivially then I do not >>> see any reason to implement it in the kernel. >> Er, I may not articulate in the earlier email, force_empty can not be >> offloaded from userspace *trivially*. IOWs the container scheduler may >> unexpectedly overcommit something due to the stall of synchronous force >> empty, which can't be figured out by userspace before it actually >> happens. The scheduler doesn't know how long force_empty would take. If >> the force_empty could be offloaded by kernel, it would make scheduler's >> life much easier. This is not something userspace could do. > If kernel workqueues are doing more work (i.e. force_empty processing), > then it seem like the time to offline could grow. I'm not sure if > that's important. Yes, it would grow. I'm not sure, but it seems fine with our workloads. The reclaim can be placed at the last step of offline, and it can be interrupted by some signals, i.e. fatal signal in current code. > > I assume that if we make force_empty an async side effect of rmdir then > user space scheduler would not be unable to immediately assume the > rmdir'd container memory is available without subjecting a new container > to direct reclaim. So it seems like user space would use a mechanism to > wait for reclaim: either the existing sync force_empty or polling > meminfo/etc waiting for free memory to appear. Yes, it is expected side effect, the memory reclaim would happen in a short while. In this series I keep sync reclaim behavior of force_empty by checking the written value. Michal suggested a new knob do the offline reclaim, and keep force_empty intact. I think using which knob is user's discretion. Thanks, Yang > >>>>> I think it is more important to discuss whether we want to introduce >>>>> force_empty in cgroup v2. >>>> We would prefer have it in v2 as well. >>> Then bring this up in a separate email thread please. >> Sure. Will prepare the patches later. >> >> Thanks, >> Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-04 20:03 ` Greg Thelen 2019-01-04 20:03 ` Greg Thelen 2019-01-04 21:41 ` Yang Shi @ 2019-01-04 22:57 ` Yang Shi 2019-01-04 23:04 ` Yang Shi 2 siblings, 1 reply; 30+ messages in thread From: Yang Shi @ 2019-01-04 22:57 UTC (permalink / raw) To: Greg Thelen, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/4/19 12:03 PM, Greg Thelen wrote: > Yang Shi <yang.shi@linux.alibaba.com> wrote: > >> On 1/3/19 11:23 AM, Michal Hocko wrote: >>> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >>> [...] >>>>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>>>> other words why cannot you offload those expensive operations to a >>>>>>> detached context in _userspace_? >>>>>> I would say it has not to be strictly sequential. The above script is just >>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>>>> due to the complicated cluster scheduling and container scheduling in the >>>>>> production environment, for example the creation process might be scheduled >>>>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>>>> much about the internals of the container scheduling. >>>>> In that case I do not see a strong reason to implement the offloding >>>>> into the kernel. It is an additional code and semantic to maintain. >>>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>>> quite simple and very straight forward, isn't it? Just utilize the existing >>>> css offline worker. And, that a couple of lines of code do improve some >>>> throughput issues for some real usecases. >>> I do not really care it is few LOC. It is more important that it is >>> conflating force_empty into offlining logic. There was a good reason to >>> remove reparenting/emptying the memcg during the offline. Considering >>> that you can offload force_empty from userspace trivially then I do not >>> see any reason to implement it in the kernel. >> Er, I may not articulate in the earlier email, force_empty can not be >> offloaded from userspace *trivially*. IOWs the container scheduler may >> unexpectedly overcommit something due to the stall of synchronous force >> empty, which can't be figured out by userspace before it actually >> happens. The scheduler doesn't know how long force_empty would take. If >> the force_empty could be offloaded by kernel, it would make scheduler's >> life much easier. This is not something userspace could do. > If kernel workqueues are doing more work (i.e. force_empty processing), > then it seem like the time to offline could grow. I'm not sure if > that's important. One thing I can think of is this may slow down the recycling of memcg id. This may cause memcg id exhausted for some extreme workload. But, I don't see this as a problem in our workload. Thanks, Yang > > I assume that if we make force_empty an async side effect of rmdir then > user space scheduler would not be unable to immediately assume the > rmdir'd container memory is available without subjecting a new container > to direct reclaim. So it seems like user space would use a mechanism to > wait for reclaim: either the existing sync force_empty or polling > meminfo/etc waiting for free memory to appear. > >>>>> I think it is more important to discuss whether we want to introduce >>>>> force_empty in cgroup v2. >>>> We would prefer have it in v2 as well. >>> Then bring this up in a separate email thread please. >> Sure. Will prepare the patches later. >> >> Thanks, >> Yang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty 2019-01-04 22:57 ` Yang Shi @ 2019-01-04 23:04 ` Yang Shi 0 siblings, 0 replies; 30+ messages in thread From: Yang Shi @ 2019-01-04 23:04 UTC (permalink / raw) To: Greg Thelen, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel On 1/4/19 2:57 PM, Yang Shi wrote: > > > On 1/4/19 12:03 PM, Greg Thelen wrote: >> Yang Shi <yang.shi@linux.alibaba.com> wrote: >> >>> On 1/3/19 11:23 AM, Michal Hocko wrote: >>>> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>>>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >>>> [...] >>>>>>>> Is there any reason for your scripts to be strictly sequential >>>>>>>> here? In >>>>>>>> other words why cannot you offload those expensive operations to a >>>>>>>> detached context in _userspace_? >>>>>>> I would say it has not to be strictly sequential. The above >>>>>>> script is just >>>>>>> an example to illustrate the pattern. But, sometimes it may hit >>>>>>> such pattern >>>>>>> due to the complicated cluster scheduling and container >>>>>>> scheduling in the >>>>>>> production environment, for example the creation process might >>>>>>> be scheduled >>>>>>> to the same CPU which is doing force_empty. I have to say I >>>>>>> don't know too >>>>>>> much about the internals of the container scheduling. >>>>>> In that case I do not see a strong reason to implement the offloding >>>>>> into the kernel. It is an additional code and semantic to maintain. >>>>> Yes, it does introduce some additional code and semantic, but >>>>> IMHO, it is >>>>> quite simple and very straight forward, isn't it? Just utilize the >>>>> existing >>>>> css offline worker. And, that a couple of lines of code do improve >>>>> some >>>>> throughput issues for some real usecases. >>>> I do not really care it is few LOC. It is more important that it is >>>> conflating force_empty into offlining logic. There was a good >>>> reason to >>>> remove reparenting/emptying the memcg during the offline. Considering >>>> that you can offload force_empty from userspace trivially then I do >>>> not >>>> see any reason to implement it in the kernel. >>> Er, I may not articulate in the earlier email, force_empty can not be >>> offloaded from userspace *trivially*. IOWs the container scheduler may >>> unexpectedly overcommit something due to the stall of synchronous force >>> empty, which can't be figured out by userspace before it actually >>> happens. The scheduler doesn't know how long force_empty would take. If >>> the force_empty could be offloaded by kernel, it would make scheduler's >>> life much easier. This is not something userspace could do. >> If kernel workqueues are doing more work (i.e. force_empty processing), >> then it seem like the time to offline could grow. I'm not sure if >> that's important. > > One thing I can think of is this may slow down the recycling of memcg > id. This may cause memcg id exhausted for some extreme workload. But, > I don't see this as a problem in our workload. Actually, sync force_empty should have the same side effect. Yang > > Thanks, > Yang > >> >> I assume that if we make force_empty an async side effect of rmdir then >> user space scheduler would not be unable to immediately assume the >> rmdir'd container memory is available without subjecting a new container >> to direct reclaim. So it seems like user space would use a mechanism to >> wait for reclaim: either the existing sync force_empty or polling >> meminfo/etc waiting for free memory to appear. >> >>>>>> I think it is more important to discuss whether we want to introduce >>>>>> force_empty in cgroup v2. >>>>> We would prefer have it in v2 as well. >>>> Then bring this up in a separate email thread please. >>> Sure. Will prepare the patches later. >>> >>> Thanks, >>> Yang > ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2019-01-04 23:06 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi 2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi 2019-01-02 21:18 ` Shakeel Butt 2019-01-02 21:18 ` Shakeel Butt 2019-01-03 10:13 ` Michal Hocko 2019-01-02 20:05 ` [PATCH 2/3] mm: memcontrol: do not try to do swap when " Yang Shi 2019-01-02 21:45 ` Shakeel Butt 2019-01-02 21:45 ` Shakeel Butt 2019-01-03 16:56 ` Yang Shi 2019-01-03 17:03 ` Shakeel Butt 2019-01-03 17:03 ` Shakeel Butt 2019-01-03 18:19 ` Yang Shi 2019-01-02 20:05 ` [PATCH 3/3] mm: memcontrol: delay force empty to css offline Yang Shi 2019-01-03 10:12 ` [RFC PATCH 0/3] mm: memcontrol: delayed force empty Michal Hocko 2019-01-03 17:33 ` Yang Shi 2019-01-03 18:13 ` Michal Hocko 2019-01-03 18:40 ` Yang Shi 2019-01-03 18:53 ` Michal Hocko 2019-01-03 19:10 ` Yang Shi 2019-01-03 19:23 ` Michal Hocko 2019-01-03 19:49 ` Yang Shi 2019-01-03 20:01 ` Michal Hocko 2019-01-04 4:15 ` Yang Shi 2019-01-04 8:55 ` Michal Hocko 2019-01-04 16:46 ` Yang Shi 2019-01-04 20:03 ` Greg Thelen 2019-01-04 20:03 ` Greg Thelen 2019-01-04 21:41 ` Yang Shi 2019-01-04 22:57 ` Yang Shi 2019-01-04 23:04 ` Yang Shi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox