[RFC PATCH 0/3] mm: memcontrol: delayed force empty

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] mm: memcontrol: delayed force empty
@ 2019-01-02 20:05 Yang Shi
  2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-02 20:05 UTC (permalink / raw)
  To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel

Currently, force empty reclaims memory synchronously when writing to
memory.force_empty.  It may take some time to return and the afterwards
operations are blocked by it.  Although it can be interrupted by signal,
it still seems suboptimal.

Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline.  So, handling force empty in css offline
sounds reasonable.

The user may write into any value to memory.force_empty, but I'm
supposed the most used value should be 0 and 1.  To not break existing
applications, writing 0 or 1 still do force empty synchronously, any
other value will tell kernel to do force empty in css offline worker.

Patch #1: Fix some obsolete information about force_empty in the document
Patch #2: A minor improvement to skip swap for force_empty
Patch #3: Implement delayed force_empty

Yang Shi (3):
      doc: memcontrol: fix the obsolete content about force empty
      mm: memcontrol: do not try to do swap when force empty
      mm: memcontrol: delay force empty to css offline

 Documentation/cgroup-v1/memory.txt | 15 ++++++++++-----
 include/linux/memcontrol.h         |  2 ++
 mm/memcontrol.c                    | 20 +++++++++++++++++++-
 3 files changed, 31 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty
  2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi
@ 2019-01-02 20:05 ` Yang Shi
  2019-01-02 21:18   ` Shakeel Butt
  2019-01-03 10:13   ` Michal Hocko
  2019-01-02 20:05 ` [PATCH 2/3] mm: memcontrol: do not try to do swap when " Yang Shi
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-02 20:05 UTC (permalink / raw)
  To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel

We don't do page cache reparent anymore when offlining memcg, so update
force empty related content accordingly.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/cgroup-v1/memory.txt | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index 3682e99..8e2cb1d 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -70,7 +70,7 @@ Brief summary of control files.
  memory.soft_limit_in_bytes	 # set/show soft limit of memory usage
  memory.stat			 # show various statistics
  memory.use_hierarchy		 # set/show hierarchical account enabled
- memory.force_empty		 # trigger forced move charge to parent
+ memory.force_empty		 # trigger forced page reclaim
  memory.pressure_level		 # set memory pressure notifications
  memory.swappiness		 # set/show swappiness parameter of vmscan
 				 (See sysctl's vm.swappiness)
@@ -459,8 +459,9 @@ About use_hierarchy, see Section 6.
   the cgroup will be reclaimed and as many pages reclaimed as possible.
 
   The typical use case for this interface is before calling rmdir().
-  Because rmdir() moves all pages to parent, some out-of-use page caches can be
-  moved to the parent. If you want to avoid that, force_empty will be useful.
+  Though rmdir() offlines memcg, but the memcg may still stay there due to
+  charged file caches. Some out-of-use page caches may keep charged until
+  memory pressure happens. If you want to avoid that, force_empty will be useful.
 
   Also, note that when memory.kmem.limit_in_bytes is set the charges due to
   kernel pages will still be seen. This is not considered a failure and the
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty
  2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi
@ 2019-01-02 21:18   ` Shakeel Butt
  2019-01-02 21:18     ` Shakeel Butt
  2019-01-03 10:13   ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Shakeel Butt @ 2019-01-02 21:18 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML

On Wed, Jan 2, 2019 at 12:07 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> We don't do page cache reparent anymore when offlining memcg, so update
> force empty related content accordingly.
>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  Documentation/cgroup-v1/memory.txt | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> index 3682e99..8e2cb1d 100644
> --- a/Documentation/cgroup-v1/memory.txt
> +++ b/Documentation/cgroup-v1/memory.txt
> @@ -70,7 +70,7 @@ Brief summary of control files.
>   memory.soft_limit_in_bytes     # set/show soft limit of memory usage
>   memory.stat                    # show various statistics
>   memory.use_hierarchy           # set/show hierarchical account enabled
> - memory.force_empty             # trigger forced move charge to parent
> + memory.force_empty             # trigger forced page reclaim
>   memory.pressure_level          # set memory pressure notifications
>   memory.swappiness              # set/show swappiness parameter of vmscan
>                                  (See sysctl's vm.swappiness)
> @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6.
>    the cgroup will be reclaimed and as many pages reclaimed as possible.
>
>    The typical use case for this interface is before calling rmdir().
> -  Because rmdir() moves all pages to parent, some out-of-use page caches can be
> -  moved to the parent. If you want to avoid that, force_empty will be useful.
> +  Though rmdir() offlines memcg, but the memcg may still stay there due to
> +  charged file caches. Some out-of-use page caches may keep charged until
> +  memory pressure happens. If you want to avoid that, force_empty will be useful.
>
>    Also, note that when memory.kmem.limit_in_bytes is set the charges due to
>    kernel pages will still be seen. This is not considered a failure and the
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty
  2019-01-02 21:18   ` Shakeel Butt
@ 2019-01-02 21:18     ` Shakeel Butt
  0 siblings, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2019-01-02 21:18 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML

On Wed, Jan 2, 2019 at 12:07 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> We don't do page cache reparent anymore when offlining memcg, so update
> force empty related content accordingly.
>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  Documentation/cgroup-v1/memory.txt | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> index 3682e99..8e2cb1d 100644
> --- a/Documentation/cgroup-v1/memory.txt
> +++ b/Documentation/cgroup-v1/memory.txt
> @@ -70,7 +70,7 @@ Brief summary of control files.
>   memory.soft_limit_in_bytes     # set/show soft limit of memory usage
>   memory.stat                    # show various statistics
>   memory.use_hierarchy           # set/show hierarchical account enabled
> - memory.force_empty             # trigger forced move charge to parent
> + memory.force_empty             # trigger forced page reclaim
>   memory.pressure_level          # set memory pressure notifications
>   memory.swappiness              # set/show swappiness parameter of vmscan
>                                  (See sysctl's vm.swappiness)
> @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6.
>    the cgroup will be reclaimed and as many pages reclaimed as possible.
>
>    The typical use case for this interface is before calling rmdir().
> -  Because rmdir() moves all pages to parent, some out-of-use page caches can be
> -  moved to the parent. If you want to avoid that, force_empty will be useful.
> +  Though rmdir() offlines memcg, but the memcg may still stay there due to
> +  charged file caches. Some out-of-use page caches may keep charged until
> +  memory pressure happens. If you want to avoid that, force_empty will be useful.
>
>    Also, note that when memory.kmem.limit_in_bytes is set the charges due to
>    kernel pages will still be seen. This is not considered a failure and the
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty
  2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi
  2019-01-02 21:18   ` Shakeel Butt
@ 2019-01-03 10:13   ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2019-01-03 10:13 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Thu 03-01-19 04:05:31, Yang Shi wrote:
> We don't do page cache reparent anymore when offlining memcg, so update
> force empty related content accordingly.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>

Thanks for the clean up.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  Documentation/cgroup-v1/memory.txt | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> index 3682e99..8e2cb1d 100644
> --- a/Documentation/cgroup-v1/memory.txt
> +++ b/Documentation/cgroup-v1/memory.txt
> @@ -70,7 +70,7 @@ Brief summary of control files.
>   memory.soft_limit_in_bytes	 # set/show soft limit of memory usage
>   memory.stat			 # show various statistics
>   memory.use_hierarchy		 # set/show hierarchical account enabled
> - memory.force_empty		 # trigger forced move charge to parent
> + memory.force_empty		 # trigger forced page reclaim
>   memory.pressure_level		 # set memory pressure notifications
>   memory.swappiness		 # set/show swappiness parameter of vmscan
>  				 (See sysctl's vm.swappiness)
> @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6.
>    the cgroup will be reclaimed and as many pages reclaimed as possible.
>  
>    The typical use case for this interface is before calling rmdir().
> -  Because rmdir() moves all pages to parent, some out-of-use page caches can be
> -  moved to the parent. If you want to avoid that, force_empty will be useful.
> +  Though rmdir() offlines memcg, but the memcg may still stay there due to
> +  charged file caches. Some out-of-use page caches may keep charged until
> +  memory pressure happens. If you want to avoid that, force_empty will be useful.
>  
>    Also, note that when memory.kmem.limit_in_bytes is set the charges due to
>    kernel pages will still be seen. This is not considered a failure and the
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty
  2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi
  2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi
@ 2019-01-02 20:05 ` Yang Shi
  2019-01-02 21:45   ` Shakeel Butt
  2019-01-02 20:05 ` [PATCH 3/3] mm: memcontrol: delay force empty to css offline Yang Shi
  2019-01-03 10:12 ` [RFC PATCH 0/3] mm: memcontrol: delayed force empty Michal Hocko
  3 siblings, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-01-02 20:05 UTC (permalink / raw)
  To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel

The typical usecase of force empty is to try to reclaim as much as
possible memory before offlining a memcg.  Since there should be no
attached tasks to offlining memcg, the tasks anonymous pages would have
already been freed or uncharged.  Even though anonymous pages get
swapped out, but they still get charged to swap space.  So, it sounds
pointless to do swap for force empty.

I tried to dig into the history of this, it was introduced by
commit 8c7c6e34a125 ("memcg: mem+swap controller core"), but there is
not any clue about why it was done so at the first place.

The below simple test script shows slight file cache reclaim improvement
when swap is on.

echo 3 > /proc/sys/vm/drop_caches
mkdir /sys/fs/cgroup/memory/test
echo 30 > /sys/fs/cgroup/memory/test/memory.swappiness
echo $$ >/sys/fs/cgroup/memory/test/cgroup.procs
cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}'
dd if=/dev/zero of=/mnt/test bs=1M count=1024
ping localhost > /dev/null &
echo 1 > /sys/fs/cgroup/memory/test/memory.force_empty
killall ping
echo $$ >/sys/fs/cgroup/memory/cgroup.procs
cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}'
rmdir /sys/fs/cgroup/memory/test
cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}'

The number of page cache is:
			w/o		w/
before force empty    1088792        1088784
after force empty     41492          39428
reclaimed	      1047300        1049356

Without doing swap, force empty can reclaim 2MB more memory in 1GB page
cache.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b..bbf39b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2872,7 +2872,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		progress = try_to_free_mem_cgroup_pages(memcg, 1,
-							GFP_KERNEL, true);
+							GFP_KERNEL, false);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty
  2019-01-02 20:05 ` [PATCH 2/3] mm: memcontrol: do not try to do swap when " Yang Shi
@ 2019-01-02 21:45   ` Shakeel Butt
  2019-01-02 21:45     ` Shakeel Butt
  2019-01-03 16:56     ` Yang Shi
  0 siblings, 2 replies; 30+ messages in thread
From: Shakeel Butt @ 2019-01-02 21:45 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML

On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> The typical usecase of force empty is to try to reclaim as much as
> possible memory before offlining a memcg.  Since there should be no
> attached tasks to offlining memcg, the tasks anonymous pages would have
> already been freed or uncharged.

Anon pages can come from tmpfs files as well.

> Even though anonymous pages get
> swapped out, but they still get charged to swap space.  So, it sounds
> pointless to do swap for force empty.
>

I understand that force_empty is typically used before rmdir'ing a
memcg but it might be used differently by some users. We use this
interface to test memory reclaim behavior (anon and file).

Anyways, I am not against changing the behavior, we can adapt
internally but there might be other users using this interface
differently.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty
  2019-01-02 21:45   ` Shakeel Butt
@ 2019-01-02 21:45     ` Shakeel Butt
  2019-01-03 16:56     ` Yang Shi
  1 sibling, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2019-01-02 21:45 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML

On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> The typical usecase of force empty is to try to reclaim as much as
> possible memory before offlining a memcg.  Since there should be no
> attached tasks to offlining memcg, the tasks anonymous pages would have
> already been freed or uncharged.

Anon pages can come from tmpfs files as well.

> Even though anonymous pages get
> swapped out, but they still get charged to swap space.  So, it sounds
> pointless to do swap for force empty.
>

I understand that force_empty is typically used before rmdir'ing a
memcg but it might be used differently by some users. We use this
interface to test memory reclaim behavior (anon and file).

Anyways, I am not against changing the behavior, we can adapt
internally but there might be other users using this interface
differently.

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty
  2019-01-02 21:45   ` Shakeel Butt
  2019-01-02 21:45     ` Shakeel Butt
@ 2019-01-03 16:56     ` Yang Shi
  2019-01-03 17:03       ` Shakeel Butt
  1 sibling, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-01-03 16:56 UTC (permalink / raw)
  To: Shakeel Butt; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML



On 1/2/19 1:45 PM, Shakeel Butt wrote:
> On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>> The typical usecase of force empty is to try to reclaim as much as
>> possible memory before offlining a memcg.  Since there should be no
>> attached tasks to offlining memcg, the tasks anonymous pages would have
>> already been freed or uncharged.
> Anon pages can come from tmpfs files as well.

Yes, but they are charged to swap space as regular anon pages.

>
>> Even though anonymous pages get
>> swapped out, but they still get charged to swap space.  So, it sounds
>> pointless to do swap for force empty.
>>
> I understand that force_empty is typically used before rmdir'ing a
> memcg but it might be used differently by some users. We use this
> interface to test memory reclaim behavior (anon and file).

Thanks for sharing your usecase. So, you uses this for test only?

>
> Anyways, I am not against changing the behavior, we can adapt
> internally but there might be other users using this interface
> differently.

Thanks.

Yang

>
> thanks,
> Shakeel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty
  2019-01-03 16:56     ` Yang Shi
@ 2019-01-03 17:03       ` Shakeel Butt
  2019-01-03 17:03         ` Shakeel Butt
  2019-01-03 18:19         ` Yang Shi
  0 siblings, 2 replies; 30+ messages in thread
From: Shakeel Butt @ 2019-01-03 17:03 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML

On Thu, Jan 3, 2019 at 8:57 AM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
>
> On 1/2/19 1:45 PM, Shakeel Butt wrote:
> > On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> >> The typical usecase of force empty is to try to reclaim as much as
> >> possible memory before offlining a memcg.  Since there should be no
> >> attached tasks to offlining memcg, the tasks anonymous pages would have
> >> already been freed or uncharged.
> > Anon pages can come from tmpfs files as well.
>
> Yes, but they are charged to swap space as regular anon pages.
>

The point was the lifetime of tmpfs anon pages are not tied to any
task. Even though there aren't any task attached to a memcg, the tmpfs
anon pages will remain charged. Other than that, the old anon pages of
a task which have migrated away might still be charged to the old
memcg (if move_charge_at_immigrate is not set).

> >
> >> Even though anonymous pages get
> >> swapped out, but they still get charged to swap space.  So, it sounds
> >> pointless to do swap for force empty.
> >>
> > I understand that force_empty is typically used before rmdir'ing a
> > memcg but it might be used differently by some users. We use this
> > interface to test memory reclaim behavior (anon and file).
>
> Thanks for sharing your usecase. So, you uses this for test only?
>

Yes.

Shakeel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty
  2019-01-03 17:03       ` Shakeel Butt
@ 2019-01-03 17:03         ` Shakeel Butt
  2019-01-03 18:19         ` Yang Shi
  1 sibling, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2019-01-03 17:03 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML

On Thu, Jan 3, 2019 at 8:57 AM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
>
> On 1/2/19 1:45 PM, Shakeel Butt wrote:
> > On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> >> The typical usecase of force empty is to try to reclaim as much as
> >> possible memory before offlining a memcg.  Since there should be no
> >> attached tasks to offlining memcg, the tasks anonymous pages would have
> >> already been freed or uncharged.
> > Anon pages can come from tmpfs files as well.
>
> Yes, but they are charged to swap space as regular anon pages.
>

The point was the lifetime of tmpfs anon pages are not tied to any
task. Even though there aren't any task attached to a memcg, the tmpfs
anon pages will remain charged. Other than that, the old anon pages of
a task which have migrated away might still be charged to the old
memcg (if move_charge_at_immigrate is not set).

> >
> >> Even though anonymous pages get
> >> swapped out, but they still get charged to swap space.  So, it sounds
> >> pointless to do swap for force empty.
> >>
> > I understand that force_empty is typically used before rmdir'ing a
> > memcg but it might be used differently by some users. We use this
> > interface to test memory reclaim behavior (anon and file).
>
> Thanks for sharing your usecase. So, you uses this for test only?
>

Yes.

Shakeel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty
  2019-01-03 17:03       ` Shakeel Butt
  2019-01-03 17:03         ` Shakeel Butt
@ 2019-01-03 18:19         ` Yang Shi
  1 sibling, 0 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-03 18:19 UTC (permalink / raw)
  To: Shakeel Butt; +Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Linux MM, LKML



On 1/3/19 9:03 AM, Shakeel Butt wrote:
> On Thu, Jan 3, 2019 at 8:57 AM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>
>>
>> On 1/2/19 1:45 PM, Shakeel Butt wrote:
>>> On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>> The typical usecase of force empty is to try to reclaim as much as
>>>> possible memory before offlining a memcg.  Since there should be no
>>>> attached tasks to offlining memcg, the tasks anonymous pages would have
>>>> already been freed or uncharged.
>>> Anon pages can come from tmpfs files as well.
>> Yes, but they are charged to swap space as regular anon pages.
>>
> The point was the lifetime of tmpfs anon pages are not tied to any
> task. Even though there aren't any task attached to a memcg, the tmpfs
> anon pages will remain charged. Other than that, the old anon pages of
> a task which have migrated away might still be charged to the old
> memcg (if move_charge_at_immigrate is not set).

Yes, my understanding is even though they are swapped out but they are 
still charged to memsw for cgroupv1. force_empty is supposed to reclaim 
as much as possible memory, here I'm supposed "reclaim" also means 
"uncharge".

Even though the anon pages are still charged to the old memcg, it will 
be moved the new memcg when the old one is deleted, or the pages will be 
just released if the task is killed.

So, IMHO, I don't see the point why swapping anon pages when doing 
force_empty.

Thanks,
Yang

>>>> Even though anonymous pages get
>>>> swapped out, but they still get charged to swap space.  So, it sounds
>>>> pointless to do swap for force empty.
>>>>
>>> I understand that force_empty is typically used before rmdir'ing a
>>> memcg but it might be used differently by some users. We use this
>>> interface to test memory reclaim behavior (anon and file).
>> Thanks for sharing your usecase. So, you uses this for test only?
>>
> Yes.
>
> Shakeel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/3] mm: memcontrol: delay force empty to css offline
  2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi
  2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi
  2019-01-02 20:05 ` [PATCH 2/3] mm: memcontrol: do not try to do swap when " Yang Shi
@ 2019-01-02 20:05 ` Yang Shi
  2019-01-03 10:12 ` [RFC PATCH 0/3] mm: memcontrol: delayed force empty Michal Hocko
  3 siblings, 0 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-02 20:05 UTC (permalink / raw)
  To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel

Currently, force empty reclaims memory synchronously when writing to
memory.force_empty.  It may take some time to return and the afterwards
operations are blocked by it.  Although it can be interrupted by signal,
it still seems suboptimal.

Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline.  So, handling force empty in css offline
sounds reasonable.

The user may write into any value to memory.force_empty, but I'm
supposed the most used value should be 0 and 1.  To not break existing
applications, writing 0 or 1 still do force empty synchronously, any
other value will tell kernel to do force empty in css offline worker.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/cgroup-v1/memory.txt |  8 ++++++--
 include/linux/memcontrol.h         |  2 ++
 mm/memcontrol.c                    | 18 ++++++++++++++++++
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index 8e2cb1d..313d45f 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -452,11 +452,15 @@ About use_hierarchy, see Section 6.
 
 5.1 force_empty
   memory.force_empty interface is provided to make cgroup's memory usage empty.
-  When writing anything to this
+  When writing 0 or 1 to this
 
   # echo 0 > memory.force_empty
 
-  the cgroup will be reclaimed and as many pages reclaimed as possible.
+  the cgroup will be reclaimed and as many pages reclaimed as possible
+  synchronously.
+
+  Writing any other value to this, the cgroup will delay the memory reclaim
+  to css offline.
 
   The typical use case for this interface is before calling rmdir().
   Though rmdir() offlines memcg, but the memcg may still stay there due to
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ab2120..48a5cf2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -311,6 +311,8 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;
 
+	bool delayed_force_empty;
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bbf39b5..620b6c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2888,10 +2888,25 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
 					    char *buf, size_t nbytes,
 					    loff_t off)
 {
+	unsigned long val;
+	ssize_t ret;
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 
 	if (mem_cgroup_is_root(memcg))
 		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	ret = kstrtoul(buf, 10, &val);
+	if (ret < 0)
+		return ret;
+
+	if (val != 0 && val != 1) {
+		memcg->delayed_force_empty = true;
+		return nbytes;
+	}
+
+	memcg->delayed_force_empty = false;
 	return mem_cgroup_force_empty(memcg) ?: nbytes;
 }
 
@@ -4531,6 +4546,9 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup_event *event, *tmp;
 
+	if (memcg->delayed_force_empty)
+		mem_cgroup_force_empty(memcg);
+
 	/*
 	 * Unregister events and notify userspace.
 	 * Notify userspace about cgroup removing only after rmdir of cgroup
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi
                   ` (2 preceding siblings ...)
  2019-01-02 20:05 ` [PATCH 3/3] mm: memcontrol: delay force empty to css offline Yang Shi
@ 2019-01-03 10:12 ` Michal Hocko
  2019-01-03 17:33   ` Yang Shi
  3 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-01-03 10:12 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Thu 03-01-19 04:05:30, Yang Shi wrote:
> 
> Currently, force empty reclaims memory synchronously when writing to
> memory.force_empty.  It may take some time to return and the afterwards
> operations are blocked by it.  Although it can be interrupted by signal,
> it still seems suboptimal.

Why it is suboptimal? We are doing that operation on behalf of the
process requesting it. What should anybody else pay for it? In other
words why should we hide the overhead?

> Now css offline is handled by worker, and the typical usecase of force
> empty is before memcg offline.  So, handling force empty in css offline
> sounds reasonable.

Hmm, so I guess you are talking about
echo 1 > $MEMCG/force_empty
rmdir $MEMCG

and you are complaining that the operation takes too long. Right? Why do
you care actually?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 10:12 ` [RFC PATCH 0/3] mm: memcontrol: delayed force empty Michal Hocko
@ 2019-01-03 17:33   ` Yang Shi
  2019-01-03 18:13     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-01-03 17:33 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel

On 1/3/19 2:12 AM, Michal Hocko wrote:
> On Thu 03-01-19 04:05:30, Yang Shi wrote:
>> Currently, force empty reclaims memory synchronously when writing to
>> memory.force_empty.  It may take some time to return and the afterwards
>> operations are blocked by it.  Although it can be interrupted by signal,
>> it still seems suboptimal.
> Why it is suboptimal? We are doing that operation on behalf of the
> process requesting it. What should anybody else pay for it? In other
> words why should we hide the overhead?

Please see the below explanation.

>
>> Now css offline is handled by worker, and the typical usecase of force
>> empty is before memcg offline.  So, handling force empty in css offline
>> sounds reasonable.
> Hmm, so I guess you are talking about
> echo 1 > $MEMCG/force_empty
> rmdir $MEMCG
>
> and you are complaining that the operation takes too long. Right? Why do
> you care actually?

We have some usecases which create and remove memcgs very frequently, 
and the tasks in the memcg may just access the files which are unlikely 
accessed by anyone else. So, we prefer force_empty the memcg before 
rmdir'ing it to reclaim the page cache so that they don't get 
accumulated to incur unnecessary memory pressure. Since the memory 
pressure may incur direct reclaim to harm some latency sensitive 
applications.

And, the create/remove might be run in a script sequentially (there 
might be a lot scripts or applications are run in parallel to do this), i.e.
mkdir cg1
do something
echo 0 > cg1/memory.force_empty
rmdir cg1

mkdir cg2
...

The creation of the afterwards memcg might be blocked by the force_empty 
for long time if there are a lot page caches, so the overall throughput 
of the system may get hurt.
And, it is not that urgent to reclaim the page cache right away and it 
is not that important who pays the cost, we just need a mechanism to 
reclaim the pages soon in a short while. The overhead could be smoothed 
by background workqueue.

And, the patch still keeps the old behavior, just in case someone else 
still depends on it.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 17:33   ` Yang Shi
@ 2019-01-03 18:13     ` Michal Hocko
  2019-01-03 18:40       ` Yang Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-01-03 18:13 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Thu 03-01-19 09:33:14, Yang Shi wrote:
> 
> 
> On 1/3/19 2:12 AM, Michal Hocko wrote:
> > On Thu 03-01-19 04:05:30, Yang Shi wrote:
> > > Currently, force empty reclaims memory synchronously when writing to
> > > memory.force_empty.  It may take some time to return and the afterwards
> > > operations are blocked by it.  Although it can be interrupted by signal,
> > > it still seems suboptimal.
> > Why it is suboptimal? We are doing that operation on behalf of the
> > process requesting it. What should anybody else pay for it? In other
> > words why should we hide the overhead?
> 
> Please see the below explanation.
> 
> > 
> > > Now css offline is handled by worker, and the typical usecase of force
> > > empty is before memcg offline.  So, handling force empty in css offline
> > > sounds reasonable.
> > Hmm, so I guess you are talking about
> > echo 1 > $MEMCG/force_empty
> > rmdir $MEMCG
> > 
> > and you are complaining that the operation takes too long. Right? Why do
> > you care actually?
> 
> We have some usecases which create and remove memcgs very frequently, and
> the tasks in the memcg may just access the files which are unlikely accessed
> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
> reclaim the page cache so that they don't get accumulated to incur
> unnecessary memory pressure. Since the memory pressure may incur direct
> reclaim to harm some latency sensitive applications.

Yes, this makes sense to me.

> And, the create/remove might be run in a script sequentially (there might be
> a lot scripts or applications are run in parallel to do this), i.e.
> mkdir cg1
> do something
> echo 0 > cg1/memory.force_empty
> rmdir cg1
> 
> mkdir cg2
> ...
> 
> The creation of the afterwards memcg might be blocked by the force_empty for
> long time if there are a lot page caches, so the overall throughput of the
> system may get hurt.

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 18:13     ` Michal Hocko
@ 2019-01-03 18:40       ` Yang Shi
  2019-01-03 18:53         ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-01-03 18:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/3/19 10:13 AM, Michal Hocko wrote:
> On Thu 03-01-19 09:33:14, Yang Shi wrote:
>>
>> On 1/3/19 2:12 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 04:05:30, Yang Shi wrote:
>>>> Currently, force empty reclaims memory synchronously when writing to
>>>> memory.force_empty.  It may take some time to return and the afterwards
>>>> operations are blocked by it.  Although it can be interrupted by signal,
>>>> it still seems suboptimal.
>>> Why it is suboptimal? We are doing that operation on behalf of the
>>> process requesting it. What should anybody else pay for it? In other
>>> words why should we hide the overhead?
>> Please see the below explanation.
>>
>>>> Now css offline is handled by worker, and the typical usecase of force
>>>> empty is before memcg offline.  So, handling force empty in css offline
>>>> sounds reasonable.
>>> Hmm, so I guess you are talking about
>>> echo 1 > $MEMCG/force_empty
>>> rmdir $MEMCG
>>>
>>> and you are complaining that the operation takes too long. Right? Why do
>>> you care actually?
>> We have some usecases which create and remove memcgs very frequently, and
>> the tasks in the memcg may just access the files which are unlikely accessed
>> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
>> reclaim the page cache so that they don't get accumulated to incur
>> unnecessary memory pressure. Since the memory pressure may incur direct
>> reclaim to harm some latency sensitive applications.
> Yes, this makes sense to me.
>
>> And, the create/remove might be run in a script sequentially (there might be
>> a lot scripts or applications are run in parallel to do this), i.e.
>> mkdir cg1
>> do something
>> echo 0 > cg1/memory.force_empty
>> rmdir cg1
>>
>> mkdir cg2
>> ...
>>
>> The creation of the afterwards memcg might be blocked by the force_empty for
>> long time if there are a lot page caches, so the overall throughput of the
>> system may get hurt.
> Is there any reason for your scripts to be strictly sequential here? In
> other words why cannot you offload those expensive operations to a
> detached context in _userspace_?

I would say it has not to be strictly sequential. The above script is 
just an example to illustrate the pattern. But, sometimes it may hit 
such pattern due to the complicated cluster scheduling and container 
scheduling in the production environment, for example the creation 
process might be scheduled to the same CPU which is doing force_empty. I 
have to say I don't know too much about the internals of the container 
scheduling.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 18:40       ` Yang Shi
@ 2019-01-03 18:53         ` Michal Hocko
  2019-01-03 19:10           ` Yang Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-01-03 18:53 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Thu 03-01-19 10:40:54, Yang Shi wrote:
> 
> 
> On 1/3/19 10:13 AM, Michal Hocko wrote:
> > On Thu 03-01-19 09:33:14, Yang Shi wrote:
> > > 
> > > On 1/3/19 2:12 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 04:05:30, Yang Shi wrote:
> > > > > Currently, force empty reclaims memory synchronously when writing to
> > > > > memory.force_empty.  It may take some time to return and the afterwards
> > > > > operations are blocked by it.  Although it can be interrupted by signal,
> > > > > it still seems suboptimal.
> > > > Why it is suboptimal? We are doing that operation on behalf of the
> > > > process requesting it. What should anybody else pay for it? In other
> > > > words why should we hide the overhead?
> > > Please see the below explanation.
> > > 
> > > > > Now css offline is handled by worker, and the typical usecase of force
> > > > > empty is before memcg offline.  So, handling force empty in css offline
> > > > > sounds reasonable.
> > > > Hmm, so I guess you are talking about
> > > > echo 1 > $MEMCG/force_empty
> > > > rmdir $MEMCG
> > > > 
> > > > and you are complaining that the operation takes too long. Right? Why do
> > > > you care actually?
> > > We have some usecases which create and remove memcgs very frequently, and
> > > the tasks in the memcg may just access the files which are unlikely accessed
> > > by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
> > > reclaim the page cache so that they don't get accumulated to incur
> > > unnecessary memory pressure. Since the memory pressure may incur direct
> > > reclaim to harm some latency sensitive applications.
> > Yes, this makes sense to me.
> > 
> > > And, the create/remove might be run in a script sequentially (there might be
> > > a lot scripts or applications are run in parallel to do this), i.e.
> > > mkdir cg1
> > > do something
> > > echo 0 > cg1/memory.force_empty
> > > rmdir cg1
> > > 
> > > mkdir cg2
> > > ...
> > > 
> > > The creation of the afterwards memcg might be blocked by the force_empty for
> > > long time if there are a lot page caches, so the overall throughput of the
> > > system may get hurt.
> > Is there any reason for your scripts to be strictly sequential here? In
> > other words why cannot you offload those expensive operations to a
> > detached context in _userspace_?
> 
> I would say it has not to be strictly sequential. The above script is just
> an example to illustrate the pattern. But, sometimes it may hit such pattern
> due to the complicated cluster scheduling and container scheduling in the
> production environment, for example the creation process might be scheduled
> to the same CPU which is doing force_empty. I have to say I don't know too
> much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.

I think it is more important to discuss whether we want to introduce
force_empty in cgroup v2.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 18:53         ` Michal Hocko
@ 2019-01-03 19:10           ` Yang Shi
  2019-01-03 19:23             ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-01-03 19:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/3/19 10:53 AM, Michal Hocko wrote:
> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>
>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 09:33:14, Yang Shi wrote:
>>>> On 1/3/19 2:12 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 04:05:30, Yang Shi wrote:
>>>>>> Currently, force empty reclaims memory synchronously when writing to
>>>>>> memory.force_empty.  It may take some time to return and the afterwards
>>>>>> operations are blocked by it.  Although it can be interrupted by signal,
>>>>>> it still seems suboptimal.
>>>>> Why it is suboptimal? We are doing that operation on behalf of the
>>>>> process requesting it. What should anybody else pay for it? In other
>>>>> words why should we hide the overhead?
>>>> Please see the below explanation.
>>>>
>>>>>> Now css offline is handled by worker, and the typical usecase of force
>>>>>> empty is before memcg offline.  So, handling force empty in css offline
>>>>>> sounds reasonable.
>>>>> Hmm, so I guess you are talking about
>>>>> echo 1 > $MEMCG/force_empty
>>>>> rmdir $MEMCG
>>>>>
>>>>> and you are complaining that the operation takes too long. Right? Why do
>>>>> you care actually?
>>>> We have some usecases which create and remove memcgs very frequently, and
>>>> the tasks in the memcg may just access the files which are unlikely accessed
>>>> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
>>>> reclaim the page cache so that they don't get accumulated to incur
>>>> unnecessary memory pressure. Since the memory pressure may incur direct
>>>> reclaim to harm some latency sensitive applications.
>>> Yes, this makes sense to me.
>>>
>>>> And, the create/remove might be run in a script sequentially (there might be
>>>> a lot scripts or applications are run in parallel to do this), i.e.
>>>> mkdir cg1
>>>> do something
>>>> echo 0 > cg1/memory.force_empty
>>>> rmdir cg1
>>>>
>>>> mkdir cg2
>>>> ...
>>>>
>>>> The creation of the afterwards memcg might be blocked by the force_empty for
>>>> long time if there are a lot page caches, so the overall throughput of the
>>>> system may get hurt.
>>> Is there any reason for your scripts to be strictly sequential here? In
>>> other words why cannot you offload those expensive operations to a
>>> detached context in _userspace_?
>> I would say it has not to be strictly sequential. The above script is just
>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>> due to the complicated cluster scheduling and container scheduling in the
>> production environment, for example the creation process might be scheduled
>> to the same CPU which is doing force_empty. I have to say I don't know too
>> much about the internals of the container scheduling.
> In that case I do not see a strong reason to implement the offloding
> into the kernel. It is an additional code and semantic to maintain.

Yes, it does introduce some additional code and semantic, but IMHO, it 
is quite simple and very straight forward, isn't it? Just utilize the 
existing css offline worker. And, that a couple of lines of code do 
improve some throughput issues for some real usecases.

>
> I think it is more important to discuss whether we want to introduce
> force_empty in cgroup v2.

We would prefer have it in v2 as well.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 19:10           ` Yang Shi
@ 2019-01-03 19:23             ` Michal Hocko
  2019-01-03 19:49               ` Yang Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-01-03 19:23 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Thu 03-01-19 11:10:00, Yang Shi wrote:
> 
> 
> On 1/3/19 10:53 AM, Michal Hocko wrote:
> > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > > 
> > > On 1/3/19 10:13 AM, Michal Hocko wrote:
[...]
> > > > Is there any reason for your scripts to be strictly sequential here? In
> > > > other words why cannot you offload those expensive operations to a
> > > > detached context in _userspace_?
> > > I would say it has not to be strictly sequential. The above script is just
> > > an example to illustrate the pattern. But, sometimes it may hit such pattern
> > > due to the complicated cluster scheduling and container scheduling in the
> > > production environment, for example the creation process might be scheduled
> > > to the same CPU which is doing force_empty. I have to say I don't know too
> > > much about the internals of the container scheduling.
> > In that case I do not see a strong reason to implement the offloding
> > into the kernel. It is an additional code and semantic to maintain.
> 
> Yes, it does introduce some additional code and semantic, but IMHO, it is
> quite simple and very straight forward, isn't it? Just utilize the existing
> css offline worker. And, that a couple of lines of code do improve some
> throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.

> > I think it is more important to discuss whether we want to introduce
> > force_empty in cgroup v2.
> 
> We would prefer have it in v2 as well.

Then bring this up in a separate email thread please.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 19:23             ` Michal Hocko
@ 2019-01-03 19:49               ` Yang Shi
  2019-01-03 20:01                 ` Michal Hocko
  2019-01-04 20:03                 ` Greg Thelen
  0 siblings, 2 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-03 19:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/3/19 11:23 AM, Michal Hocko wrote:
> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>
>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
> [...]
>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>> other words why cannot you offload those expensive operations to a
>>>>> detached context in _userspace_?
>>>> I would say it has not to be strictly sequential. The above script is just
>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>> due to the complicated cluster scheduling and container scheduling in the
>>>> production environment, for example the creation process might be scheduled
>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>> much about the internals of the container scheduling.
>>> In that case I do not see a strong reason to implement the offloding
>>> into the kernel. It is an additional code and semantic to maintain.
>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>> quite simple and very straight forward, isn't it? Just utilize the existing
>> css offline worker. And, that a couple of lines of code do improve some
>> throughput issues for some real usecases.
> I do not really care it is few LOC. It is more important that it is
> conflating force_empty into offlining logic. There was a good reason to
> remove reparenting/emptying the memcg during the offline. Considering
> that you can offload force_empty from userspace trivially then I do not
> see any reason to implement it in the kernel.

Er, I may not articulate in the earlier email, force_empty can not be 
offloaded from userspace *trivially*. IOWs the container scheduler may 
unexpectedly overcommit something due to the stall of synchronous force 
empty, which can't be figured out by userspace before it actually 
happens. The scheduler doesn't know how long force_empty would take. If 
the force_empty could be offloaded by kernel, it would make scheduler's 
life much easier. This is not something userspace could do.

>
>>> I think it is more important to discuss whether we want to introduce
>>> force_empty in cgroup v2.
>> We would prefer have it in v2 as well.
> Then bring this up in a separate email thread please.

Sure. Will prepare the patches later.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 19:49               ` Yang Shi
@ 2019-01-03 20:01                 ` Michal Hocko
  2019-01-04  4:15                   ` Yang Shi
  2019-01-04 20:03                 ` Greg Thelen
  1 sibling, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-01-03 20:01 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Thu 03-01-19 11:49:32, Yang Shi wrote:
> 
> 
> On 1/3/19 11:23 AM, Michal Hocko wrote:
> > On Thu 03-01-19 11:10:00, Yang Shi wrote:
> > > 
> > > On 1/3/19 10:53 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > > > > On 1/3/19 10:13 AM, Michal Hocko wrote:
> > [...]
> > > > > > Is there any reason for your scripts to be strictly sequential here? In
> > > > > > other words why cannot you offload those expensive operations to a
> > > > > > detached context in _userspace_?
> > > > > I would say it has not to be strictly sequential. The above script is just
> > > > > an example to illustrate the pattern. But, sometimes it may hit such pattern
> > > > > due to the complicated cluster scheduling and container scheduling in the
> > > > > production environment, for example the creation process might be scheduled
> > > > > to the same CPU which is doing force_empty. I have to say I don't know too
> > > > > much about the internals of the container scheduling.
> > > > In that case I do not see a strong reason to implement the offloding
> > > > into the kernel. It is an additional code and semantic to maintain.
> > > Yes, it does introduce some additional code and semantic, but IMHO, it is
> > > quite simple and very straight forward, isn't it? Just utilize the existing
> > > css offline worker. And, that a couple of lines of code do improve some
> > > throughput issues for some real usecases.
> > I do not really care it is few LOC. It is more important that it is
> > conflating force_empty into offlining logic. There was a good reason to
> > remove reparenting/emptying the memcg during the offline. Considering
> > that you can offload force_empty from userspace trivially then I do not
> > see any reason to implement it in the kernel.
> 
> Er, I may not articulate in the earlier email, force_empty can not be
> offloaded from userspace *trivially*. IOWs the container scheduler may
> unexpectedly overcommit something due to the stall of synchronous force
> empty, which can't be figured out by userspace before it actually happens.
> The scheduler doesn't know how long force_empty would take. If the
> force_empty could be offloaded by kernel, it would make scheduler's life
> much easier. This is not something userspace could do.

What exactly prevents
(
echo 1 > $memecg/force_empty
rmdir $memcg
) &

so that this sequence doesn't really block anything?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 20:01                 ` Michal Hocko
@ 2019-01-04  4:15                   ` Yang Shi
  2019-01-04  8:55                     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-01-04  4:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/3/19 12:01 PM, Michal Hocko wrote:
> On Thu 03-01-19 11:49:32, Yang Shi wrote:
>>
>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>> [...]
>>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>> detached context in _userspace_?
>>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>>> production environment, for example the creation process might be scheduled
>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>>> much about the internals of the container scheduling.
>>>>> In that case I do not see a strong reason to implement the offloding
>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>>> css offline worker. And, that a couple of lines of code do improve some
>>>> throughput issues for some real usecases.
>>> I do not really care it is few LOC. It is more important that it is
>>> conflating force_empty into offlining logic. There was a good reason to
>>> remove reparenting/emptying the memcg during the offline. Considering
>>> that you can offload force_empty from userspace trivially then I do not
>>> see any reason to implement it in the kernel.
>> Er, I may not articulate in the earlier email, force_empty can not be
>> offloaded from userspace *trivially*. IOWs the container scheduler may
>> unexpectedly overcommit something due to the stall of synchronous force
>> empty, which can't be figured out by userspace before it actually happens.
>> The scheduler doesn't know how long force_empty would take. If the
>> force_empty could be offloaded by kernel, it would make scheduler's life
>> much easier. This is not something userspace could do.
> What exactly prevents
> (
> echo 1 > $memecg/force_empty
> rmdir $memcg
> ) &
>
> so that this sequence doesn't really block anything?

We have "restarting the same name job" logic in our usecase (I'm not 
quite sure why they do so). Basically, it means to create memcg with the 
exact same name right after the old one is deleted, but may have 
different limit or other settings. The creation has to wait for rmdir is 
done. Even though rmdir is done in background like the above, the stall 
still exists since rmdir simply is waiting for force_empty.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-04  4:15                   ` Yang Shi
@ 2019-01-04  8:55                     ` Michal Hocko
  2019-01-04 16:46                       ` Yang Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-01-04  8:55 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Thu 03-01-19 20:15:30, Yang Shi wrote:
> 
> 
> On 1/3/19 12:01 PM, Michal Hocko wrote:
> > On Thu 03-01-19 11:49:32, Yang Shi wrote:
> > > 
> > > On 1/3/19 11:23 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 11:10:00, Yang Shi wrote:
> > > > > On 1/3/19 10:53 AM, Michal Hocko wrote:
> > > > > > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote:
> > > > [...]
> > > > > > > > Is there any reason for your scripts to be strictly sequential here? In
> > > > > > > > other words why cannot you offload those expensive operations to a
> > > > > > > > detached context in _userspace_?
> > > > > > > I would say it has not to be strictly sequential. The above script is just
> > > > > > > an example to illustrate the pattern. But, sometimes it may hit such pattern
> > > > > > > due to the complicated cluster scheduling and container scheduling in the
> > > > > > > production environment, for example the creation process might be scheduled
> > > > > > > to the same CPU which is doing force_empty. I have to say I don't know too
> > > > > > > much about the internals of the container scheduling.
> > > > > > In that case I do not see a strong reason to implement the offloding
> > > > > > into the kernel. It is an additional code and semantic to maintain.
> > > > > Yes, it does introduce some additional code and semantic, but IMHO, it is
> > > > > quite simple and very straight forward, isn't it? Just utilize the existing
> > > > > css offline worker. And, that a couple of lines of code do improve some
> > > > > throughput issues for some real usecases.
> > > > I do not really care it is few LOC. It is more important that it is
> > > > conflating force_empty into offlining logic. There was a good reason to
> > > > remove reparenting/emptying the memcg during the offline. Considering
> > > > that you can offload force_empty from userspace trivially then I do not
> > > > see any reason to implement it in the kernel.
> > > Er, I may not articulate in the earlier email, force_empty can not be
> > > offloaded from userspace *trivially*. IOWs the container scheduler may
> > > unexpectedly overcommit something due to the stall of synchronous force
> > > empty, which can't be figured out by userspace before it actually happens.
> > > The scheduler doesn't know how long force_empty would take. If the
> > > force_empty could be offloaded by kernel, it would make scheduler's life
> > > much easier. This is not something userspace could do.
> > What exactly prevents
> > (
> > echo 1 > $memecg/force_empty
> > rmdir $memcg
> > ) &
> > 
> > so that this sequence doesn't really block anything?
> 
> We have "restarting the same name job" logic in our usecase (I'm not quite
> sure why they do so). Basically, it means to create memcg with the exact
> same name right after the old one is deleted, but may have different limit
> or other settings. The creation has to wait for rmdir is done. Even though
> rmdir is done in background like the above, the stall still exists since
> rmdir simply is waiting for force_empty.

OK, I see. This is an important detail you didn't mention previously (or
at least I didn't understand it). One thing is still not clear to me.
"Restarting the same job" sounds as if the memcg itself could be
recycled as well. You are saying that the setting might change but if
that is about limits then we should handle that just fine. Or what other
kind of setting changes that wouldn't work properly?

If the recycling is not possible then I would suggest to not reuse
force_empty interface but add wipe_on_destruction or similar new knob
which would enforce reclaim on offlining. It seems we have several
people asking for something like that already.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-04  8:55                     ` Michal Hocko
@ 2019-01-04 16:46                       ` Yang Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-04 16:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/4/19 12:55 AM, Michal Hocko wrote:
> On Thu 03-01-19 20:15:30, Yang Shi wrote:
>>
>> On 1/3/19 12:01 PM, Michal Hocko wrote:
>>> On Thu 03-01-19 11:49:32, Yang Shi wrote:
>>>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>>>> [...]
>>>>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>>>> detached context in _userspace_?
>>>>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>>>>> production environment, for example the creation process might be scheduled
>>>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>>>>> much about the internals of the container scheduling.
>>>>>>> In that case I do not see a strong reason to implement the offloding
>>>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>>>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>>>>> css offline worker. And, that a couple of lines of code do improve some
>>>>>> throughput issues for some real usecases.
>>>>> I do not really care it is few LOC. It is more important that it is
>>>>> conflating force_empty into offlining logic. There was a good reason to
>>>>> remove reparenting/emptying the memcg during the offline. Considering
>>>>> that you can offload force_empty from userspace trivially then I do not
>>>>> see any reason to implement it in the kernel.
>>>> Er, I may not articulate in the earlier email, force_empty can not be
>>>> offloaded from userspace *trivially*. IOWs the container scheduler may
>>>> unexpectedly overcommit something due to the stall of synchronous force
>>>> empty, which can't be figured out by userspace before it actually happens.
>>>> The scheduler doesn't know how long force_empty would take. If the
>>>> force_empty could be offloaded by kernel, it would make scheduler's life
>>>> much easier. This is not something userspace could do.
>>> What exactly prevents
>>> (
>>> echo 1 > $memecg/force_empty
>>> rmdir $memcg
>>> ) &
>>>
>>> so that this sequence doesn't really block anything?
>> We have "restarting the same name job" logic in our usecase (I'm not quite
>> sure why they do so). Basically, it means to create memcg with the exact
>> same name right after the old one is deleted, but may have different limit
>> or other settings. The creation has to wait for rmdir is done. Even though
>> rmdir is done in background like the above, the stall still exists since
>> rmdir simply is waiting for force_empty.
> OK, I see. This is an important detail you didn't mention previously (or
> at least I didn't understand it). One thing is still not clear to me.

Sorry, I should articulated at the first place.

> "Restarting the same job" sounds as if the memcg itself could be
> recycled as well. You are saying that the setting might change but if
> that is about limits then we should handle that just fine. Or what other
> kind of setting changes that wouldn't work properly?

We did try resize limit, but it may also incur costly direct reclaim to 
block something. Other than this we also want to reset all the 
counters/stats to get clearer and cleaner resource isolation since the 
container may run different jobs although they use the same name.

>
> If the recycling is not possible then I would suggest to not reuse
> force_empty interface but add wipe_on_destruction or similar new knob
> which would enforce reclaim on offlining. It seems we have several
> people asking for something like that already.

We did have a new knob in our in-house implementation, it just did 
force_empty on offlining.

So, you mean to have a new knob to just do force empty offlining, and 
keep force_empty's behavior, right?

Thanks,
Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-03 19:49               ` Yang Shi
  2019-01-03 20:01                 ` Michal Hocko
@ 2019-01-04 20:03                 ` Greg Thelen
  2019-01-04 20:03                   ` Greg Thelen
                                     ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Greg Thelen @ 2019-01-04 20:03 UTC (permalink / raw)
  To: Yang Shi, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel

Yang Shi <yang.shi@linux.alibaba.com> wrote:

> On 1/3/19 11:23 AM, Michal Hocko wrote:
>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>
>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>> [...]
>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>> other words why cannot you offload those expensive operations to a
>>>>>> detached context in _userspace_?
>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>> production environment, for example the creation process might be scheduled
>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>> much about the internals of the container scheduling.
>>>> In that case I do not see a strong reason to implement the offloding
>>>> into the kernel. It is an additional code and semantic to maintain.
>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>> css offline worker. And, that a couple of lines of code do improve some
>>> throughput issues for some real usecases.
>> I do not really care it is few LOC. It is more important that it is
>> conflating force_empty into offlining logic. There was a good reason to
>> remove reparenting/emptying the memcg during the offline. Considering
>> that you can offload force_empty from userspace trivially then I do not
>> see any reason to implement it in the kernel.
>
> Er, I may not articulate in the earlier email, force_empty can not be 
> offloaded from userspace *trivially*. IOWs the container scheduler may 
> unexpectedly overcommit something due to the stall of synchronous force 
> empty, which can't be figured out by userspace before it actually 
> happens. The scheduler doesn't know how long force_empty would take. If 
> the force_empty could be offloaded by kernel, it would make scheduler's 
> life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow.  I'm not sure if
that's important.

I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim.  So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.

>>>> I think it is more important to discuss whether we want to introduce
>>>> force_empty in cgroup v2.
>>> We would prefer have it in v2 as well.
>> Then bring this up in a separate email thread please.
>
> Sure. Will prepare the patches later.
>
> Thanks,
> Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-04 20:03                 ` Greg Thelen
@ 2019-01-04 20:03                   ` Greg Thelen
  2019-01-04 21:41                   ` Yang Shi
  2019-01-04 22:57                   ` Yang Shi
  2 siblings, 0 replies; 30+ messages in thread
From: Greg Thelen @ 2019-01-04 20:03 UTC (permalink / raw)
  To: Yang Shi, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel

Yang Shi <yang.shi@linux.alibaba.com> wrote:

> On 1/3/19 11:23 AM, Michal Hocko wrote:
>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>
>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>> [...]
>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>> other words why cannot you offload those expensive operations to a
>>>>>> detached context in _userspace_?
>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>> production environment, for example the creation process might be scheduled
>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>> much about the internals of the container scheduling.
>>>> In that case I do not see a strong reason to implement the offloding
>>>> into the kernel. It is an additional code and semantic to maintain.
>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>> css offline worker. And, that a couple of lines of code do improve some
>>> throughput issues for some real usecases.
>> I do not really care it is few LOC. It is more important that it is
>> conflating force_empty into offlining logic. There was a good reason to
>> remove reparenting/emptying the memcg during the offline. Considering
>> that you can offload force_empty from userspace trivially then I do not
>> see any reason to implement it in the kernel.
>
> Er, I may not articulate in the earlier email, force_empty can not be 
> offloaded from userspace *trivially*. IOWs the container scheduler may 
> unexpectedly overcommit something due to the stall of synchronous force 
> empty, which can't be figured out by userspace before it actually 
> happens. The scheduler doesn't know how long force_empty would take. If 
> the force_empty could be offloaded by kernel, it would make scheduler's 
> life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow.  I'm not sure if
that's important.

I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim.  So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.

>>>> I think it is more important to discuss whether we want to introduce
>>>> force_empty in cgroup v2.
>>> We would prefer have it in v2 as well.
>> Then bring this up in a separate email thread please.
>
> Sure. Will prepare the patches later.
>
> Thanks,
> Yang


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-04 20:03                 ` Greg Thelen
  2019-01-04 20:03                   ` Greg Thelen
@ 2019-01-04 21:41                   ` Yang Shi
  2019-01-04 22:57                   ` Yang Shi
  2 siblings, 0 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-04 21:41 UTC (permalink / raw)
  To: Greg Thelen, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/4/19 12:03 PM, Greg Thelen wrote:
> Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>> [...]
>>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>> detached context in _userspace_?
>>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>>> production environment, for example the creation process might be scheduled
>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>>> much about the internals of the container scheduling.
>>>>> In that case I do not see a strong reason to implement the offloding
>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>>> css offline worker. And, that a couple of lines of code do improve some
>>>> throughput issues for some real usecases.
>>> I do not really care it is few LOC. It is more important that it is
>>> conflating force_empty into offlining logic. There was a good reason to
>>> remove reparenting/emptying the memcg during the offline. Considering
>>> that you can offload force_empty from userspace trivially then I do not
>>> see any reason to implement it in the kernel.
>> Er, I may not articulate in the earlier email, force_empty can not be
>> offloaded from userspace *trivially*. IOWs the container scheduler may
>> unexpectedly overcommit something due to the stall of synchronous force
>> empty, which can't be figured out by userspace before it actually
>> happens. The scheduler doesn't know how long force_empty would take. If
>> the force_empty could be offloaded by kernel, it would make scheduler's
>> life much easier. This is not something userspace could do.
> If kernel workqueues are doing more work (i.e. force_empty processing),
> then it seem like the time to offline could grow.  I'm not sure if
> that's important.

Yes, it would grow. I'm not sure, but it seems fine with our workloads.

The reclaim can be placed at the last step of offline, and it can be 
interrupted by some signals, i.e. fatal signal in current code.

>
> I assume that if we make force_empty an async side effect of rmdir then
> user space scheduler would not be unable to immediately assume the
> rmdir'd container memory is available without subjecting a new container
> to direct reclaim.  So it seems like user space would use a mechanism to
> wait for reclaim: either the existing sync force_empty or polling
> meminfo/etc waiting for free memory to appear.

Yes, it is expected side effect, the memory reclaim would happen in a 
short while. In this series I keep sync reclaim behavior of force_empty 
by checking the written value. Michal suggested a new knob do the 
offline reclaim, and keep force_empty intact.

I think using which knob is user's discretion.

Thanks,
Yang

>
>>>>> I think it is more important to discuss whether we want to introduce
>>>>> force_empty in cgroup v2.
>>>> We would prefer have it in v2 as well.
>>> Then bring this up in a separate email thread please.
>> Sure. Will prepare the patches later.
>>
>> Thanks,
>> Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-04 20:03                 ` Greg Thelen
  2019-01-04 20:03                   ` Greg Thelen
  2019-01-04 21:41                   ` Yang Shi
@ 2019-01-04 22:57                   ` Yang Shi
  2019-01-04 23:04                     ` Yang Shi
  2 siblings, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-01-04 22:57 UTC (permalink / raw)
  To: Greg Thelen, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/4/19 12:03 PM, Greg Thelen wrote:
> Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>> [...]
>>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>> detached context in _userspace_?
>>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>>> production environment, for example the creation process might be scheduled
>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>>> much about the internals of the container scheduling.
>>>>> In that case I do not see a strong reason to implement the offloding
>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>>> css offline worker. And, that a couple of lines of code do improve some
>>>> throughput issues for some real usecases.
>>> I do not really care it is few LOC. It is more important that it is
>>> conflating force_empty into offlining logic. There was a good reason to
>>> remove reparenting/emptying the memcg during the offline. Considering
>>> that you can offload force_empty from userspace trivially then I do not
>>> see any reason to implement it in the kernel.
>> Er, I may not articulate in the earlier email, force_empty can not be
>> offloaded from userspace *trivially*. IOWs the container scheduler may
>> unexpectedly overcommit something due to the stall of synchronous force
>> empty, which can't be figured out by userspace before it actually
>> happens. The scheduler doesn't know how long force_empty would take. If
>> the force_empty could be offloaded by kernel, it would make scheduler's
>> life much easier. This is not something userspace could do.
> If kernel workqueues are doing more work (i.e. force_empty processing),
> then it seem like the time to offline could grow.  I'm not sure if
> that's important.

One thing I can think of is this may slow down the recycling of memcg 
id. This may cause memcg id exhausted for some extreme workload. But, I 
don't see this as a problem in our workload.

Thanks,
Yang

>
> I assume that if we make force_empty an async side effect of rmdir then
> user space scheduler would not be unable to immediately assume the
> rmdir'd container memory is available without subjecting a new container
> to direct reclaim.  So it seems like user space would use a mechanism to
> wait for reclaim: either the existing sync force_empty or polling
> meminfo/etc waiting for free memory to appear.
>
>>>>> I think it is more important to discuss whether we want to introduce
>>>>> force_empty in cgroup v2.
>>>> We would prefer have it in v2 as well.
>>> Then bring this up in a separate email thread please.
>> Sure. Will prepare the patches later.
>>
>> Thanks,
>> Yang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
  2019-01-04 22:57                   ` Yang Shi
@ 2019-01-04 23:04                     ` Yang Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Yang Shi @ 2019-01-04 23:04 UTC (permalink / raw)
  To: Greg Thelen, Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/4/19 2:57 PM, Yang Shi wrote:
>
>
> On 1/4/19 12:03 PM, Greg Thelen wrote:
>> Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>
>>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>>> [...]
>>>>>>>> Is there any reason for your scripts to be strictly sequential 
>>>>>>>> here? In
>>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>>> detached context in _userspace_?
>>>>>>> I would say it has not to be strictly sequential. The above 
>>>>>>> script is just
>>>>>>> an example to illustrate the pattern. But, sometimes it may hit 
>>>>>>> such pattern
>>>>>>> due to the complicated cluster scheduling and container 
>>>>>>> scheduling in the
>>>>>>> production environment, for example the creation process might 
>>>>>>> be scheduled
>>>>>>> to the same CPU which is doing force_empty. I have to say I 
>>>>>>> don't know too
>>>>>>> much about the internals of the container scheduling.
>>>>>> In that case I do not see a strong reason to implement the offloding
>>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>>> Yes, it does introduce some additional code and semantic, but 
>>>>> IMHO, it is
>>>>> quite simple and very straight forward, isn't it? Just utilize the 
>>>>> existing
>>>>> css offline worker. And, that a couple of lines of code do improve 
>>>>> some
>>>>> throughput issues for some real usecases.
>>>> I do not really care it is few LOC. It is more important that it is
>>>> conflating force_empty into offlining logic. There was a good 
>>>> reason to
>>>> remove reparenting/emptying the memcg during the offline. Considering
>>>> that you can offload force_empty from userspace trivially then I do 
>>>> not
>>>> see any reason to implement it in the kernel.
>>> Er, I may not articulate in the earlier email, force_empty can not be
>>> offloaded from userspace *trivially*. IOWs the container scheduler may
>>> unexpectedly overcommit something due to the stall of synchronous force
>>> empty, which can't be figured out by userspace before it actually
>>> happens. The scheduler doesn't know how long force_empty would take. If
>>> the force_empty could be offloaded by kernel, it would make scheduler's
>>> life much easier. This is not something userspace could do.
>> If kernel workqueues are doing more work (i.e. force_empty processing),
>> then it seem like the time to offline could grow.  I'm not sure if
>> that's important.
>
> One thing I can think of is this may slow down the recycling of memcg 
> id. This may cause memcg id exhausted for some extreme workload. But, 
> I don't see this as a problem in our workload.

Actually, sync force_empty should have the same side effect.

Yang

>
> Thanks,
> Yang
>
>>
>> I assume that if we make force_empty an async side effect of rmdir then
>> user space scheduler would not be unable to immediately assume the
>> rmdir'd container memory is available without subjecting a new container
>> to direct reclaim.  So it seems like user space would use a mechanism to
>> wait for reclaim: either the existing sync force_empty or polling
>> meminfo/etc waiting for free memory to appear.
>>
>>>>>> I think it is more important to discuss whether we want to introduce
>>>>>> force_empty in cgroup v2.
>>>>> We would prefer have it in v2 as well.
>>>> Then bring this up in a separate email thread please.
>>> Sure. Will prepare the patches later.
>>>
>>> Thanks,
>>> Yang
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-01-04 23:06 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-02 20:05 [RFC PATCH 0/3] mm: memcontrol: delayed force empty Yang Shi
2019-01-02 20:05 ` [PATCH 1/3] doc: memcontrol: fix the obsolete content about " Yang Shi
2019-01-02 21:18   ` Shakeel Butt
2019-01-02 21:18     ` Shakeel Butt
2019-01-03 10:13   ` Michal Hocko
2019-01-02 20:05 ` [PATCH 2/3] mm: memcontrol: do not try to do swap when " Yang Shi
2019-01-02 21:45   ` Shakeel Butt
2019-01-02 21:45     ` Shakeel Butt
2019-01-03 16:56     ` Yang Shi
2019-01-03 17:03       ` Shakeel Butt
2019-01-03 17:03         ` Shakeel Butt
2019-01-03 18:19         ` Yang Shi
2019-01-02 20:05 ` [PATCH 3/3] mm: memcontrol: delay force empty to css offline Yang Shi
2019-01-03 10:12 ` [RFC PATCH 0/3] mm: memcontrol: delayed force empty Michal Hocko
2019-01-03 17:33   ` Yang Shi
2019-01-03 18:13     ` Michal Hocko
2019-01-03 18:40       ` Yang Shi
2019-01-03 18:53         ` Michal Hocko
2019-01-03 19:10           ` Yang Shi
2019-01-03 19:23             ` Michal Hocko
2019-01-03 19:49               ` Yang Shi
2019-01-03 20:01                 ` Michal Hocko
2019-01-04  4:15                   ` Yang Shi
2019-01-04  8:55                     ` Michal Hocko
2019-01-04 16:46                       ` Yang Shi
2019-01-04 20:03                 ` Greg Thelen
2019-01-04 20:03                   ` Greg Thelen
2019-01-04 21:41                   ` Yang Shi
2019-01-04 22:57                   ` Yang Shi
2019-01-04 23:04                     ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox