[PATCH] memcg: event control at vmpressure.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] memcg: event control at vmpressure.
@ 2013-06-10 11:14 Hyunhee Kim
  2013-06-10 14:09 ` Luiz Capitulino
  2013-06-10 15:12 ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Hyunhee Kim @ 2013-06-10 11:14 UTC (permalink / raw)
  To: linux-mm; +Cc: 'Kyungmin Park'

In vmpressure, events are sent to the user space continuously
until the memory state changes. This becomes overheads for user space module
and also consumes power consumption. So, with this patch, vmpressure
remembers
the current level and only sends the event only when new memory state is
different from the current level.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 include/linux/vmpressure.h |    2 ++
 mm/vmpressure.c            |    4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 76be077..fa0c0d2 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -20,6 +20,8 @@ struct vmpressure {
 	struct mutex events_lock;
 
 	struct work_struct work;
+
+	int current_level;
 };
 
 struct mem_cgroup;
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..5f6609c 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -152,9 +152,10 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (level >= ev->level && level != vmpr->current_level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
+			vmpr->current_level = level;
 		}
 	}
 
@@ -371,4 +372,5 @@ void vmpressure_init(struct vmpressure *vmpr)
 	mutex_init(&vmpr->events_lock);
 	INIT_LIST_HEAD(&vmpr->events);
 	INIT_WORK(&vmpr->work, vmpressure_work_fn);
+	vmpr->current_level = -1;
 }
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: event control at vmpressure.
  2013-06-10 11:14 [PATCH] memcg: event control at vmpressure Hyunhee Kim
@ 2013-06-10 14:09 ` Luiz Capitulino
  2013-06-10 15:12 ` Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Luiz Capitulino @ 2013-06-10 14:09 UTC (permalink / raw)
  To: Hyunhee Kim; +Cc: linux-mm, 'Kyungmin Park'

On Mon, 10 Jun 2013 20:14:13 +0900
Hyunhee Kim <hyunhee.kim@samsung.com> wrote:

> In vmpressure, events are sent to the user space continuously
> until the memory state changes. This becomes overheads for user space module
> and also consumes power consumption.

If the kernel is still under memory pressure, I think we do want to keep
sending the event to user-space. At least as a default behavior.

I think it would be fine to implement this change as an additional parameter
when registering for the event, but I also wonder if this shouldn't be
solved by the user-space app itself (eg. rate-limiting the event reception).

> So, with this patch, vmpressure
> remembers
> the current level and only sends the event only when new memory state is
> different from the current level.
> 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  include/linux/vmpressure.h |    2 ++
>  mm/vmpressure.c            |    4 +++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 76be077..fa0c0d2 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -20,6 +20,8 @@ struct vmpressure {
>  	struct mutex events_lock;
>  
>  	struct work_struct work;
> +
> +	int current_level;
>  };
>  
>  struct mem_cgroup;
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..5f6609c 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -152,9 +152,10 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>  	mutex_lock(&vmpr->events_lock);
>  
>  	list_for_each_entry(ev, &vmpr->events, node) {
> -		if (level >= ev->level) {
> +		if (level >= ev->level && level != vmpr->current_level) {
>  			eventfd_signal(ev->efd, 1);
>  			signalled = true;
> +			vmpr->current_level = level;
>  		}
>  	}
>  
> @@ -371,4 +372,5 @@ void vmpressure_init(struct vmpressure *vmpr)
>  	mutex_init(&vmpr->events_lock);
>  	INIT_LIST_HEAD(&vmpr->events);
>  	INIT_WORK(&vmpr->work, vmpressure_work_fn);
> +	vmpr->current_level = -1;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: event control at vmpressure.
  2013-06-10 11:14 [PATCH] memcg: event control at vmpressure Hyunhee Kim
  2013-06-10 14:09 ` Luiz Capitulino
@ 2013-06-10 15:12 ` Michal Hocko
  2013-06-11  0:17   ` Anton Vorontsov
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2013-06-10 15:12 UTC (permalink / raw)
  To: Hyunhee Kim; +Cc: linux-mm, 'Kyungmin Park', Anton Vorontsov

[Let's CC Anton]

On Mon 10-06-13 20:14:13, Hyunhee Kim wrote:
> In vmpressure, events are sent to the user space continuously
> until the memory state changes. This becomes overheads for user space module
> and also consumes power consumption. So, with this patch, vmpressure
> remembers
> the current level and only sends the event only when new memory state is
> different from the current level.
> 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  include/linux/vmpressure.h |    2 ++
>  mm/vmpressure.c            |    4 +++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 76be077..fa0c0d2 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -20,6 +20,8 @@ struct vmpressure {
>  	struct mutex events_lock;
>  
>  	struct work_struct work;
> +
> +	int current_level;
>  };
>  
>  struct mem_cgroup;
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..5f6609c 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -152,9 +152,10 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>  	mutex_lock(&vmpr->events_lock);
>  
>  	list_for_each_entry(ev, &vmpr->events, node) {
> -		if (level >= ev->level) {
> +		if (level >= ev->level && level != vmpr->current_level) {
>  			eventfd_signal(ev->efd, 1);
>  			signalled = true;
> +			vmpr->current_level = level;

This would mean that you send a signal for, say, VMPRESSURE_LOW, then
the reclaim finishes and two days later when you hit the reclaim again
you would simply miss the event, right?

So, unless I am missing something, then this is plain wrong. If you are
worried about too many events then a time based throttling should be
implemented.

>  		}
>  	}
>  
> @@ -371,4 +372,5 @@ void vmpressure_init(struct vmpressure *vmpr)
>  	mutex_init(&vmpr->events_lock);
>  	INIT_LIST_HEAD(&vmpr->events);
>  	INIT_WORK(&vmpr->work, vmpressure_work_fn);
> +	vmpr->current_level = -1;
>  }
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: event control at vmpressure.
  2013-06-10 15:12 ` Michal Hocko
@ 2013-06-11  0:17   ` Anton Vorontsov
  2013-06-11  1:01     ` Kyungmin Park
  2013-06-11  6:21     ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Anton Vorontsov @ 2013-06-11  0:17 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Hyunhee Kim, linux-mm, 'Kyungmin Park'

On Mon, Jun 10, 2013 at 05:12:58PM +0200, Michal Hocko wrote:
> > +		if (level >= ev->level && level != vmpr->current_level) {
> >  			eventfd_signal(ev->efd, 1);
> >  			signalled = true;
> > +			vmpr->current_level = level;
> 
> This would mean that you send a signal for, say, VMPRESSURE_LOW, then
> the reclaim finishes and two days later when you hit the reclaim again
> you would simply miss the event, right?
> 
> So, unless I am missing something, then this is plain wrong.

Yup, in it current version, it is not acceptable. For example, sometimes
we do want to see all the _LOW events, since _LOW level shows not just the
level itself, but the activity (i.e. reclaiming process).

There are a few ways to make both parties happy, though.

If the app wants to implement the time-based throttling, then just close
the fd and sleep for needed amount of time (or do not read from the
eventfd -- kernel then will just increment the eventfd counter, so there
won't be context switches at the least). Doing the time-based throttling
in the kernel won't buy us much, I believe.

Or, if you still want the "one-shot"/"edge-triggered" events (which might
make perfect sense for medium and critical levels), then I'd propose to
add some additional flag when you register the event, so that the old
behaviour would be still available for those who need it. This approach I
think is the best one.

Thanks!

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: event control at vmpressure.
  2013-06-11  0:17   ` Anton Vorontsov
@ 2013-06-11  1:01     ` Kyungmin Park
  2013-06-11  6:21     ` Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Kyungmin Park @ 2013-06-11  1:01 UTC (permalink / raw)
  To: Anton Vorontsov; +Cc: Michal Hocko, Hyunhee Kim, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1651 bytes --]

On Tue, Jun 11, 2013 at 9:17 AM, Anton Vorontsov <anton@enomsg.org> wrote:

> On Mon, Jun 10, 2013 at 05:12:58PM +0200, Michal Hocko wrote:
> > > +           if (level >= ev->level && level != vmpr->current_level) {
> > >                     eventfd_signal(ev->efd, 1);
> > >                     signalled = true;
> > > +                   vmpr->current_level = level;
> >
> > This would mean that you send a signal for, say, VMPRESSURE_LOW, then
> > the reclaim finishes and two days later when you hit the reclaim again
> > you would simply miss the event, right?
> >
> > So, unless I am missing something, then this is plain wrong.
>
> Yup, in it current version, it is not acceptable. For example, sometimes
> we do want to see all the _LOW events, since _LOW level shows not just the
> level itself, but the activity (i.e. reclaiming process).
>
> There are a few ways to make both parties happy, though.
>
> If the app wants to implement the time-based throttling, then just close
> the fd and sleep for needed amount of time (or do not read from the
> eventfd -- kernel then will just increment the eventfd counter, so there
> won't be context switches at the least). Doing the time-based throttling
> in the kernel won't buy us much, I believe.
>
> Or, if you still want the "one-shot"/"edge-triggered" events (which might
> make perfect sense for medium and critical levels), then I'd propose to
> add some additional flag when you register the event, so that the old
> behaviour would be still available for those who need it. This approach I
> think is the best one.
>
> Ok we will prepare this way and resend it.

Thank you,
Kyungmin Park

[-- Attachment #2: Type: text/html, Size: 2207 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: event control at vmpressure.
  2013-06-11  0:17   ` Anton Vorontsov
  2013-06-11  1:01     ` Kyungmin Park
@ 2013-06-11  6:21     ` Michal Hocko
  2013-06-11  8:49       ` [PATCH v2] " Hyunhee Kim
                         ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Michal Hocko @ 2013-06-11  6:21 UTC (permalink / raw)
  To: Anton Vorontsov; +Cc: Hyunhee Kim, linux-mm, 'Kyungmin Park'

On Mon 10-06-13 17:17:47, Anton Vorontsov wrote:
> On Mon, Jun 10, 2013 at 05:12:58PM +0200, Michal Hocko wrote:
> > > +		if (level >= ev->level && level != vmpr->current_level) {
> > >  			eventfd_signal(ev->efd, 1);
> > >  			signalled = true;
> > > +			vmpr->current_level = level;
> > 
> > This would mean that you send a signal for, say, VMPRESSURE_LOW, then
> > the reclaim finishes and two days later when you hit the reclaim again
> > you would simply miss the event, right?
> > 
> > So, unless I am missing something, then this is plain wrong.
> 
> Yup, in it current version, it is not acceptable. For example, sometimes
> we do want to see all the _LOW events, since _LOW level shows not just the
> level itself, but the activity (i.e. reclaiming process).
> 
> There are a few ways to make both parties happy, though.
> 
> If the app wants to implement the time-based throttling, then just close
> the fd and sleep for needed amount of time (or do not read from the
> eventfd -- kernel then will just increment the eventfd counter, so there
> won't be context switches at the least).

That makes sense to me.

> Doing the time-based throttling in the kernel won't buy us much, I
> believe.

Yes.
 
> Or, if you still want the "one-shot"/"edge-triggered" events (which might
> make perfect sense for medium and critical levels), then I'd propose to
> add some additional flag when you register the event, so that the old
> behaviour would be still available for those who need it. This approach I
> think is the best one.

Hmm, how would one-shot even differ from a single open, register, read
and close?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] memcg: event control at vmpressure.
  2013-06-11  6:21     ` Michal Hocko
@ 2013-06-11  8:49       ` Hyunhee Kim
  2013-06-11 12:59         ` Michal Hocko
  2013-06-11 13:10       ` [PATCH] " Luiz Capitulino
  2013-06-11 13:13       ` Pekka Enberg
  2 siblings, 1 reply; 12+ messages in thread
From: Hyunhee Kim @ 2013-06-11  8:49 UTC (permalink / raw)
  To: 'Michal Hocko', 'Anton Vorontsov'
  Cc: linux-mm, 'Kyungmin Park'

In the original vmpressure, event is sent to the user space continuously
until the memory state changes. This becomes overheads to user space module
and also consumes power consumption. So, with this patch, vmpressure
remembers
the current level and only sends the event only new memory state is
different
with the current level. This can be set when registering each event by
writing
a trigger option (0 or 1) next to the level.

Change-Id: Ie075b7c510a9cea8c4a092ac4fa4680248139371
Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Reviewed-on: http://165.213.202.130:8080/55935
Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
Tested-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 Documentation/cgroups/memory.txt |   10 ++++++++--
 include/linux/vmpressure.h       |    2 ++
 mm/vmpressure.c                  |   35 ++++++++++++++++++++++++++++++-----
 3 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroups/memory.txt
b/Documentation/cgroups/memory.txt
index ddf4f93..cc12aaa 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -791,6 +791,11 @@ way to trigger. Applications should do whatever they
can to help the
 system. It might be too late to consult with vmstat or any other
 statistics, so it's advisable to take an immediate action.
 
+Events can be triggered continuously or only when the level changes.
Trigger
+option is decided by writing it next to level. If "0", events are sent
+every time the reclaiming occurs. If "1", events are sent only when the
level
+is changed.
+
 The events are propagated upward until the event is handled, i.e. the
 events are not pass-through. Here is what this means: for example you have
 three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
@@ -807,7 +812,8 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+- write string like
+	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
   to cgroup.event_control.
 
 Application will be notified through eventfd when memory pressure is at
@@ -823,7 +829,7 @@ Test:
    # cd /sys/fs/cgroup/memory/
    # mkdir foo
    # cd foo
-   # cgroup_event_listener memory.pressure_level low &
+   # cgroup_event_listener memory.pressure_level low 0 &
    # echo 8000000 > memory.limit_in_bytes
    # echo 8000000 > memory.memsw.limit_in_bytes
    # echo $$ > tasks
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 76be077..fa0c0d2 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -20,6 +20,8 @@ struct vmpressure {
 	struct mutex events_lock;
 
 	struct work_struct work;
+
+	int current_level;
 };
 
 struct mem_cgroup;
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..0ffed76 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,6 +137,7 @@ static enum vmpressure_levels
vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
 	struct eventfd_ctx *efd;
 	enum vmpressure_levels level;
+	unsigned long edge_trigger;
 	struct list_head node;
 };
 
@@ -153,8 +154,11 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (level >= ev->level) {
+			if (ev->edge_trigger && level ==
vmpr->current_level)
+				continue;
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
+			vmpr->current_level = level;
 		}
 	}
 
@@ -290,9 +294,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup
*memcg, int prio)
  *
  * This function associates eventfd context with the vmpressure
  * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a string that denotes pressure level
+ * @eventfd. The @args parameters are a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and a trigger option that decides whether events are
triggered
+ * continuously or only on edge (0 or 1 if 1, events are triggered only
when
+ * the level changes.
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,14 +309,31 @@ int vmpressure_register_event(struct cgroup *cg,
struct cftype *cft,
 {
 	struct vmpressure *vmpr = cg_to_vmpressure(cg);
 	struct vmpressure_event *ev;
-	int level;
+	unsigned long trigger = 0;
+	int level, i = 0;
+	char *s[2], *p;
+
+	while ((p = strsep((char **)&args, " ")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+
+		/* Prevent from inputing more than 2 args */
+		if (i == 2)
+			break;
+	}
+
+	if (i != 2)
+		return -EINVAL;
+
+	trigger = simple_strtoul(s[1], NULL, sizeof(s[1]));
 
 	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-		if (!strcmp(vmpressure_str_levels[level], args))
+		if (!strcmp(vmpressure_str_levels[level], s[0]))
 			break;
 	}
 
-	if (level >= VMPRESSURE_NUM_LEVELS)
+	if (trigger > 1 || level >= VMPRESSURE_NUM_LEVELS)
 		return -EINVAL;
 
 	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
@@ -319,6 +342,7 @@ int vmpressure_register_event(struct cgroup *cg, struct
cftype *cft,
 
 	ev->efd = eventfd;
 	ev->level = level;
+	ev->edge_trigger = trigger;
 
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
@@ -371,4 +395,5 @@ void vmpressure_init(struct vmpressure *vmpr)
 	mutex_init(&vmpr->events_lock);
 	INIT_LIST_HEAD(&vmpr->events);
 	INIT_WORK(&vmpr->work, vmpressure_work_fn);
+	vmpr->current_level = -1;
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] memcg: event control at vmpressure.
  2013-06-11  8:49       ` [PATCH v2] " Hyunhee Kim
@ 2013-06-11 12:59         ` Michal Hocko
  2013-06-12  5:42           ` Hyunhee Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2013-06-11 12:59 UTC (permalink / raw)
  To: Hyunhee Kim; +Cc: 'Anton Vorontsov', linux-mm, 'Kyungmin Park'

On Tue 11-06-13 17:49:31, Hyunhee Kim wrote:
> In the original vmpressure, event is sent to the user space continuously
> until the memory state changes.

This is not correct AFAIU. Events are sent when the vm_pressure event is
triggered - aka when there is a reclaim activity.

> This becomes overheads to user space module
> and also consumes power consumption.

As Anton already pointed out. If there is nobody to listen then there
are no events triggered in fact so no power consumption should be
increased. If you are under reclaim activity then your system is hardly
idle anyway.

> So, with this patch, vmpressure remembers the current level and only

I guess you meant "remembers the last level"

> sends the event only new memory state is different with the current
> level. This can be set when registering each event by writing a
> trigger option (0 or 1) next to the level.

What does 0 and what does 1 mean? I know I can go and check the code but
the changelog should better tell me without that.

> Change-Id: Ie075b7c510a9cea8c4a092ac4fa4680248139371

Please do not add references to an internal tracking system.

> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Reviewed-on: http://165.213.202.130:8080/55935
> Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
> Tested-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  Documentation/cgroups/memory.txt |   10 ++++++++--
>  include/linux/vmpressure.h       |    2 ++
>  mm/vmpressure.c                  |   35 ++++++++++++++++++++++++++++++-----
>  3 files changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt
> b/Documentation/cgroups/memory.txt
> index ddf4f93..cc12aaa 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -791,6 +791,11 @@ way to trigger. Applications should do whatever they
> can to help the
>  system. It might be too late to consult with vmstat or any other
>  statistics, so it's advisable to take an immediate action.
>  
> +Events can be triggered continuously or only when the level changes.
> Trigger
> +option is decided by writing it next to level. If "0", events are sent
> +every time the reclaiming occurs. If "1", events are sent only when the
> level
> +is changed.
> +

The lines seems to be wrapped (maybe your email client does that).

Also what happens when somebody uses an existing application and `0' is
not added? The interface _has_ to be backward compatible. And is the
numberic interface appropriate at all?

>  The events are propagated upward until the event is handled, i.e. the
>  events are not pass-through. Here is what this means: for example you have
>  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> @@ -807,7 +812,8 @@ register a notification, an application must:
>  
>  - create an eventfd using eventfd(2);
>  - open memory.pressure_level;
> -- write string like "<event_fd> <fd of memory.pressure_level> <level>"
> +- write string like
> +	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
>    to cgroup.event_control.
>  
>  Application will be notified through eventfd when memory pressure is at
> @@ -823,7 +829,7 @@ Test:
>     # cd /sys/fs/cgroup/memory/
>     # mkdir foo
>     # cd foo
> -   # cgroup_event_listener memory.pressure_level low &
> +   # cgroup_event_listener memory.pressure_level low 0 &
>     # echo 8000000 > memory.limit_in_bytes
>     # echo 8000000 > memory.memsw.limit_in_bytes
>     # echo $$ > tasks
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 76be077..fa0c0d2 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -20,6 +20,8 @@ struct vmpressure {
>  	struct mutex events_lock;
>  
>  	struct work_struct work;
> +
> +	int current_level;

The name seems to be really inappropriate. This is the last_level in
fact, isn't it?

>  };
>  
>  struct mem_cgroup;
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..0ffed76 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -137,6 +137,7 @@ static enum vmpressure_levels
> vmpressure_calc_level(unsigned long scanned,
>  struct vmpressure_event {
>  	struct eventfd_ctx *efd;
>  	enum vmpressure_levels level;
> +	unsigned long edge_trigger;

Unsigned long? Why? level is an int so there is a nice 4B hole between
level and edge_trigger. I would also suggest using something like bool.
Do we have more modes that could be used?

>  	struct list_head node;
>  };
>  
> @@ -153,8 +154,11 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>  
>  	list_for_each_entry(ev, &vmpr->events, node) {
>  		if (level >= ev->level) {
> +			if (ev->edge_trigger && level ==
> vmpr->current_level)

Email client again.
But what confuses me is that the current_level is shared for all events
for the pressure group. Is this correct?

> +				continue;
>  			eventfd_signal(ev->efd, 1);
>  			signalled = true;
> +			vmpr->current_level = level;
>  		}
>  	}
>  
> @@ -290,9 +294,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup
> *memcg, int prio)
>   *
>   * This function associates eventfd context with the vmpressure
>   * infrastructure, so that the notifications will be delivered to the
> - * @eventfd. The @args parameter is a string that denotes pressure level
> + * @eventfd. The @args parameters are a string that denotes pressure level
>   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> - * "critical").
> + * "critical") and a trigger option that decides whether events are
> triggered
> + * continuously or only on edge (0 or 1 if 1, events are triggered only
> when
> + * the level changes.
>   *
>   * This function should not be used directly, just pass it to (struct
>   * cftype).register_event, and then cgroup core will handle everything by
> @@ -303,14 +309,31 @@ int vmpressure_register_event(struct cgroup *cg,
> struct cftype *cft,
>  {
>  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
>  	struct vmpressure_event *ev;
> -	int level;
> +	unsigned long trigger = 0;
> +	int level, i = 0;
> +	char *s[2], *p;
> +
> +	while ((p = strsep((char **)&args, " ")) != NULL) {
> +		if (!*p)
> +			continue;
> +		s[i++] = p;
> +
> +		/* Prevent from inputing more than 2 args */
> +		if (i == 2)
> +			break;
> +	}
> +
> +	if (i != 2)
> +		return -EINVAL;

Ouch, this is just ugly.

> +
> +	trigger = simple_strtoul(s[1], NULL, sizeof(s[1]));
>  
>  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> -		if (!strcmp(vmpressure_str_levels[level], args))
> +		if (!strcmp(vmpressure_str_levels[level], s[0]))
>  			break;
>  	}
>  
> -	if (level >= VMPRESSURE_NUM_LEVELS)
> +	if (trigger > 1 || level >= VMPRESSURE_NUM_LEVELS)
>  		return -EINVAL;
>  
>  	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> @@ -319,6 +342,7 @@ int vmpressure_register_event(struct cgroup *cg, struct
> cftype *cft,
>  
>  	ev->efd = eventfd;
>  	ev->level = level;
> +	ev->edge_trigger = trigger;
>  
>  	mutex_lock(&vmpr->events_lock);
>  	list_add(&ev->node, &vmpr->events);
> @@ -371,4 +395,5 @@ void vmpressure_init(struct vmpressure *vmpr)
>  	mutex_init(&vmpr->events_lock);
>  	INIT_LIST_HEAD(&vmpr->events);
>  	INIT_WORK(&vmpr->work, vmpressure_work_fn);
> +	vmpr->current_level = -1;
>  }
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: event control at vmpressure.
  2013-06-11  6:21     ` Michal Hocko
  2013-06-11  8:49       ` [PATCH v2] " Hyunhee Kim
@ 2013-06-11 13:10       ` Luiz Capitulino
  2013-06-11 13:13       ` Pekka Enberg
  2 siblings, 0 replies; 12+ messages in thread
From: Luiz Capitulino @ 2013-06-11 13:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anton Vorontsov, Hyunhee Kim, linux-mm, 'Kyungmin Park'

On Tue, 11 Jun 2013 08:21:24 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Mon 10-06-13 17:17:47, Anton Vorontsov wrote:
> > On Mon, Jun 10, 2013 at 05:12:58PM +0200, Michal Hocko wrote:
> > > > +		if (level >= ev->level && level != vmpr->current_level) {
> > > >  			eventfd_signal(ev->efd, 1);
> > > >  			signalled = true;
> > > > +			vmpr->current_level = level;
> > > 
> > > This would mean that you send a signal for, say, VMPRESSURE_LOW, then
> > > the reclaim finishes and two days later when you hit the reclaim again
> > > you would simply miss the event, right?
> > > 
> > > So, unless I am missing something, then this is plain wrong.
> > 
> > Yup, in it current version, it is not acceptable. For example, sometimes
> > we do want to see all the _LOW events, since _LOW level shows not just the
> > level itself, but the activity (i.e. reclaiming process).
> > 
> > There are a few ways to make both parties happy, though.
> > 
> > If the app wants to implement the time-based throttling, then just close
> > the fd and sleep for needed amount of time (or do not read from the
> > eventfd -- kernel then will just increment the eventfd counter, so there
> > won't be context switches at the least).
> 
> That makes sense to me.
> 
> > Doing the time-based throttling in the kernel won't buy us much, I
> > believe.
> 
> Yes.
>  
> > Or, if you still want the "one-shot"/"edge-triggered" events (which might
> > make perfect sense for medium and critical levels), then I'd propose to
> > add some additional flag when you register the event, so that the old
> > behaviour would be still available for those who need it. This approach I
> > think is the best one.
> 
> Hmm, how would one-shot even differ from a single open, register, read
> and close?

Agreed.

A different solution would be to have a simple state machine and notify
user-space on state transitions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: event control at vmpressure.
  2013-06-11  6:21     ` Michal Hocko
  2013-06-11  8:49       ` [PATCH v2] " Hyunhee Kim
  2013-06-11 13:10       ` [PATCH] " Luiz Capitulino
@ 2013-06-11 13:13       ` Pekka Enberg
  2 siblings, 0 replies; 12+ messages in thread
From: Pekka Enberg @ 2013-06-11 13:13 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Anton Vorontsov, Hyunhee Kim, linux-mm, Kyungmin Park

On Tue, Jun 11, 2013 at 9:21 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> Or, if you still want the "one-shot"/"edge-triggered" events (which might
>> make perfect sense for medium and critical levels), then I'd propose to
>> add some additional flag when you register the event, so that the old
>> behaviour would be still available for those who need it. This approach I
>> think is the best one.
>
> Hmm, how would one-shot even differ from a single open, register, read
> and close?

Yup, one-shot probably doesn't make sense but edge-triggered does.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2] memcg: event control at vmpressure.
  2013-06-11 12:59         ` Michal Hocko
@ 2013-06-12  5:42           ` Hyunhee Kim
  2013-06-12 13:09             ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Hyunhee Kim @ 2013-06-12  5:42 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Anton Vorontsov', linux-mm, 'Kyungmin Park'

Thanks for your comment.
I replied in the below.

Thanks,
Hyunhee Kim.

-----Original Message-----
From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf
Of Michal Hocko
Sent: Tuesday, June 11, 2013 9:59 PM
To: Hyunhee Kim
Cc: 'Anton Vorontsov'; linux-mm@kvack.org; 'Kyungmin Park'
Subject: Re: [PATCH v2] memcg: event control at vmpressure.

On Tue 11-06-13 17:49:31, Hyunhee Kim wrote:
> In the original vmpressure, event is sent to the user space continuously
> until the memory state changes.

This is not correct AFAIU. Events are sent when the vm_pressure event is
triggered - aka when there is a reclaim activity.

> This becomes overheads to user space module
> and also consumes power consumption.

As Anton already pointed out. If there is nobody to listen then there
are no events triggered in fact so no power consumption should be
increased. If you are under reclaim activity then your system is hardly
idle anyway.
=> Right. I'll modify logs.

> So, with this patch, vmpressure remembers the current level and only

I guess you meant "remembers the last level"
=> I think that the last is better than the current level.
I'll modify current_level to last_level.

> sends the event only new memory state is different with the current
> level. This can be set when registering each event by writing a
> trigger option (0 or 1) next to the level.

What does 0 and what does 1 mean? I know I can go and check the code but
the changelog should better tell me without that.
=> I'll add more explanation in the logs.

> Change-Id: Ie075b7c510a9cea8c4a092ac4fa4680248139371

Please do not add references to an internal tracking system.
=> Mistake. I'll remove it.

> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Reviewed-on: http://165.213.202.130:8080/55935
> Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
> Tested-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  Documentation/cgroups/memory.txt |   10 ++++++++--
>  include/linux/vmpressure.h       |    2 ++
>  mm/vmpressure.c                  |   35
++++++++++++++++++++++++++++++-----
>  3 files changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt
> b/Documentation/cgroups/memory.txt
> index ddf4f93..cc12aaa 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -791,6 +791,11 @@ way to trigger. Applications should do whatever they
> can to help the
>  system. It might be too late to consult with vmstat or any other
>  statistics, so it's advisable to take an immediate action.
>  
> +Events can be triggered continuously or only when the level changes.
> Trigger
> +option is decided by writing it next to level. If "0", events are sent
> +every time the reclaiming occurs. If "1", events are sent only when the
> level
> +is changed.
> +

The lines seems to be wrapped (maybe your email client does that).

Also what happens when somebody uses an existing application and `0' is
not added? The interface _has_ to be backward compatible. And is the
numberic interface appropriate at all?
=> I'll modify it to support backward compatibility. When nothing is input,
It will work as the original vmpressure by default.

>  The events are propagated upward until the event is handled, i.e. the
>  events are not pass-through. Here is what this means: for example you
have
>  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> @@ -807,7 +812,8 @@ register a notification, an application must:
>  
>  - create an eventfd using eventfd(2);
>  - open memory.pressure_level;
> -- write string like "<event_fd> <fd of memory.pressure_level> <level>"
> +- write string like
> +	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
>    to cgroup.event_control.
>  
>  Application will be notified through eventfd when memory pressure is at
> @@ -823,7 +829,7 @@ Test:
>     # cd /sys/fs/cgroup/memory/
>     # mkdir foo
>     # cd foo
> -   # cgroup_event_listener memory.pressure_level low &
> +   # cgroup_event_listener memory.pressure_level low 0 &
>     # echo 8000000 > memory.limit_in_bytes
>     # echo 8000000 > memory.memsw.limit_in_bytes
>     # echo $$ > tasks
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 76be077..fa0c0d2 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -20,6 +20,8 @@ struct vmpressure {
>  	struct mutex events_lock;
>  
>  	struct work_struct work;
> +
> +	int current_level;

The name seems to be really inappropriate. This is the last_level in
fact, isn't it?
=> Yes.

>  };
>  
>  struct mem_cgroup;
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..0ffed76 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -137,6 +137,7 @@ static enum vmpressure_levels
> vmpressure_calc_level(unsigned long scanned,
>  struct vmpressure_event {
>  	struct eventfd_ctx *efd;
>  	enum vmpressure_levels level;
> +	unsigned long edge_trigger;

Unsigned long? Why? level is an int so there is a nice 4B hole between
level and edge_trigger. I would also suggest using something like bool.
Do we have more modes that could be used?
=> Right. I'll use bool instead of unsigned long.

>  	struct list_head node;
>  };
>  
> @@ -153,8 +154,11 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>  
>  	list_for_each_entry(ev, &vmpr->events, node) {
>  		if (level >= ev->level) {
> +			if (ev->edge_trigger && level ==
> vmpr->current_level)

Email client again.
But what confuses me is that the current_level is shared for all events
for the pressure group. Is this correct?
=> I think that it is correct. event lists are kept in vmpressure and so the
last level
keeps one of them. Isn't it?

> +				continue;
>  			eventfd_signal(ev->efd, 1);
>  			signalled = true;
> +			vmpr->current_level = level;
>  		}
>  	}
>  
> @@ -290,9 +294,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup
> *memcg, int prio)
>   *
>   * This function associates eventfd context with the vmpressure
>   * infrastructure, so that the notifications will be delivered to the
> - * @eventfd. The @args parameter is a string that denotes pressure level
> + * @eventfd. The @args parameters are a string that denotes pressure
level
>   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> - * "critical").
> + * "critical") and a trigger option that decides whether events are
> triggered
> + * continuously or only on edge (0 or 1 if 1, events are triggered only
> when
> + * the level changes.
>   *
>   * This function should not be used directly, just pass it to (struct
>   * cftype).register_event, and then cgroup core will handle everything by
> @@ -303,14 +309,31 @@ int vmpressure_register_event(struct cgroup *cg,
> struct cftype *cft,
>  {
>  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
>  	struct vmpressure_event *ev;
> -	int level;
> +	unsigned long trigger = 0;
> +	int level, i = 0;
> +	char *s[2], *p;
> +
> +	while ((p = strsep((char **)&args, " ")) != NULL) {
> +		if (!*p)
> +			continue;
> +		s[i++] = p;
> +
> +		/* Prevent from inputing more than 2 args */
> +		if (i == 2)
> +			break;
> +	}
> +
> +	if (i != 2)
> +		return -EINVAL;

Ouch, this is just ugly.

=> Because I'll parse only one (when the original format is input when event
is registered)
or two (for new format), I think that we can ignore the last part. And can
remove this check.
Is it okay?

> +
> +	trigger = simple_strtoul(s[1], NULL, sizeof(s[1]));
>  
>  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> -		if (!strcmp(vmpressure_str_levels[level], args))
> +		if (!strcmp(vmpressure_str_levels[level], s[0]))
>  			break;
>  	}
>  
> -	if (level >= VMPRESSURE_NUM_LEVELS)
> +	if (trigger > 1 || level >= VMPRESSURE_NUM_LEVELS)
>  		return -EINVAL;
>  
>  	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> @@ -319,6 +342,7 @@ int vmpressure_register_event(struct cgroup *cg,
struct
> cftype *cft,
>  
>  	ev->efd = eventfd;
>  	ev->level = level;
> +	ev->edge_trigger = trigger;
>  
>  	mutex_lock(&vmpr->events_lock);
>  	list_add(&ev->node, &vmpr->events);
> @@ -371,4 +395,5 @@ void vmpressure_init(struct vmpressure *vmpr)
>  	mutex_init(&vmpr->events_lock);
>  	INIT_LIST_HEAD(&vmpr->events);
>  	INIT_WORK(&vmpr->work, vmpressure_work_fn);
> +	vmpr->current_level = -1;
>  }
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] memcg: event control at vmpressure.
  2013-06-12  5:42           ` Hyunhee Kim
@ 2013-06-12 13:09             ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2013-06-12 13:09 UTC (permalink / raw)
  To: Hyunhee Kim; +Cc: 'Anton Vorontsov', linux-mm, 'Kyungmin Park'

[Please try to convince your email client to do a proper quoting.
I have fixed it this time but I won't do it in the future again.]

On Wed 12-06-13 14:42:41, Hyunhee Kim wrote:
> Thanks for your comment.
> I replied in the below.
> 
> Thanks,
> Hyunhee Kim.
> 
> > -----Original Message-----
> > From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf
> > Of Michal Hocko
> > Sent: Tuesday, June 11, 2013 9:59 PM
> > To: Hyunhee Kim
> > Cc: 'Anton Vorontsov'; linux-mm@kvack.org; 'Kyungmin Park'
> > Subject: Re: [PATCH v2] memcg: event control at vmpressure.
> > 
> > On Tue 11-06-13 17:49:31, Hyunhee Kim wrote:
> > > In the original vmpressure, event is sent to the user space continuously
> > > until the memory state changes.
> > 
> > This is not correct AFAIU. Events are sent when the vm_pressure event is
> > triggered - aka when there is a reclaim activity.
> > 
> > > This becomes overheads to user space module
> > > and also consumes power consumption.
> > 
> > As Anton already pointed out. If there is nobody to listen then there
> > are no events triggered in fact so no power consumption should be
> > increased. If you are under reclaim activity then your system is hardly
> > idle anyway.
>
> Right. I'll modify logs.
> 
> > > So, with this patch, vmpressure remembers the current level and only
> 
> > I guess you meant "remembers the last level"
>
> I think that the last is better than the current level.  I'll modify
> current_level to last_level.
> 
> > > sends the event only new memory state is different with the current
> > > level. This can be set when registering each event by writing a
> > > trigger option (0 or 1) next to the level.
> > 
> > What does 0 and what does 1 mean? I know I can go and check the code but
> > the changelog should better tell me without that.
> I'll add more explanation in the logs.
> 
> > > Change-Id: Ie075b7c510a9cea8c4a092ac4fa4680248139371
> 
> > Please do not add references to an internal tracking system.
>
> Mistake. I'll remove it.
> 
> > > Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> > > Reviewed-on: http://165.213.202.130:8080/55935
> > > Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
> > > Tested-by: Kyungmin Park <kyungmin.park@samsung.com>
> > > ---
> > >  Documentation/cgroups/memory.txt |   10 ++++++++--
> > >  include/linux/vmpressure.h       |    2 ++
> > >  mm/vmpressure.c                  |   35 ++++++++++++++++++++++++++++++-----
> > >  3 files changed, 40 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/Documentation/cgroups/memory.txt
> > > b/Documentation/cgroups/memory.txt
> > > index ddf4f93..cc12aaa 100644
> > > --- a/Documentation/cgroups/memory.txt
> > > +++ b/Documentation/cgroups/memory.txt
> > > @@ -791,6 +791,11 @@ way to trigger. Applications should do whatever they
> > > can to help the
> > >  system. It might be too late to consult with vmstat or any other
> > >  statistics, so it's advisable to take an immediate action.
> > >  
> > > +Events can be triggered continuously or only when the level changes.
> > > Trigger
> > > +option is decided by writing it next to level. If "0", events are sent
> > > +every time the reclaiming occurs. If "1", events are sent only when the
> > > level
> > > +is changed.
> > > +
> > 
> > The lines seems to be wrapped (maybe your email client does that).
> > 
> > Also what happens when somebody uses an existing application and `0' is
> > not added? The interface _has_ to be backward compatible. And is the
> > numberic interface appropriate at all?
>
> I'll modify it to support backward compatibility. When nothing is input,
> It will work as the original vmpressure by default.

Please also think about the interface as well. The level is provided as
a string value so it would be good if the new parameter was done the
same way.

> > >  The events are propagated upward until the event is handled, i.e. the
> > >  events are not pass-through. Here is what this means: for example you have
> > >  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> > > @@ -807,7 +812,8 @@ register a notification, an application must:
> > >  
> > >  - create an eventfd using eventfd(2);
> > >  - open memory.pressure_level;
> > > -- write string like "<event_fd> <fd of memory.pressure_level> <level>"
> > > +- write string like
> > > +	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
> > >    to cgroup.event_control.
> > >  
> > >  Application will be notified through eventfd when memory pressure is at
> > > @@ -823,7 +829,7 @@ Test:
> > >     # cd /sys/fs/cgroup/memory/
> > >     # mkdir foo
> > >     # cd foo
> > > -   # cgroup_event_listener memory.pressure_level low &
> > > +   # cgroup_event_listener memory.pressure_level low 0 &
> > >     # echo 8000000 > memory.limit_in_bytes
> > >     # echo 8000000 > memory.memsw.limit_in_bytes
> > >     # echo $$ > tasks
> > > diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> > > index 76be077..fa0c0d2 100644
> > > --- a/include/linux/vmpressure.h
> > > +++ b/include/linux/vmpressure.h
> > > @@ -20,6 +20,8 @@ struct vmpressure {
> > >  	struct mutex events_lock;
> > >  
> > >  	struct work_struct work;
> > > +
> > > +	int current_level;
> > 
> > The name seems to be really inappropriate. This is the last_level in
> > fact, isn't it?
>
> Yes.
> 
> > >  };
> > >  
> > >  struct mem_cgroup;
> > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > > index 736a601..0ffed76 100644
> > > --- a/mm/vmpressure.c
> > > +++ b/mm/vmpressure.c
> > > @@ -137,6 +137,7 @@ static enum vmpressure_levels
> > > vmpressure_calc_level(unsigned long scanned,
> > >  struct vmpressure_event {
> > >  	struct eventfd_ctx *efd;
> > >  	enum vmpressure_levels level;
> > > +	unsigned long edge_trigger;
> > 
> > Unsigned long? Why? level is an int so there is a nice 4B hole between
> > level and edge_trigger. I would also suggest using something like bool.
> > Do we have more modes that could be used?
>
> Right. I'll use bool instead of unsigned long.
> 
> > >  	struct list_head node;
> > >  };
> > >  
> > > @@ -153,8 +154,11 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> > >  
> > >  	list_for_each_entry(ev, &vmpr->events, node) {
> > >  		if (level >= ev->level) {
> > > +			if (ev->edge_trigger && level ==
> > > vmpr->current_level)
> > 
> > Email client again.
> > But what confuses me is that the current_level is shared for all events
> > for the pressure group. Is this correct?
>
> I think that it is correct. event lists are kept in vmpressure and so
> the last level keeps one of them. Isn't it?

I do not understand what you are trying to say here.
What is the semantic when there are multiple events registered? I do not
think it is correct to signal only the first one. This could be fixed
easily and set the last level only after all registered events have been
signaled.
But even then I am not sure what should happen if a new event is
registered _after_ somebody has been signaled already. It would be the
first time such an event happened for the new ev but it doesn't get
signaled. Is this really expected behavior?

> > > +				continue;
> > >  			eventfd_signal(ev->efd, 1);
> > >  			signalled = true;
> > > +			vmpr->current_level = level;
> > >  		}
> > >  	}
> > >  
> > > @@ -290,9 +294,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup
> > > *memcg, int prio)
> > >   *
> > >   * This function associates eventfd context with the vmpressure
> > >   * infrastructure, so that the notifications will be delivered to the
> > > - * @eventfd. The @args parameter is a string that denotes pressure level
> > > + * @eventfd. The @args parameters are a string that denotes pressure level
> > >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> > > - * "critical").
> > > + * "critical") and a trigger option that decides whether events are triggered
> > > + * continuously or only on edge (0 or 1 if 1, events are triggered only when
> > > + * the level changes.
> > >   *
> > >   * This function should not be used directly, just pass it to (struct
> > >   * cftype).register_event, and then cgroup core will handle everything by
> > > @@ -303,14 +309,31 @@ int vmpressure_register_event(struct cgroup *cg,
> > > struct cftype *cft,
> > >  {
> > >  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
> > >  	struct vmpressure_event *ev;
> > > -	int level;
> > > +	unsigned long trigger = 0;
> > > +	int level, i = 0;
> > > +	char *s[2], *p;
> > > +
> > > +	while ((p = strsep((char **)&args, " ")) != NULL) {
> > > +		if (!*p)
> > > +			continue;
> > > +		s[i++] = p;
> > > +
> > > +		/* Prevent from inputing more than 2 args */
> > > +		if (i == 2)
> > > +			break;
> > > +	}
> > > +
> > > +	if (i != 2)
> > > +		return -EINVAL;
> > 
> > Ouch, this is just ugly.
> > 
> Because I'll parse only one (when the original format is input when
> event is registered) or two (for new format), I think that we can
> ignore the last part. And can remove this check.  Is it okay?

What about doing it properly instead? sscanf?
 
> > > +
> > > +	trigger = simple_strtoul(s[1], NULL, sizeof(s[1]));
> > >  
> > >  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> > > -		if (!strcmp(vmpressure_str_levels[level], args))
> > > +		if (!strcmp(vmpressure_str_levels[level], s[0]))
> > >  			break;
> > >  	}
> > >  
> > > -	if (level >= VMPRESSURE_NUM_LEVELS)
> > > +	if (trigger > 1 || level >= VMPRESSURE_NUM_LEVELS)
> > >  		return -EINVAL;
> > >  
> > >  	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> > > @@ -319,6 +342,7 @@ int vmpressure_register_event(struct cgroup *cg, struct
> > > cftype *cft,
> > >  
> > >  	ev->efd = eventfd;
> > >  	ev->level = level;
> > > +	ev->edge_trigger = trigger;
> > >  
> > >  	mutex_lock(&vmpr->events_lock);
> > >  	list_add(&ev->node, &vmpr->events);
> > > @@ -371,4 +395,5 @@ void vmpressure_init(struct vmpressure *vmpr)
> > >  	mutex_init(&vmpr->events_lock);
> > >  	INIT_LIST_HEAD(&vmpr->events);
> > >  	INIT_WORK(&vmpr->work, vmpressure_work_fn);
> > > +	vmpr->current_level = -1;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-12 13:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-10 11:14 [PATCH] memcg: event control at vmpressure Hyunhee Kim
2013-06-10 14:09 ` Luiz Capitulino
2013-06-10 15:12 ` Michal Hocko
2013-06-11  0:17   ` Anton Vorontsov
2013-06-11  1:01     ` Kyungmin Park
2013-06-11  6:21     ` Michal Hocko
2013-06-11  8:49       ` [PATCH v2] " Hyunhee Kim
2013-06-11 12:59         ` Michal Hocko
2013-06-12  5:42           ` Hyunhee Kim
2013-06-12 13:09             ` Michal Hocko
2013-06-11 13:10       ` [PATCH] " Luiz Capitulino
2013-06-11 13:13       ` Pekka Enberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox