[PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap_account
@ 2008-11-18  9:07 Daisuke Nishimura
  2008-11-18  9:26 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2008-11-18  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, KAMEZAWA Hiroyuki, Balbir Singh, Pavel Emelianov,
	Hugh Dickins, Li Zefan, nishimura

memswap controller uses KM_USER0 at swap_cgroup_record and lookup_swap_cgroup.

But delete_from_swap_cache, which eventually calls swap_cgroup_record, can be
called with KM_USER0 mapped in case of shmem.

So it should be unmapped before calling it.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
After this patch, I think memswap controller of x86_32 will be
on the same level with that of x86_64.

 mm/shmem.c |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index bee8612..7aebc1b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -171,6 +171,28 @@ static inline void shmem_unacct_size(unsigned long flags, loff_t size)
 		vm_unacct_memory(VM_ACCT(size));
 }
 
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_SWAP) && defined(CONFIG_HIGHMEM)
+/*
+ * memswap controller uses KM_USER0, so dir should be unmapped
+ * before calling delete_from_swap_cache.
+ */
+static inline void swap_cgroup_map_prepare(struct page ***dir)
+{
+	if (!do_swap_account)
+		return;
+
+	if (*dir) {
+		shmem_dir_unmap(*dir);
+		*dir = NULL;
+	}
+}
+#else
+static inline void swap_cgroup_map_prepare(struct page ***dir)
+{
+	return;
+}
+#endif
+
 /*
  * ... whereas tmpfs objects are accounted incrementally as
  * pages are allocated, in order to allow huge sparse files.
@@ -479,6 +501,7 @@ static int shmem_map_and_free_swp(struct page *subdir, int offset,
 		int size = limit - offset;
 		if (size > LATENCY_LIMIT)
 			size = LATENCY_LIMIT;
+		swap_cgroup_map_prepare(dir);
 		freed += shmem_free_swp(ptr+offset, ptr+offset+size,
 							punch_lock);
 		if (need_resched()) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap_account
  2008-11-18  9:07 [PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap_account Daisuke Nishimura
@ 2008-11-18  9:26 ` KAMEZAWA Hiroyuki
  2008-11-18 10:21   ` Daisuke Nishimura
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-18  9:26 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, linux-mm, Balbir Singh, Pavel Emelianov,
	Hugh Dickins, Li Zefan

On Tue, 18 Nov 2008 18:07:21 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> memswap controller uses KM_USER0 at swap_cgroup_record and lookup_swap_cgroup.
> 
> But delete_from_swap_cache, which eventually calls swap_cgroup_record, can be
> called with KM_USER0 mapped in case of shmem.
> 
> So it should be unmapped before calling it.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Nice Catch!

Thank you for tests and patches.
brief comment below.

> ---
> After this patch, I think memswap controller of x86_32 will be
> on the same level with that of x86_64.
> 
>  mm/shmem.c |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index bee8612..7aebc1b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -171,6 +171,28 @@ static inline void shmem_unacct_size(unsigned long flags, loff_t size)
>  		vm_unacct_memory(VM_ACCT(size));
>  }
>  
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR_SWAP) && defined(CONFIG_HIGHMEM)
> +/*
> + * memswap controller uses KM_USER0, so dir should be unmapped
> + * before calling delete_from_swap_cache.
> + */
> +static inline void swap_cgroup_map_prepare(struct page ***dir)
> +{
> +	if (!do_swap_account)
> +		return;
> +
> +	if (*dir) {
> +		shmem_dir_unmap(*dir);
> +		*dir = NULL;
> +	}
> +}
> +#else
> +static inline void swap_cgroup_map_prepare(struct page ***dir)
> +{
> +	return;
> +}
> +#endif
> +
>  /*
>   * ... whereas tmpfs objects are accounted incrementally as
>   * pages are allocated, in order to allow huge sparse files.
> @@ -479,6 +501,7 @@ static int shmem_map_and_free_swp(struct page *subdir, int offset,
>  		int size = limit - offset;
>  		if (size > LATENCY_LIMIT)
>  			size = LATENCY_LIMIT;
> +		swap_cgroup_map_prepare(dir);
I think put this before "for() loop" is better.

Thanks,
-Kame

>  		freed += shmem_free_swp(ptr+offset, ptr+offset+size,
>  							punch_lock);
>  		if (need_resched()) {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap_account
  2008-11-18  9:26 ` KAMEZAWA Hiroyuki
@ 2008-11-18 10:21   ` Daisuke Nishimura
  2008-11-18 12:08     ` Daisuke Nishimura
  0 siblings, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2008-11-18 10:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Andrew Morton, linux-mm, Balbir Singh,
	Pavel Emelianov, Hugh Dickins, Li Zefan

On Tue, 18 Nov 2008 18:26:37 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 18 Nov 2008 18:07:21 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > memswap controller uses KM_USER0 at swap_cgroup_record and lookup_swap_cgroup.
> > 
> > But delete_from_swap_cache, which eventually calls swap_cgroup_record, can be
> > called with KM_USER0 mapped in case of shmem.
> > 
> > So it should be unmapped before calling it.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Nice Catch!
> 
> Thank you for tests and patches.
> brief comment below.
> 
> > ---
> > After this patch, I think memswap controller of x86_32 will be
> > on the same level with that of x86_64.
> > 
> >  mm/shmem.c |   23 +++++++++++++++++++++++
> >  1 files changed, 23 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index bee8612..7aebc1b 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -171,6 +171,28 @@ static inline void shmem_unacct_size(unsigned long flags, loff_t size)
> >  		vm_unacct_memory(VM_ACCT(size));
> >  }
> >  
> > +#if defined(CONFIG_CGROUP_MEM_RES_CTLR_SWAP) && defined(CONFIG_HIGHMEM)
> > +/*
> > + * memswap controller uses KM_USER0, so dir should be unmapped
> > + * before calling delete_from_swap_cache.
> > + */
> > +static inline void swap_cgroup_map_prepare(struct page ***dir)
> > +{
> > +	if (!do_swap_account)
> > +		return;
> > +
> > +	if (*dir) {
> > +		shmem_dir_unmap(*dir);
> > +		*dir = NULL;
> > +	}
> > +}
> > +#else
> > +static inline void swap_cgroup_map_prepare(struct page ***dir)
> > +{
> > +	return;
> > +}
> > +#endif
> > +
> >  /*
> >   * ... whereas tmpfs objects are accounted incrementally as
> >   * pages are allocated, in order to allow huge sparse files.
> > @@ -479,6 +501,7 @@ static int shmem_map_and_free_swp(struct page *subdir, int offset,
> >  		int size = limit - offset;
> >  		if (size > LATENCY_LIMIT)
> >  			size = LATENCY_LIMIT;
> > +		swap_cgroup_map_prepare(dir);
> I think put this before "for() loop" is better.
> 
Make sense.

I'll post updated version later.


Thanks,
Dasiuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap_account
  2008-11-18 10:21   ` Daisuke Nishimura
@ 2008-11-18 12:08     ` Daisuke Nishimura
  2008-11-18 12:48       ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2008-11-18 12:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Pavel Emelianov,
	Hugh Dickins, Li Zefan, nishimura

memswap controller uses KM_USER0 at swap_cgroup_record and lookup_swap_cgroup.

But delete_from_swap_cache, which eventually calls swap_cgroup_record, can be
called with KM_USER0 mapped in case of shmem.

So it should be unmapped before calling it.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
After this patch, I think memswap controller of x86_32 will be
on near level with that of x86_64.

 mm/shmem.c |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index bee8612..09447b9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -171,6 +171,28 @@ static inline void shmem_unacct_size(unsigned long flags, loff_t size)
 		vm_unacct_memory(VM_ACCT(size));
 }
 
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_SWAP) && defined(CONFIG_HIGHMEM)
+/*
+ * memswap controller uses KM_USER0, so dir should be unmapped
+ * before calling delete_from_swap_cache.
+ */
+static inline void swap_cgroup_map_prepare(struct page ***dir)
+{
+	if (!do_swap_account)
+		return;
+
+	if (*dir) {
+		shmem_dir_unmap(*dir);
+		*dir = NULL;
+	}
+}
+#else
+static inline void swap_cgroup_map_prepare(struct page ***dir)
+{
+	return;
+}
+#endif
+
 /*
  * ... whereas tmpfs objects are accounted incrementally as
  * pages are allocated, in order to allow huge sparse files.
@@ -474,6 +496,7 @@ static int shmem_map_and_free_swp(struct page *subdir, int offset,
 	swp_entry_t *ptr;
 	int freed = 0;
 
+	swap_cgroup_map_prepare(dir);
 	ptr = shmem_swp_map(subdir);
 	for (; offset < limit; offset += LATENCY_LIMIT) {
 		int size = limit - offset;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap_account
  2008-11-18 12:08     ` Daisuke Nishimura
@ 2008-11-18 12:48       ` Hugh Dickins
  2008-11-18 15:17         ` Daisuke Nishimura
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2008-11-18 12:48 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, linux-mm, Balbir Singh, KAMEZAWA Hiroyuki,
	Pavel Emelianov, Li Zefan

On Tue, 18 Nov 2008, Daisuke Nishimura wrote:

> memswap controller uses KM_USER0 at swap_cgroup_record and lookup_swap_cgroup.
> 
> But delete_from_swap_cache, which eventually calls swap_cgroup_record, can be
> called with KM_USER0 mapped in case of shmem.
> 
> So it should be unmapped before calling it.

Excellent find, but ...

> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
> After this patch, I think memswap controller of x86_32 will be
> on near level with that of x86_64.
> 
>  mm/shmem.c |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)

... sorry, no, please don't go around unmapping other people's kmaps
like this.  If the memswap controller needs its own kmap_atomic()s at
a level below other users, then it needs to define a new KM_MEMSWAP
in arch/*/include/asm/kmap_types.h and include/asm-*/kmap_types.h.

That's a lot of files which you may not wish to update to get working
right now: I think page_cgroup.c can _probably_ reuse KM_PTE1 as a
temporary measure, but please verify that's safe first.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap_account
  2008-11-18 12:48       ` Hugh Dickins
@ 2008-11-18 15:17         ` Daisuke Nishimura
  2008-11-18 16:12           ` [PATCH mmotm] memcg: avoid using buggy kmap at swap_cgroup KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2008-11-18 15:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Daisuke Nishimura, Andrew Morton, linux-mm, Balbir Singh,
	KAMEZAWA Hiroyuki, Pavel Emelianov, Li Zefan

On Tue, 18 Nov 2008 12:48:52 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> On Tue, 18 Nov 2008, Daisuke Nishimura wrote:
> 
> > memswap controller uses KM_USER0 at swap_cgroup_record and lookup_swap_cgroup.
> > 
> > But delete_from_swap_cache, which eventually calls swap_cgroup_record, can be
> > called with KM_USER0 mapped in case of shmem.
> > 
> > So it should be unmapped before calling it.
> 
> Excellent find, but ...
> 
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> > After this patch, I think memswap controller of x86_32 will be
> > on near level with that of x86_64.
> > 
> >  mm/shmem.c |   23 +++++++++++++++++++++++
> >  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> ... sorry, no, please don't go around unmapping other people's kmaps
> like this.  If the memswap controller needs its own kmap_atomic()s at
> a level below other users, then it needs to define a new KM_MEMSWAP
> in arch/*/include/asm/kmap_types.h and include/asm-*/kmap_types.h.
> 
> That's a lot of files which you may not wish to update to get working
> right now: I think page_cgroup.c can _probably_ reuse KM_PTE1 as a
> temporary measure, but please verify that's safe first.
> 
Thank you for your comment.

Hmm, shmem_map_and_free_swp might unmap dir anyway and caller
(shmem_trancate_range) handles the case, but I do agree it's
not good manner to unmap other people's kmaps.

I'll consider more.
(I don't have enough time till Friday, though.)


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mmotm] memcg: avoid using buggy kmap at swap_cgroup
  2008-11-18 15:17         ` Daisuke Nishimura
@ 2008-11-18 16:12           ` KAMEZAWA Hiroyuki
  2008-11-18 16:40             ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-18 16:12 UTC (permalink / raw)
  To: nishimura
  Cc: Hugh Dickins, Andrew Morton, linux-mm, Balbir Singh,
	KAMEZAWA Hiroyuki, Pavel Emelianov, LiZefan

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

Daisuke Nishimura said:
> On Tue, 18 Nov 2008 12:48:52 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
>> That's a lot of files which you may not wish to update to get working
>> right now: I think page_cgroup.c can _probably_ reuse KM_PTE1 as a
>> temporary measure, but please verify that's safe first.
>>
> Thank you for your comment.
>
> Hmm, shmem_map_and_free_swp might unmap dir anyway and caller
> (shmem_trancate_range) handles the case, but I do agree it's
> not good manner to unmap other people's kmaps.
>
Hmm...Sorry for my original implementation.
Okay, how about this direction ?
 1. at first, remove kmap_atomic from page_cgroup.c and use GFP_KERNEL
    to allocate buffer.
 2. later, add kmap_atomic + HighMem buffer support in explicit style.
    maybe KM_BOUNCE_READ...can be used.....

patch for BUGFIX is attached.
(Sorry, I have to use Web-Mail and can't make it inlined)

Sorry,
-Kame

[-- Attachment #2: swapcg-kmap-fix.patch --]
[-- Type: application/octet-stream, Size: 1418 bytes --]

swap_cgroup's kmap logic conflicts shmem's kmap logic.
avoid to use HIGHMEM for now and revisit this later.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/page_cgroup.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Index: temp/mm/page_cgroup.c
===================================================================
--- temp.orig/mm/page_cgroup.c
+++ temp/mm/page_cgroup.c
@@ -306,7 +306,7 @@ static int swap_cgroup_prepare(int type)
 	ctrl = &swap_cgroup_ctrl[type];
 
 	for (idx = 0; idx < ctrl->length; idx++) {
-		page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page)
 			goto not_enough_page;
 		ctrl->map[idx] = page;
@@ -347,11 +347,10 @@ struct mem_cgroup *swap_cgroup_record(sw
 
 	mappage = ctrl->map[idx];
 	spin_lock_irqsave(&ctrl->lock, flags);
-	sc = kmap_atomic(mappage, KM_USER0);
+	sc = page_address(mappage);
 	sc += pos;
 	old = sc->val;
 	sc->val = mem;
-	kunmap_atomic((void *)sc, KM_USER0);
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	return old;
 }
@@ -382,10 +381,9 @@ struct mem_cgroup *lookup_swap_cgroup(sw
 	mappage = ctrl->map[idx];
 
 	spin_lock_irqsave(&ctrl->lock, flags);
-	sc = kmap_atomic(mappage, KM_USER0);
+	sc = page_address(mappage);
 	sc += pos;
 	ret = sc->val;
-	kunmap_atomic((void *)sc, KM_USER0);
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	return ret;
 }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mmotm] memcg: avoid using buggy kmap at swap_cgroup
  2008-11-18 16:12           ` [PATCH mmotm] memcg: avoid using buggy kmap at swap_cgroup KAMEZAWA Hiroyuki
@ 2008-11-18 16:40             ` Hugh Dickins
  2008-11-18 16:56               ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2008-11-18 16:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrew Morton, linux-mm, Balbir Singh,
	Pavel Emelianov, LiZefan

On Wed, 19 Nov 2008, KAMEZAWA Hiroyuki wrote:
> Okay, how about this direction ?
>  1. at first, remove kmap_atomic from page_cgroup.c and use GFP_KERNEL
>     to allocate buffer.

Yes, that's sensible for now.

>  2. later, add kmap_atomic + HighMem buffer support in explicit style.
>     maybe KM_BOUNCE_READ...can be used.....

It's hardly appropriate (there's no bouncing here), and you could only
use it if you disable interrupts.  Oh, you do disable interrupts:
why's that?

> 
> patch for BUGFIX is attached.
> (Sorry, I have to use Web-Mail and can't make it inlined)

swap_cgroup's kmap logic conflicts shmem's kmap logic.
avoid to use HIGHMEM for now and revisit this later.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
---

 mm/page_cgroup.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Index: temp/mm/page_cgroup.c
===================================================================
--- temp.orig/mm/page_cgroup.c
+++ temp/mm/page_cgroup.c
@@ -306,7 +306,7 @@ static int swap_cgroup_prepare(int type)
 	ctrl = &swap_cgroup_ctrl[type];
 
 	for (idx = 0; idx < ctrl->length; idx++) {
-		page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page)
 			goto not_enough_page;
 		ctrl->map[idx] = page;
@@ -347,11 +347,10 @@ struct mem_cgroup *swap_cgroup_record(sw
 
 	mappage = ctrl->map[idx];
 	spin_lock_irqsave(&ctrl->lock, flags);
-	sc = kmap_atomic(mappage, KM_USER0);
+	sc = page_address(mappage);
 	sc += pos;
 	old = sc->val;
 	sc->val = mem;
-	kunmap_atomic((void *)sc, KM_USER0);
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	return old;
 }
@@ -382,10 +381,9 @@ struct mem_cgroup *lookup_swap_cgroup(sw
 	mappage = ctrl->map[idx];
 
 	spin_lock_irqsave(&ctrl->lock, flags);
-	sc = kmap_atomic(mappage, KM_USER0);
+	sc = page_address(mappage);
 	sc += pos;
 	ret = sc->val;
-	kunmap_atomic((void *)sc, KM_USER0);
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mmotm] memcg: avoid using buggy kmap at swap_cgroup
  2008-11-18 16:40             ` Hugh Dickins
@ 2008-11-18 16:56               ` Hugh Dickins
  2008-11-19 12:44                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2008-11-18 16:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrew Morton, linux-mm, Balbir Singh,
	Pavel Emelianov, LiZefan

On Tue, 18 Nov 2008, Hugh Dickins wrote:
> On Wed, 19 Nov 2008, KAMEZAWA Hiroyuki wrote:
> 
> >  2. later, add kmap_atomic + HighMem buffer support in explicit style.
> >     maybe KM_BOUNCE_READ...can be used.....
> 
> It's hardly appropriate (there's no bouncing here), and you could only
> use it if you disable interrupts.  Oh, you do disable interrupts:
> why's that?

In fact, why do you even need the spinlock?  I can see that you would
need it if in future you reduce the size of the elements of the array
from pointers; but at present, aren't you already in trouble if there's
a race on the pointer?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mmotm] memcg: avoid using buggy kmap at swap_cgroup
  2008-11-18 16:56               ` Hugh Dickins
@ 2008-11-19 12:44                 ` KAMEZAWA Hiroyuki
  2008-11-21  5:54                   ` [PATCH] memcg-swap-cgroup-for-remembering-usage-v2.patch KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-19 12:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Andrew Morton, linux-mm,
	Balbir Singh, Pavel Emelianov, LiZefan

Hugh Dickins said:
> On Tue, 18 Nov 2008, Hugh Dickins wrote:
>> On Wed, 19 Nov 2008, KAMEZAWA Hiroyuki wrote:
>>
>> >  2. later, add kmap_atomic + HighMem buffer support in explicit style.
>> >     maybe KM_BOUNCE_READ...can be used.....
>>
>> It's hardly appropriate (there's no bouncing here), and you could only
>> use it if you disable interrupts.  Oh, you do disable interrupts:
>> why's that?
>
> In fact, why do you even need the spinlock?  I can see that you would
> need it if in future you reduce the size of the elements of the array
> from pointers; but at present, aren't you already in trouble if there's
> a race on the pointer?
>
Hmm, I originally added it just for doing exchange-entry.
Now, lookup and exchange operation is implemented for swap_cgroup.

This field is touched when
  - try to map swap cache (try_charge_swapin) -> lookup
  - after swapcache is mapped -> exchange
  - swap cache is read by shmem -> lookup/exchange
  - swap cache is dropped -> exchange
  - swap entry is freed. -> exchange
  ...
Hmm.....
  When accessed via SwapCache -> SwapCache is locked -> no race..
  When accessed vid swap_free -> no user of swap -> no race....

Then, maybe lock is not needed...I'll review and prepare rework patch
for patch-in-mmotn+fix1234.

Thank you for pointing out.
-Kame





> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] memcg-swap-cgroup-for-remembering-usage-v2.patch
  2008-11-19 12:44                 ` KAMEZAWA Hiroyuki
@ 2008-11-21  5:54                   ` KAMEZAWA Hiroyuki
  2008-11-21  6:16                     ` Daisuke Nishimura
  2008-11-21  9:42                     ` Daisuke Nishimura
  0 siblings, 2 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-21  5:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, Daisuke Nishimura, Andrew Morton, linux-mm,
	Balbir Singh, Pavel Emelianov, LiZefan

This patch is written as a replacement for
  memcg-swap-cgroup-for-remembering-usage.patch
  memcg-swap-cgroup-for-remembering-usage-fix.patch
  memcg-swap-cgroup-for-remembering-usage-fix-2.patch
  memcg-swap-cgroup-for-remembering-usage-fix-3.patch
  memcg-swap-cgroup-for-remembering-usage-fix-4.patch

in mm queue. (sorry for low quality.)

I'm now testing this. (replace above 5 patches with this). S
Nishimura-san, Could you try ?

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Unified patch + rework for memcg-swap-cgroup-for-remembering-usage.patch

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
  - swap_cgroup_swapon() .... called from swapon
  - swap_cgroup_swapoff() ... called at the end of swapoff.

  - swap_cgroup_record() .... record information of swap entry.
  - swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information".  No actual method
for limit the usage of swap.  These routine uses flat table to record and
lookup.  "wise" lookup system like radix-tree requires requires memory
allocation at new records but swap-out is usually called under memory
shortage (or memcg hits limit.) So, I used static allocation.  (maybe
dynamic allocation is not very hard but it adds additional memory
allocation in memory shortage path.)

Note1: In this, we use pointer to record information and this means
      8bytes per swap entry. I think we can reduce this when we
      create "id of cgroup" in the range of 0-65535 or 0-255.

Changelog: v1->v2
 - make mutex to be static (from Andrew Morton <akpm@linux-foundation.org>)
 - fixed typo (from  Balbir Singh <balbir@linux.vnet.ibm.com>)
 - omit HIGHMEM functions. put that in TODO list.
 - delete spinlock around operations.(added comment "why we're safe.")

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Hugh Dickins <hugh@veritas.com>
Reported-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---

 include/linux/page_cgroup.h |   35 +++++++
 mm/page_cgroup.c            |  197 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c               |    8 +
 3 files changed, 240 insertions(+)

Index: mmotm-2.6.28-Nov20/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-Nov20.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-Nov20/include/linux/page_cgroup.h
@@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#include <linux/swap.h>
+extern struct mem_cgroup *
+swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
+extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+#else
+#include <linux/swap.h>
+
+static inline
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+	return NULL;
+}
+
+static inline
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+	return NULL;
+}
+
+static inline int
+swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	return 0;
+}
+
+static inline void swap_cgroup_swapoff(int type)
+{
+	return;
+}
+
+#endif
 #endif
Index: mmotm-2.6.28-Nov20/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov20.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov20/mm/page_cgroup.c
@@ -8,6 +8,7 @@
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
+#include <linux/swapops.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -266,3 +267,199 @@ void __init pgdat_page_cgroup_init(struc
 }
 
 #endif
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static DEFINE_MUTEX(swap_cgroup_mutex);
+struct swap_cgroup_ctrl {
+	struct page **map;
+	unsigned long length;
+};
+
+struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
+
+/*
+ * This 8bytes seems big..maybe we can reduce this when we can use "id" for
+ * cgroup rather than pointer.
+ */
+struct swap_cgroup {
+	struct mem_cgroup	*val;
+};
+#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
+#define SC_POS_MASK	(SC_PER_PAGE - 1)
+
+/*
+ * SwapCgroup implements "lookup" and "exchange" operations.
+ * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
+ * against SwapCache. At swap_free(), this is accessed directly from swap.
+ *
+ * This means,
+ *  - we have no race in "exchange" when we're accessed via SwapCache because
+ *    SwapCache(and its swp_entry) is under lock.
+ *  - When called via swap_free(), there is no user of this entry and no race.
+ * Then, we don't need lock around "exchange".
+ *
+ * TODO: we can push these buffers out to HIGHMEM.
+ */
+
+/*
+ * allocate buffer for swap_cgroup.
+ */
+static int swap_cgroup_prepare(int type)
+{
+	struct page *page;
+	struct swap_cgroup_ctrl *ctrl;
+	unsigned long idx, max;
+
+	if (!do_swap_account)
+		return 0;
+	ctrl = &swap_cgroup_ctrl[type];
+
+	for (idx = 0; idx < ctrl->length; idx++) {
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			goto not_enough_page;
+		ctrl->map[idx] = page;
+	}
+	return 0;
+not_enough_page:
+	max = idx;
+	for (idx = 0; idx < max; idx++)
+		__free_page(ctrl->map[idx]);
+
+	return -ENOMEM;
+}
+
+/**
+ * swap_cgroup_record - record mem_cgroup for this swp_entry.
+ * @ent: swap entry to be recorded into
+ * @mem: mem_cgroup to be recorded
+ *
+ * Returns old value at success, NULL at failure.
+ * (Of course, old value can be NULL.)
+ */
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+	int type = swp_type(ent);
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+	struct mem_cgroup *old;
+
+	if (!do_swap_account)
+		return NULL;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+	sc = page_address(mappage);
+	sc += pos;
+	old = sc->val;
+	sc->val = mem;
+
+	return old;
+}
+
+/**
+ * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
+ * @ent: swap entry to be looked up.
+ *
+ * Returns pointer to mem_cgroup at success. NULL at failure.
+ */
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+	int type = swp_type(ent);
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+	struct mem_cgroup *ret;
+
+	if (!do_swap_account)
+		return NULL;
+
+	ctrl = &swap_cgroup_ctrl[type];
+	mappage = ctrl->map[idx];
+	sc = page_address(mappage);
+	sc += pos;
+	ret = sc->val;
+	return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	void *array;
+	unsigned long array_size;
+	unsigned long length;
+	struct swap_cgroup_ctrl *ctrl;
+
+	if (!do_swap_account)
+		return 0;
+
+	length = ((max_pages/SC_PER_PAGE) + 1);
+	array_size = length * sizeof(void *);
+
+	array = vmalloc(array_size);
+	if (!array)
+		goto nomem;
+
+	memset(array, 0, array_size);
+	ctrl = &swap_cgroup_ctrl[type];
+	mutex_lock(&swap_cgroup_mutex);
+	ctrl->length = length;
+	ctrl->map = array;
+	if (swap_cgroup_prepare(type)) {
+		/* memory shortage */
+		ctrl->map = NULL;
+		ctrl->length = 0;
+		vfree(array);
+		mutex_unlock(&swap_cgroup_mutex);
+		goto nomem;
+	}
+	mutex_unlock(&swap_cgroup_mutex);
+
+	printk(KERN_INFO
+		"swap_cgroup: uses %ld bytes of vmalloc for pointer array space"
+		" and %ld bytes to hold mem_cgroup pointers on swap\n",
+		array_size, length * PAGE_SIZE);
+	printk(KERN_INFO
+	"swap_cgroup can be disabled by noswapaccount boot option.\n");
+
+	return 0;
+nomem:
+	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
+	printk(KERN_INFO
+		"swap_cgroup can be disabled by noswapaccount boot option\n");
+	return -ENOMEM;
+}
+
+void swap_cgroup_swapoff(int type)
+{
+	int i;
+	struct swap_cgroup_ctrl *ctrl;
+
+	if (!do_swap_account)
+		return;
+
+	mutex_lock(&swap_cgroup_mutex);
+	ctrl = &swap_cgroup_ctrl[type];
+	if (ctrl->map) {
+		for (i = 0; i < ctrl->length; i++) {
+			struct page *page = ctrl->map[i];
+			if (page)
+				__free_page(page);
+		}
+		vfree(ctrl->map);
+		ctrl->map = NULL;
+		ctrl->length = 0;
+	}
+	mutex_unlock(&swap_cgroup_mutex);
+}
+
+#endif
Index: mmotm-2.6.28-Nov20/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-Nov20.orig/mm/swapfile.c
+++ mmotm-2.6.28-Nov20/mm/swapfile.c
@@ -32,6 +32,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
+#include <linux/page_cgroup.h>
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
+	/* Destroy swap account informatin */
+	swap_cgroup_swapoff(type);
+
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
@@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
 		nr_good_pages = swap_header->info.last_page -
 				swap_header->info.nr_badpages -
 				1 /* header page */;
+
+		if (!error)
+			error = swap_cgroup_swapon(type, maxpages);
+
 		if (error)
 			goto bad_swap;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] memcg-swap-cgroup-for-remembering-usage-v2.patch
  2008-11-21  5:54                   ` [PATCH] memcg-swap-cgroup-for-remembering-usage-v2.patch KAMEZAWA Hiroyuki
@ 2008-11-21  6:16                     ` Daisuke Nishimura
  2008-11-21  9:42                     ` Daisuke Nishimura
  1 sibling, 0 replies; 13+ messages in thread
From: Daisuke Nishimura @ 2008-11-21  6:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Hugh Dickins, Andrew Morton, linux-mm, Balbir Singh,
	Pavel Emelianov, LiZefan

On Fri, 21 Nov 2008 14:54:26 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> This patch is written as a replacement for
>   memcg-swap-cgroup-for-remembering-usage.patch
>   memcg-swap-cgroup-for-remembering-usage-fix.patch
>   memcg-swap-cgroup-for-remembering-usage-fix-2.patch
>   memcg-swap-cgroup-for-remembering-usage-fix-3.patch
>   memcg-swap-cgroup-for-remembering-usage-fix-4.patch
> 
> in mm queue. (sorry for low quality.)
> 
> I'm now testing this. (replace above 5 patches with this). S
> Nishimura-san, Could you try ?
> 
Sure.

I'll test with this patch applied.


Thanks,
Daisuke Nishimura.

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Unified patch + rework for memcg-swap-cgroup-for-remembering-usage.patch
> 
> For accounting swap, we need a record per swap entry, at least.
> 
> This patch adds following function.
>   - swap_cgroup_swapon() .... called from swapon
>   - swap_cgroup_swapoff() ... called at the end of swapoff.
> 
>   - swap_cgroup_record() .... record information of swap entry.
>   - swap_cgroup_lookup() .... lookup information of swap entry.
> 
> This patch just implements "how to record information".  No actual method
> for limit the usage of swap.  These routine uses flat table to record and
> lookup.  "wise" lookup system like radix-tree requires requires memory
> allocation at new records but swap-out is usually called under memory
> shortage (or memcg hits limit.) So, I used static allocation.  (maybe
> dynamic allocation is not very hard but it adds additional memory
> allocation in memory shortage path.)
> 
> Note1: In this, we use pointer to record information and this means
>       8bytes per swap entry. I think we can reduce this when we
>       create "id of cgroup" in the range of 0-65535 or 0-255.
> 
> Changelog: v1->v2
>  - make mutex to be static (from Andrew Morton <akpm@linux-foundation.org>)
>  - fixed typo (from  Balbir Singh <balbir@linux.vnet.ibm.com>)
>  - omit HIGHMEM functions. put that in TODO list.
>  - delete spinlock around operations.(added comment "why we're safe.")
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Reported-by: Hugh Dickins <hugh@veritas.com>
> Reported-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Reported-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> 
>  include/linux/page_cgroup.h |   35 +++++++
>  mm/page_cgroup.c            |  197 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c               |    8 +
>  3 files changed, 240 insertions(+)
> 
> Index: mmotm-2.6.28-Nov20/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.28-Nov20.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.28-Nov20/include/linux/page_cgroup.h
> @@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
>  }
>  
>  #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +#include <linux/swap.h>
> +extern struct mem_cgroup *
> +swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
> +extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
> +extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> +extern void swap_cgroup_swapoff(int type);
> +#else
> +#include <linux/swap.h>
> +
> +static inline
> +struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
> +{
> +	return NULL;
> +}
> +
> +static inline
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> +	return NULL;
> +}
> +
> +static inline int
> +swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	return 0;
> +}
> +
> +static inline void swap_cgroup_swapoff(int type)
> +{
> +	return;
> +}
> +
> +#endif
>  #endif
> Index: mmotm-2.6.28-Nov20/mm/page_cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Nov20.orig/mm/page_cgroup.c
> +++ mmotm-2.6.28-Nov20/mm/page_cgroup.c
> @@ -8,6 +8,7 @@
>  #include <linux/memory.h>
>  #include <linux/vmalloc.h>
>  #include <linux/cgroup.h>
> +#include <linux/swapops.h>
>  
>  static void __meminit
>  __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> @@ -266,3 +267,199 @@ void __init pgdat_page_cgroup_init(struc
>  }
>  
>  #endif
> +
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
> +static DEFINE_MUTEX(swap_cgroup_mutex);
> +struct swap_cgroup_ctrl {
> +	struct page **map;
> +	unsigned long length;
> +};
> +
> +struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> +
> +/*
> + * This 8bytes seems big..maybe we can reduce this when we can use "id" for
> + * cgroup rather than pointer.
> + */
> +struct swap_cgroup {
> +	struct mem_cgroup	*val;
> +};
> +#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
> +#define SC_POS_MASK	(SC_PER_PAGE - 1)
> +
> +/*
> + * SwapCgroup implements "lookup" and "exchange" operations.
> + * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> + * against SwapCache. At swap_free(), this is accessed directly from swap.
> + *
> + * This means,
> + *  - we have no race in "exchange" when we're accessed via SwapCache because
> + *    SwapCache(and its swp_entry) is under lock.
> + *  - When called via swap_free(), there is no user of this entry and no race.
> + * Then, we don't need lock around "exchange".
> + *
> + * TODO: we can push these buffers out to HIGHMEM.
> + */
> +
> +/*
> + * allocate buffer for swap_cgroup.
> + */
> +static int swap_cgroup_prepare(int type)
> +{
> +	struct page *page;
> +	struct swap_cgroup_ctrl *ctrl;
> +	unsigned long idx, max;
> +
> +	if (!do_swap_account)
> +		return 0;
> +	ctrl = &swap_cgroup_ctrl[type];
> +
> +	for (idx = 0; idx < ctrl->length; idx++) {
> +		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +		if (!page)
> +			goto not_enough_page;
> +		ctrl->map[idx] = page;
> +	}
> +	return 0;
> +not_enough_page:
> +	max = idx;
> +	for (idx = 0; idx < max; idx++)
> +		__free_page(ctrl->map[idx]);
> +
> +	return -ENOMEM;
> +}
> +
> +/**
> + * swap_cgroup_record - record mem_cgroup for this swp_entry.
> + * @ent: swap entry to be recorded into
> + * @mem: mem_cgroup to be recorded
> + *
> + * Returns old value at success, NULL at failure.
> + * (Of course, old value can be NULL.)
> + */
> +struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
> +{
> +	int type = swp_type(ent);
> +	unsigned long offset = swp_offset(ent);
> +	unsigned long idx = offset / SC_PER_PAGE;
> +	unsigned long pos = offset & SC_POS_MASK;
> +	struct swap_cgroup_ctrl *ctrl;
> +	struct page *mappage;
> +	struct swap_cgroup *sc;
> +	struct mem_cgroup *old;
> +
> +	if (!do_swap_account)
> +		return NULL;
> +
> +	ctrl = &swap_cgroup_ctrl[type];
> +
> +	mappage = ctrl->map[idx];
> +	sc = page_address(mappage);
> +	sc += pos;
> +	old = sc->val;
> +	sc->val = mem;
> +
> +	return old;
> +}
> +
> +/**
> + * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
> + * @ent: swap entry to be looked up.
> + *
> + * Returns pointer to mem_cgroup at success. NULL at failure.
> + */
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> +	int type = swp_type(ent);
> +	unsigned long offset = swp_offset(ent);
> +	unsigned long idx = offset / SC_PER_PAGE;
> +	unsigned long pos = offset & SC_POS_MASK;
> +	struct swap_cgroup_ctrl *ctrl;
> +	struct page *mappage;
> +	struct swap_cgroup *sc;
> +	struct mem_cgroup *ret;
> +
> +	if (!do_swap_account)
> +		return NULL;
> +
> +	ctrl = &swap_cgroup_ctrl[type];
> +	mappage = ctrl->map[idx];
> +	sc = page_address(mappage);
> +	sc += pos;
> +	ret = sc->val;
> +	return ret;
> +}
> +
> +int swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	void *array;
> +	unsigned long array_size;
> +	unsigned long length;
> +	struct swap_cgroup_ctrl *ctrl;
> +
> +	if (!do_swap_account)
> +		return 0;
> +
> +	length = ((max_pages/SC_PER_PAGE) + 1);
> +	array_size = length * sizeof(void *);
> +
> +	array = vmalloc(array_size);
> +	if (!array)
> +		goto nomem;
> +
> +	memset(array, 0, array_size);
> +	ctrl = &swap_cgroup_ctrl[type];
> +	mutex_lock(&swap_cgroup_mutex);
> +	ctrl->length = length;
> +	ctrl->map = array;
> +	if (swap_cgroup_prepare(type)) {
> +		/* memory shortage */
> +		ctrl->map = NULL;
> +		ctrl->length = 0;
> +		vfree(array);
> +		mutex_unlock(&swap_cgroup_mutex);
> +		goto nomem;
> +	}
> +	mutex_unlock(&swap_cgroup_mutex);
> +
> +	printk(KERN_INFO
> +		"swap_cgroup: uses %ld bytes of vmalloc for pointer array space"
> +		" and %ld bytes to hold mem_cgroup pointers on swap\n",
> +		array_size, length * PAGE_SIZE);
> +	printk(KERN_INFO
> +	"swap_cgroup can be disabled by noswapaccount boot option.\n");
> +
> +	return 0;
> +nomem:
> +	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
> +	printk(KERN_INFO
> +		"swap_cgroup can be disabled by noswapaccount boot option\n");
> +	return -ENOMEM;
> +}
> +
> +void swap_cgroup_swapoff(int type)
> +{
> +	int i;
> +	struct swap_cgroup_ctrl *ctrl;
> +
> +	if (!do_swap_account)
> +		return;
> +
> +	mutex_lock(&swap_cgroup_mutex);
> +	ctrl = &swap_cgroup_ctrl[type];
> +	if (ctrl->map) {
> +		for (i = 0; i < ctrl->length; i++) {
> +			struct page *page = ctrl->map[i];
> +			if (page)
> +				__free_page(page);
> +		}
> +		vfree(ctrl->map);
> +		ctrl->map = NULL;
> +		ctrl->length = 0;
> +	}
> +	mutex_unlock(&swap_cgroup_mutex);
> +}
> +
> +#endif
> Index: mmotm-2.6.28-Nov20/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.28-Nov20.orig/mm/swapfile.c
> +++ mmotm-2.6.28-Nov20/mm/swapfile.c
> @@ -32,6 +32,7 @@
>  #include <asm/pgtable.h>
>  #include <asm/tlbflush.h>
>  #include <linux/swapops.h>
> +#include <linux/page_cgroup.h>
>  
>  static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
> @@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
>  	spin_unlock(&swap_lock);
>  	mutex_unlock(&swapon_mutex);
>  	vfree(swap_map);
> +	/* Destroy swap account informatin */
> +	swap_cgroup_swapoff(type);
> +
>  	inode = mapping->host;
>  	if (S_ISBLK(inode->i_mode)) {
>  		struct block_device *bdev = I_BDEV(inode);
> @@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
>  		nr_good_pages = swap_header->info.last_page -
>  				swap_header->info.nr_badpages -
>  				1 /* header page */;
> +
> +		if (!error)
> +			error = swap_cgroup_swapon(type, maxpages);
> +
>  		if (error)
>  			goto bad_swap;
>  	}
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] memcg-swap-cgroup-for-remembering-usage-v2.patch
  2008-11-21  5:54                   ` [PATCH] memcg-swap-cgroup-for-remembering-usage-v2.patch KAMEZAWA Hiroyuki
  2008-11-21  6:16                     ` Daisuke Nishimura
@ 2008-11-21  9:42                     ` Daisuke Nishimura
  1 sibling, 0 replies; 13+ messages in thread
From: Daisuke Nishimura @ 2008-11-21  9:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Hugh Dickins, Andrew Morton, linux-mm, Balbir Singh,
	Pavel Emelianov, LiZefan

On Fri, 21 Nov 2008 14:54:26 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> This patch is written as a replacement for
>   memcg-swap-cgroup-for-remembering-usage.patch
>   memcg-swap-cgroup-for-remembering-usage-fix.patch
>   memcg-swap-cgroup-for-remembering-usage-fix-2.patch
>   memcg-swap-cgroup-for-remembering-usage-fix-3.patch
>   memcg-swap-cgroup-for-remembering-usage-fix-4.patch
> 
> in mm queue. (sorry for low quality.)
> 
> I'm now testing this. (replace above 5 patches with this). S
> Nishimura-san, Could you try ?
> 
This patch looks good to me, and it works well on my x86_32/64 environment.
I think it would be better to replace above patches with this patch
for further test.

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
	Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Unified patch + rework for memcg-swap-cgroup-for-remembering-usage.patch
> 
> For accounting swap, we need a record per swap entry, at least.
> 
> This patch adds following function.
>   - swap_cgroup_swapon() .... called from swapon
>   - swap_cgroup_swapoff() ... called at the end of swapoff.
> 
>   - swap_cgroup_record() .... record information of swap entry.
>   - swap_cgroup_lookup() .... lookup information of swap entry.
> 
> This patch just implements "how to record information".  No actual method
> for limit the usage of swap.  These routine uses flat table to record and
> lookup.  "wise" lookup system like radix-tree requires requires memory
> allocation at new records but swap-out is usually called under memory
> shortage (or memcg hits limit.) So, I used static allocation.  (maybe
> dynamic allocation is not very hard but it adds additional memory
> allocation in memory shortage path.)
> 
> Note1: In this, we use pointer to record information and this means
>       8bytes per swap entry. I think we can reduce this when we
>       create "id of cgroup" in the range of 0-65535 or 0-255.
> 
> Changelog: v1->v2
>  - make mutex to be static (from Andrew Morton <akpm@linux-foundation.org>)
>  - fixed typo (from  Balbir Singh <balbir@linux.vnet.ibm.com>)
>  - omit HIGHMEM functions. put that in TODO list.
>  - delete spinlock around operations.(added comment "why we're safe.")
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Reported-by: Hugh Dickins <hugh@veritas.com>
> Reported-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Reported-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> 
>  include/linux/page_cgroup.h |   35 +++++++
>  mm/page_cgroup.c            |  197 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c               |    8 +
>  3 files changed, 240 insertions(+)
> 
> Index: mmotm-2.6.28-Nov20/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.28-Nov20.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.28-Nov20/include/linux/page_cgroup.h
> @@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
>  }
>  
>  #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +#include <linux/swap.h>
> +extern struct mem_cgroup *
> +swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
> +extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
> +extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> +extern void swap_cgroup_swapoff(int type);
> +#else
> +#include <linux/swap.h>
> +
> +static inline
> +struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
> +{
> +	return NULL;
> +}
> +
> +static inline
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> +	return NULL;
> +}
> +
> +static inline int
> +swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	return 0;
> +}
> +
> +static inline void swap_cgroup_swapoff(int type)
> +{
> +	return;
> +}
> +
> +#endif
>  #endif
> Index: mmotm-2.6.28-Nov20/mm/page_cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Nov20.orig/mm/page_cgroup.c
> +++ mmotm-2.6.28-Nov20/mm/page_cgroup.c
> @@ -8,6 +8,7 @@
>  #include <linux/memory.h>
>  #include <linux/vmalloc.h>
>  #include <linux/cgroup.h>
> +#include <linux/swapops.h>
>  
>  static void __meminit
>  __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> @@ -266,3 +267,199 @@ void __init pgdat_page_cgroup_init(struc
>  }
>  
>  #endif
> +
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
> +static DEFINE_MUTEX(swap_cgroup_mutex);
> +struct swap_cgroup_ctrl {
> +	struct page **map;
> +	unsigned long length;
> +};
> +
> +struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> +
> +/*
> + * This 8bytes seems big..maybe we can reduce this when we can use "id" for
> + * cgroup rather than pointer.
> + */
> +struct swap_cgroup {
> +	struct mem_cgroup	*val;
> +};
> +#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
> +#define SC_POS_MASK	(SC_PER_PAGE - 1)
> +
> +/*
> + * SwapCgroup implements "lookup" and "exchange" operations.
> + * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> + * against SwapCache. At swap_free(), this is accessed directly from swap.
> + *
> + * This means,
> + *  - we have no race in "exchange" when we're accessed via SwapCache because
> + *    SwapCache(and its swp_entry) is under lock.
> + *  - When called via swap_free(), there is no user of this entry and no race.
> + * Then, we don't need lock around "exchange".
> + *
> + * TODO: we can push these buffers out to HIGHMEM.
> + */
> +
> +/*
> + * allocate buffer for swap_cgroup.
> + */
> +static int swap_cgroup_prepare(int type)
> +{
> +	struct page *page;
> +	struct swap_cgroup_ctrl *ctrl;
> +	unsigned long idx, max;
> +
> +	if (!do_swap_account)
> +		return 0;
> +	ctrl = &swap_cgroup_ctrl[type];
> +
> +	for (idx = 0; idx < ctrl->length; idx++) {
> +		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +		if (!page)
> +			goto not_enough_page;
> +		ctrl->map[idx] = page;
> +	}
> +	return 0;
> +not_enough_page:
> +	max = idx;
> +	for (idx = 0; idx < max; idx++)
> +		__free_page(ctrl->map[idx]);
> +
> +	return -ENOMEM;
> +}
> +
> +/**
> + * swap_cgroup_record - record mem_cgroup for this swp_entry.
> + * @ent: swap entry to be recorded into
> + * @mem: mem_cgroup to be recorded
> + *
> + * Returns old value at success, NULL at failure.
> + * (Of course, old value can be NULL.)
> + */
> +struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
> +{
> +	int type = swp_type(ent);
> +	unsigned long offset = swp_offset(ent);
> +	unsigned long idx = offset / SC_PER_PAGE;
> +	unsigned long pos = offset & SC_POS_MASK;
> +	struct swap_cgroup_ctrl *ctrl;
> +	struct page *mappage;
> +	struct swap_cgroup *sc;
> +	struct mem_cgroup *old;
> +
> +	if (!do_swap_account)
> +		return NULL;
> +
> +	ctrl = &swap_cgroup_ctrl[type];
> +
> +	mappage = ctrl->map[idx];
> +	sc = page_address(mappage);
> +	sc += pos;
> +	old = sc->val;
> +	sc->val = mem;
> +
> +	return old;
> +}
> +
> +/**
> + * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
> + * @ent: swap entry to be looked up.
> + *
> + * Returns pointer to mem_cgroup at success. NULL at failure.
> + */
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> +	int type = swp_type(ent);
> +	unsigned long offset = swp_offset(ent);
> +	unsigned long idx = offset / SC_PER_PAGE;
> +	unsigned long pos = offset & SC_POS_MASK;
> +	struct swap_cgroup_ctrl *ctrl;
> +	struct page *mappage;
> +	struct swap_cgroup *sc;
> +	struct mem_cgroup *ret;
> +
> +	if (!do_swap_account)
> +		return NULL;
> +
> +	ctrl = &swap_cgroup_ctrl[type];
> +	mappage = ctrl->map[idx];
> +	sc = page_address(mappage);
> +	sc += pos;
> +	ret = sc->val;
> +	return ret;
> +}
> +
> +int swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	void *array;
> +	unsigned long array_size;
> +	unsigned long length;
> +	struct swap_cgroup_ctrl *ctrl;
> +
> +	if (!do_swap_account)
> +		return 0;
> +
> +	length = ((max_pages/SC_PER_PAGE) + 1);
> +	array_size = length * sizeof(void *);
> +
> +	array = vmalloc(array_size);
> +	if (!array)
> +		goto nomem;
> +
> +	memset(array, 0, array_size);
> +	ctrl = &swap_cgroup_ctrl[type];
> +	mutex_lock(&swap_cgroup_mutex);
> +	ctrl->length = length;
> +	ctrl->map = array;
> +	if (swap_cgroup_prepare(type)) {
> +		/* memory shortage */
> +		ctrl->map = NULL;
> +		ctrl->length = 0;
> +		vfree(array);
> +		mutex_unlock(&swap_cgroup_mutex);
> +		goto nomem;
> +	}
> +	mutex_unlock(&swap_cgroup_mutex);
> +
> +	printk(KERN_INFO
> +		"swap_cgroup: uses %ld bytes of vmalloc for pointer array space"
> +		" and %ld bytes to hold mem_cgroup pointers on swap\n",
> +		array_size, length * PAGE_SIZE);
> +	printk(KERN_INFO
> +	"swap_cgroup can be disabled by noswapaccount boot option.\n");
> +
> +	return 0;
> +nomem:
> +	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
> +	printk(KERN_INFO
> +		"swap_cgroup can be disabled by noswapaccount boot option\n");
> +	return -ENOMEM;
> +}
> +
> +void swap_cgroup_swapoff(int type)
> +{
> +	int i;
> +	struct swap_cgroup_ctrl *ctrl;
> +
> +	if (!do_swap_account)
> +		return;
> +
> +	mutex_lock(&swap_cgroup_mutex);
> +	ctrl = &swap_cgroup_ctrl[type];
> +	if (ctrl->map) {
> +		for (i = 0; i < ctrl->length; i++) {
> +			struct page *page = ctrl->map[i];
> +			if (page)
> +				__free_page(page);
> +		}
> +		vfree(ctrl->map);
> +		ctrl->map = NULL;
> +		ctrl->length = 0;
> +	}
> +	mutex_unlock(&swap_cgroup_mutex);
> +}
> +
> +#endif
> Index: mmotm-2.6.28-Nov20/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.28-Nov20.orig/mm/swapfile.c
> +++ mmotm-2.6.28-Nov20/mm/swapfile.c
> @@ -32,6 +32,7 @@
>  #include <asm/pgtable.h>
>  #include <asm/tlbflush.h>
>  #include <linux/swapops.h>
> +#include <linux/page_cgroup.h>
>  
>  static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
> @@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
>  	spin_unlock(&swap_lock);
>  	mutex_unlock(&swapon_mutex);
>  	vfree(swap_map);
> +	/* Destroy swap account informatin */
> +	swap_cgroup_swapoff(type);
> +
>  	inode = mapping->host;
>  	if (S_ISBLK(inode->i_mode)) {
>  		struct block_device *bdev = I_BDEV(inode);
> @@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
>  		nr_good_pages = swap_header->info.last_page -
>  				swap_header->info.nr_badpages -
>  				1 /* header page */;
> +
> +		if (!error)
> +			error = swap_cgroup_swapon(type, maxpages);
> +
>  		if (error)
>  			goto bad_swap;
>  	}
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-11-21  9:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-18  9:07 [PATCH mmotm] memcg: unmap KM_USER0 at shmem_map_and_free_swp if do_swap_account Daisuke Nishimura
2008-11-18  9:26 ` KAMEZAWA Hiroyuki
2008-11-18 10:21   ` Daisuke Nishimura
2008-11-18 12:08     ` Daisuke Nishimura
2008-11-18 12:48       ` Hugh Dickins
2008-11-18 15:17         ` Daisuke Nishimura
2008-11-18 16:12           ` [PATCH mmotm] memcg: avoid using buggy kmap at swap_cgroup KAMEZAWA Hiroyuki
2008-11-18 16:40             ` Hugh Dickins
2008-11-18 16:56               ` Hugh Dickins
2008-11-19 12:44                 ` KAMEZAWA Hiroyuki
2008-11-21  5:54                   ` [PATCH] memcg-swap-cgroup-for-remembering-usage-v2.patch KAMEZAWA Hiroyuki
2008-11-21  6:16                     ` Daisuke Nishimura
2008-11-21  9:42                     ` Daisuke Nishimura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox