linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>,
	"xemul@openvz.org" <xemul@openvz.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Dave Hansen <haveblue@us.ibm.com>,
	ryov@valinux.co.jp
Subject: Re: [PATCH 9/12] memcg allocate all page_cgroup at boot
Date: Fri, 26 Sep 2008 10:43:36 +0900	[thread overview]
Message-ID: <20080926104336.d96ab5bd.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <20080926100022.8bfb8d4d.nishimura@mxp.nes.nec.co.jp>

On Fri, 26 Sep 2008 10:00:22 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 25 Sep 2008 15:32:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Allocate all page_cgroup at boot and remove page_cgroup poitner
> > from struct page. This patch adds an interface as
> > 
> >  struct page_cgroup *lookup_page_cgroup(struct page*)
> > 
> > All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.
> > 
> > Remove page_cgroup pointer reduces the amount of memory by
> >  - 4 bytes per PAGE_SIZE.
> >  - 8 bytes per PAGE_SIZE
> > if memory controller is disabled. (even if configured.)
> > meta data usage of this is no problem in FLATMEM/DISCONTIGMEM.
> > On SPARSEMEM, this makes mem_section[] size twice.
> > 
> > On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
> > On my x86-64 server with 48GB of memory, this saves 96MB of memory.
> > (and uses xx kbytes for mem_section.)
> > I think this reduction makes sense.
> > 
> > By pre-allocation, kmalloc/kfree in charge/uncharge are removed. 
> > This means
> >   - we're not necessary to be afraid of kmalloc faiulre.
> >     (this can happen because of gfp_mask type.)
> >   - we can avoid calling kmalloc/kfree.
> >   - we can avoid allocating tons of small objects which can be fragmented.
> >   - we can know what amount of memory will be used for this extra-lru handling.
> > 
> > I added printk message as
> > 
> > 	"allocated %ld bytes of page_cgroup"
> >         "please try cgroup_disable=memory option if you don't want"
> > 
> > maybe enough informative for users.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> >  include/linux/memcontrol.h  |   11 -
> >  include/linux/mm_types.h    |    4 
> >  include/linux/mmzone.h      |    9 +
> >  include/linux/page_cgroup.h |   90 +++++++++++++++
> >  mm/Makefile                 |    2 
> >  mm/memcontrol.c             |  258 ++++++++++++--------------------------------
> >  mm/page_alloc.c             |   12 --
> >  mm/page_cgroup.c            |  253 +++++++++++++++++++++++++++++++++++++++++++
> >  8 files changed, 431 insertions(+), 208 deletions(-)
> > 
> > Index: mmotm-2.6.27-rc7+/mm/page_cgroup.c
> > ===================================================================
> > --- /dev/null
> > +++ mmotm-2.6.27-rc7+/mm/page_cgroup.c
> > @@ -0,0 +1,253 @@
> > +#include <linux/mm.h>
> > +#include <linux/mmzone.h>
> > +#include <linux/bootmem.h>
> > +#include <linux/bit_spinlock.h>
> > +#include <linux/page_cgroup.h>
> > +#include <linux/hash.h>
> > +#include <linux/memory.h>
> > +
> > +static void __meminit
> > +__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> > +{
> > +	pc->flags = 0;
> > +	pc->mem_cgroup = NULL;
> > +	pc->page = pfn_to_page(pfn);
> > +}
> > +static unsigned long total_usage = 0;
> > +
> > +#ifdef CONFIG_FLAT_NODE_MEM_MAP
> > +
> > +
> > +void __init pgdat_page_cgroup_init(struct pglist_data *pgdat)
> > +{
> > +	pgdat->node_page_cgroup = NULL;
> > +}
> > +
> > +struct page_cgroup *lookup_page_cgroup(struct page *page)
> > +{
> > +	unsigned long pfn = page_to_pfn(page);
> > +	unsigned long offset;
> > +	struct page_cgroup *base;
> > +
> > +	base = NODE_DATA(page_to_nid(nid))->node_page_cgroup;
> page_to_nid(page) :)
> 
yes..

> > +	if (unlikely(!base))
> > +		return NULL;
> > +
> > +	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
> > +	return base + offset;
> > +}
> > +
> > +static int __init alloc_node_page_cgroup(int nid)
> > +{
> > +	struct page_cgroup *base, *pc;
> > +	unsigned long table_size;
> > +	unsigned long start_pfn, nr_pages, index;
> > +
> > +	start_pfn = NODE_DATA(nid)->node_start_pfn;
> > +	nr_pages = NODE_DATA(nid)->node_spanned_pages;
> > +
> > +	table_size = sizeof(struct page_cgroup) * nr_pages;
> > +
> > +	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
> > +			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
> > +	if (!base)
> > +		return -ENOMEM;
> > +	for (index = 0; index < nr_pages; index++) {
> > +		pc = base + index;
> > +		__init_page_cgroup(pc, start_pfn + index);
> > +	}
> > +	NODE_DATA(nid)->node_page_cgroup = base;
> > +	total_usage += table_size;
> > +	return 0;
> > +}
> > +
> > +void __init free_node_page_cgroup(int nid)
> > +{
> > +	unsigned long table_size;
> > +	unsigned long nr_pages;
> > +	struct page_cgroup *base;
> > +
> > +	base = NODE_DATA(nid)->node_page_cgroup;
> > +	if (!base)
> > +		return;
> > +	nr_pages = NODE_DATA(nid)->node_spanned_pages;
> > +
> > +	table_size = sizeof(struct page_cgroup) * nr_pages;
> > +
> > +	free_bootmem_node(NODE_DATA(nid),
> > +			(unsigned long)base, table_size);
> > +	NODE_DATA(nid)->node_page_cgroup = NULL;
> > +}
> > +
> Hmm, who uses this function?
> 
Uh, ok. unnecessary. (In my first version, this allocation error
just shows Warning. Now, it panics.)

Appearently, FLATMEM check is not enough...

> (snip)
> 
> > @@ -812,49 +708,41 @@ __mem_cgroup_uncharge_common(struct page
> >  
> >  	if (mem_cgroup_subsys.disabled)
> >  		return;
> > +	/* check the condition we can know from page */
> >  
> > -	/*
> > -	 * Check if our page_cgroup is valid
> > -	 */
> > -	lock_page_cgroup(page);
> > -	pc = page_get_page_cgroup(page);
> > -	if (unlikely(!pc))
> > -		goto unlock;
> > -
> > -	VM_BUG_ON(pc->page != page);
> > +	pc = lookup_page_cgroup(page);
> > +	if (unlikely(!pc || !PageCgroupUsed(pc)))
> > +		return;
> > +	preempt_disable();
> > +	lock_page_cgroup(pc);
> > +	if (unlikely(page_mapped(page))) {
> > +		unlock_page_cgroup(pc);
> > +		preempt_enable();
> > +		return;
> > +	}
> Just for clarification, in what sequence will the page be mapped here?
> mem_cgroup_uncharge_page checks whether the page is mapped.
> 
Please think about folloing situation.

   There is a SwapCache which is referred from 2 process, A, B.
   A maps it.
   B doesn't maps it.

   And now, process A exits.

	CPU0(process A)				CPU1 (process B)
 
    zap_pte_range()
    => page remove from rmap			=> charge() (do_swap_page)
	=> set page->mapcount->0          	
		=> uncharge()			=> set page->mapcount=1

This race is what patch 12/12 is fixed.
This only happens on cursed SwapCache.


> > +	ClearPageCgroupUsed(pc);
> > +	unlock_page_cgroup(pc);
> >  
> > -	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > -	    && ((PageCgroupCache(pc) || page_mapped(page))))
> > -		goto unlock;
> > -retry:
> >  	mem = pc->mem_cgroup;
> >  	mz = page_cgroup_zoneinfo(pc);
> > +
> >  	spin_lock_irqsave(&mz->lru_lock, flags);
> > -	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
> > -	    unlikely(mem != pc->mem_cgroup)) {
> > -		/* MAPPED account can be done without lock_page().
> > -		   Check race with mem_cgroup_move_account() */
> > -		spin_unlock_irqrestore(&mz->lru_lock, flags);
> > -		goto retry;
> > -	}
> By these changes, ctype becomes unnecessary so it can be removed.
> 
Uh, maybe it can be removed.

> >  	__mem_cgroup_remove_list(mz, pc);
> >  	spin_unlock_irqrestore(&mz->lru_lock, flags);
> > -
> > -	page_assign_page_cgroup(page, NULL);
> > -	unlock_page_cgroup(page);
> > -
> > -
> > -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	pc->mem_cgroup = NULL;
> >  	css_put(&mem->css);
> > +	preempt_enable();
> > +	res_counter_uncharge(&mem->res, PAGE_SIZE);
> >  
> > -	kmem_cache_free(page_cgroup_cache, pc);
> >  	return;
> > -unlock:
> > -	unlock_page_cgroup(page);
> >  }
> >  

Thank you for review.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2008-09-26  1:43 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-25  6:11 [PATCH 0/12] memcg updates v5 KAMEZAWA Hiroyuki
2008-09-25  6:13 ` [PATCH 1/12] memcg avoid accounting special mappings not on LRU KAMEZAWA Hiroyuki
2008-09-26  8:25   ` Balbir Singh
2008-09-26  9:17     ` KAMEZAWA Hiroyuki
2008-09-26  9:32       ` Balbir Singh
2008-09-26  9:55         ` KAMEZAWA Hiroyuki
2008-09-25  6:14 ` [PATCH 2/12] memcg move charege() call to swapped-in page under lock_page() KAMEZAWA Hiroyuki
2008-09-26  8:36   ` Balbir Singh
2008-09-26  9:18     ` KAMEZAWA Hiroyuki
2008-09-25  6:15 ` [PATCH 3/12] memcg make root cgroup unlimited KAMEZAWA Hiroyuki
2008-09-26  8:41   ` Balbir Singh
2008-09-26  9:21     ` KAMEZAWA Hiroyuki
2008-09-26  9:29       ` Balbir Singh
2008-09-26  9:59         ` KAMEZAWA Hiroyuki
2008-09-25  6:16 ` [PATCH 4/12] memcg make page->mapping NULL before calling uncharge KAMEZAWA Hiroyuki
2008-09-26  9:47   ` Balbir Singh
2008-09-26 10:07     ` KAMEZAWA Hiroyuki
2008-09-25  6:17 ` [PATCH 5/12] memcg make page_cgroup->flags atomic KAMEZAWA Hiroyuki
2008-09-27  6:58   ` Balbir Singh
2008-09-25  6:18 ` [PATCH 6/12] memcg optimize percpu stat KAMEZAWA Hiroyuki
2008-09-26  9:53   ` Balbir Singh
2008-09-25  6:27 ` [PATCH 7/12] memcg add function to move account KAMEZAWA Hiroyuki
2008-09-26  7:30   ` Daisuke Nishimura
2008-09-26  9:24     ` KAMEZAWA Hiroyuki
2008-09-27  7:56   ` Balbir Singh
2008-09-27  8:35   ` kamezawa.hiroyu
2008-09-25  6:29 ` [PATCH 8/12] memcg rewrite force empty to move account to root KAMEZAWA Hiroyuki
2008-09-25  6:32 ` [PATCH 9/12] memcg allocate all page_cgroup at boot KAMEZAWA Hiroyuki
2008-09-25 18:40   ` Dave Hansen
2008-09-26  1:17     ` KAMEZAWA Hiroyuki
2008-09-26  1:22       ` KAMEZAWA Hiroyuki
2008-09-26  1:00   ` Daisuke Nishimura
2008-09-26  1:43     ` KAMEZAWA Hiroyuki [this message]
2008-09-26  2:05       ` KAMEZAWA Hiroyuki
2008-09-26  5:54         ` Daisuke Nishimura
2008-09-26  6:54           ` KAMEZAWA Hiroyuki
2008-09-27  3:47           ` KAMEZAWA Hiroyuki
2008-09-27  3:25       ` KAMEZAWA Hiroyuki
2008-09-26  2:21   ` [PATCH(fixed) " KAMEZAWA Hiroyuki
2008-09-26  2:25     ` [PATCH(fixed) 10/12] free page cgroup from LRU in lazy KAMEZAWA Hiroyuki
2008-09-26  2:28       ` [PATCH(fixed) 11/12] free page cgroup from LRU in add KAMEZAWA Hiroyuki
2008-10-01  4:03   ` [PATCH 9/12] memcg allocate all page_cgroup at boot Balbir Singh
2008-10-01  5:07     ` KAMEZAWA Hiroyuki
2008-10-01  5:30       ` Balbir Singh
2008-10-01  5:41         ` KAMEZAWA Hiroyuki
2008-10-01  6:12           ` KAMEZAWA Hiroyuki
2008-10-01  6:26             ` Balbir Singh
2008-10-01  5:32       ` KAMEZAWA Hiroyuki
2008-10-01  5:59         ` Balbir Singh
2008-10-01  6:17           ` KAMEZAWA Hiroyuki
2008-09-25  6:33 ` [PATCH 10/12] memcg free page_cgroup from LRU in lazy KAMEZAWA Hiroyuki
2008-09-25  6:35 ` [PATCH 11/12] memcg add to " KAMEZAWA Hiroyuki
2008-09-25  6:36 ` [PATCH 12/12] memcg: fix race at charging swap-in KAMEZAWA Hiroyuki
2008-09-26  2:32 ` [PATCH 0/12] memcg updates v5 Daisuke Nishimura
2008-09-26  2:58   ` KAMEZAWA Hiroyuki
2008-09-26  3:04     ` KAMEZAWA Hiroyuki
2008-09-26  3:00       ` Daisuke Nishimura
2008-09-26  4:05         ` KAMEZAWA Hiroyuki
2008-09-26  5:24           ` Daisuke Nishimura
2008-09-26  9:28             ` KAMEZAWA Hiroyuki
2008-09-26 10:43             ` KAMEZAWA Hiroyuki
2008-09-27  2:53               ` KAMEZAWA Hiroyuki
2008-09-26  8:18 ` Balbir Singh
2008-09-26  9:22   ` KAMEZAWA Hiroyuki
2008-09-26  9:31     ` Balbir Singh
2008-09-26 10:36       ` KAMEZAWA Hiroyuki
2008-09-27  3:19         ` KAMEZAWA Hiroyuki
2008-09-29  3:02           ` Balbir Singh
2008-09-29  3:27             ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080926104336.d96ab5bd.kamezawa.hiroyu@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=haveblue@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nishimura@mxp.nes.nec.co.jp \
    --cc=ryov@valinux.co.jp \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox