linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: balbir@linux.vnet.ibm.com
Cc: Nick Piggin <nickpiggin@yahoo.com.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	hugh@veritas.com, menage@google.com, xemul@openvz.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [Approach #2] [RFC][PATCH] Remove cgroup member from struct page
Date: Wed, 10 Sep 2008 11:35:46 +0900	[thread overview]
Message-ID: <20080910113546.7e5b2fe8.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <48C72CBD.6040602@linux.vnet.ibm.com>

On Tue, 09 Sep 2008 19:11:09 -0700
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 1. This is nonsense...do you know the memory map of IBM's (maybe ppc) machine ?
> > Node's memory are splitted into several pieces and not ordered by node number.
> > example)
> >    Node 0 | Node 1 | Node 2 | Node 1 | Node 2 | 
> > 
> > This seems special but when I helped SPARSEMEM and MEMORY_HOTPLUG,
> > I saw mannnny kinds of memory map. As you wrote, this should be re-designed.
> > 
> 
> Thanks, so that means that we cannot before hand predict the size of pcg_map[n],
> we'll need to do an incremental addition to pcg_map?
Or use some "allocate a chunk of page_cgroup for a chunk of continuous pages".
(This is the reason I mentioned SPARSEMEM.)

> 
> > 2. If pre-allocating all is ok, I stop my work. Mine is of-no-use.
> 
> One of the goals of this patch is refinement, it is a starting piece, something
> I shared very early. I am not asking you to stop your work. While I think
> pre-allocating is not the best way to do this, the trade off is the sparseness
> of the machine. I don't mind doing it in other ways, but we'll still need to do
> some batch'ed preallocation (of a smaller size maybe).
> 
Hmm, maybe clarifying trade-off and comapring them is the first step.
I'll post my idea if it comes.

> > But you have to know that by pre-allocationg, we can't use avoid-lru-lock
> > by batch like page_vec technique. We can't delay uncharge because a page
> > can be reused soon.
> > 
> > 
> 
> Care to elaborate on this? Why not? If the page is reused, we act on the batch
> and sync it up
> 
And touch vec on other cpu ? The reason "vec" is fast is because it's per-cpu.
If we want to use "delaying", we'll have to make page_cgroup unused and not-on-lru
when the page of page_cgroup is added to free queue.

> > 
> > 
> >> +	pcg_map[n] = alloc_bootmem_node(pgdat, size);
> >> +	/*
> >> +	 * We can do smoother recovery
> >> +	 */
> >> +	BUG_ON(!pcg_map[n]);
> >> +	return 0;
> >>  }
> >>  
> >> -static int try_lock_page_cgroup(struct page *page)
> >> +void page_cgroup_init(int nid, unsigned long pfn)
> >>  {
> >> -	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> >> +	unsigned long node_pfn;
> >> +	struct page_cgroup *pc;
> >> +
> >> +	if (mem_cgroup_subsys.disabled)
> >> +		return;
> >> +
> >> +	node_pfn = pfn - NODE_DATA(nid)->node_start_pfn;
> >> +	pc = &pcg_map[nid][node_pfn];
> >> +
> >> +	BUG_ON(!pc);
> >> +	pc->flags = PAGE_CGROUP_FLAG_VALID;
> >> +	INIT_LIST_HEAD(&pc->lru);
> >> +	pc->page = NULL;
> > This NULL is unnecessary. pc->page = pnf_to_page(pfn) always.
> > 
> 
> OK
> 
> > 
> >> +	pc->mem_cgroup = NULL;
> >>  }
> >>  
> >> -static void unlock_page_cgroup(struct page *page)
> >> +struct page_cgroup *__page_get_page_cgroup(struct page *page, bool lock,
> >> +						bool trylock)
> >>  {
> >> -	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> >> +	struct page_cgroup *pc;
> >> +	int ret;
> >> +	int node = page_to_nid(page);
> >> +	unsigned long pfn;
> >> +
> >> +	pfn = page_to_pfn(page) - NODE_DATA(node)->node_start_pfn;
> >> +	pc = &pcg_map[node][pfn];
> >> +	BUG_ON(!(pc->flags & PAGE_CGROUP_FLAG_VALID));
> >> +	if (lock)
> >> +		lock_page_cgroup(pc);
> >> +	else if (trylock) {
> >> +		ret = trylock_page_cgroup(pc);
> >> +		if (!ret)
> >> +			pc = NULL;
> >> +	}
> >> +
> >> +	return pc;
> >> +}
> >> +
> >> +/*
> >> + * Should be called with page_cgroup lock held. Any additions to pc->flags
> >> + * should be reflected here. This might seem ugly, refine it later.
> >> + */
> >> +void page_clear_page_cgroup(struct page_cgroup *pc)
> >> +{
> >> +	pc->flags &= ~PAGE_CGROUP_FLAG_INUSE;
> >>  }
> >>  
> >>  static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
> >> @@ -377,17 +443,15 @@ void mem_cgroup_move_lists(struct page *
> >>  	 * safely get to page_cgroup without it, so just try_lock it:
> >>  	 * mem_cgroup_isolate_pages allows for page left on wrong list.
> >>  	 */
> >> -	if (!try_lock_page_cgroup(page))
> >> +	pc = page_get_page_cgroup_trylock(page);
> >> +	if (!pc)
> >>  		return;
> >>  
> >> -	pc = page_get_page_cgroup(page);
> >> -	if (pc) {
> >> -		mz = page_cgroup_zoneinfo(pc);
> >> -		spin_lock_irqsave(&mz->lru_lock, flags);
> >> -		__mem_cgroup_move_lists(pc, lru);
> >> -		spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> -	}
> >> -	unlock_page_cgroup(page);
> >> +	mz = page_cgroup_zoneinfo(pc);
> >> +	spin_lock_irqsave(&mz->lru_lock, flags);
> >> +	__mem_cgroup_move_lists(pc, lru);
> >> +	spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> +	unlock_page_cgroup(pc);
> > 
> > This lock/unlock_page_cgroup is against what ?
> > 
> 
> We use page_cgroup_zoneinfo(pc), we want to make sure pc does not disappear or
> change from underneath us.
> 
> >>  }
> >>  
> >>  /*
> >> @@ -521,10 +585,6 @@ static int mem_cgroup_charge_common(stru
> >>  	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >>  	struct mem_cgroup_per_zone *mz;
> >>  
> >> -	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> >> -	if (unlikely(pc == NULL))
> >> -		goto err;
> >> -
> >>  	/*
> >>  	 * We always charge the cgroup the mm_struct belongs to.
> >>  	 * The mm_struct's mem_cgroup changes on task migration if the
> >> @@ -567,43 +627,40 @@ static int mem_cgroup_charge_common(stru
> >>  		}
> >>  	}
> >>  
> >> +	pc = page_get_page_cgroup_locked(page);
> >> +	if (pc->flags & PAGE_CGROUP_FLAG_INUSE) {
> >> +		unlock_page_cgroup(pc);
> >> +		res_counter_uncharge(&mem->res, PAGE_SIZE);
> >> +		css_put(&mem->css);
> >> +		goto done;
> >> +	}
> >> +
> > Can this happen ? Our direction should be
> > VM_BUG_ON(pc->flags & PAGE_CGROUP_FLAG_INUSE)
> > 
> 
> Yes, it can.. several people trying to map the same page at once. Can't we race
> doing that?
> 
I'll dig this. My version(lockless) already removed this and use VM_BUG_ON()

> > 
> > 
> >>  	pc->mem_cgroup = mem;
> >>  	pc->page = page;
> >> +	pc->flags |= PAGE_CGROUP_FLAG_INUSE;
> >> +
> >>  	/*
> >>  	 * If a page is accounted as a page cache, insert to inactive list.
> >>  	 * If anon, insert to active list.
> >>  	 */
> >>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
> >> -		pc->flags = PAGE_CGROUP_FLAG_CACHE;
> >> +		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> >>  		if (page_is_file_cache(page))
> >>  			pc->flags |= PAGE_CGROUP_FLAG_FILE;
> >>  		else
> >>  			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> >>  	} else
> >> -		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> >> -
> >> -	lock_page_cgroup(page);
> >> -	if (unlikely(page_get_page_cgroup(page))) {
> >> -		unlock_page_cgroup(page);
> >> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> >> -		css_put(&mem->css);
> >> -		kmem_cache_free(page_cgroup_cache, pc);
> >> -		goto done;
> >> -	}
> >> -	page_assign_page_cgroup(page, pc);
> >> +		pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> >>  
> >>  	mz = page_cgroup_zoneinfo(pc);
> >>  	spin_lock_irqsave(&mz->lru_lock, flags);
> >>  	__mem_cgroup_add_list(mz, pc);
> >>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> -
> >> -	unlock_page_cgroup(page);
> >> +	unlock_page_cgroup(pc);
> > 
> > Is this lock/unlock_page_cgroup is for what kind of race ?
> 
> for setting pc->flags and for setting pc->page and pc->mem_cgroup.
> 

Hmm...there is a confustion, maybe.

The page_cgroup is now 1:1 to struct page. Then, we can guarantee that

- There is no race between charge v.s. uncharge.

Only problem is force_empty. (But it's difficult..)

This means pc->mem_cgroup is safe here. 
And pc->flags should be atomic flags, anyway. I believe we have to record
"Dirty bit" at el, later.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2008-09-10  2:35 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-31 17:47 Balbir Singh
2008-09-01  0:01 ` KAMEZAWA Hiroyuki
2008-09-01  3:28   ` Balbir Singh
2008-09-01  4:03     ` KAMEZAWA Hiroyuki
2008-09-01  5:17       ` KAMEZAWA Hiroyuki
2008-09-01  6:16         ` Balbir Singh
2008-09-01  6:09       ` Balbir Singh
2008-09-01  6:24         ` KAMEZAWA Hiroyuki
2008-09-01  6:25           ` Balbir Singh
2008-09-01  6:59             ` KAMEZAWA Hiroyuki
2008-09-01  6:56   ` Nick Piggin
2008-09-01  7:17     ` Balbir Singh
2008-09-01  7:19     ` KAMEZAWA Hiroyuki
2008-09-01  7:43       ` Nick Piggin
2008-09-02  9:24         ` Balbir Singh
2008-09-02 10:02           ` KAMEZAWA Hiroyuki
2008-09-02  9:58             ` Balbir Singh
2008-09-02 10:07               ` KAMEZAWA Hiroyuki
2008-09-02 10:12                 ` Balbir Singh
2008-09-02 10:57                   ` KAMEZAWA Hiroyuki
2008-09-02 12:37                     ` Balbir Singh
2008-09-03  3:33                       ` KAMEZAWA Hiroyuki
2008-09-03  7:31                         ` Balbir Singh
2008-09-08 15:28                         ` Balbir Singh
2008-09-09  3:57                           ` KAMEZAWA Hiroyuki
2008-09-09  3:58                             ` Nick Piggin
2008-09-09  4:53                               ` KAMEZAWA Hiroyuki
2008-09-09  5:00                                 ` Nick Piggin
2008-09-09  5:12                                   ` KAMEZAWA Hiroyuki
2008-09-09 12:24                                     ` Balbir Singh
2008-09-09 12:28                                       ` Nick Piggin
2008-09-09 12:30                                     ` kamezawa.hiroyu
2008-09-09 12:34                                       ` Balbir Singh
2008-09-10  1:20                                       ` [Approach #2] " Balbir Singh
2008-09-10  1:49                                         ` KAMEZAWA Hiroyuki
2008-09-10  2:11                                           ` Balbir Singh
2008-09-10  2:35                                             ` KAMEZAWA Hiroyuki [this message]
2008-09-10 20:44                                           ` Nick Piggin
2008-09-10 11:03                                             ` KAMEZAWA Hiroyuki
2008-09-10 21:02                                               ` Nick Piggin
2008-09-10 11:27                                                 ` KAMEZAWA Hiroyuki
2008-09-10 14:34                                                   ` Balbir Singh
2008-09-10 22:21                                         ` Dave Hansen
2008-09-10 22:31                                           ` David Miller, Dave Hansen
2008-09-10 22:36                                           ` Balbir Singh
2008-09-10 22:56                                             ` Dave Hansen
2008-09-11  1:35                                               ` KAMEZAWA Hiroyuki
2008-09-11  1:47                                                 ` Balbir Singh
2008-09-11  1:56                                                   ` KAMEZAWA Hiroyuki
2008-09-17 23:28                                                     ` [RFC][PATCH] Remove cgroup member from struct page (v3) Balbir Singh
2008-09-18  1:40                                                       ` Andrew Morton
2008-09-18  3:57                                                         ` Balbir Singh
2008-09-18  5:00                                                           ` KAMEZAWA Hiroyuki
2008-09-18  4:26                                                         ` Hirokazu Takahashi
2008-09-18  4:50                                                           ` KAMEZAWA Hiroyuki
2008-09-18  6:13                                                             ` Hirokazu Takahashi
2008-09-18  4:43                                                         ` KAMEZAWA Hiroyuki
2008-09-18  4:58                                                           ` Balbir Singh
2008-09-18  5:15                                                             ` KAMEZAWA Hiroyuki
2008-09-18 11:01                                                             ` KAMEZAWA Hiroyuki
2008-09-18 23:56                                                               ` Balbir Singh
2008-09-19  0:37                                                                 ` KAMEZAWA Hiroyuki
2008-09-10 22:38                                           ` [Approach #2] [RFC][PATCH] Remove cgroup member from struct page Nick Piggin
2008-09-09  4:18                             ` Balbir Singh
2008-09-09  4:55                               ` KAMEZAWA Hiroyuki
2008-09-09  7:37                               ` KAMEZAWA Hiroyuki
2008-09-01  2:39 ` KAMEZAWA Hiroyuki
2008-09-01  3:42   ` Balbir Singh
2008-09-01  9:03 ` Pavel Emelyanov
2008-09-01  9:17   ` Balbir Singh
2008-09-01  9:43     ` Pavel Emelyanov
2008-09-01 13:19     ` Peter Zijlstra
2008-09-02  7:35       ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080910113546.7e5b2fe8.kamezawa.hiroyu@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=menage@google.com \
    --cc=nickpiggin@yahoo.com.au \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox