From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: balbir@linux.vnet.ibm.com
Cc: Nick Piggin <nickpiggin@yahoo.com.au>,
Andrew Morton <akpm@linux-foundation.org>,
hugh@veritas.com, menage@google.com, xemul@openvz.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [Approach #2] [RFC][PATCH] Remove cgroup member from struct page
Date: Wed, 10 Sep 2008 11:35:46 +0900 [thread overview]
Message-ID: <20080910113546.7e5b2fe8.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <48C72CBD.6040602@linux.vnet.ibm.com>
On Tue, 09 Sep 2008 19:11:09 -0700
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 1. This is nonsense...do you know the memory map of IBM's (maybe ppc) machine ?
> > Node's memory are splitted into several pieces and not ordered by node number.
> > example)
> > Node 0 | Node 1 | Node 2 | Node 1 | Node 2 |
> >
> > This seems special but when I helped SPARSEMEM and MEMORY_HOTPLUG,
> > I saw mannnny kinds of memory map. As you wrote, this should be re-designed.
> >
>
> Thanks, so that means that we cannot before hand predict the size of pcg_map[n],
> we'll need to do an incremental addition to pcg_map?
Or use some "allocate a chunk of page_cgroup for a chunk of continuous pages".
(This is the reason I mentioned SPARSEMEM.)
>
> > 2. If pre-allocating all is ok, I stop my work. Mine is of-no-use.
>
> One of the goals of this patch is refinement, it is a starting piece, something
> I shared very early. I am not asking you to stop your work. While I think
> pre-allocating is not the best way to do this, the trade off is the sparseness
> of the machine. I don't mind doing it in other ways, but we'll still need to do
> some batch'ed preallocation (of a smaller size maybe).
>
Hmm, maybe clarifying trade-off and comapring them is the first step.
I'll post my idea if it comes.
> > But you have to know that by pre-allocationg, we can't use avoid-lru-lock
> > by batch like page_vec technique. We can't delay uncharge because a page
> > can be reused soon.
> >
> >
>
> Care to elaborate on this? Why not? If the page is reused, we act on the batch
> and sync it up
>
And touch vec on other cpu ? The reason "vec" is fast is because it's per-cpu.
If we want to use "delaying", we'll have to make page_cgroup unused and not-on-lru
when the page of page_cgroup is added to free queue.
> >
> >
> >> + pcg_map[n] = alloc_bootmem_node(pgdat, size);
> >> + /*
> >> + * We can do smoother recovery
> >> + */
> >> + BUG_ON(!pcg_map[n]);
> >> + return 0;
> >> }
> >>
> >> -static int try_lock_page_cgroup(struct page *page)
> >> +void page_cgroup_init(int nid, unsigned long pfn)
> >> {
> >> - return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> >> + unsigned long node_pfn;
> >> + struct page_cgroup *pc;
> >> +
> >> + if (mem_cgroup_subsys.disabled)
> >> + return;
> >> +
> >> + node_pfn = pfn - NODE_DATA(nid)->node_start_pfn;
> >> + pc = &pcg_map[nid][node_pfn];
> >> +
> >> + BUG_ON(!pc);
> >> + pc->flags = PAGE_CGROUP_FLAG_VALID;
> >> + INIT_LIST_HEAD(&pc->lru);
> >> + pc->page = NULL;
> > This NULL is unnecessary. pc->page = pnf_to_page(pfn) always.
> >
>
> OK
>
> >
> >> + pc->mem_cgroup = NULL;
> >> }
> >>
> >> -static void unlock_page_cgroup(struct page *page)
> >> +struct page_cgroup *__page_get_page_cgroup(struct page *page, bool lock,
> >> + bool trylock)
> >> {
> >> - bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> >> + struct page_cgroup *pc;
> >> + int ret;
> >> + int node = page_to_nid(page);
> >> + unsigned long pfn;
> >> +
> >> + pfn = page_to_pfn(page) - NODE_DATA(node)->node_start_pfn;
> >> + pc = &pcg_map[node][pfn];
> >> + BUG_ON(!(pc->flags & PAGE_CGROUP_FLAG_VALID));
> >> + if (lock)
> >> + lock_page_cgroup(pc);
> >> + else if (trylock) {
> >> + ret = trylock_page_cgroup(pc);
> >> + if (!ret)
> >> + pc = NULL;
> >> + }
> >> +
> >> + return pc;
> >> +}
> >> +
> >> +/*
> >> + * Should be called with page_cgroup lock held. Any additions to pc->flags
> >> + * should be reflected here. This might seem ugly, refine it later.
> >> + */
> >> +void page_clear_page_cgroup(struct page_cgroup *pc)
> >> +{
> >> + pc->flags &= ~PAGE_CGROUP_FLAG_INUSE;
> >> }
> >>
> >> static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
> >> @@ -377,17 +443,15 @@ void mem_cgroup_move_lists(struct page *
> >> * safely get to page_cgroup without it, so just try_lock it:
> >> * mem_cgroup_isolate_pages allows for page left on wrong list.
> >> */
> >> - if (!try_lock_page_cgroup(page))
> >> + pc = page_get_page_cgroup_trylock(page);
> >> + if (!pc)
> >> return;
> >>
> >> - pc = page_get_page_cgroup(page);
> >> - if (pc) {
> >> - mz = page_cgroup_zoneinfo(pc);
> >> - spin_lock_irqsave(&mz->lru_lock, flags);
> >> - __mem_cgroup_move_lists(pc, lru);
> >> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> - }
> >> - unlock_page_cgroup(page);
> >> + mz = page_cgroup_zoneinfo(pc);
> >> + spin_lock_irqsave(&mz->lru_lock, flags);
> >> + __mem_cgroup_move_lists(pc, lru);
> >> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> + unlock_page_cgroup(pc);
> >
> > This lock/unlock_page_cgroup is against what ?
> >
>
> We use page_cgroup_zoneinfo(pc), we want to make sure pc does not disappear or
> change from underneath us.
>
> >> }
> >>
> >> /*
> >> @@ -521,10 +585,6 @@ static int mem_cgroup_charge_common(stru
> >> unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >> struct mem_cgroup_per_zone *mz;
> >>
> >> - pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> >> - if (unlikely(pc == NULL))
> >> - goto err;
> >> -
> >> /*
> >> * We always charge the cgroup the mm_struct belongs to.
> >> * The mm_struct's mem_cgroup changes on task migration if the
> >> @@ -567,43 +627,40 @@ static int mem_cgroup_charge_common(stru
> >> }
> >> }
> >>
> >> + pc = page_get_page_cgroup_locked(page);
> >> + if (pc->flags & PAGE_CGROUP_FLAG_INUSE) {
> >> + unlock_page_cgroup(pc);
> >> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> >> + css_put(&mem->css);
> >> + goto done;
> >> + }
> >> +
> > Can this happen ? Our direction should be
> > VM_BUG_ON(pc->flags & PAGE_CGROUP_FLAG_INUSE)
> >
>
> Yes, it can.. several people trying to map the same page at once. Can't we race
> doing that?
>
I'll dig this. My version(lockless) already removed this and use VM_BUG_ON()
> >
> >
> >> pc->mem_cgroup = mem;
> >> pc->page = page;
> >> + pc->flags |= PAGE_CGROUP_FLAG_INUSE;
> >> +
> >> /*
> >> * If a page is accounted as a page cache, insert to inactive list.
> >> * If anon, insert to active list.
> >> */
> >> if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
> >> - pc->flags = PAGE_CGROUP_FLAG_CACHE;
> >> + pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> >> if (page_is_file_cache(page))
> >> pc->flags |= PAGE_CGROUP_FLAG_FILE;
> >> else
> >> pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> >> } else
> >> - pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> >> -
> >> - lock_page_cgroup(page);
> >> - if (unlikely(page_get_page_cgroup(page))) {
> >> - unlock_page_cgroup(page);
> >> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> >> - css_put(&mem->css);
> >> - kmem_cache_free(page_cgroup_cache, pc);
> >> - goto done;
> >> - }
> >> - page_assign_page_cgroup(page, pc);
> >> + pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> >>
> >> mz = page_cgroup_zoneinfo(pc);
> >> spin_lock_irqsave(&mz->lru_lock, flags);
> >> __mem_cgroup_add_list(mz, pc);
> >> spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> -
> >> - unlock_page_cgroup(page);
> >> + unlock_page_cgroup(pc);
> >
> > Is this lock/unlock_page_cgroup is for what kind of race ?
>
> for setting pc->flags and for setting pc->page and pc->mem_cgroup.
>
Hmm...there is a confustion, maybe.
The page_cgroup is now 1:1 to struct page. Then, we can guarantee that
- There is no race between charge v.s. uncharge.
Only problem is force_empty. (But it's difficult..)
This means pc->mem_cgroup is safe here.
And pc->flags should be atomic flags, anyway. I believe we have to record
"Dirty bit" at el, later.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-09-10 2:35 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-31 17:47 Balbir Singh
2008-09-01 0:01 ` KAMEZAWA Hiroyuki
2008-09-01 3:28 ` Balbir Singh
2008-09-01 4:03 ` KAMEZAWA Hiroyuki
2008-09-01 5:17 ` KAMEZAWA Hiroyuki
2008-09-01 6:16 ` Balbir Singh
2008-09-01 6:09 ` Balbir Singh
2008-09-01 6:24 ` KAMEZAWA Hiroyuki
2008-09-01 6:25 ` Balbir Singh
2008-09-01 6:59 ` KAMEZAWA Hiroyuki
2008-09-01 6:56 ` Nick Piggin
2008-09-01 7:17 ` Balbir Singh
2008-09-01 7:19 ` KAMEZAWA Hiroyuki
2008-09-01 7:43 ` Nick Piggin
2008-09-02 9:24 ` Balbir Singh
2008-09-02 10:02 ` KAMEZAWA Hiroyuki
2008-09-02 9:58 ` Balbir Singh
2008-09-02 10:07 ` KAMEZAWA Hiroyuki
2008-09-02 10:12 ` Balbir Singh
2008-09-02 10:57 ` KAMEZAWA Hiroyuki
2008-09-02 12:37 ` Balbir Singh
2008-09-03 3:33 ` KAMEZAWA Hiroyuki
2008-09-03 7:31 ` Balbir Singh
2008-09-08 15:28 ` Balbir Singh
2008-09-09 3:57 ` KAMEZAWA Hiroyuki
2008-09-09 3:58 ` Nick Piggin
2008-09-09 4:53 ` KAMEZAWA Hiroyuki
2008-09-09 5:00 ` Nick Piggin
2008-09-09 5:12 ` KAMEZAWA Hiroyuki
2008-09-09 12:24 ` Balbir Singh
2008-09-09 12:28 ` Nick Piggin
2008-09-09 12:30 ` kamezawa.hiroyu
2008-09-09 12:34 ` Balbir Singh
2008-09-10 1:20 ` [Approach #2] " Balbir Singh
2008-09-10 1:49 ` KAMEZAWA Hiroyuki
2008-09-10 2:11 ` Balbir Singh
2008-09-10 2:35 ` KAMEZAWA Hiroyuki [this message]
2008-09-10 20:44 ` Nick Piggin
2008-09-10 11:03 ` KAMEZAWA Hiroyuki
2008-09-10 21:02 ` Nick Piggin
2008-09-10 11:27 ` KAMEZAWA Hiroyuki
2008-09-10 14:34 ` Balbir Singh
2008-09-10 22:21 ` Dave Hansen
2008-09-10 22:31 ` David Miller, Dave Hansen
2008-09-10 22:36 ` Balbir Singh
2008-09-10 22:56 ` Dave Hansen
2008-09-11 1:35 ` KAMEZAWA Hiroyuki
2008-09-11 1:47 ` Balbir Singh
2008-09-11 1:56 ` KAMEZAWA Hiroyuki
2008-09-17 23:28 ` [RFC][PATCH] Remove cgroup member from struct page (v3) Balbir Singh
2008-09-18 1:40 ` Andrew Morton
2008-09-18 3:57 ` Balbir Singh
2008-09-18 5:00 ` KAMEZAWA Hiroyuki
2008-09-18 4:26 ` Hirokazu Takahashi
2008-09-18 4:50 ` KAMEZAWA Hiroyuki
2008-09-18 6:13 ` Hirokazu Takahashi
2008-09-18 4:43 ` KAMEZAWA Hiroyuki
2008-09-18 4:58 ` Balbir Singh
2008-09-18 5:15 ` KAMEZAWA Hiroyuki
2008-09-18 11:01 ` KAMEZAWA Hiroyuki
2008-09-18 23:56 ` Balbir Singh
2008-09-19 0:37 ` KAMEZAWA Hiroyuki
2008-09-10 22:38 ` [Approach #2] [RFC][PATCH] Remove cgroup member from struct page Nick Piggin
2008-09-09 4:18 ` Balbir Singh
2008-09-09 4:55 ` KAMEZAWA Hiroyuki
2008-09-09 7:37 ` KAMEZAWA Hiroyuki
2008-09-01 2:39 ` KAMEZAWA Hiroyuki
2008-09-01 3:42 ` Balbir Singh
2008-09-01 9:03 ` Pavel Emelyanov
2008-09-01 9:17 ` Balbir Singh
2008-09-01 9:43 ` Pavel Emelyanov
2008-09-01 13:19 ` Peter Zijlstra
2008-09-02 7:35 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080910113546.7e5b2fe8.kamezawa.hiroyu@jp.fujitsu.com \
--to=kamezawa.hiroyu@jp.fujitsu.com \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=hugh@veritas.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=menage@google.com \
--cc=nickpiggin@yahoo.com.au \
--cc=xemul@openvz.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox