From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: akpm@linux-foundation.org, a.p.zijlstra@chello.nl, dev@sw.ru,
ebiederm@xmission.com, herbert@13thfloor.at, menage@google.com,
rientjes@google.com, svaidy@linux.vnet.ibm.com, xemul@openvz.org,
Linux Memory Management <linux-mm@kvack.org>
Subject: Re: + memory-controller-memory-accounting-v7.patch added to -mm tree
Date: Wed, 29 Aug 2007 13:45:03 +0530 [thread overview]
Message-ID: <46D52B07.6050809@linux.vnet.ibm.com> (raw)
In-Reply-To: <46D52030.9080605@yahoo.com.au>
Nick Piggin wrote:
> Balbir Singh wrote:
>> Nick Piggin wrote:
>
>>> Sure. And if all you intend is workload management, then that's probably
>>> fine. If you want guarantees, then its useless on its own.
>>>
>>
>>
>> Yes. Guarantees are hard to do with just this controller (due to
>> non-reclaimable
>> pages). It should be easy to add a mlock() controller on top of this one.
>> Pavel is planning to work on a kernel memory controller. With that in
>> place,
>> guarantees might be possible.
>
> So it doesn't sound like they are necessarily a requirement for linux
> kernel memory control -- that's fine, I just want to know where you
> stand with that.
>
>
>>> I don't mean to say it would be done on purpose, but without a coherent
>>> strategy then things might slowly just get added as people think they
>>> are
>>> needed. I'm fairly sure several people want to really guarantee memory
>>> resources in an untrusted environment, don't they? And that's obviously
>>> not going to scale by putting calls all throughout the kernel.
>>>
>>
>>
>> We sent out an RFC last year and got several comments from stake holders
>>
>> 1. Pavel Emelianov
>> 2. Paul Menage
>> 3. Vaidyanathan Srinivasan
>> 4. YAMAMOTO Takshi
>> 5. KAMEZAWA Hiroyuki
>>
>>
>> At the OLS resource management BoF, we had a broader participation and
>> several comments and suggestions.
>>
>> We've incorporated suggestions and comments from all stake holders
>
> :) I know everyone's really worked hard at this and is trying to do the
> right thing. But for example some of the non-container VM people may not
> know exactly what you are up to or planning (I don't). It can make things
> a bit harder to review IF there is an expectation that more functionality
> is going to be required (which is what my expectation is).
>
> Anyway, let's not continue this tangent. Instead, I'll try to be more
> constructive and ask for eg. your future plans if they are not clear.
>
Thanks, I am going to be present at the VM summit and we have a roadmap
as well. We can discuss it and we'll also send it out for comments.
Just FYI, we started with an RFC about a year ago and got several
comments from non-container folks. We'll try to discuss the roadmap
further and get in more inputs.
>
>>> But at this point you have already charged the container, and have put
>>> it in the page tables, if I read correctly. Nothing is going to fail
>>> at this point and the page could get uncharged when it is unmapped?
>>>
>>>
>>
>>
>> Here's the sequence of steps
>>
>> 1. Charge (we might need to block on reclaim)
>> 2. Check to see if there is a race in updating the PTE
>> 3. Add to the page table, update _mapcount under a lock
>>
>> Several times the error occurs in step 2, due to which we need to
>> uncharge
>> the memory container.
>
> Oh yeah, I understand all _those_ uncharging calls you do. But let me
> just quote the code again:
>
> diff -puN mm/rmap.c~memory-controller-memory-accounting-v7 mm/rmap.c
> --- a/mm/rmap.c~memory-controller-memory-accounting-v7
> +++ a/mm/rmap.c
> @@ -48,6 +48,7 @@
> #include <linux/rcupdate.h>
> #include <linux/module.h>
> #include <linux/kallsyms.h>
> +#include <linux/memcontrol.h>
>
> #include <asm/tlbflush.h>
>
> @@ -550,8 +551,14 @@ void page_add_anon_rmap(struct page *pag
> VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> if (atomic_inc_and_test(&page->_mapcount))
> __page_set_anon_rmap(page, vma, address);
> - else
> + else {
> __page_check_anon_rmap(page, vma, address);
> + /*
> + * We unconditionally charged during prepare, we uncharge here
> + * This takes care of balancing the reference counts
> + */
> + mem_container_uncharge_page(page);
> + }
> }
>
> At the point you are uncharging here, the pte has been updated and
> the page is in the pagetables of the process. I guess this uncharging
> works on the presumption that you are not the first container to map
> the page, but I thought that you already check for that in your
> accounting implementation.
>
> Now how does it take care of the refcounts? I guess it is because your
> rmap removal function also takes care of refcounts by only uncharging
> if mapcount has gone to zero... however that's polluting the VM with
> knowledge that your accounting scheme is a first-touch one, isn't it?
>
> Aside, I'm slightly suspicious of whether this is correct against
> mapcount races, but I didn't look closely yet. I remember you bringing
> that up with me, so I guess you've been careful there...
>
Very good review comment. Here's what we see
1. Page comes in through page cache, we increment the reference ount
2. Page comes into rmap, we increment the refcount again
3. We race in page_add.*rmap(), the problem we have is that for
the same page, for rmap(), step 2 would have taken place more than
once
4. That's why we uncharge
I think I need to add a big fat comment in the ref_cnt member. ref_cnt
is held once for the page in the page cache and once for the page mapped
anywhere in the page tables.
reference counting helps us correctly determine when to take the page
of the LRU (it moves from page tables to swap cache, but it's still
on the LRU).
>>> add_to_page_cache gets called with GFP_ATOMIC as well, and it gets
>>> called
>>> with GFP_NOFS for new pages.
>>>
>>> But I don't think you were suggesting that this isn't a problem, where
>>> you? Relying on implementation in the VM would signal more broken
>>> layering.
>>>
>>>
>>
>>
>> Good, thanks for spotting this. It should be possible to fix this case.
>> I'll work on a fix and send it out.
>
> OK.
>
>
>>> It would be so so much easier and cleaner for the VM if you did all the
>>> accounting in page alloc and freeing hooks, and just put the page
>>> on per-container LRUs when it goes on the regular LRU.
>>>
>>
>>
>> The advantage of the current approach is that not all page allocations
>> have
>> to worry about charges. Only user space pages (unmapped cache and
>> mapped RSS)
>> are accounted for.
>
> That could be true. OTOH, if you have a significant amount of non userspace
> page allocations happening, the chances are that your userspace-only
> accounting is going out the window too :)
>
> You'd still be able to avoid charging if a process isn't in a container
> or accounting is turned off, of course.
>
Yes, we need a mechanism to turn off the container at boot time. Currently
if the config is enabled, we charge :-)
>
>> We tried syncing the container and the global LRU. Pavel is working on
>> that approach. The problem with adding the page at the same time as the
>> regular LRU is that we end up calling the addition under the
>> zone->lru_lock.
>> Holding the entire zone's lru_lock for list addition might be an
>> overhead.
>> Although, we do the list isolation under the zone's lru lock.
>
> Well it may not have to be _exactly_ the same time as the page goes on the
> normal lru. You could try doing it in pagevec_lru_add, for example, after
> the lru locks have been released?
>
I'll play around with Pavel's patches or work with him to explore this
further.
>
>>> I'm still going to keep pushing for that approach until either someone
>>> explains why it can't be done or the current patch gets a fair bit
>>> cleaner. Has that already been tried and shown not to work? I would have
>>> thought so seeing as it would be the simplest patch, however I can't
>>> remember hearing about the actual problems with it.
>>>
>>
>>
>> I am all eyes and ears to patches/suggestions/improvements to the current
>> container.
>
> Well I have made a few. I'm actually not too interested in containers and
> resource control myself, but I suspect it may be a fair bit easier to play
> around with eg. my ideas for VM hooks with an implementation in -mm, so I
> might get motivated to try a patch...
>
Great! Thanks!
> Thanks,
> Nick
>
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-08-29 8:17 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <200708272119.l7RLJoOD028582@imap1.linux-foundation.org>
2007-08-28 6:35 ` Nick Piggin
2007-08-28 7:26 ` Balbir Singh
2007-08-28 9:29 ` Nick Piggin
2007-08-28 11:39 ` Balbir Singh
2007-08-29 7:28 ` Nick Piggin
2007-08-29 8:15 ` Balbir Singh [this message]
2007-08-30 7:39 ` Nick Piggin
2007-08-30 9:04 ` Balbir Singh
2007-08-31 0:35 ` Nick Piggin
2007-08-31 4:40 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46D52B07.6050809@linux.vnet.ibm.com \
--to=balbir@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=dev@sw.ru \
--cc=ebiederm@xmission.com \
--cc=herbert@13thfloor.at \
--cc=linux-mm@kvack.org \
--cc=menage@google.com \
--cc=nickpiggin@yahoo.com.au \
--cc=rientjes@google.com \
--cc=svaidy@linux.vnet.ibm.com \
--cc=xemul@openvz.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox