linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	YAMAMOTO Takashi <yamamoto@valinux.co.jp>,
	Paul Menage <menage@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [0/2] memrlimit improve error handling
Date: Wed, 25 Jun 2008 06:31:05 +0530	[thread overview]
Message-ID: <486198D1.20500@linux.vnet.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0806242240200.6804@blonde.site>

Hugh Dickins wrote:
> Hi Balbir,
> 
> Below I've a few comments on this particular patch, then report other
> problems I've seen with memrlimits configured into my 2.6.25-rc5-mm3
> kernel - I've not tried _using_ them.  My own opinion would be that
> -mm already contains enough breakage, we don't need memrlimits yet.
> 

Thanks for the review and feedback, I'll take a look and fix

> On Fri, 20 Jun 2008, Balbir Singh wrote:
>> memrlimit cgroup does not handle error cases after may_expand_vm(). This
>> BUG was reported by Kamezawa, with the test case below to reproduce it
>>
>> This patch adds better handling support to fix the reported problem.
>>
>> Reported-By: kamezawa.hiroyu@jp.fujitsu.com
>>
>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> I wrote a similar patch when I saw Kame's report, because I was
> interested in getting that accounting right.  Comparing the two,
> a couple of notes on the details...
> 
> In mm/mmap.c:
>> @@ -1982,6 +1992,7 @@ unsigned long do_brk(unsigned long addr,
>>  	struct rb_node ** rb_link, * rb_parent;
>>  	pgoff_t pgoff = addr >> PAGE_SHIFT;
>>  	int error;
>> +	int ret = -ENOMEM;
> 
> I think we don't want int error returned from some parts of do_brk()
> and int ret returned from other parts: please use int error throughout.
> It would probably be nicer to add int error rather than int ret in
> acct_stack_growth() too, but that doesn't matter much.
> 

Fair enough, will fix

> In mm/mremap.c:
>> @@ -256,6 +257,7 @@ unsigned long do_mremap(unsigned long ad
>>  	struct vm_area_struct *vma;
>>  	unsigned long ret = -EINVAL;
>>  	unsigned long charged = 0;
>> +	int vm_expanded = 0;
>>  
>>  	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
>>  		goto out;
>> @@ -349,6 +351,7 @@ unsigned long do_mremap(unsigned long ad
>>  		goto out;
>>  	}
>>  
>> +	vm_expanded = 1;
>>  	if (vma->vm_flags & VM_ACCOUNT) {
>>  		charged = (new_len - old_len) >> PAGE_SHIFT;
>>  		if (security_vm_enough_memory(charged))
>> @@ -411,6 +414,9 @@ out:
>>  	if (ret & ~PAGE_MASK)
>>  		vm_unacct_memory(charged);
>>  out_nc:
>> +	if (vm_expanded)
>> +		memrlimit_cgroup_uncharge_as(mm,
>> +				(new_len - old_len) >> PAGE_SHIFT);
>>  	return ret;
>>  }
> 
> See how vm_unacct_memory(charged) is only called if (ret & ~PAGE_MASK)?
> If ret is a valid address being returned, we do not want to uncharge.
> So I believe you need to do likewise with your uncharge_as().
> 
> And please handle them both in the same way: either follow the same
> "charged" style as is being used for vm_unacct_memory, rather than a
> boolean; or convert vm_unacct_memory over to use your boolean style:
> but it's unhelpful to have the two using different techniques.
> 

I did look at that and use that code, but we initialize ret = -EINVAL and then
we have the checks for

        if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
                goto out;

        if (addr & ~PAGE_MASK)
                goto out;

In out, we check for
	ret & ~PAGE_MASK

That's OK for vm_unacct_memory, since charged is set to 0, but not for
mem_cgroup_uncharge_as as we uncharge (new_len - old_len) >> PAGE_SHIFT);

I'll find a way to see if we can convert both over.

> In kernel/fork.c:
> nothing, but when I had a quick look there, again the error handling
> appeared to be broken e.g. if allocate_mm fails, where's the uncharge?
> But be careful: there's a particular point where enough of the new mm
> is set up that it gets torn down by normal exit_mmap.
> 

Yes, that needs fixing

> And while looking at copy_mm, do I see an extra down_write and
> up_write of mmap_sem, just to guard some memrlimit charging?

Yes, we need to prevent charging and migration race.

> Don't add overhead when memrlimits are CONFIGed off; but can't
> it be moved into dup_mm() where mmap_sem is already down_write?
> 

Yes, it can be moved there

> Please go through all your charges, again, to double check
> that you've got the uncharging right in the error cases.
> 

Will do


> I was interested in these because I find that after running kernel
> build on tmpfs swapping loads in a 700M memcg (but you may well hit
> other rc5-mm3 bugs if you try that, I've fixes to send out) for some

I've never tried that before

> hours on x86_64, my shutdown hits the kernel/res_counter.c:49
> res_counter_uncharge_locked() WARN_ON(counter->usage < val) several
> times, called from memrlimit_cgroup_uncharge_as called from exit_mmap.
> 
> I have no idea what the cause is; but I've not seen it on i386,
> and I'm not seeing it after shorter runs.  It could even be some
> error introduced by other patches in what I'm testing: so don't
> worry too much about it yet, but please keep an eye out for it.
> 

Sure, will do

> (If I'd thought harder, I'd have been less interested in the
> charge_as leaks: those WARN_ONs come when too much is being
> uncharged, not when too little.)
> 
> But the issue which worries me most, because it's mislocking
> which affects anyone with CONFIG_MM_OWNER=y, is seen when I
> have CONFIG_DEBUG_SPINLOCK_SLEEP=y: 
> 
>  BUG: sleeping function called from invalid context at kernel/rwsem.c:48
>  in_atomic():1, irqs_disabled():0
>  1 lock held by blogd/830:
>   #0:  (tasklist_lock){..--}, at: [<78125869>] mm_update_next_owner+0x1dd/0x1fa
>  Pid: 830, comm: blogd Not tainted 2.6.26-rc5-mm2 #2
>   [<78120ea2>] __might_sleep+0xed/0xf2
>   [<78367947>] down_write+0x13/0x43
>   [<781257b6>] mm_update_next_owner+0x12a/0x1fa
>   [<7812595a>] exit_mm+0xd4/0xe5
>   [<78125f9d>] do_exit+0x1e7/0x2bc
>   [<781260fb>] do_group_exit+0x61/0x8a
>   [<78126134>] sys_exit_group+0x10/0x13
>   [<78102dd9>] sysenter_past_esp+0x6a/0xa5
> 
> Your memrlimit-cgroup-mm-owner-callback-changes-to-add-task-info.patch
> thinks it can acquire mmap_sem while it has read_lock(&tasklist_lock).
> Sorry, no: you'll have to rework that somehow.
> 

Yes, that is nasty and a bad miss. It should be easy to fix, but will require
extensive testing.


> (In passing, I'll add that I'm not a great fan of these memrlimits:
> to me it's loony to be charging people for virtual address space,
> it's _virtual_, and process A can have as much as it likes without
> affecting process B in any way.  You're following the lead of RLIMIT_AS,
> but I've always thought RLIMIT_AS a lame attempt to move into the mmap
> decade, after RLIMIT_DATA and RLIMIT_STACK no longer made sense.
> 
> Taking Alan Cox's Committed_AS as a limited resource charged per mm makes
> much more sense to me: but yes, it's not perfect, and it is a lot harder
> to get its accounting right, and to maintain that down the line.  Okay,
> you've gone for the easier option of tracking total_vm, getting that
> right is a more achievable target.  And I accept that I may be too
> pessimistic about it: total_vm may often enough give a rough
> approximation to something else worth limiting.)

You seem to have read my mind, my motivation for memrlimits is

1. Administrators to set a limit and be sure that a cgroup cannot consume more
swap + RSS than the assigned virtual memory limit
2. It allows applications to fail gracefully or decide what parts to free up
to get more memory or change their allocation pattern (a scientific application
deciding what size of matrix to allocate for example).

Excellent feedback, thanks for the review!

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

      reply	other threads:[~2008-06-25  1:01 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-20 15:01 Balbir Singh
2008-06-20 15:01 ` [1/2] memrlimit handle attach_task() failure, add can_attach() callback Balbir Singh
2008-06-25 23:37   ` Andrew Morton
2008-06-25 23:52     ` Balbir Singh
2008-06-20 15:01 ` [2/2] memrlimit fix usage of tmp as a parameter name Balbir Singh
2008-06-25 23:41   ` Andrew Morton
2008-06-25 23:51     ` Balbir Singh
2008-06-24 23:11 ` [0/2] memrlimit improve error handling Hugh Dickins
2008-06-25  1:01   ` Balbir Singh [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=486198D1.20500@linux.vnet.ibm.com \
    --to=balbir@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=hugh@veritas.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=menage@google.com \
    --cc=yamamoto@valinux.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox