From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id 548D26B024C for ; Thu, 2 May 2013 00:57:52 -0400 (EDT) Received: by mail-qc0-f170.google.com with SMTP id i13so107935qcs.1 for ; Wed, 01 May 2013 21:57:51 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130501191033.GG1229@cmpxchg.org> References: <516B9B57.6050308@redhat.com> <20130416075047.GA4184@osiris> <1638103518.2400447.1366266465689.JavaMail.root@redhat.com> <20130418071303.GB4203@osiris> <20130424104255.GC4350@osiris> <20130424131851.GC31960@dhcp22.suse.cz> <20130424152043.GP2018@cmpxchg.org> <20130430172711.GE1229@cmpxchg.org> <20130501191033.GG1229@cmpxchg.org> Date: Wed, 1 May 2013 21:57:50 -0700 Message-ID: Subject: Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed) From: Hugh Dickins Content-Type: multipart/alternative; boundary=e89a8f64753f971e9e04dbb5131a Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , Heiko Carstens , Zhouping Liu , Andrew Morton , Glauber Costa , linux-mm , LKML , caiqian , Caspar Zhang , Martin Schwidefsky , Lingzhu Xiang --e89a8f64753f971e9e04dbb5131a Content-Type: text/plain; charset=UTF-8 On Wed, May 1, 2013 at 12:10 PM, Johannes Weiner wrote: > On Wed, May 01, 2013 at 08:28:30AM -0700, Hugh Dickins wrote: > > On Tue, 30 Apr 2013, Johannes Weiner wrote: > > > On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote: > > > > On Wed, 24 Apr 2013, Johannes Weiner wrote: > > > > > On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote: > > > > > > On Wed 24-04-13 12:42:55, Heiko Carstens wrote: > > > > > > > On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote: > > > > > > > > > > > > > > [ 48.347963] ------------[ cut here ]------------ > > > > > > > [ 48.347972] kernel BUG at mm/memcontrol.c:3994! > > > > > > > __mem_cgroup_uncharge_common() triggers: > > > > > > > > > > > > > > [...] > > > > > > > if (mem_cgroup_disabled()) > > > > > > > return NULL; > > > > > > > > > > > > > > VM_BUG_ON(PageSwapCache(page)); > > > > > > > [...] > > > > > > > > I agree that the actual memcg uncharging should be okay, but the > memsw > > > > swap stats will go wrong (doesn't matter toooo much), and > mem_cgroup_put > > > > get missed (leaking a struct mem_cgroup). > > > > > > Ok, so I just went over this again. For the swapout path the memsw > > > uncharge is deferred, but if we "steal" this uncharge from the swap > > > code, we actually do uncharge memsw in mem_cgroup_do_uncharge(), so we > > > may prematurely unaccount the swap page, but we never leak a charge. > > > Good. > > > > > > Because of this stealing, we also don't do the following: > > > > > > if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) { > > > mem_cgroup_swap_statistics(memcg, true); > > > mem_cgroup_get(memcg); > > > } > > > > > > I.e. it does not matter that mem_cgroup_uncharge_swap() doesn't do the > > > put, we are also not doing the get. We should not leak references. > > > > > > So the only thing that I can see go wrong is that we may have a > > > swapped out page that is not charged to memsw and not accounted as > > > MEM_CGROUP_STAT_SWAP. But I don't know how likely that is, because we > > > check for PG_swapcache in this uncharge path after the last pte is > > > torn down, so even though the page is put on swap cache, it probably > > > won't be swapped. It would require that the PG_swapcache setting > > > would become visible only after the page has been added to the swap > > > cache AND rmap has established at least one swap pte for us to > > > uncharge a page that actually continues to be used. And that's a bit > > > of a stretch, I think. > > > > Sorry, our minds seem to work in different ways, > > I understood very little of what you wrote above :-( > > > > But once I try to disprove you with a counter-example, I seem to > > arrive at the same conclusion as you have (well, I haven't quite > > arrived there yet, but cannot give it any more time). > > I might be losing my mind. But since you are reaching the same > conclusion, and I see the same mental milestones in your thought > process described below, it's more likely that I suck at describing my > train of thought coherently. Or the third possibility: we're both > losing it! > > > Looking at it from my point of view, I concentrate on the racy > > if (PageSwapCache(page)) > > return; > > __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, > false); > > in mem_cgroup_uncharge_page(). > > > > Now, that may or may not catch the case where last reference to page > > is unmapped at the same time as the page is added to swap: but being > > a MEM_CGROUP_CHARGE_TYPE_ANON call, it does not interfere with the > > memsw stats and get/put at all, those remain in balance. > > Yes, exactly. > > > And mem_cgroup_uncharge_swap() has all along been prepared to get > > a zero id from swap_cgroup_record(), if a SwapCache page should be > > uncharged when it was never quite charged as such. > > > > Yes, we may occasionally fail to charge a SwapCache page as such > > if its final unmap from userspace races with its being added to swap; > > but it's heading towards swap_writepage()'s try_to_free_swap() anyway, > > so I don't think that's anything to worry about. > > Agreed as well. If there are no pte references to the swap slot, it > will be freed either way. I didn't even think of the > try_to_free_swap() in the writeout call, but was looking at the > __remove_mapping later on in reclaim that will do a swapcache_free(). > > The only case I was worried about is the following: > > #0 #1 > page_remove_rmap() shrink_page_list() > if --page->mapcount == 0: add_to_swap() > mem_cgroup_uncharge_page() __add_to_swap_cache() > if PageSwapCache: SetPageSwapCache() > return try_to_unmap() > __mem_cgroup_uncharge_common() for each pte: > install swp_entry_t > page->mapcount-- > Thanks for spelling it out for me in more detail, this time I think I do grasp your concern. > > Looking at #1, I don't see anything that would force concurrent > threads to observe SetSwapCache ordered against the page->mapcount--. > My concern was that if those get reordered, #0 may see page->mapcount > == 1 AND !PageSwapcache, and then go ahead and uncharge the page while > there is actually a swp_entry_t pointing to it. The page will be a > proper long-term swap page without being charged as such. > But I don't see any problem with ordering here. #0 is using an atomic operation which returns a result on page->mapcount, so that amounts to (more than) an smp_rmb ensuring it reads mapcount before reading PageSwapCache flag. And in #1, there's at least an unlock of the radix_tree lock (after adding to swap tree) and a lock of the page table lock (before unmapping the page), and that pairing amounts to (more than) an smp_wmb. Hugh > > (If I had time to stop and read through that, I'd probably find it > > just as hard to understand as what you wrote!) > > > > > > > > Did I miss something? If not, I'll just send a patch that removes the > > > VM_BUG_ON() and adds a comment describing the scenarios and a note > > > that we may want to fix this in the future. > > > > I don't think you missed something. Yes, please just send Linus and > > Andrew a patch to remove the VM_BUG_ON() (with Cc stable tag), I now > > agree that's all that's really needed - thanks. > > Will do, thanks for taking them time to think through it again, even > after failing to decipher my ramblings... > > Johannes > --e89a8f64753f971e9e04dbb5131a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On Wed, May 1, 2013 at 12:10 PM, Johannes Weiner <hannes= @cmpxchg.org> wrote:
On W= ed, May 01, 2013 at 08:28:30AM -0700, Hugh Dickins wrote:
> On Tue, 30 Apr 2013, Johannes Weiner wrote:
> > On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote:
> > > On Wed, 24 Apr 2013, Johannes Weiner wrote:
> > > > On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko = wrote:
> > > > > On Wed 24-04-13 12:42:55, Heiko Carstens wrote: > > > > > > On Thu, Apr 18, 2013 at 09:13:03AM +0200, Hei= ko Carstens wrote:
> > > > > >
> > > > > > [ =C2=A0 48.347963] ------------[ cut here ]-= -----------
> > > > > > [ =C2=A0 48.347972] kernel BUG at mm/memcontr= ol.c:3994!
> > > > > > __mem_cgroup_uncharge_common() triggers:
> > > > > >
> > > > > > [...]
> > > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (mem_cgroup_di= sabled())
> > > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 return NULL;
> > > > > >
> > > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 VM_BUG_ON(PageSwa= pCache(page));
> > > > > > [...]
> > >
> > > I agree that the actual memcg uncharging should be okay, but= the memsw
> > > swap stats will go wrong (doesn't matter toooo much), an= d mem_cgroup_put
> > > get missed (leaking a struct mem_cgroup).
> >
> > Ok, so I just went over this again. =C2=A0For the swapout path th= e memsw
> > uncharge is deferred, but if we "steal" this uncharge f= rom the swap
> > code, we actually do uncharge memsw in mem_cgroup_do_uncharge(), = so we
> > may prematurely unaccount the swap page, but we never leak a char= ge.
> > Good.
> >
> > Because of this stealing, we also don't do the following:
> >
> > =C2=A0 =C2=A0 if (do_swap_account && ctype =3D=3D MEM_CGR= OUP_CHARGE_TYPE_SWAPOUT) {
> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mem_cgroup_swap_statist= ics(memcg, true);
> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mem_cgroup_get(memcg);<= br> > > =C2=A0 =C2=A0 }
> >
> > I.e. it does not matter that mem_cgroup_uncharge_swap() doesn'= ;t do the
> > put, we are also not doing the get. =C2=A0We should not leak refe= rences.
> >
> > So the only thing that I can see go wrong is that we may have a > > swapped out page that is not charged to memsw and not accounted a= s
> > MEM_CGROUP_STAT_SWAP. =C2=A0But I don't know how likely that = is, because we
> > check for PG_swapcache in this uncharge path after the last pte i= s
> > torn down, so even though the page is put on swap cache, it proba= bly
> > won't be swapped. =C2=A0It would require that the PG_swapcach= e setting
> > would become visible only after the page has been added to the sw= ap
> > cache AND rmap has established at least one swap pte for us to > > uncharge a page that actually continues to be used. =C2=A0And tha= t's a bit
> > of a stretch, I think.
>
> Sorry, our minds seem to work in different ways,
> I understood very little of what you wrote above :-(
>
> But once I try to disprove you with a counter-example, I seem to
> arrive at the same conclusion as you have (well, I haven't quite > arrived there yet, but cannot give it any more time).

I might be losing my mind. =C2=A0But since you are reaching the= same
conclusion, and I see the same mental milestones in your thought
process described below, it's more likely that I suck at describing my<= br> train of thought coherently. =C2=A0Or the third possibility: we're both=
losing it!

> Looking at it from my point of view, I concentrate on the racy
> =C2=A0 =C2=A0 =C2=A0 if (PageSwapCache(page))
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return;
> =C2=A0 =C2=A0 =C2=A0 __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHA= RGE_TYPE_ANON, false);
> in mem_cgroup_uncharge_page().
>
> Now, that may or may not catch the case where last reference to page > is unmapped at the same time as the page is added to swap: but being > a MEM_CGROUP_CHARGE_TYPE_ANON call, it does not interfere with the
> memsw stats and get/put at all, those remain in balance.

Yes, exactly.

> And mem_cgroup_uncharge_swap() has all along been prepared to get
> a zero id from swap_cgroup_record(), if a SwapCache page should be
> uncharged when it was never quite charged as such.
>
> Yes, we may occasionally fail to charge a SwapCache page as such
> if its final unmap from userspace races with its being added to swap;<= br> > but it's heading towards swap_writepage()'s try_to_free_swap()= anyway,
> so I don't think that's anything to worry about.

Agreed as well. =C2=A0If there are no pte references to the swap slot= , it
will be freed either way. =C2=A0I didn't even think of the
try_to_free_swap() in the writeout call, but was looking at the
__remove_mapping later on in reclaim that will do a swapcache_free().

The only case I was worried about is the following:

#0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0#1
page_remove_rmap() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0shrink_page_list()
=C2=A0 if --page->mapcount =3D=3D 0: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 add_to_swap()
=C2=A0 =C2=A0 mem_cgroup_uncharge_page() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0__add_to_swap_cache()
=C2=A0 =C2=A0 =C2=A0 if PageSwapCache: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 SetPageSwapCache()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0try_to_unmap() =C2=A0 =C2=A0 =C2=A0 __mem_cgroup_uncharge_common() =C2=A0 =C2=A0 =C2=A0 = =C2=A0for each pte:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 install swp_entry_t
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 page->mapcount--

Thanks for spelling it out for me in more detail, this time I think I do g= rasp your concern.
=C2=A0

Looking at #1, I don't see anything that would force concurrent
threads to observe SetSwapCache ordered against the page->mapcount--. My concern was that if those get reordered, #0 may see page->mapcount =3D=3D 1 AND !PageSwapcache, and then go ahead and uncharge the page while<= br> there is actually a swp_entry_t pointing to it. =C2=A0The page will be a proper long-term swap page without being charged as such.
<= div>
But I don't see any problem with ordering here= . =C2=A0#0 is using an atomic operation which returns a result on page->= mapcount, so that amounts to (more than) an smp_rmb ensuring it reads mapco= unt before reading PageSwapCache flag. =C2=A0And in #1, there's at leas= t an unlock of the radix_tree lock (after adding to swap tree) and a lock o= f the page table lock (before unmapping the page), and that pairing amounts= to (more than) an smp_wmb.

Hugh


> (If I had time to stop and read through that, I'd probably find it=
> just as hard to understand as what you wrote!)
>
> >
> > Did I miss something? =C2=A0If not, I'll just send a patch th= at removes the
> > VM_BUG_ON() and adds a comment describing the scenarios and a not= e
> > that we may want to fix this in the future.
>
> I don't think you missed something. =C2=A0Yes, please just send Li= nus and
> Andrew a patch to remove the VM_BUG_ON() (with Cc stable tag), I now > agree that's all that's really needed - thanks.

Will do, thanks for taking them time to think through it again, even<= br> after failing to decipher my ramblings...

Johannes

--e89a8f64753f971e9e04dbb5131a-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org