linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Feng Tang <feng.tang@intel.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	"Sang, Oliver" <oliver.sang@intel.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	lkp <lkp@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Jann Horn <jannh@google.com>,
	"Song, Youquan" <youquan.song@intel.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>, Jan Kara <jack@suse.cz>,
	John Hubbard <jhubbard@nvidia.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	"Matthew Wilcox" <willy@infradead.org>,
	Michal Hocko <mhocko@kernel.org>,
	Muchun Song <songmuchun@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>, <hongjiu.lu@intel.com>
Subject: Re: [linus:master] [hugetlb] 7118fc2906: kernel_BUG_at_lib/list_debug.c
Date: Wed, 18 Jan 2023 23:07:10 +0800	[thread overview]
Message-ID: <Y8gLHnUwnu5DkH0B@feng-clx> (raw)
In-Reply-To: <2b857e20-5e3a-13ec-a0b0-1f69d2d047a5@suse.cz>

On Wed, Jan 18, 2023 at 02:35:17PM +0100, Vlastimil Babka wrote:
> On 1/17/23 19:25, Linus Torvalds wrote:
> > On Tue, Jan 17, 2023 at 4:22 AM Feng Tang <feng.tang@intel.com> wrote:
> >>
> >> With the following patch to use 'O1' instead 'O2' gcc optoin for
> >> page_alloc.c, the list corruption issue can't be reproduced for
> >> commit 7118fc2906 in 1000 runs.
> > 
> > Ugh.
> > 
> > It would be lovely if you could just narrow it down with
> > 
> >   #pragma GCC optimize ("O1")
> >  ...
> >   #pragma GCC optimize ("O2")
> > 
> > around just that prep_compound_page(), but when I tried it myself I
> > get some function attribute mismatch errors.
> > 
> > 
> >> As is can't be reproduced with X86_64 build, it could be i386
> >> compiling related.
> > 
> > Your particular config causes a huge amount of nasty 64-bit arithmetic
> > according to the objdump code, with sequences like
> > 
> >   c13b3cbb:       83 05 d0 28 6c c5 01    addl   $0x1,0xc56c28d0
> >   c13b3cc2:       83 15 d4 28 6c c5 00    adcl   $0x0,0xc56c28d4
> > 
> > which seems to be just from some coverage profiling being on
> > (CONFIG_GCOV?), or something. It makes it very hard to read the code.
> 
> I think whatever this is, it's also responsible for the bug.
> Now from all the dumps I've seen so far, the struct page corruptions looked
> like prep_compound_page() is called more times than it should. Which didn't
> make sense as the code is very simple, a loop of 1 << order - 1 iterations.
> 
> If I look at Feng's disassembly of 48b8d744ea84 (the "good" parent commit),
> I can work out that edi holds the 1 << order, ebx is initialized as 1, and
> there's a nice clear "inc ebx, cmp ebx, edi, jne <loop>".
> 
> Now the disassembly of the "bad" commit is much harder to follow when it
> comes to the iteration control. There are xors and or's involved so I didn't
> even try to work out the meaning. Clearly it can't be deterministically
> wrong or it would crash always and fast. But crucially I think it uses
> memory addresses 0xc56c28f8 and 0xc56c28fc in a racy way and that could
> explain the bug.
> 
> Before the loop iteration itself, it initializes several registers from
> these addreses, and the values end up modifying also registers used for the
> loop iteration control (ebx/esi) And crucially in the loop iteration it also
> writes some values to the same addresses, and there seems to be no
> synchronization whatsoever. So it seems to me, even if these addresses are
> supposed to be "private" to prep_compound_page(), running two
> prep_compound_page() in parallel, or even due to interrupt (Hyeonggon did
> reproduce this with -smp 1 as well) can result in a race ultimately
> affecting the number of iterations taken.

Thanks for the decoding!

For the strange memory address 0xc56c28xx in the assembly, they are
GCOV related  memory for prep_compound_page(), as from the System.map:

	c56c28a0 b __gcov0.free_compound_page
	c56c28c0 b __gcov0.prep_compound_page
	c56c2918 b __gcov0.early_debug_pagealloc

And your analysis of racing of these region makes sense, that some
pointer could be changed when a race happens. From the memory dump
in previous thread, it was always one off issue, that page[2] for
order 1 and page[8] for order 3 were wrongly written.

Thanks,
Feng

> Attaching the disasm I annotated for more details. Everything register whose
> value is coming from 0xc56c28f8/0xc56c28fc in some way is annotated as
> "crap" there (sorry).
> 
> > You also have UBSAN enabled, which - again - makes for some really
> > grotty asm that hides any actual logic.
> > 
> > Finally, your objdump version also does some horrendous decoding, like
> > 
> >   c13b3e29:       8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
> > 
> > which is just a 7-byte 'nop' instruction, but again, it makes it
> > rather hard to actually read the code.
> > 
> > With the i386 defconfig, gcc generates a function that is just ~30
> > instructions for me, so this makes a huge difference in the legibility
> > of the code.
> > 
> > I wonder if you can recreate the issue with a much more
> > straightforward config. By all means, leave DEBUG_PAGEALLOC and SLUB
> > debugging on, but without the things like UBSAN and GCOV.
> > 
> >> I also objdumped 'prep_compound_page' for vmlinux of 7118fc2906 and
> >> its parent commit 48b8d744ea84, which have big difference than the
> >> simple 'set_page_count()' change, but I can't tell which part is
> >> abnormal, so attach them for further check.
> > 
> > Yeah, I can't make heads or tails of them either, see above on how
> > illegible the objdump files are. And that's despite not even having
> > all of prep_compound_page() in them (it's missing
> > prep_compound_page.cold, which is probably just UBSAN fixup code, but
> > who knows..)
> > 
> > That said, with the i386 defconfig, the only change from adding
> > set_page_count() to the loop seems to be exactly that:
> > 
> >  .L589:
> > -       movl    $1024, 12(%eax)
> > +       movl    $0, 28(%eax)
> >         addl    $32, %eax
> > +       movl    $1024, -20(%eax)
> >         movl    %esi, -28(%eax)
> >         movl    $0, -12(%eax)
> >         cmpl    %edx, %eax
> > 
> > (don't ask me why gcc does *one* access using the pre-incremented
> > pointer, and then the rest to the post-incremented ones, but whatever
> > - it means that it's not just "add a mov $0", it's also changing how
> > the
> > 
> >         p->mapping = TAIL_MAPPING;
> > 
> > instruction is done, which is that
> > 
> > -       movl    $1024, 12(%eax)
> > +       movl    $1024, -20(%eax)
> > 
> > part of the change)
> > 
> >              Linus

> inputs:
> eax: page
> edx: order
> 
> stack below ebp
> -0x04 saved edi
> -0x08 saved esi
> -0x0c saved ebp
> (sub    $0x14,%esp)
> -0x10  crap
> -0x14  crap
> -0x18  page|1 (compound_head value)
> -0x1c  copy of order
> -0x20  page
> -0x24
> esp
> 
> eax = page|1 (compound_head value) / crap
> edx = page / crap
> edi = page / crap
> ebx = 1 << order (=2) / 1 << order -2 (=0) / crap!
> ecx = page + 1 (first tail page)
> esi = 0 / crap
> 
> c13b3b90 <prep_compound_page>:
> c13b3b90:	55                   	push   %ebp
> c13b3b91:	89 e5                	mov    %esp,%ebp
> c13b3b93:	57                   	push   %edi
> c13b3b94:	89 c7                	mov    %eax,%edi                            edi = page
> c13b3b96:	56                   	push   %esi
> c13b3b97:	53                   	push   %ebx
> c13b3b98:	83 ec 14             	sub    $0x14,%esp
> c13b3b9b:	83 fa 1f             	cmp    $0x1f,%edx                           
> c13b3b9e:	89 55 e4             	mov    %edx,-0x1c(%ebp)                    
> c13b3ba1:	0f 87 33 31 ed 01    	ja     c3286cda <prep_compound_page.cold>   only when order > 31 (?) 
> c13b3ba7:	0f b6 4d e4          	movzbl -0x1c(%ebp),%ecx                     ecx = order
> c13b3bab:	bb 01 00 00 00       	mov    $0x1,%ebx                            ebx = 1
> c13b3bb0:	d3 e3                	shl    %cl,%ebx                             ebx = 1 << order  (=2)
> c13b3bb2:	83 3f ff             	cmpl   $0xffffffff,(%edi)
> c13b3bb5:	0f 84 65 02 00 00    	je     c13b3e20 <prep_compound_page+0x290>  only when page flags == 0xffffffff (?)
> c13b3bbb:	83 05 d0 28 6c c5 01 	addl   $0x1,0xc56c28d0
> c13b3bc2:	83 15 d4 28 6c c5 00 	adcl   $0x0,0xc56c28d4
> c13b3bc9:	0f ba 2f 10          	btsl   $0x10,(%edi)                         sets bit 0x10 (PG_head) in page flags
> c13b3bcd:	83 05 f0 28 6c c5 01 	addl   $0x1,0xc56c28f0
> c13b3bd4:	83 15 f4 28 6c c5 00 	adcl   $0x0,0xc56c28f4
> c13b3bdb:	83 fb 01             	cmp    $0x1,%ebx
> c13b3bde:	0f 8e 80 00 00 00    	jle    c13b3c64 <prep_compound_page+0xd4>   nr_pages <= 1 -> skip over the whole loop
> c13b3be4:	8d 47 01             	lea    0x1(%edi),%eax                       eax = page|1 (compound_head value)
> c13b3be7:	8b 15 fc 28 6c c5    	mov    0xc56c28fc,%edx                      edx now some crap, we don't know how this was initialized                                          !!!
> c13b3bed:	89 45 e8             	mov    %eax,-0x18(%ebp)                     saves the page|1                                                                                   
> c13b3bf0:	a1 f8 28 6c c5       	mov    0xc56c28f8,%eax                      eax now some crap, again we don't know how this was initialized                                    !!!
> c13b3bf5:	8d 4f 28             	lea    0x28(%edi),%ecx                      ecx = edi + 0x28 - first tail page
> c13b3bf8:	89 7d e0             	mov    %edi,-0x20(%ebp)                     saves page pointer
> c13b3bfb:	83 c0 01             	add    $0x1,%eax                            crap + 1 
> c13b3bfe:	89 45 ec             	mov    %eax,-0x14(%ebp)                     saves the crap
> c13b3c01:	83 d2 00             	adc    $0x0,%edx                            crap in crap out
> c13b3c04:	a1 f8 28 6c c5       	mov    0xc56c28f8,%eax                      reset as crap
> c13b3c09:	89 55 f0             	mov    %edx,-0x10(%ebp)                     save more crap
> c13b3c0c:	8b 15 fc 28 6c c5    	mov    0xc56c28fc,%edx                      reset as crap
> c13b3c12:	83 eb 02             	sub    $0x2,%ebx                            ebx is nr_pages - 2 (for order-1 was 2, now 0) - remaining tail pages after first iteration?
> c13b3c15:	31 f6                	xor    %esi,%esi                            esi is 0
> c13b3c17:	83 c0 02             	add    $0x2,%eax                            crap + 2
> c13b3c1a:	83 d2 00             	adc    $0x0,%edx                            crap
> c13b3c1d:	01 c3                	add    %eax,%ebx                            ebx is now modified by crap !!!                                                                    !!!
> c13b3c1f:	8b 45 ec             	mov    -0x14(%ebp),%eax                     reset as crap
> c13b3c22:	11 d6                	adc    %edx,%esi                            esi now also crap
> c13b3c24:	8b 55 f0             	mov    -0x10(%ebp),%edx                     reset as crap
> c13b3c27:	89 f7                	mov    %esi,%edi                            edi now also crap, oh my
> c13b3c29:	89 de                	mov    %ebx,%esi                            supposed to be the remaining tail pages, byt modified by crap
> c13b3c2b:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi                elaborate nop
> c13b3c2f:	90                   	nop                                         nop
> 
> // this should be the loop for prep_compound_tail
> c13b3c30:	a3 f8 28 6c c5       	mov    %eax,0xc56c28f8                      uh now we saved some value to the memory we read previously assuming some known good value?        !!!
> c13b3c35:	8b 5d e8             	mov    -0x18(%ebp),%ebx                     ebx = page|1 (compound_head value)
> c13b3c38:	83 c0 01             	add    $0x1,%eax                            still crap
> c13b3c3b:	89 15 fc 28 6c c5    	mov    %edx,0xc56c28fc                      and now we wrote to the other address as well?                                                     !!!
> c13b3c41:	83 d2 00             	adc    $0x0,%edx                            still crap
> c13b3c44:	83 c1 28             	add    $0x28,%ecx                           ecx pointed to first tail pages, now points to second tail page
> c13b3c47:	c7 41 e4 00 04 00 00 	movl   $0x400,-0x1c(%ecx)                   (first tail page)->mapping = TAIL_MAPPING;
> c13b3c4e:	89 59 dc             	mov    %ebx,-0x24(%ecx)                     (first tail page)->compound_head is now set
> c13b3c51:	89 fb                	mov    %edi,%ebx                            ebx now crap
> c13b3c53:	31 d3                	xor    %edx,%ebx                            ebx xored with more crap
> c13b3c55:	89 5d ec             	mov    %ebx,-0x14(%ebp)                     we might need that crap later
> c13b3c58:	89 f3                	mov    %esi,%ebx                            supposed to be the remaining tail pages, byt modified by crap
> c13b3c5a:	31 c3                	xor    %eax,%ebx                            yeah why not xor it with more crap
> c13b3c5c:	0b 5d ec             	or     -0x14(%ebp),%ebx                     and "or" it with the previously saved crap
> c13b3c5f:	75 cf                	jne    c13b3c30 <prep_compound_page+0xa0>   and that's how we decided if we should do another iteration :(
> 
> c13b3c61:	8b 7d e0             	mov    -0x20(%ebp),%edi                     restore edi = page
> 
> // here we land if we skip everything at c13b3bde
> c13b3c64:	c6 47 30 01          	movb   $0x1,0x30(%edi)
> // the rest shouldn't be important



  reply	other threads:[~2023-01-18 15:11 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-17  7:10 kernel test robot
2023-01-17  7:39 ` Vlastimil Babka
2023-01-17  7:47   ` Yin, Fengwei
2023-01-17  8:01   ` Feng Tang
2023-01-17 12:20     ` Feng Tang
2023-01-17 18:25       ` Linus Torvalds
2023-01-18  1:07         ` Feng Tang
2023-01-18 13:31         ` Feng Tang
2023-01-18 17:10           ` Linus Torvalds
2023-01-19  2:19             ` Feng Tang
2023-01-18 13:35         ` Vlastimil Babka
2023-01-18 15:07           ` Feng Tang [this message]
2023-01-17 18:50 ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y8gLHnUwnu5DkH0B@feng-clx \
    --to=feng.tang@intel.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=hongjiu.lu@intel.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=songmuchun@bytedance.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=youquan.song@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox