From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4627ACD128A for ; Thu, 11 Apr 2024 07:04:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C89E06B008C; Thu, 11 Apr 2024 03:04:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C39336B0092; Thu, 11 Apr 2024 03:04:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ADA156B0095; Thu, 11 Apr 2024 03:04:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8DA606B008C for ; Thu, 11 Apr 2024 03:04:48 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 47891140AB2 for ; Thu, 11 Apr 2024 07:04:48 +0000 (UTC) X-FDA: 81996363456.05.0C8AD3F Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by imf17.hostedemail.com (Postfix) with ESMTP id 940E54000B for ; Thu, 11 Apr 2024 07:04:46 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GnPgcdaq; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of huangzhaoyang@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=huangzhaoyang@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712819086; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2FucNyXYncjy7kwHBxXtuX2PTVIjahh5aNMYaH57AMg=; b=d2XCrVnEd9D2HEKtX8G/fEvP/wjTi533aC07+1c/yo6KGGULMIpbJBIUJPD2/vcD5/9qB5 Mmxq1BW9htdkC5UhgfQU9zO0hDRKy4R4+Sm5pqEYb6jRCXSCYvDSS2Wqoos8GjQGo2o+hQ yIQ+oNNKyKC27DRd1Qb8r4mcxhF/3Io= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GnPgcdaq; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of huangzhaoyang@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=huangzhaoyang@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712819086; a=rsa-sha256; cv=none; b=RXXWasOTLNc3WKOnTRt+cq06Wk+VVrrzxPfWmiBldRmFtJLvO6np+j23UgdxgwWvdwojcr /HqwXg/Qplpun746XdC5wCsyyoq3X8e50Wu9kb2ytWvk/jNoXb8Fzs2yzPR3AEa3Ku0bO3 lJfCngAkWAyAPK7++KBIXsbmA/wfgus= Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-2d47a92cfefso94961681fa.1 for ; Thu, 11 Apr 2024 00:04:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1712819085; x=1713423885; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2FucNyXYncjy7kwHBxXtuX2PTVIjahh5aNMYaH57AMg=; b=GnPgcdaqfVDpxEGkrf0bUosrvbzlXKyMfFZASk2HzJWQL1adqRj460Hb5LuTgCjKpX 2f8QP/Ff4fXUrK9w7TRyc9fm6Dd7U9GIYm5FjneVyuR11YiIBNI/PO+u9VLgiR+nMdD0 pv8B/OAELY7f2ZWR9y31kxAJMnqnqEhRLUNVjI2w4g8g4VZMmEbEAtab6QNQtoZ9PoT4 4xStJVkRQYmM68xHQBYoOWZLOjNIu+YlPoJmXftfNdf2CkbuqlvcUQksIC6KPCqrxFIJ Y6nuHSrMBHFKYd9Vca5r3kf/be837OLcE99y3y4PdtR+pkw65dM14yPK5qOswqy39IGN sd5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712819085; x=1713423885; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2FucNyXYncjy7kwHBxXtuX2PTVIjahh5aNMYaH57AMg=; b=kh5XxKKqaNHQ3Sum4jGETY5sZ6dQqXl9SGIb16UpjGx21OqwEPu/CFd8bPHzZ/Q800 48f1dwja1UFXzkUpVSGPdWCVugQb00QwxlEAI0YWqGbsi+h0HLwFZm/UL3Jg3qoiZvkW d44GfpxcYKuJd6DKGJpzuMXAyAxBdIWy6joQXABUggvJQ5VvmFHYidP8VCjFIz1OKg+c io9AvEnx7lUiI9Ei931Cf1lavYR2WWLf9KRJLYDXfdF7RIhI4TkUR/3EywmfjOL7oCwY U8OwJQN/Nmq0wuJdBgJ4wVzCQjdl60b5RkmpEHUUtkXkbICvUyzmDstb9e28Goii+LoM wYgw== X-Forwarded-Encrypted: i=1; AJvYcCXUsYSFakKj/w3Yy/zbn/t8j66CfvTbSbRdDD5W8HQFUbkWnU7kCBmnqI369as5rQusNvSPTdf8qyaI3gUla88HHqg= X-Gm-Message-State: AOJu0Ywa+DIUx9LwwF5n+VrfGKNMbBI0NRLBgno3OIYY8N2Wdu8rXZO8 xUmYsm7m+vPO3k5GamD5boozLKkWH54hNRoyG7QWrnkt8pJ+xgM1kN25D63XilnjWThrDWm0fYf 27Z6tR864ySi+ybx4/qARac/uvFE= X-Google-Smtp-Source: AGHT+IHmAZYzKLlo9ZRZ5MSvjVVpsdnlztH3Eugi/K9/tCqxLHp4ecBXsf/lMLuZBLv8YTSMB7qbYfKqcDr0XYnbEcc= X-Received: by 2002:a2e:9b4b:0:b0:2d4:6c52:23d5 with SMTP id o11-20020a2e9b4b000000b002d46c5223d5mr2927189ljj.50.1712819084262; Thu, 11 Apr 2024 00:04:44 -0700 (PDT) MIME-Version: 1.0 References: <1665725448-31439-1-git-send-email-zhaoyang.huang@unisoc.com> <20221018223042.GJ2703033@dread.disaster.area> <20221019220424.GO2703033@dread.disaster.area> <20221101071721.GV2703033@dread.disaster.area> In-Reply-To: <20221101071721.GV2703033@dread.disaster.area> From: Zhaoyang Huang Date: Thu, 11 Apr 2024 15:04:32 +0800 Message-ID: Subject: Re: [RFC PATCH] mm: move xa forward when run across zombie page To: Dave Chinner Cc: Matthew Wilcox , "zhaoyang.huang" , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, steve.kang@unisoc.com, baocong.liu@unisoc.com, linux-fsdevel@vger.kernel.org, Brian Foster , Christoph Hellwig , David Hildenbrand Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 940E54000B X-Stat-Signature: 47gz3qarkahkx4h83y6eymricas1acch X-HE-Tag: 1712819086-215175 X-HE-Meta: U2FsdGVkX18khduBGFQr1p1JQZ0xZopeEUfj7u4P4WblcYPkJZr4+FztZzVsVArog/APNWJBYRdgWAnPUro4GSuA+2yUh5dq0enpAp+pFODQEcFXaRAOTbmccaC4K5I25ckyIuRl8qJjr6zivumElIU7cWd03aXa6ay4/SzrBWBFxelaqWb/LBI5/BKMgNBKD8OQzWgnufIvOP9QK5JDBx1K15lvOt4xFOKgCumiqh0k54+46wPO5PMoU20/q2iUyBwwZX84GNznRRgm+2pQYrr+5DjK6Ubr0TjXGwkmYj9Q14IHaIlP+D9hNlahHZmHjf9cUy7sGijAEGnO8COTPdebomxfUDMMf94Az9R/sPevLUOwbOOhPxG1BqrBZFSWO3WaLiBa3MToD6f/QE7W5WaPMbpsXxu2+JterI2vDZqCB6foN22gmwPceYfLDnpMtDgOu/on+I37r4MR7uzptt7A5/n8kh0+AhExd82clHDLYKVbYNYQSZVCLasETF+x2uND125fXLGYtutqEhKVQ7RbooezywdMWm+evruvixY1kfup5WuCdgclbJ6311stTANiSxOeYXab4G/7TOa6gmNW0AINKF+e+BbIxhhSinIXQc31IO1SdhsttkVaEEeNG2MNeF4Q+XKHNl7ax38wCLr6KgDjYMaB5zKTje7n57enqR4syc7jI5tdck6c5t0WW5RAKqaPNXPr3Elv/2ja/A+93U/lPpzEa6DeEwMu+fMBwPSMb1etfkzmUHi5Khwko704aM14c7BU5OgQo1EWi0uj7lnj1I5/SU51SO+j3ZZv4uKbs7jdwkq/SGRPOEmofkcPe+dGdyZxNJE/9tSAfrdzKmiaLX4rh9S/0qyo87vOsxh3bGVxu+xORw0iAQ+cWi8X4uiVdOQlkdXdIJTESUNmSmSD0kAwx6WXcFZZm23+We5mxyH0P1n9wJLmbzpbk84vIflO9Z3T202H9qV ueMrKH83 c6YtSgNyOSxGZsw4ccohXW4hB91a3CeyHFXSR X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Nov 1, 2022 at 3:17=E2=80=AFPM Dave Chinner w= rote: > > On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote: > > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > > > This is reading and writing the same amount of file data at the > > > > > application level, but once the data has been written and kicked = out > > > > > of the page cache it seems to require an awful lot more read IO t= o > > > > > get it back to the application. i.e. this looks like mmap() is > > > > > readahead thrashing severely, and eventually it livelocks with th= is > > > > > sort of report: > > > > > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/ta= sks: > > > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0= -15): P25728 > > > > > [175901.987996] (detected by 0, t=3D97399871 jiffies, g= =3D15891025, q=3D1972622 ncpus=3D32) > > > > > [175901.991698] task:test_write state:R running task st= ack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > > > [175901.995614] Call Trace: > > > > > [175901.996090] > > > > > [175901.996594] ? __schedule+0x301/0xa30 > > > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > > [175902.000714] ? xas_start+0x53/0xc0 > > > > > [175902.001484] ? xas_load+0x24/0xa0 > > > > > [175902.002208] ? xas_load+0x5/0xa0 > > > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > > > [175902.008613] > > > > > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > > > folios, I do wonder if this is a result of some kind of race with > > > > > teardown of a large folio... > > > > > > > > It doesn't matter whether we're trying to map a large folio; it > > > > matters whether a large folio was previously created in the cache. > > > > Through the magic of readahead, it may well have been. I suspect > > > > it's not teardown of a large folio, but splitting. Removing a > > > > page from the page cache stores to the pointer in the XArray > > > > first (either NULL or a shadow entry), then decrements the refcount= . > > > > > > > > We must be observing a frozen folio. There are a number of places > > > > in the MM which freeze a folio, but the obvious one is splitting. > > > > That looks like this: > > > > > > > > local_irq_disable(); > > > > if (mapping) { > > > > xas_lock(&xas); > > > > (...) > > > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > > > > > But the lookup is not doing anything to prevent the split on the > > > frozen page from making progress, right? It's not holding any folio > > > references, and it's not holding the mapping tree lock, either. So > > > how does the lookup in progress prevent the page split from making > > > progress? > > > > My thinking was that it keeps hammering the ->refcount field in > > struct folio. That might prevent a thread on a different socket > > from making forward progress. In contrast, spinlocks are designed > > to be fair under contention, so by spinning on an actual lock, we'd > > remove contention on the folio. > > > > But I think the tests you've done refute that theory. I'm all out of > > ideas at the moment. Either we have a frozen folio from somebody who > > doesn't hold the lock, or we have someone who's left a frozen folio in > > the page cache. I'm leaning towards that explanation at the moment, > > but I don't have a good suggestion for debugging. > > It's something else. I got gdb attached to qemu and single stepped > the looping lookup. The context I caught this time is truncate after > unlink: > > (gdb) bt > #0 find_get_entry (mark=3D, max=3D, xas=3D= ) at mm/filemap.c:2014 > #1 find_lock_entries (mapping=3Dmapping@entry=3D0xffff8882445e2118, star= t=3Dstart@entry=3D25089, end=3Dend@entry=3D18446744073709551614, > fbatch=3Dfbatch@entry=3D0xffffc900082a7dd8, indices=3Dindices@entry= =3D0xffffc900082a7d60) at mm/filemap.c:2095 > #2 0xffffffff8128f024 in truncate_inode_pages_range (mapping=3Dmapping@e= ntry=3D0xffff8882445e2118, lstart=3Dlstart@entry=3D0, lend=3Dlend@entry=3D-= 1) > at mm/truncate.c:364 > #3 0xffffffff8128f452 in truncate_inode_pages (lstart=3D0, mapping=3D0xf= fff8882445e2118) at mm/truncate.c:452 > #4 0xffffffff8136335d in evict (inode=3Dinode@entry=3D0xffff8882445e1f78= ) at fs/inode.c:666 > #5 0xffffffff813636cc in iput_final (inode=3D0xffff8882445e1f78) at fs/i= node.c:1747 > #6 0xffffffff81355b8b in do_unlinkat (dfd=3Ddfd@entry=3D10, name=3D0xfff= f88834170e000) at fs/namei.c:4326 > #7 0xffffffff81355cc3 in __do_sys_unlinkat (flag=3D, path= name=3D, dfd=3D) at fs/namei.c:4362 > #8 __se_sys_unlinkat (flag=3D, pathname=3D= , dfd=3D) at fs/namei.c:4355 > #9 __x64_sys_unlinkat (regs=3D) at fs/namei.c:4355 > #10 0xffffffff81e92e35 in do_syscall_x64 (nr=3D, regs=3D0x= ffffc900082a7f58) at arch/x86/entry/common.c:50 > #11 do_syscall_64 (regs=3D0xffffc900082a7f58, nr=3D) at ar= ch/x86/entry/common.c:80 > #12 0xffffffff82000087 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.= S:120 > #13 0x0000000000000000 in ?? () > > The find_lock_entries() call is being asked to start at index > 25089, and we are spinning on a folio we find because > folio_try_get_rcu(folio) is failing - the folio ref count is zero. > > The xas state on lookup is: > > (gdb) p *xas > $6 =3D {xa =3D 0xffff8882445e2120, xa_index =3D 25092, xa_shift =3D 0 '\0= 00', xa_sibs =3D 0 '\000', xa_offset =3D 4 '\004', xa_pad =3D 0 '\000', > xa_node =3D 0xffff888144c15918, xa_alloc =3D 0x0 , x= a_update =3D 0x0 , xa_lru =3D 0x0 > > indicating that we are trying to look up index 25092 (3 pages > further in than the start of the batch), and the folio that this > keeps returning is this: > > (gdb) p *folio > $7 =3D {{{flags =3D 24769796876795904, {lru =3D {next =3D 0xffffea0005690= 008, prev =3D 0xffff88823ffd5f50}, {__filler =3D 0xffffea0005690008, > mlock_count =3D 1073569616}}, mapping =3D 0x0 , index =3D 18688, private =3D 0x8 , _mapcount =3D { > counter =3D -129}, _refcount =3D {counter =3D 0}, memcg_data =3D = 0}, page =3D {flags =3D 24769796876795904, {{{lru =3D {next =3D 0xffffea000= 5690008, > prev =3D 0xffff88823ffd5f50}, {__filler =3D 0xffffea0005690= 008, mlock_count =3D 1073569616}, buddy_list =3D { > next =3D 0xffffea0005690008, prev =3D 0xffff88823ffd5f50}, = pcp_list =3D {next =3D 0xffffea0005690008, prev =3D 0xffff88823ffd5f50}}, > mapping =3D 0x0 , index =3D 18688, private = =3D 8}, {pp_magic =3D 18446719884544507912, pp =3D 0xffff88823ffd5f50, > _pp_mapping_pad =3D 0, dma_addr =3D 18688, {dma_addr_upper =3D = 8, pp_frag_count =3D {counter =3D 8}}}, { > compound_head =3D 18446719884544507912, compound_dtor =3D 80 'P= ', compound_order =3D 95 '_', compound_mapcount =3D {counter =3D -30590}, > compound_pincount =3D {counter =3D 0}, compound_nr =3D 0}, {_co= mpound_pad_1 =3D 18446719884544507912, > _compound_pad_2 =3D 18446612691733536592, deferred_list =3D {ne= xt =3D 0x0 , > prev =3D 0x4900 }}, {_pt_pad_1= =3D 18446719884544507912, pmd_huge_pte =3D 0xffff88823ffd5f50, > _pt_pad_2 =3D 0, {pt_mm =3D 0x4900 , pt_frag_refcount =3D {counter =3D 18688}}, > ptl =3D 0x8 }, {pgmap =3D 0xffffea00056900= 08, zone_device_data =3D 0xffff88823ffd5f50}, callback_head =3D { > next =3D 0xffffea0005690008, func =3D 0xffff88823ffd5f50}}, {_m= apcount =3D {counter =3D -129}, page_type =3D 4294967167}, _refcount =3D { > counter =3D 0}, memcg_data =3D 0}}, _flags_1 =3D 2476979687679590= 4, __head =3D 0, _folio_dtor =3D 3 '\003', _folio_order =3D 8 '\b', > _total_mapcount =3D {counter =3D -1}, _pincount =3D {counter =3D 0}, _f= olio_nr_pages =3D 0} > (gdb) > > The folio has a NULL mapping, and an index of 18688, which means > even if it was not a folio that has been invalidated or freed, the > index is way outside the range we are looking for. > > If I step it round the lookup loop, xas does not change, and the > same folio is returned every time through the loop. Perhaps > the mapping tree itself might be corrupt??? > > It's simple enough to stop the machine once it has become stuck to > observe the iteration and dump structures, just tell me what you > need to know from here... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com This bug emerges again and I would like to propose a reproduce sequence of this bug which has nothing to do with scheduler stuff ( this could be wrong and sorry for wasting your time if so) Thread_isolate: 1. alloc_contig_range->isolate_migratepages_block isolate a certain of pages to cc->migratepages (folio has refcount: 1 + n (alloc_pages, page_cache)) 2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 + extra_pins) set the folio->refcnt to 0 3. alloc_contig_range->migrate_pages->xas_split split the folios to each slot as folio from slot[offset] to slot[offset + sibs] Thread_truncate: 4. enter the livelock by the chain below rcu_read_lock(); find_get_entry folio =3D xas_find if(!folio_try_get_rcu) xas_reset; rcu_read_unlock(); 4'. alloc_contig_range->migrate_pages->__split_huge_page which will modify folio's refcnt to 2 and breaks the livelock but is blocked by lruvec->lock's contention If the above call chain makes sense, could we solve this by below modification which has split_folio and __split_huge_page be atomic by taking lruvec->lock earlier than now. int split_huge_page_to_list_to_order(struct page *page, struct list_head *l= ist, unsigned int new_order) { + lruvec =3D folio_lruvec_lock(folio); if (mapping) { int nr =3D folio_nr_pages(folio); xas_split(&xas, folio, folio_order(folio)); if (folio_test_pmd_mappable(folio) && new_order < HPAGE_PMD_ORDER) { if (folio_test_swapbacked(folio)) { __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr)= ; } else { __lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr); filemap_nr_thps_dec(mapping); } } } __split_huge_page(page, list, end, new_order); + folio_lruvec_unlock(folio);