From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 38464CE8D4C for ; Thu, 19 Sep 2024 05:20:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 939146B0082; Thu, 19 Sep 2024 01:20:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8E8A66B0083; Thu, 19 Sep 2024 01:20:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7893F6B0085; Thu, 19 Sep 2024 01:20:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5BE5D6B0082 for ; Thu, 19 Sep 2024 01:20:41 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0A57480A61 for ; Thu, 19 Sep 2024 05:20:41 +0000 (UTC) X-FDA: 82580337882.10.8CB04E2 Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com [209.85.218.41]) by imf25.hostedemail.com (Postfix) with ESMTP id E2F0AA0003 for ; Thu, 19 Sep 2024 05:20:38 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=oY+uY8zk; dmarc=none; spf=pass (imf25.hostedemail.com: domain of axboe@kernel.dk designates 209.85.218.41 as permitted sender) smtp.mailfrom=axboe@kernel.dk ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726723228; a=rsa-sha256; cv=none; b=NC8jHL6Na/ZWEyLKlMCP1B7mngJf0WzUUgn1uVZu7ubKKIY6qn0aqaY1eeYPCK17PAr9Fl x3W1XFr46oPAgWx9AAInlMLgDES1kNc4DytOHEzPBworMAm5GlGzbMnRUkUfe+30KHqXcZ dtGbq2LXQ5Tw3UO7QU/PPXsSUGMFDX0= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=oY+uY8zk; dmarc=none; spf=pass (imf25.hostedemail.com: domain of axboe@kernel.dk designates 209.85.218.41 as permitted sender) smtp.mailfrom=axboe@kernel.dk ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726723228; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IzPONe7Hl8hATuXT+GEoDEy0vE4mDwVH0cNH7cNT9Uo=; b=s0amSGL4ZXNIjUY8+wQ9UD/wF6UJxYYTd5f3xioQehW+s2e6Sgi2fHhVjtQyY0ggwwfWVU p837TtJIdqeJ+JZtB0PDo6/S/apYR8tecQNblDj4N14fEE8yUPQLm883cXCQkUxryn4kpH T8W1vp8ZU7FRdcFaS/ZDX8yQSj/9Zok= Received: by mail-ej1-f41.google.com with SMTP id a640c23a62f3a-a90188ae58eso47734466b.1 for ; Wed, 18 Sep 2024 22:20:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1726723237; x=1727328037; darn=kvack.org; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:from:subject:user-agent:mime-version:date:message-id:from:to :cc:subject:date:message-id:reply-to; bh=IzPONe7Hl8hATuXT+GEoDEy0vE4mDwVH0cNH7cNT9Uo=; b=oY+uY8zk9iWrpRWIZfCqOqkeVmTUqfIF46VAHQXivhRPQ0B6w2Z0Rj7ZO2of+K2S4X V872yc1GgdxHtOkKPZr+qwYB6yja8ksbYsLWv3QNH7aDz1dxHrqlUCTGdh084FCWDpsM 2eVH5fJCPYq3YzTB5+rizb6tW3FCgX3bi8JG78VXIZeaByl2WJPUL+whBIl8+XAyQfOc t4R4wkE9eJ1umY/bZ1BqKCSsDfIAGvf43vUBSFrvhtAE7JBXJBIfb+h04G/wcyaw1D9W yEHDT02yFHE1m2oQIk7vS0thlXInGB1flpDWpvFup07l0YqrgqrbzVXHxR9vYWAgKn9p W62A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726723237; x=1727328037; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:from:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IzPONe7Hl8hATuXT+GEoDEy0vE4mDwVH0cNH7cNT9Uo=; b=IuSmVHmtfdJBx4gLmQ/lDKNZrCsr2lYX3eSXrswJ8H42niKKbYvRo3G8/sQDv96nK1 z+tFMqEqZL0HqrdTEUun9sNwQPEaWFJzOctZ1IzyOkqt75JlJrq2EOfpQAHOc8dvhHBB bCHoFgkEk3ZjKbupPfwqdnUxGzK4chYJlQ2NI+OcEA90JJe33vZnaYH5A0QNzczHdszV OaMIKRWG7ew1EUUcssajTg7U5pzlYX/rjkvd6cA++bom3uNDuv720pDkdrqujCTQLFde Jxz78KO9kxgMHGKz+hILBKmDFVZmGKxiK6s+kJZJ8ih1tCtEP0clHNjedIvdGLCLGYWq wqsg== X-Forwarded-Encrypted: i=1; AJvYcCWFVK1Ojd2YPIlf1q22+3gzWWqsjxVAE/kPj8RTOBvPg4DDXtrvNbHpxDV6quoM9+7ewlO2nYFLxQ==@kvack.org X-Gm-Message-State: AOJu0YzzWe6QAEhR+g9obJY83krydSoMlCKYeB1rXWayUpw4jwDTK7DC v9W0E2jKzUT7Si14G7Zd+NpWgdcuGTQ5mzdU6OowsaQ6M+6BE2LSlItIHNl3np8= X-Google-Smtp-Source: AGHT+IHju4oboqyEQSy4+Om/ISByeho+FWCyfDVzSygtpfXVDtV4IFkyl8yrfHmySRRFv+5nrUuHiw== X-Received: by 2002:a17:906:bc26:b0:a8d:2ec3:94f4 with SMTP id a640c23a62f3a-a902964d007mr2102225766b.54.1726723237206; Wed, 18 Sep 2024 22:20:37 -0700 (PDT) Received: from [192.168.0.216] ([185.44.53.103]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a90610967a5sm675990066b.21.2024.09.18.22.20.34 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 18 Sep 2024 22:20:35 -0700 (PDT) Message-ID: <6839133c-56ca-45dc-8115-2fbcc907938e@kernel.dk> Date: Wed, 18 Sep 2024 23:20:33 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) From: Jens Axboe To: Matthew Wilcox Cc: Linus Torvalds , Dave Chinner , Chris Mason , Christian Theune , linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Dao , regressions@lists.linux.dev, regressions@leemhuis.info References: <52d45d22-e108-400e-a63f-f50ef1a0ae1a@meta.com> <5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk> <459beb1c-defd-4836-952c-589203b7005c@meta.com> <8697e349-d22f-43a0-8469-beb857eb44a1@kernel.dk> <9e62898f-907e-439f-96f3-de2e29f57e37@kernel.dk> Content-Language: en-US In-Reply-To: <9e62898f-907e-439f-96f3-de2e29f57e37@kernel.dk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: ffgjz1f1baucf1ejzbpmuw53yb7345rm X-Rspamd-Queue-Id: E2F0AA0003 X-Rspamd-Server: rspam02 X-HE-Tag: 1726723238-997937 X-HE-Meta: U2FsdGVkX1/+vKpuLQtfKhLvnE0NLF7B5uM2h6kkibo+o5IzkiHdodMvAPulIwbHhB6/CAG+Q0UIq1sL5uCTlSQX0csEGdDbEqnp1D3fRwdidXGCwWIl1Ubzdx5ayFwu06q+HBQCyeKHdl7kpeUUUJWhn7+/j2jL08P5kLRXyvFv/IO+WJoCbz+zIOLJ99ut4hRAwIY9DqIhn1a+TBFRKsuRvzLMFXIqJnYg7e0Mzm6Gx7qv+GcdTCP318AJ0E2XAZDFU1OtHb9J8BwXr/MsycJRmDIJP0ftQnyLeteqpDLghywPlFOuZHxn6qAYSZV/U5WHDI7/+oJMvrJQIuI/9Ri7h9fA7MljzrXMkNI7z2wbd7ydzW7SpoLEGMgiOd8T0zl+KJ4wHE7tYBU0LKhY8U6HDpWlxtUFnMhHpg9aKUNWF8ca7JbhRvZCUnX6xvbrSbQ8eF3ScjUdQIOfrV4ztNMq7/ZyNaLLiotCqcQq70Pu+CGueP3SYQtuAx8t8KHfQHJtpJ0xFQ+Yg1aWv7w5qKHkFYcLYsUDe2DLxZOjDFBOWLKE5DHbkG1qp31/ioPhqCR5LuMibGT6PuXt1HQGMPwBNhXpEJwX9uhTm7yUBcxsfmGb8D/3Tg1HHPYdg+kzlmpF4UrF0xB5Q3cLnq/RXlqNR2wEDvrH7SsEnG8JD+XjMWOURD1nDuBsplLRuPrpLe36EGEARwcd0rOfzAhpdBH4MQZ8wcyUepnsUDP9wy7a6VqKkptc6dqD7R7AXLu8PlgEyIgp1Mt5YoIblw+irbryvSs+zKxvPnjqkrMiB+D6XEr+sDCpsI4n9ILqE5nt9q5gkdtc3k0OMoHCSFzO7to+dhh8YsQsqWEBkJxT6RrDj7pl41xa3+ytJLs4JB7Epp9DAw52lu4F/o2aGuY8SOHJimePn6CQ4yS/gej7mdBa9EeOGnRIsVWSBmC4+2s/VZPybKjOGZqyufHJQAd IEWjhWQ/ mFeAjMUOT0XKzoLAiDImllQgqV/0XbpwszlZOAo1oDiR6qZ/T5Lo3+idvocPrXKw1qSmB08JrSakFbGXmwdEG5VqtjZRhE2Q6guLdMVsPXFWugLZOVpGhLBzDt1oM2sHXggiMTb5sAS3zwbi41mqNCjuc1cEVtc1+LHmsBxLA4oACTgET/Hgq1BeIRDeOLbAFcWmsFoDCX3Q15TOSHvd0qscEDl2OU21ZgdBe4v0zG6g9kc3daNRk957BG1fXuGpugL2ueG4Guib/th4fXR9cCbJDSkxem06klygsyxbxTJZk7zbpMX8Hh8sO3Up/RJrxRa/DlihzZvKQJ4BfWcyQDeXfSK1O0Xej+In3/faNh3H/fnUNTiMYw9UTd6b2d1yM/z76237hf8t9LSJD2BZ7+IJJtTXic5b7EYchgLNIWHkVW0JjsVC+Js1BrBoE4UDABzyt+UbfwJekjvU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 9/18/24 10:46 PM, Jens Axboe wrote: > On 9/18/24 10:36 PM, Matthew Wilcox wrote: >> On Wed, Sep 18, 2024 at 09:38:41PM -0600, Jens Axboe wrote: >>> On 9/18/24 9:12 PM, Linus Torvalds wrote: >>>> On Thu, 19 Sept 2024 at 05:03, Linus Torvalds >>>> wrote: >>>>> >>>>> I think we should just do the simple one-liner of adding a >>>>> "xas_reset()" to after doing xas_split_alloc() (or do it inside the >>>>> xas_split_alloc()). >>>> >>>> .. and obviously that should be actually *verified* to fix the issue >>>> not just with the test-case that Chris and Jens have been using, but >>>> on Christian's real PostgreSQL load. >>>> >>>> Christian? >>>> >>>> Note that the xas_reset() needs to be done after the check for errors >>>> - or like Willy suggested, xas_split_alloc() needs to be re-organized. >>>> >>>> So the simplest fix is probably to just add a >>>> >>>> if (xas_error(&xas)) >>>> goto error; >>>> } >>>> + xas_reset(&xas); >>>> xas_lock_irq(&xas); >>>> xas_for_each_conflict(&xas, entry) { >>>> old = entry; >>>> >>>> in __filemap_add_folio() in mm/filemap.c >>>> >>>> (The above is obviously a whitespace-damaged pseudo-patch for the >>>> pre-6758c1128ceb state. I don't actually carry a stable tree around on >>>> my laptop, but I hope it's clear enough what I'm rambling about) >>> >>> I kicked off a quick run with this on 6.9 with my debug patch as well, >>> and it still fails for me... I'll double check everything is sane. For >>> reference, below is the 6.9 filemap patch. >>> >>> diff --git a/mm/filemap.c b/mm/filemap.c >>> index 30de18c4fd28..88093e2b7256 100644 >>> --- a/mm/filemap.c >>> +++ b/mm/filemap.c >>> @@ -883,6 +883,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, >>> if (order > folio_order(folio)) >>> xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index), >>> order, gfp); >>> + xas_reset(&xas); >>> xas_lock_irq(&xas); >>> xas_for_each_conflict(&xas, entry) { >>> old = entry; >> >> My brain is still mushy, but I think there is still a problem (both with >> the simple fix for 6.9 and indeed with 6.10). >> >> For splitting a folio, we have the folio locked, so we know it's not >> going anywhere. The tree may get rearranged around it while we don't >> have the xa_lock, but we're somewhat protected. >> >> In this case we're splitting something that was, at one point, a shadow >> entry. There's no struct there to lock. So I think we can have a >> situation where we replicate 'old' (in 6.10) or xa_load() (in 6.9) >> into the nodes we allocate in xas_split_alloc(). In 6.10, that's at >> least guaranteed to be a shadow entry, but in 6.9, it might already be a >> folio by this point because we've raced with something else also doing a >> split. >> >> Probably xas_split_alloc() needs to just do the alloc, like the name >> says, and drop the 'entry' argument. ICBW, but I think it explains >> what you're seeing? Maybe it doesn't? > > Since I can hit it pretty reliably and quickly, I'm happy to test > whatever you want on top of 6.9. From the other email, I backported: > > a4864671ca0b ("lib/xarray: introduce a new helper xas_get_order") > 6758c1128ceb ("mm/filemap: optimize filemap folio adding") > > to 6.9 and kicked off a test with that 5 min ago, and it's still going. > I'd say with 90% confidence that it should've hit already, but let's > leave it churning for an hour and see what pops out the other end. 45 min later, I think I can conclusively call the backport of those two on top of 6.9 good. Below is what I'm running, which is those two commits (modulo the test bits, for clarify). Rather than attempt to fix this differently for 6.9, perhaps not a bad idea to just get those two into stable? It's not a lot of churn, and at least that keeps it consistent rather than doing something differently for stable. I'll try and do a patch that just ensures the order is consistent across lock cycles as Linus suggested, just to verify that this is indeed the main issue. Will keep the xas_reset() as well. diff --git a/include/linux/xarray.h b/include/linux/xarray.h index cb571dfcf4b1..da2f5bba7944 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -1548,6 +1551,7 @@ void xas_create_range(struct xa_state *); #ifdef CONFIG_XARRAY_MULTI int xa_get_order(struct xarray *, unsigned long index); +int xas_get_order(struct xa_state *xas); void xas_split(struct xa_state *, void *entry, unsigned int order); void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t); #else @@ -1556,6 +1560,11 @@ static inline int xa_get_order(struct xarray *xa, unsigned long index) return 0; } +static inline int xas_get_order(struct xa_state *xas) +{ + return 0; +} + static inline void xas_split(struct xa_state *xas, void *entry, unsigned int order) { diff --git a/lib/xarray.c b/lib/xarray.c index 5e7d6334d70d..c0514fb16d33 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -1765,39 +1780,52 @@ void *xa_store_range(struct xarray *xa, unsigned long first, EXPORT_SYMBOL(xa_store_range); /** - * xa_get_order() - Get the order of an entry. - * @xa: XArray. - * @index: Index of the entry. + * xas_get_order() - Get the order of an entry. + * @xas: XArray operation state. + * + * Called after xas_load, the xas should not be in an error state. * * Return: A number between 0 and 63 indicating the order of the entry. */ -int xa_get_order(struct xarray *xa, unsigned long index) +int xas_get_order(struct xa_state *xas) { - XA_STATE(xas, xa, index); - void *entry; int order = 0; - rcu_read_lock(); - entry = xas_load(&xas); - - if (!entry) - goto unlock; - - if (!xas.xa_node) - goto unlock; + if (!xas->xa_node) + return 0; for (;;) { - unsigned int slot = xas.xa_offset + (1 << order); + unsigned int slot = xas->xa_offset + (1 << order); if (slot >= XA_CHUNK_SIZE) break; - if (!xa_is_sibling(xas.xa_node->slots[slot])) + if (!xa_is_sibling(xa_entry(xas->xa, xas->xa_node, slot))) break; order++; } - order += xas.xa_node->shift; -unlock: + order += xas->xa_node->shift; + return order; +} +EXPORT_SYMBOL_GPL(xas_get_order); + +/** + * xa_get_order() - Get the order of an entry. + * @xa: XArray. + * @index: Index of the entry. + * + * Return: A number between 0 and 63 indicating the order of the entry. + */ +int xa_get_order(struct xarray *xa, unsigned long index) +{ + XA_STATE(xas, xa, index); + int order = 0; + void *entry; + + rcu_read_lock(); + entry = xas_load(&xas); + if (entry) + order = xas_get_order(&xas); rcu_read_unlock(); return order; diff --git a/mm/filemap.c b/mm/filemap.c index 30de18c4fd28..b8d525825d3f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -852,7 +852,9 @@ noinline int __filemap_add_folio(struct address_space *mapping, struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp) { XA_STATE(xas, &mapping->i_pages, index); - bool huge = folio_test_hugetlb(folio); + void *alloced_shadow = NULL; + int alloced_order = 0; + bool huge; bool charged = false; long nr = 1; @@ -869,6 +871,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); xas_set_order(&xas, index, folio_order(folio)); + huge = folio_test_hugetlb(folio); nr = folio_nr_pages(folio); gfp &= GFP_RECLAIM_MASK; @@ -876,13 +879,10 @@ noinline int __filemap_add_folio(struct address_space *mapping, folio->mapping = mapping; folio->index = xas.xa_index; - do { - unsigned int order = xa_get_order(xas.xa, xas.xa_index); + for (;;) { + int order = -1, split_order = 0; void *entry, *old = NULL; - if (order > folio_order(folio)) - xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index), - order, gfp); xas_lock_irq(&xas); xas_for_each_conflict(&xas, entry) { old = entry; @@ -890,19 +890,33 @@ noinline int __filemap_add_folio(struct address_space *mapping, xas_set_err(&xas, -EEXIST); goto unlock; } + /* + * If a larger entry exists, + * it will be the first and only entry iterated. + */ + if (order == -1) + order = xas_get_order(&xas); + } + + /* entry may have changed before we re-acquire the lock */ + if (alloced_order && (old != alloced_shadow || order != alloced_order)) { + xas_destroy(&xas); + alloced_order = 0; } if (old) { - if (shadowp) - *shadowp = old; - /* entry may have been split before we acquired lock */ - order = xa_get_order(xas.xa, xas.xa_index); - if (order > folio_order(folio)) { + if (order > 0 && order > folio_order(folio)) { /* How to handle large swap entries? */ BUG_ON(shmem_mapping(mapping)); + if (!alloced_order) { + split_order = order; + goto unlock; + } xas_split(&xas, old, order); xas_reset(&xas); } + if (shadowp) + *shadowp = old; } xas_store(&xas, folio); @@ -918,9 +932,24 @@ noinline int __filemap_add_folio(struct address_space *mapping, __lruvec_stat_mod_folio(folio, NR_FILE_THPS, nr); } + unlock: xas_unlock_irq(&xas); - } while (xas_nomem(&xas, gfp)); + + /* split needed, alloc here and retry. */ + if (split_order) { + xas_split_alloc(&xas, old, split_order, gfp); + if (xas_error(&xas)) + goto error; + alloced_shadow = old; + alloced_order = split_order; + xas_reset(&xas); + continue; + } + + if (!xas_nomem(&xas, gfp)) + break; + } if (xas_error(&xas)) goto error; -- Jens Axboe