From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CBABFC4829E for ; Thu, 15 Feb 2024 18:49:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 576506B0088; Thu, 15 Feb 2024 13:49:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 521F56B0093; Thu, 15 Feb 2024 13:49:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 39BAB6B0098; Thu, 15 Feb 2024 13:49:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 260296B0088 for ; Thu, 15 Feb 2024 13:49:27 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D82FA1A03D6 for ; Thu, 15 Feb 2024 18:49:26 +0000 (UTC) X-FDA: 81794926332.13.7B90A6C Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by imf10.hostedemail.com (Postfix) with ESMTP id E4EE1C001B for ; Thu, 15 Feb 2024 18:49:24 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=l82VU6if; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708022965; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mjmxs1/XFDuKA1CW6hbrEvgCagLuf21iVpMkEqqePsA=; b=B75NWBxM3F9ctcNyNqPcLgzuSPX5egLiiAoOFROKY2BVBRulzxL6tqHPsLXw19U9jB/eR9 4UBY/9zYs4iOgHnp6QkXaRBBZ/F2d4+vAqJOZl3Tai4cL2C4TEWCbhxKh5UxDqZJzzKg7O ST78RB5MisgRQygne8DqYZo5Rfkx+Mc= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=l82VU6if; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708022965; a=rsa-sha256; cv=none; b=AOqpp2lg+yWugE/7ZsTfod8dSUrdUGMvGGNKP5lj5zBqIEBn4wQ+LBTDgGaMv2GPSdm53t OuWog3e8RHr5ODcWCCBQT6tqSsaA/zqS8UrHyVy3K6hPUTP2A9XnHuQult9Y3NUTs2+IMa K50aIG+x9DSaXf1kg6U2azYfig/Rs7I= Received: by mail-lj1-f172.google.com with SMTP id 38308e7fff4ca-2d0aabed735so15772371fa.0 for ; Thu, 15 Feb 2024 10:49:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708022963; x=1708627763; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mjmxs1/XFDuKA1CW6hbrEvgCagLuf21iVpMkEqqePsA=; b=l82VU6ifY0rmmxhth9CGINkXSxqclA9Xwk50qx9SlNlm3nmvcHdMnN6Ms7Y2CY4LUN wsC7pjU5I4xKBwHgc9X88jadp5u0IsV9ODXZAackXwp8ZFb8n2wT3OudzdlpAr/ll6G8 8jeLinU/MCh4RD89v9qTRROOcTwXDbiv8L6i8BhGG+/zTAQdJcctBqTOY21MyYXb6A/t Yr5yAWEh/WEvhEOtfN/SDKtOFildWIiYCVf9v90355SDNUHQw1AxhT9/0w738Wn1ZWQR 9ljP/JNRshwUcWIJLiP6Dg+iNISgPfbmx/UDjeN/e8CAxoI/I4tzawGKktNQ13UkEMN6 2jqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708022963; x=1708627763; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mjmxs1/XFDuKA1CW6hbrEvgCagLuf21iVpMkEqqePsA=; b=UIU8H1ALDPrhHQ8WcQTM2+NKYRqkXaB5drBFrTc51HB3ycTPhLk9CqEm4purPpfefo rNNtVun936CkGznFpwx23aYdrIER06IDf2oNQ1SN7LtxZG77ChGs7qM7wPSGVSctTiMS daNHvvIVDrWJ06H7kpNre999pB9fxy6+K5dj7yDDI1RtA7eVX2EcMeLTFca9vIbWw+Is /Xc9uyqc5dL7P6fgb6fpFJH/MHIvw5uVMNvZ/FZw89aSiS45SdTpEpfGkEm8UmUyHVet iyy/M9wa2KGgwsxXRZ53/G2PrYykCvd5I8bifehGCxo7J+od9gJ85YfXLC87pJbcuh38 RT6w== X-Gm-Message-State: AOJu0YwpS7L/3tEuY8fnQv5Kl5rGCsH4JUSGmYuG2aHNgwSGsXrGBJgI 4xcOXXaUtxBDUqfsjw5k3BiWbBgot43SMZihopjsJDy68g8Wyj/5LV8cm4nfRVGB1XW4FFv1htf w4yHvLkP8CJ06WiXuHRLamdXsk8k= X-Google-Smtp-Source: AGHT+IGack4lWALoh4qTHm0VjSJHKO55NNraNbak/+vWzwZN8buGh6WtX8PXLzJGtR+arpXe5iDmxgFjm6KIbx9RgDI= X-Received: by 2002:a2e:a706:0:b0:2d0:b646:3bc9 with SMTP id s6-20020a2ea706000000b002d0b6463bc9mr1656235lje.23.1708022962756; Thu, 15 Feb 2024 10:49:22 -0800 (PST) MIME-Version: 1.0 References: <20240206182559.32264-1-ryncsn@gmail.com> <1d259a51-46e6-4d3b-9455-38dbcc17b168@redhat.com> In-Reply-To: <1d259a51-46e6-4d3b-9455-38dbcc17b168@redhat.com> From: Kairui Song Date: Fri, 16 Feb 2024 02:49:04 +0800 Message-ID: Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache To: David Hildenbrand Cc: linux-mm@kvack.org, Andrew Morton , "Huang, Ying" , Chris Li , Minchan Kim , Yu Zhao , Barry Song , SeongJae Park , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: E4EE1C001B X-Stat-Signature: 9tt7ad5pbxx394okq5d9tg1phobww6ex X-Rspam-User: X-HE-Tag: 1708022964-38452 X-HE-Meta: U2FsdGVkX1/F1Wl+kRucT8l6vhYHqVSU96kkLKmtbFXQ6s/Qp3lX2JJc+WE4kqdwNBcCdYLuHYfGHEjMyHO3zx3WLobrEcCV5UfHbkF83SRsXSJNQZgtoHxTfFoE58ITG7mPo5Dmf13bUE5ZPo7sCMRWcYn+HOev6xbKFBlAThPMP/HORcxxmFylf3HZPzaR7qeXy6kiJV/tqA8OQi1MuNCvFwj1UvRWa+ss3F7ofLwVYmLSSppujvQVV8IuaTcL5NqrDBT5JI1y36eRZF7zGfbUYnxtrBZ5zdaTZ38i+yohQLam5/7KNEwbZMrLkAY5bJOXjREpVBZHk9bpi0oi3+iZS36rergOOaWNiso93sEB6nBZ9QFtcizXPmOs4jnB86xPIY3TJudBXd5K/eNsobhPWEaTQHL8zGPRx1NqkJLRgG13cZ/YRhCQxa/V8pRvs1r1z7jN0XF54+pHwua6gW0y2BtUvxhrsZaByPv2+PgxiFfccVgNQKEHnszPKPAvdMMqox8GjGg6ix/Z823QTUrQwiP2aSiSAt1b0SKIQVMlQUyBld21PMkUmy7tmw9TxlW2xASmcKabw0Ipcir3Hr9I3nE3pDsTV2Nv8JVxQipeMsOdihm9IfBjmuf/cI/WJEXz4dJloQO+weo+qfKntNum8VybtrVYETXcGstgWBGIDiqzTXHLpA1ylR6wCMJRWqY83ykadDU4071wiG3YxvDKcs7LgzRtWJ9668CmDkMfudd/aHVEDjP06sWgwhqk+Oko1li/kQvHUUiP3CIpMUYt2+/kJrTDJ6t8mNBVVwewEa3alCM7GoE4RKHotj1jswQGtnv99GpVVvIsBYD8UXkxofR0j09yMP9uHXZ+qFT9s59imWtVjU87yvCw37Mt6i8AKIU5q3BYNAzWc52IbyuY9KB0Y5I65v8fdd9QUlp03hRwu25KTgWDola0YyBgVkrOWrq3/1OfP5c51ux n999dhkn UPhE5MVGUOT6ZiSdympFqKnYRP+S9vlZbWPk0WrFZyoio7iD3Ki6am892oyn0qHwd8eyJX1A6tTjt7NwNczRrlavE9EGJ0YO4dI8IvzOcU19ZfgY/zphS074Ym0nkLpafjwHBK8LtbyDYoJEjvYCP5xotNVH0N829+UsvSI8rZ7u1fR2txYfBtn7nfx1nullulgxkAK5J/5RLQyCO+LlOh0CtVHZ7nRav+aCZ+zO/QSQlLDhYIYgLyo3NnN+ShvIB5Hq12Q9NfJZ0qTSsgVQUvdxC4O6SeIAekSXDZXvoibyyZE2FM/VbIoeujke4jmu2Mw2WKn/BzDrOQnFjMxI4CWjxwDMbYiOiKykxvYudhVJRDP3KaJmb3zNv3OlPAZrCfRZZ+sWfhRQueEGkdsILmA9xmA5WOHOoYGvZGqXyI8HL9dLvwPUZaf+SSs3cXCOhefBixEhAFhuHlQVwk03QsasDe1l6iXT7bEnDAolQEHBCL5oH0YDSN/rZ6b+GXgvR+Aomyat5md/R2homw2/VF9QLvpH+jYXIUBShS0D87keYUHbDh7K5Yac/Og== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 15, 2024 at 11:36=E2=80=AFPM David Hildenbrand wrote: > On 06.02.24 19:25, Kairui Song wrote: > > From: Kairui Song > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > swapin the same entry at the same time, they get different pages (A, B)= . > > Before one thread (T0) finishes the swapin and installs page (A) > > to the PTE, another thread (T1) could finish swapin of page (B), > > swap_free the entry, then swap out the possibly modified page > > reusing the same entry. It breaks the pte_same check in (T0) because > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > install a stalled page (A) into the PTE and cause data corruption. > > > > One possible callstack is like this: > > > > CPU0 CPU1 > > ---- ---- > > do_swap_page() do_swap_page() with same entry > > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page = B > > > > ... set_pte_at() > > swap_free() <- entry is free > > > > > > pte_same() <- Check pass, PTE seems > > unchanged, but page A > > is stalled! > > swap_free() <- page B content lost! > > set_pte_at() <- staled page A installed! > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > the entry content, so even if page (B) is not modified, if > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > it may also cause data loss. > > > > To fix this, reuse swapcache_prepare which will pin the swap entry usin= g > > the cache flag, and allow only one thread to pin it. Release the pin > > after PT unlocked. Racers will simply busy wait since it's a rare > > and very short event. > > > > Other methods like increasing the swap count don't seem to be a good > > idea after some tests, that will cause racers to fall back to use the > > swap cache again. Parallel swapin using different methods leads to > > a much more complex scenario. > > > > Reproducer: > > > > This race issue can be triggered easily using a well constructed > > reproducer and patched brd (with a delay in read path) [1]: > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > Polulating 32MB of memory region... > > Keep swapping out... > > Starting round 0... > > Spawning 65536 workers... > > 32746 workers spawned, wait for done... > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > Round 0 Failed, 15 data loss! > > > > This reproducer spawns multiple threads sharing the same memory region > > using a small swap device. Every two threads updates mapped pages one b= y > > one in opposite direction trying to create a race, with one dedicated > > thread keep swapping out the data out using madvise. > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > so the race should be totally possible in production. > > > > After this patch, I ran the reproducer for over a few hundred rounds > > and no data loss observed. > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > zram: > > > > Before: 10934698 us > > After: 11157121 us > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronou= s device") > > Reported-by: "Huang, Ying" > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.cc= r.corp.intel.com/ > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stres= s-race [1] > > Signed-off-by: Kairui Song > > Reviewed-by: "Huang, Ying" > > Acked-by: Yu Zhao > > > > --- > > Update from V1: > > - Add some words on ZRAM case, it will discard swap content on swap_fre= e so the race window is a bit different but cure is the same. [Barry Song] > > - Update comments make it cleaner [Huang, Ying] > > - Add a function place holder to fix CONFIG_SWAP=3Dn built [SeongJae Pa= rk] > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO in= stead of "direct swapin path" [Yu Zhao] > > - Update commit message. > > - Collect Review and Acks. > > > > include/linux/swap.h | 5 +++++ > > mm/memory.c | 15 +++++++++++++++ > > mm/swap.h | 5 +++++ > > mm/swapfile.c | 13 +++++++++++++ > > 4 files changed, 38 insertions(+) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 4db00ddad261..8d28f6091a32 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > return 0; > > } > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > +{ > > + return 0; > > +} > > + > > static inline void swap_free(swp_entry_t swp) > > { > > } > > diff --git a/mm/memory.c b/mm/memory.c > > index 7e1f4849463a..1749c700823d 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (!folio) { > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > __swap_count(entry) =3D=3D 1) { > > + /* > > + * Prevent parallel swapin from proceeding with > > + * the cache flag. Otherwise, another thread may > > + * finish swapin first, free the entry, and swapo= ut > > + * reusing the same entry. It's undetectable as > > + * pte_same() returns true due to entry reuse. > > + */ > > + if (swapcache_prepare(entry)) > > + goto out; > > + > > Is there anything that guarantees that the following won't > happen concurrently, and if it would happen, could it be a problem? > > Other thread MADV_DONTNEED's the swap entry, swap slot is freed. > Some other page gets swapped out, reuses that swap slot. Hi David, Thanks for adding more comments and sharing your thoughts! I'm not sure what you mean by "reuses that swap slot" here, I think you just mean reuse that swap entry (and the actual content on swap device)? > We call swapcache_prepare() on that slot that is being reused > elsewhere. (possibly some other thread in the context of the reuses > swap slot might do the same!) I think this kind of swapcache_prepare() or false swapin read is already happening quite frequently by swap readaheads. I've seen swap cluster readahead mess up working set activation and memory policy already. Swap cluster readahead simply read in nearby entries of target entry, regardless of whether they are owned by the reader or not. For this patch, similar issues also exist, I think it only hurts the performance, but that's a really rare thing to happen, so should not be a problem. > > We would detect later, that the PTE changed, but we would temporarily > mess with that swap slot that we might no longer "own". > > I was thinking about alternatives, it's tricky because of the concurrent > MADV_DONTNEED possibility. Something with another fake-swap entry type > (similar to migration entries) might work, but would require more changes= . Yeah, in the long term I also think more work is needed for the swap subsys= tem. In my opinion, for this particular issue, or, for cache bypassed swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM might be needed, that may even help to simplify the swap count release routine for cache bypassed swapin, and improve the performance.