From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9911C48260 for ; Mon, 19 Feb 2024 07:29:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 473466B0083; Mon, 19 Feb 2024 02:29:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4229A6B0085; Mon, 19 Feb 2024 02:29:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C2D86B0088; Mon, 19 Feb 2024 02:29:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 1E9986B0083 for ; Mon, 19 Feb 2024 02:29:54 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 97121A0285 for ; Mon, 19 Feb 2024 07:29:53 +0000 (UTC) X-FDA: 81807729066.19.7972A01 Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by imf03.hostedemail.com (Postfix) with ESMTP id D755A20014 for ; Mon, 19 Feb 2024 07:29:50 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=G9OzapqU; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf03.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708327791; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=X4jRpIVVFkwHJggu+vIwvnZ/bia4Q4PW4RNzbM25CuI=; b=SPW0pEB2bprbhyjrlU+v2W1RA0vNeidy6p96gD5Juorf2IIyjZ6O8YZ7b30SY4qAoK/5sr P4LknsqufrfsItc7EK5VVDBoDhbUIBulNNmHYZIVYczF9QaCHv8BouZHJDLvP7ap44vkvv qUOs/IGXTO8zs4+zdLm9Ogxq+T/aC4M= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=G9OzapqU; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf03.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708327791; a=rsa-sha256; cv=none; b=c1DS20VJz0w0Tih9xn3L4Tlta+78TflvUB0A9lVD9Dg0Y2FWHcIR/Y6C09vEJ+oaFD51th e98FU7Ihq/QKxPZvWXrgwRqfdIo5dWMK/jMQcMGZfGRdozyPIXtbfYV57BCyhzGo32SOe1 Gx50nFgURHjmXV7idtf6VzmHh8dhwbQ= Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-2d23700df18so11016741fa.0 for ; Sun, 18 Feb 2024 23:29:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708327789; x=1708932589; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=X4jRpIVVFkwHJggu+vIwvnZ/bia4Q4PW4RNzbM25CuI=; b=G9OzapqU2b0RqEIp+FDe1CKR3QWCIBkm8+kyWNWQX+KOlbkcosAg15CRGQ8cOo4Y7x STkQhzJ0u+vG6eBbBa0yIHZcCoUi07O4F+mrmb2+PCWE4l7XIW49TPMGX9GQV2T6vXxB OMugXglwqfZMnqgkIvbMDrkN2wQ95ev/cuLBqeHtS7oNnTNOprPWn+wFe5dwQpGr2SYj yPEcgdnmFQ5bg2Edmm3fA6Kr2psoT62i6U0ei0QOzVfWmt9Op0+fFr0HyzrguWbfO3wO EKQNfbKtz/prfiqm5MF38TGvVsIxBdbAFkvuSZMd8mNa6BBpd7W/uolORKsa/aPa6Ckn RKgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708327789; x=1708932589; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=X4jRpIVVFkwHJggu+vIwvnZ/bia4Q4PW4RNzbM25CuI=; b=BIFxZ9OZmk3tXlntnsY9xoJFrRHXVQQiB1/8sa3wMQdvYnEAc37CJXqSgyc6VbpPzY LqGgJNIP8q9Ckm2TWHOowLXpcaRP1h5g1gPE+YwxD3kPAaM0Aw7H2Xrp2fKbcj0FHpvV ZdE8lBtOdo1lLnMAXYXp0vuPNoGQKtKZUOP4GD4olTK4bZ1vmqoSNjXRp5zFukofvtVQ 4xj05ZkkqMd5ZWdO+F6AeVL0sN2pUA59POo6zGunKbq+7rHkOWw3urSIO/ri5N7EdHVE 4RNn4ZLboyM5J/+mg5DWk1BcQmvSI4r+cvVEO9BH4BC1HO/KQfoWJKvazYygv4zdaUNV Idfw== X-Gm-Message-State: AOJu0Yw5BFLpxWFwBfkH9AcVrhXKQo16bfbSJKMDNkOewugSVOB3m931 RrS1G4ufnXubiRZQhio/TtY1bt5u6vp8Kk1Iu8psgNVImh1DWqaDz8ve/o3jUUD3iaM1nGZIdVs e6cWtmmJCVofTHiMDIusIGZgMP8g= X-Google-Smtp-Source: AGHT+IEbNABZCLhbmvv2aI0SPEB1mHThVrIE6LTbEDZETMMxMcJH7KU150K/Nlpe3ibR+BNbbG2Etsnvg8qpouaTfX4= X-Received: by 2002:a2e:3318:0:b0:2d2:426b:455c with SMTP id d24-20020a2e3318000000b002d2426b455cmr185931ljc.0.1708327788596; Sun, 18 Feb 2024 23:29:48 -0800 (PST) MIME-Version: 1.0 References: <20240216095105.14502-1-ryncsn@gmail.com> <87wmr2rx4a.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzn1s2xe.fsf@yhuang6-desk2.ccr.corp.intel.com> <87frxprxm8.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: From: Kairui Song Date: Mon, 19 Feb 2024 15:29:31 +0800 Message-ID: Subject: Re: [PATCH v3] mm/swap: fix race when skipping swapcache To: "Huang, Ying" Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Minchan Kim , Yu Zhao , Barry Song , SeongJae Park , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org, Aaron Lu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: D755A20014 X-Stat-Signature: m1rakucyj7aozfyk5751bu7dwn7otyxr X-Rspam-User: X-HE-Tag: 1708327790-849173 X-HE-Meta: U2FsdGVkX1+a6OvN+q2eoq6Xb7t69/yh5nh8Pg3EMbkw0b9TB0mBDrdrRkfZldoCdDujFHVOMttlX2Fj/fTsNyL4yfPk9WJXugymMl1GFVmZNB9EwOaqyqFsCFxoyASXRHDohOqT+CnCKcJBRYG3GkjfCthZ3ZmvUZnFhj03jj+v9iBUjpnT5cuBHHvQQVWI1NvN4RpqpHdevprFMXoy1jOjtVtbbDY1h8jAL2N9/QXSn0w0ITwpw36Da2jW1EIiUNYkHb0+dklg07foDzhiHnPgOPZy9rOBpX72oHnUYdHBqTixIEphmbJOAXG7Xi5ONqUFvDUrbwQFrCzZ4VM8cil1UBnMZbHs2bMeOjKIf7TdaTa/PEZQ2iBnu33ESLlAZrn9GUOLdjiFoecDd6eUKgBeNOF9W+wWMpuBabqrOzLM/S8DRN8GlR73iEYcmQB1QJrlOOzfCuv1Cj+LyhUEG6Wc/LteLjHtj/IDBlTqseo2nl60ZowwtLovnKkXPzn8Am3EbieTVlWbdh9mpGCScThsVHbSqq9CgK+HbnZie2DkduX8PEyDdcfWV6XR5FPbIV3AZhvj1PaW2xdtc+Ygp+i6r2g0YxFydDzdQNMHfDKC9qKXqZgYoFLAYoyhICr256S6uWXy7j299Enj9o+2c2RFqcLlFUgXVRqyrO5NU0CPitQJQ5/h1d1qQs8jK3cIrDJqDP0Vi4dkOgG6uRvHUiI4wNivc52D3rD5lgLz0Y3qA4GkGMHJBhscjCJQXNnH9EQXhRx7bDfUIVD0c0I/fTzeJl7h0UKYaS0i3A/ihd5ZypHolLsPHPEcA4hMggn1lbWxD7pRTUZ0i1ThhBgvUGo9r10assFB7SH58E8s6VBZwS+ZGEMkUqG12yIzABkxnO9+KDuqUtCx9C/B3n0qECDu8HpG9roRuTGBrMt+7tfIyEHNvA9DrLAqAbWlsBbQqSq4VrOB3H1FBuOHtWy CDgIB/6S D2TqXDAClvPsp7KpksB5+S/49dSdBctxzFxY1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 19, 2024 at 11:09=E2=80=AFAM Kairui Song wro= te: > > On Mon, Feb 19, 2024 at 10:35=E2=80=AFAM Huang, Ying wrote: > > "Huang, Ying" writes: > > > Kairui Song writes: > > > > > >> On Sun, Feb 18, 2024 at 4:34=E2=80=AFPM Huang, Ying wrote: > > >>> > > >>> Kairui Song writes: > > >>> > > >>> > From: Kairui Song > > >>> > > > >>> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more th= reads > > >>> > swapin the same entry at the same time, they get different pages = (A, B). > > >>> > Before one thread (T0) finishes the swapin and installs page (A) > > >>> > to the PTE, another thread (T1) could finish swapin of page (B), > > >>> > swap_free the entry, then swap out the possibly modified page > > >>> > reusing the same entry. It breaks the pte_same check in (T0) beca= use > > >>> > PTE value is unchanged, causing ABA problem. Thread (T0) will > > >>> > install a stalled page (A) into the PTE and cause data corruption= . > > >>> > > > >>> > One possible callstack is like this: > > >>> > > > >>> > CPU0 CPU1 > > >>> > ---- ---- > > >>> > do_swap_page() do_swap_page() with same ent= ry > > >>> > > > >>> > > > >>> > swap_read_folio() <- read to page A swap_read_folio() <- read to= page B > > >>> > > > >>> > ... set_pte_at() > > >>> > swap_free() <- entry is free > > >>> > > > >>> > > > >>> > pte_same() <- Check pass, PTE seems > > >>> > unchanged, but page A > > >>> > is stalled! > > >>> > swap_free() <- page B content lost! > > >>> > set_pte_at() <- staled page A installed! > > >>> > > > >>> > And besides, for ZRAM, swap_free() allows the swap device to disc= ard > > >>> > the entry content, so even if page (B) is not modified, if > > >>> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > >>> > it may also cause data loss. > > >>> > > > >>> > To fix this, reuse swapcache_prepare which will pin the swap entr= y using > > >>> > the cache flag, and allow only one thread to pin it. Release the = pin > > >>> > after PT unlocked. Racers will simply wait since it's a rare and = very > > >>> > short event. A schedule() call is added to avoid wasting too much= CPU > > >>> > or adding too much noise to perf statistics > > >>> > > > >>> > Other methods like increasing the swap count don't seem to be a g= ood > > >>> > idea after some tests, that will cause racers to fall back to use= the > > >>> > swap cache again. Parallel swapin using different methods leads t= o > > >>> > a much more complex scenario. > > >>> > > >>> The swap entry may be put in swap cache by some parallel code path > > >>> anyway. So, we always need to consider that when reasoning the cod= e. > > >>> > > >>> > Reproducer: > > >>> > > > >>> > This race issue can be triggered easily using a well constructed > > >>> > reproducer and patched brd (with a delay in read path) [1]: > > >>> > > > >>> > With latest 6.8 mainline, race caused data loss can be observed e= asily: > > >>> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > >>> > Polulating 32MB of memory region... > > >>> > Keep swapping out... > > >>> > Starting round 0... > > >>> > Spawning 65536 workers... > > >>> > 32746 workers spawned, wait for done... > > >>> > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data lo= ss! > > >>> > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data l= oss! > > >>> > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data l= oss! > > >>> > Round 0 Failed, 15 data loss! > > >>> > > > >>> > This reproducer spawns multiple threads sharing the same memory r= egion > > >>> > using a small swap device. Every two threads updates mapped pages= one by > > >>> > one in opposite direction trying to create a race, with one dedic= ated > > >>> > thread keep swapping out the data out using madvise. > > >>> > > > >>> > The reproducer created a reproduce rate of about once every 5 min= utes, > > >>> > so the race should be totally possible in production. > > >>> > > > >>> > After this patch, I ran the reproducer for over a few hundred rou= nds > > >>> > and no data loss observed. > > >>> > > > >>> > Performance overhead is minimal, microbenchmark swapin 10G from 3= 2G > > >>> > zram: > > >>> > > > >>> > Before: 10934698 us > > >>> > After: 11157121 us > > >>> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > >>> > > > >>> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of sync= hronous device") > > >>> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap= -stress-race [1] > > >>> > Reported-by: "Huang, Ying" > > >>> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-de= sk2.ccr.corp.intel.com/ > > >>> > Signed-off-by: Kairui Song > > >>> > Cc: stable@vger.kernel.org > > >>> > > > >>> > --- > > >>> > Update from V2: > > >>> > - Add a schedule() if raced to prevent repeated page faults wasti= ng CPU > > >>> > and add noise to perf statistics. > > >>> > - Use a bool to state the special case instead of reusing existin= g > > >>> > variables fixing error handling [Minchan Kim]. > > >>> > > > >>> > V2: https://lore.kernel.org/all/20240206182559.32264-1-ryncsn@gma= il.com/ > > >>> > > > >>> > Update from V1: > > >>> > - Add some words on ZRAM case, it will discard swap content on sw= ap_free so the race window is a bit different but cure is the same. [Barry = Song] > > >>> > - Update comments make it cleaner [Huang, Ying] > > >>> > - Add a function place holder to fix CONFIG_SWAP=3Dn built [Seong= Jae Park] > > >>> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS= _IO instead of "direct swapin path" [Yu Zhao] > > >>> > - Update commit message. > > >>> > - Collect Review and Acks. > > >>> > > > >>> > V1: https://lore.kernel.org/all/20240205110959.4021-1-ryncsn@gmai= l.com/ > > >>> > > > >>> > include/linux/swap.h | 5 +++++ > > >>> > mm/memory.c | 20 ++++++++++++++++++++ > > >>> > mm/swap.h | 5 +++++ > > >>> > mm/swapfile.c | 13 +++++++++++++ > > >>> > 4 files changed, 43 insertions(+) > > >>> > > > >>> > diff --git a/include/linux/swap.h b/include/linux/swap.h > > >>> > index 4db00ddad261..8d28f6091a32 100644 > > >>> > --- a/include/linux/swap.h > > >>> > +++ b/include/linux/swap.h > > >>> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t= swp) > > >>> > return 0; > > >>> > } > > >>> > > > >>> > +static inline int swapcache_prepare(swp_entry_t swp) > > >>> > +{ > > >>> > + return 0; > > >>> > +} > > >>> > + > > >>> > static inline void swap_free(swp_entry_t swp) > > >>> > { > > >>> > } > > >>> > diff --git a/mm/memory.c b/mm/memory.c > > >>> > index 7e1f4849463a..7059230d0a54 100644 > > >>> > --- a/mm/memory.c > > >>> > +++ b/mm/memory.c > > >>> > @@ -3799,6 +3799,7 @@ vm_fault_t do_swap_page(struct vm_fault *vm= f) > > >>> > struct page *page; > > >>> > struct swap_info_struct *si =3D NULL; > > >>> > rmap_t rmap_flags =3D RMAP_NONE; > > >>> > + bool need_clear_cache =3D false; > > >>> > bool exclusive =3D false; > > >>> > swp_entry_t entry; > > >>> > pte_t pte; > > >>> > @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *v= mf) > > >>> > if (!folio) { > > >>> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > >>> > __swap_count(entry) =3D=3D 1) { > > >>> > + /* > > >>> > + * Prevent parallel swapin from proceeding = with > > >>> > + * the cache flag. Otherwise, another threa= d may > > >>> > + * finish swapin first, free the entry, and= swapout > > >>> > + * reusing the same entry. It's undetectabl= e as > > >>> > + * pte_same() returns true due to entry reu= se. > > >>> > + */ > > >>> > + if (swapcache_prepare(entry)) { > > >>> > + /* Relax a bit to prevent rapid rep= eated page faults */ > > >>> > + schedule(); > > >>> > > >>> The current task may be chosen in schedule(). So, I think that we > > >>> should use cond_resched() here. > > >>> > > >> > > >> I think if we are worried about current task got chosen again we can > > >> use schedule_timeout_uninterruptible(1) here. Isn't cond_resched sti= ll > > >> __schedule() and and it can even get omitted, so it should be "weake= r" > > >> IIUC. > > > > > > schedule_timeout_uninterruptible(1) will introduce 1ms latency for th= e > > > second task. That may kill performance of some workloads. > > It actually calls schedule_timeout so it should be a 1 jiffy latency, > not 1ms, right? > > /** > * schedule_timeout - sleep until timeout > * @timeout: timeout value in jiffies > ... > > But I think what we really want here is actually the set_current_state > to force yield CPU for a short period. The latency should be mild. I just forgot 1 jiffy >=3D 1 ms here, and uninterruptible should make it unable to wakeup until timeout... > > Just found that the cond_sched() in __read_swap_cache_async() has been > > changed to schedule_timeout_uninterruptible(1) to fix some live lock. > > Details are in the description of commit 029c4628b2eb ("mm: swap: get > > rid of livelock in swapin readahead"). I think the similar issue may > > happen here too. So, we must use schedule_timeout_uninterruptible(1) > > here until some other better idea becomes available. > > Indeed, I'll switch to schedule_timeout_uninterruptible(1). I've > tested and posted the result with schedule_timeout_uninterruptible(1) > before, it looked fine, or even better. But this should be still the same though, the minor/major fault ratio in previous test result [1] shows the race on ZRAM even with threads set to race on purpose, the chance is low, and thanks for the info on mentioning another commit! [1] https://lore.kernel.org/all/CAMgjq7BvTJmxrWQOJvkLt4g_jnvmx07NdU63sGeRMG= de4Ov=3DgA@mail.gmail.com/