From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45A6CC48297 for ; Wed, 7 Feb 2024 03:22:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B89C76B007D; Tue, 6 Feb 2024 22:22:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B61566B007E; Tue, 6 Feb 2024 22:22:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A02056B0081; Tue, 6 Feb 2024 22:22:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 905DA6B007D for ; Tue, 6 Feb 2024 22:22:27 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3728BA039B for ; Wed, 7 Feb 2024 03:22:27 +0000 (UTC) X-FDA: 81763559934.08.F30B1AB Received: from mail-lj1-f176.google.com (mail-lj1-f176.google.com [209.85.208.176]) by imf20.hostedemail.com (Postfix) with ESMTP id 4EBD81C0023 for ; Wed, 7 Feb 2024 03:22:25 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LOw2ZJYs; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707276145; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8SfjAsyHe78ZWJcJN7VTTvSSClQB6d0Ch4i9jCFCMIU=; b=XGqIYVJCLKRco/zB7BrRrYnuvSSo9TEybDJ09ep1NLq7tCCLh1+fqVOzkTOgwrIEr4jv5v OHhlbzGFSsf2sF0ghmlddpVWURwQJLf+irOMiHVNJOUWQDV7MXnXLRw1yStZKW9enzxKzX zceyAUj8UjN9gTLmf88L1Mgmz//jIyE= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LOw2ZJYs; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707276145; a=rsa-sha256; cv=none; b=yIab6jCtDs7EJzDgikTJd3QAg2dXfSAEU3ctekyAIFrwyFuMOWDMhKawTXRRppudIaFY0j ySHYGHAR45BoqsUNJegSfQNDftVjoQDfRCPbqm5jm+WlCrpUCNjRTL8r7SkQB5xttBmIba vpJTViRA3J7nLlQDMbIrNZOYGL2cGws= Received: by mail-lj1-f176.google.com with SMTP id 38308e7fff4ca-2d0a236dae7so1645341fa.2 for ; Tue, 06 Feb 2024 19:22:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707276144; x=1707880944; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8SfjAsyHe78ZWJcJN7VTTvSSClQB6d0Ch4i9jCFCMIU=; b=LOw2ZJYsP2rC71hRTNSz9mBTJG1ja3zb+mdxQOVrVx6j2khpsNQQAwLJquzdE3RPW5 dWwcmfGj6S/8Cc5gDHuHR/vDjSW36h/lm6ktYL7oygusE5MbiO5MRhcwjayZJZZ7iwKm C34dBeXWlvved41iBYaZto68hIinfZVk/TelaxTDCb4wko1/CJMEnE+IMeXzjjQ1y6Fe DcGG0Bzu6QJyk+qrgRdIImsjlpCc/Caid/AeyspWGLY+YQ9y6MAlJuFWf0JirB8LO0HI DCPj2HOiqFI8ehtkZKKX3aaP+xhaLWo7KomxGLspY1dyMxBiL7r6EX+UUtjb4u60nH00 NYvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707276144; x=1707880944; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8SfjAsyHe78ZWJcJN7VTTvSSClQB6d0Ch4i9jCFCMIU=; b=P/dpPHVvTE1Gj/XPXRYMOx9vZVL1uFt2mchcT3BP9pZHbnUzoU6uPV2rqlL/72lM7M iLXNwR7X6KzeYEkZFd9BRE8vz/g1aIpnDitMOThcPfuSjKmof9aQCukSajFr8n8lwT// vpwtUVgiQ0v9ly0xoz5wt43V8lp1BN0mwNUcyLTneFfkw0MRzmtDZhOBMX9winWp7wat Wz9SsZmwVGSSimkr+PRrUwiJ9ZwDt5wfdq9rbqEQXhXu/5yNtad7+mUkz9g07vI6kRI0 Z41NqAksW7TgQFtuuICwPBLUHEmvuDt94SDbJa5w+ST8jpKQjBjDnfFXYfphzJTjR9Qc ZprQ== X-Gm-Message-State: AOJu0YwRz3t/nD9x6dJck3zc/4qPjE0EKlpz0+5v68ulXB3V2J+JGgex fiICfBwNe7yPSahXW7uFEAjatOrvfG+woRaDBzm4rGt7hp0ecCJAbcYjIxc6Ic8F2tpYc4o3wgo kcsRBINJxiPw5FVGyZMI0bym525o= X-Google-Smtp-Source: AGHT+IGuCo/YzpCdSBEuSCbLbhhevXbaJ6FAZVtPsU8iP9rnc8AmpdYdhjzYQrwQPM4AktT5rfPdXI8pqSsxvVv9S1c= X-Received: by 2002:a2e:a273:0:b0:2d0:9322:8d0f with SMTP id k19-20020a2ea273000000b002d093228d0fmr2322153ljm.26.1707276143429; Tue, 06 Feb 2024 19:22:23 -0800 (PST) MIME-Version: 1.0 References: <20240206182559.32264-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Wed, 7 Feb 2024 11:22:05 +0800 Message-ID: Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache To: Minchan Kim Cc: linux-mm@kvack.org, Andrew Morton , "Huang, Ying" , Chris Li , Yu Zhao , Barry Song , SeongJae Park , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 4EBD81C0023 X-Stat-Signature: sjx793aopbbpiz1wa3z9rd9z56rcsyoq X-HE-Tag: 1707276145-857916 X-HE-Meta: U2FsdGVkX1+A0Gzv8j7bpjuT34BitD/U/v/t/97dQ4IdOGXIYj9/Wv84ty01cdSb54lC3OaoIfly7s9EPmgbDg0yRnUNrfqCAAvNsiyf3UszMxu995rV6d9kSNxoeVdsuRWdQWBzAc/IDyFSGbcqq6Q6Y2g05edWrhoUebI5gt6vIspmsjSDd/aALpSYepU1nW+9J3MQvKqfTk5WqMmXp2+MHS1578eHITF78gjKfdmnft+f38JcsUndih5wlEQZdd0bCNzqZE2gHsozSjtO2gXay9X1j3TlLMe9Q1TjcCHz6sX2Ysw4QSDln3OPfemLocgAfpNyJO/paRoxIV8dbAyXCICsGm+L7VShSB6aB3oFJ4nVxx1lurWa1k7PkJOdK+n3REB7QUJNK9VbMdp9gcGfh46OKHDvCnfXvLgjTBKCtRWhUW0XVSs5B2PfSAduTICbGf8EEsBeRiRglyN6V1/eZOkXHHXUKcfg1t6/tyy2oPOajK0U4ZdjbOKX43zfjxUfW/k2xlNcBIxpEOV7WrhD04HJINpw0zORZGm4GuTKFVeLpSSiuw2pJfzVUHfYbQOavaftRVTlfZD4xLbYh8S4gwuCNbKwfTE57wAmBPUC5fhrq4ajezmI+s9GYwCeA8DcE8oXlepAQ87PxD1ZJESR82dKFVyDnbvIJepII4mTu4iTFdzbsVAw8ClWthdCKtJeAkL3DRAYpJepftMIZBXUGdfg/3QIRJG/+HLnyuHEdxJMGHiAf4I6u1DSAxImnxxO6nJgkMBcGhohkQLEMDIsbFaWFBMIwiOu5XcZvQcBQWusOwd+KdNYtFt6IS8CWY9iOnNl6zlpSKxU4r2Am0DG28tUI4DqZtxsxnmNDNzn0wQDwezRJ+uhEZsUz9mh2v74mJ7Xs9iSJcmgAIVKAF9nTvbQb4vheiGTZntIgHrX9cB82whBqdrCKmEMIFM8kt5o2WrXpKoBA2cxtEh YyDV1jqx LxXfTCTlXK2AqJO2NfTxrkPfjR+9gWUSBKScJwS6p7WCx6374fcOsbZuCedi3oATbXGxOd920CO+WZ4d+DxMLBrwjA0IOByWFxAqAOZDAn3r6wv28HY8RD+Dsf4Z70Nor/01+1RbdHugGNJ4Tv93IG5pXltJrJI+jiHpJlv2KXUCibufAFqzh+WRg+v+DWn1aSDXa85jXCp+PvOcKV5a9mIrbaJLR8Yo6vPpBmskTzRI1R5M0ae7eI8VZA1nzgpWqoShE3UC3YEi9r1h7ybJhp6IHdE8uK6gPbh+mjwJZNFOLPL4cSKAdxR67MZDA03h6t6t2gVwoxpvr4UMphh+ufoo2/FkRnCCWI5N/f+UZeroyviX8e/fDEJ/sRsKAVN/DDk987QNYJblZWs1kyL3VbD5E9IQPjE5DyF61hOBrrDOdC4xhT7LzDxae2jo3+UQI/GA2nPa3kSdtEp3UcBbUj32krkKEcvY0VuIG05oWOApR1yN2x0lWdO2v7mbBqOVpeBcIwT9ZWl6JkJmKOd9UoZOFF07wj60SuPMK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 7, 2024 at 7:02=E2=80=AFAM Minchan Kim wro= te: > > On Wed, Feb 07, 2024 at 02:25:59AM +0800, Kairui Song wrote: > > From: Kairui Song > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > swapin the same entry at the same time, they get different pages (A, B)= . > > Before one thread (T0) finishes the swapin and installs page (A) > > to the PTE, another thread (T1) could finish swapin of page (B), > > swap_free the entry, then swap out the possibly modified page > > reusing the same entry. It breaks the pte_same check in (T0) because > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > install a stalled page (A) into the PTE and cause data corruption. > > > > One possible callstack is like this: > > > > CPU0 CPU1 > > ---- ---- > > do_swap_page() do_swap_page() with same entry > > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page = B > > > > ... set_pte_at() > > swap_free() <- entry is free > > ^^^ > nit: From the recent code, I see swap_free is called earlier than set_pte= _at Thanks, will update the message. > > > > > > > > pte_same() <- Check pass, PTE seems > > unchanged, but page A > > is stalled! > > swap_free() <- page B content lost! > > set_pte_at() <- staled page A installed! > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > the entry content, so even if page (B) is not modified, if > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > it may also cause data loss. > > Thanks for catching the issue, folks! > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry usin= g > > the cache flag, and allow only one thread to pin it. Release the pin > > after PT unlocked. Racers will simply busy wait since it's a rare > > and very short event. > > > > Other methods like increasing the swap count don't seem to be a good > > idea after some tests, that will cause racers to fall back to use the > > swap cache again. Parallel swapin using different methods leads to > > a much more complex scenario. > > > > Reproducer: > > > > This race issue can be triggered easily using a well constructed > > reproducer and patched brd (with a delay in read path) [1]: > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > Polulating 32MB of memory region... > > Keep swapping out... > > Starting round 0... > > Spawning 65536 workers... > > 32746 workers spawned, wait for done... > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > Round 0 Failed, 15 data loss! > > > > This reproducer spawns multiple threads sharing the same memory region > > using a small swap device. Every two threads updates mapped pages one b= y > > one in opposite direction trying to create a race, with one dedicated > > thread keep swapping out the data out using madvise. > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > so the race should be totally possible in production. > > > > After this patch, I ran the reproducer for over a few hundred rounds > > and no data loss observed. > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > zram: > > > > Before: 10934698 us > > After: 11157121 us > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronou= s device") > > Reported-by: "Huang, Ying" > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.cc= r.corp.intel.com/ > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stres= s-race [1] > > Signed-off-by: Kairui Song > > Reviewed-by: "Huang, Ying" > > Acked-by: Yu Zhao > > > > --- > > Update from V1: > > - Add some words on ZRAM case, it will discard swap content on swap_fre= e so the race window is a bit different but cure is the same. [Barry Song] > > - Update comments make it cleaner [Huang, Ying] > > - Add a function place holder to fix CONFIG_SWAP=3Dn built [SeongJae Pa= rk] > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO in= stead of "direct swapin path" [Yu Zhao] > > - Update commit message. > > - Collect Review and Acks. > > > > include/linux/swap.h | 5 +++++ > > mm/memory.c | 15 +++++++++++++++ > > mm/swap.h | 5 +++++ > > mm/swapfile.c | 13 +++++++++++++ > > 4 files changed, 38 insertions(+) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 4db00ddad261..8d28f6091a32 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > return 0; > > } > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > +{ > > + return 0; > > +} > > + > > static inline void swap_free(swp_entry_t swp) > > { > > } > > diff --git a/mm/memory.c b/mm/memory.c > > index 7e1f4849463a..1749c700823d 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (!folio) { > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > __swap_count(entry) =3D=3D 1) { > > + /* > > + * Prevent parallel swapin from proceeding with > > + * the cache flag. Otherwise, another thread may > > + * finish swapin first, free the entry, and swapo= ut > > + * reusing the same entry. It's undetectable as > > + * pte_same() returns true due to entry reuse. > > + */ > > + if (swapcache_prepare(entry)) > > + goto out; > > + > > /* skip swapcache */ > > folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0= , > > vma, vmf->address, false)= ; > > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > unlock: > > if (vmf->pte) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > + /* Clear the swap cache pin for direct swapin after PTL unlock */ > > + if (folio && !swapcache) > > + swapcache_clear(si, entry); > > out: > > if (si) > > put_swap_device(si); > > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (vmf->pte) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > out_page: > > + if (!swapcache) > > + swapcache_clear(si, entry); > > folio_unlock(folio); > > out_release: > > folio_put(folio); > > What happens? > > do_swap_page > .. > swapcache_prepare() <- tured the cache flag on > > folio =3D vma_alloc_folio <- failed to allocate the folio > page =3D &foio->page; <- crash but it's out of scope from this patch > > .. > if (!folio) > goto unlock; > > .. > unlock: > swapcache_clear(si, entry) <- it's skipped this time. > > > Can we simply introduce a boolean flag to state the special case and > clear the cache state based on the flag? Good idea, that should make the code easier to understand.