From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD1A5C48297 for ; Tue, 6 Feb 2024 23:02:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 171BF6B0071; Tue, 6 Feb 2024 18:02:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1215C6B0072; Tue, 6 Feb 2024 18:02:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 011076B0074; Tue, 6 Feb 2024 18:02:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E45DC6B0071 for ; Tue, 6 Feb 2024 18:02:12 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 97459120BE6 for ; Tue, 6 Feb 2024 23:02:12 +0000 (UTC) X-FDA: 81762904104.08.EE26584 Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) by imf30.hostedemail.com (Postfix) with ESMTP id 8EB1780006 for ; Tue, 6 Feb 2024 23:02:10 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TIBDLT8Y; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=pass (imf30.hostedemail.com: domain of minchan.kim@gmail.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707260530; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jfB77vC+4atj07bEaB/PEOGtH0kMjPFp0jYbulqqEQs=; b=wXr554jlmz85YcB+wUCMB/AvRQG4ChYctweyNEBOyq/YCSi35oQyCn1DPBF2XN7iZOzw7J xVeVJ0AS250QqPAMlZoAnf4to3ywX1OPRHGNuSdH6Ck61aDMmAvkG4yr6dF+MsAZ39UID8 eBFOAKH7R2vtOfukWotwSQ0uZe3Gge0= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TIBDLT8Y; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=pass (imf30.hostedemail.com: domain of minchan.kim@gmail.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707260530; a=rsa-sha256; cv=none; b=QQG1cXB2+Gj5BenV79IsQuqo+OmjEDWrj3WF8c/+f6zSCr/Vaw48ZdxT/Gcaevm0WW8dtu v6T2bBh9y4tfglhhoPmVyG7kAJdMco3j33H1iv8mYZwCAsqIpp49/9xj9qeSTs4o+jg+Id uhC/YucToTLGxQTaodlCowYvEjKq8X8= Received: by mail-pg1-f179.google.com with SMTP id 41be03b00d2f7-5cdf76cde78so5517215a12.1 for ; Tue, 06 Feb 2024 15:02:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707260529; x=1707865329; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=jfB77vC+4atj07bEaB/PEOGtH0kMjPFp0jYbulqqEQs=; b=TIBDLT8Y85+CGrV+9WdjAd3Xb8EwEXs4I83Yi7pgJ6J2f7Pn64bGRPrkYJ0387hJ+F II59RmhjwRW4HNqRK2Gx2KY2P9/eiIp5fS2R5QfdyX6qhheEuOO3MP+NQ9poEn4Vn/mX 70LNEviCEspjj2OQFPZGK2fWjaydC97VcTPCirB5OAAzK1r/RKkbXufjvz5DJXpiulbr UupFV5gnGxsVznmtdpCAHwzntmA8QUNZ69H5IxDUi+TifvWFc6vFZuBAo0nmsBCJdP4Z Km+UZEoKHF8fAQYCgOkM9bNLB8sI2SAmdoEikQt2rboCj5LzrDEjrIMwYq1lwXi9VMEO BeKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707260529; x=1707865329; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jfB77vC+4atj07bEaB/PEOGtH0kMjPFp0jYbulqqEQs=; b=cS/bLuZI/vDCQa4vtt3WEgSq0U3uwpEGIxHi1QbfJ2SLzcnW4zVxC8BzflxOh14KKu 0pGhK1BmcRMR4zZUVb9UQBFIV35NY936eXTeH7vXtNY4RSPYMhUgdXJlj9N07ljlcRs3 9hLFgnv5yeKSowL0HeWlanBFIMEBvqWVbYP1AN18Gq9oqb1glqv/YpbSVOqdpLTYPei2 v54UQCAshCC+SjTtopxAYcx6SkN8O+JoSYNSqDZFUR6ogSOHAKVDpg9vTTVtzGiQhVZa /YrlR7ABAXJrMKcc1K6fCOKOxl/TYhU7s3HZQHwouveq5f/rbe2B7jsulOtcHJ8y2m43 vqkQ== X-Gm-Message-State: AOJu0Yxtz8sKH+yxKgEGS9ArGPGB3Cvj7CvcYvEfIi7eL56sYWEzUKO6 kj5amxKiT4sJDnV+iXJq7UIkWz4kXXisX66RRoWqqw+yABpnNVk3 X-Google-Smtp-Source: AGHT+IG9sAQLbgCZ49HtvvjlKzpnRjY2k/ZxhsoMj7HykxM2KwYB8Rix26rWUa7GeoQmU7ekJD9VcA== X-Received: by 2002:a05:6a20:5604:b0:19c:93b7:4a90 with SMTP id ir4-20020a056a20560400b0019c93b74a90mr2907659pzc.38.1707260529248; Tue, 06 Feb 2024 15:02:09 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCVJvrRRIoPsUmbKLAJEq6IyxnWrhlqJ7Mffj1snuJwfW+Lo0glumlocplCB5c0PdbAJgLdOjpkBGuSDP1QP9oLNL78WHf8UGTQY0XrIXi+tR3Ecd7q2Bldz/FFIRU1qZwgx/k99XmpZdF955tdJOmSnhrxcLUu1KXzTHtH7l13q2ePcF6B9EmheaOi2ykjRaT6UlA/fr3NFij47HOKW2biIs4ESz2wtZWEyzf+UXXbiPJ6GPCI+OOq/eJo11Gxa6OQBt7LpC2NhzlMh9pJLNkpSd+WC/HUuHgAKMs1KMcuXXvTgqO1ifVLqlAFZIQ1hwukoGSKlEKGKnph06pjOzjG9klT422ZcExVCV8dKefBwGF3MxP0u0SF3FzLGt5pk9lF8TbvBWvihFPAE0NMJXv3h6ZibVxXbJ3L/2+JEH6eKjgkfWmNmIhLJOTRW0JF+qOG+xyiVfMg= Received: from google.com ([2620:0:1000:8411:af44:f774:c4ae:c82a]) by smtp.gmail.com with ESMTPSA id z2-20020a62d102000000b006e0651ec052sm29048pfg.32.2024.02.06.15.02.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Feb 2024 15:02:08 -0800 (PST) Date: Tue, 6 Feb 2024 15:02:05 -0800 From: Minchan Kim To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , "Huang, Ying" , Chris Li , Yu Zhao , Barry Song , SeongJae Park , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache Message-ID: References: <20240206182559.32264-1-ryncsn@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240206182559.32264-1-ryncsn@gmail.com> X-Rspamd-Queue-Id: 8EB1780006 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: kxt49ai981rwcp46ijm4qxq1oza7fw9p X-HE-Tag: 1707260530-369127 X-HE-Meta: U2FsdGVkX1/NEpFUQFnYgSKOpXxrDB9JCE/SD+rGgwAULOnj1LrJXdYusTZXd/oYKXFicd98tg/Sw524S2w6iwLEykv20IQGyU7JJFojOaVls7VKKZOoP/5o9FpFl9bCf1vlVvELilhTiU944chNoWiYxsk4xORc7cb7akGzOnPFdWSv7vFSN6275FrZfz4248RsQvBE2pl4Ys+4ASHQpYl3jzugV9CdGbBkOBFqKvxvAQ8zgud7p4YCYFbOcZhpORuycsT64Ny4LxsGrUX//Hms3pct4DBfxloEMiigcqcsKoTkl7rQpctVfag8cMaPaM3Ash28EArtLi52als3R3ffCPhnvqEEIuMgeWEIxF+TnkHx3yCuJNpkBj61L3SQdK554vC6KH8rluIniz1RtFUfJsv7M8lGqC6CbY8MpcI1yPpJNcddNl/39Dxor6bLYySeIzFgYM3UdX5a3QTUjUcI72mvnR/piFB2uSj4feYx3UoOJc22OX7R6JHqLpG9dZS2AZmJyUTG7+yV+ifdG/aO0p9XxumRMRH2L8BIf+zBlAak9Hv4FvlM3GP35v+XrUGFtHEB00QbS/P4hwltJ4H3uTPWlNW0oFNRvqZNGkvUG37VL7V9fWtJu3KI3BLlYVf2BwrX1WKYFVEOEsXLBFKzFvkLVkZigBih/brVXaMArmQWwtq4/hc7AsPHXO5D1eDZ0EL35VZL/XduxnVvivoSnpreh4Nk+caiJcRlZV2DOXwcLPOHpI9YWqooFOQLdMEo8/3o5yimJqulK6ePCzY13716DMss5W/OT9HCdpu8HY7dHHozgFbr1tN7l9dgpldVnNQU4195Q5ToHu86mK2qdcK/w73UKk4MsT6tCD47uVWr65FCrmoIU2KBcUoqwFcuU5UEyWNEfm6AJ9aHihZql1jU55bMY2oqlCXBKkM2rIbSBxH+soO712FdJb/zhJmH5yqwGhNmA/XFRnF fTH7fEBD Ml0yp2u/kA008m6p7QICI7HzQq9JQ+j/r22/qvAkanjD4jEdxmo3sn0JJIK7ZiJbe5C/lQnG52putqk2TLbYFTtvlS5wxf7Gyh26kjX06Tcm1/K7uvDUj/agxSgqO2Z8XCpf+k1xMWCVMt8fQ4c4ksaVRYxoCep6QSE3ljHX8pDgGIbpvTaGKviF5K+zsKIn+SWWFg/C2mOtRXpwTsswgyLT9HDd8zaJnkeByx0ySGJ4RK1liFoWr1dCuptTesy5+0hWM+jZJS2c+m4F96z8SQYD7DIp+QFqrA94DDgcQdTOYhzHCdzl76ujm5fQfZWq0Be2o8sGKBsJFuxjHV+SLP1e6PeKBB+xGqImsAPaG24MAa7aO50xeKJ6Zig2IBjBFv9lr5kwWW5T0C3VH4zdAZFD47RLYHGOsKBnlfG7HZKGov5av8xseNgdVgGeJfVBayu8CrRQV8a93s9/PLZTwDdH0NZQKvXbFpn+foHHRAq3v+HGV2tabRt9BPNbgBiZSyN1MOURvSpDlZ1sn+NGhwngm/bkryAbiC8hzfsQnZooJf/AKsTXGN/vfCHF/di3hpETleYsWPUB4OJGXuLkniqoDkCc6GdsGScKE+3UIWyQwCwZatp3RTR/K6rotQpyetp/n1q297vxoJlg5FH+IQQpST7GED7txCqeaF3yGvjx27rjTO+bD3k0bCd22ugowQN7WWwUQbIgbpeNwzmZX25C7MPP19gLnt4DGFNR+oE++sqFSaypi5dGZlQrxWuOLDWr3+y5cfBjzcd52saJm4xa6a7p1NzBfXlDF X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 07, 2024 at 02:25:59AM +0800, Kairui Song wrote: > From: Kairui Song > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > swapin the same entry at the same time, they get different pages (A, B). > Before one thread (T0) finishes the swapin and installs page (A) > to the PTE, another thread (T1) could finish swapin of page (B), > swap_free the entry, then swap out the possibly modified page > reusing the same entry. It breaks the pte_same check in (T0) because > PTE value is unchanged, causing ABA problem. Thread (T0) will > install a stalled page (A) into the PTE and cause data corruption. > > One possible callstack is like this: > > CPU0 CPU1 > ---- ---- > do_swap_page() do_swap_page() with same entry > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > ... set_pte_at() > swap_free() <- entry is free ^^^ nit: From the recent code, I see swap_free is called earlier than set_pte_at > > > pte_same() <- Check pass, PTE seems > unchanged, but page A > is stalled! > swap_free() <- page B content lost! > set_pte_at() <- staled page A installed! > > And besides, for ZRAM, swap_free() allows the swap device to discard > the entry content, so even if page (B) is not modified, if > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > it may also cause data loss. Thanks for catching the issue, folks! > > To fix this, reuse swapcache_prepare which will pin the swap entry using > the cache flag, and allow only one thread to pin it. Release the pin > after PT unlocked. Racers will simply busy wait since it's a rare > and very short event. > > Other methods like increasing the swap count don't seem to be a good > idea after some tests, that will cause racers to fall back to use the > swap cache again. Parallel swapin using different methods leads to > a much more complex scenario. > > Reproducer: > > This race issue can be triggered easily using a well constructed > reproducer and patched brd (with a delay in read path) [1]: > > With latest 6.8 mainline, race caused data loss can be observed easily: > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > Polulating 32MB of memory region... > Keep swapping out... > Starting round 0... > Spawning 65536 workers... > 32746 workers spawned, wait for done... > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > Round 0 Failed, 15 data loss! > > This reproducer spawns multiple threads sharing the same memory region > using a small swap device. Every two threads updates mapped pages one by > one in opposite direction trying to create a race, with one dedicated > thread keep swapping out the data out using madvise. > > The reproducer created a reproduce rate of about once every 5 minutes, > so the race should be totally possible in production. > > After this patch, I ran the reproducer for over a few hundred rounds > and no data loss observed. > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > zram: > > Before: 10934698 us > After: 11157121 us > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > Reported-by: "Huang, Ying" > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > Signed-off-by: Kairui Song > Reviewed-by: "Huang, Ying" > Acked-by: Yu Zhao > > --- > Update from V1: > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > - Update comments make it cleaner [Huang, Ying] > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > - Update commit message. > - Collect Review and Acks. > > include/linux/swap.h | 5 +++++ > mm/memory.c | 15 +++++++++++++++ > mm/swap.h | 5 +++++ > mm/swapfile.c | 13 +++++++++++++ > 4 files changed, 38 insertions(+) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 4db00ddad261..8d28f6091a32 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > return 0; > } > > +static inline int swapcache_prepare(swp_entry_t swp) > +{ > + return 0; > +} > + > static inline void swap_free(swp_entry_t swp) > { > } > diff --git a/mm/memory.c b/mm/memory.c > index 7e1f4849463a..1749c700823d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > __swap_count(entry) == 1) { > + /* > + * Prevent parallel swapin from proceeding with > + * the cache flag. Otherwise, another thread may > + * finish swapin first, free the entry, and swapout > + * reusing the same entry. It's undetectable as > + * pte_same() returns true due to entry reuse. > + */ > + if (swapcache_prepare(entry)) > + goto out; > + > /* skip swapcache */ > folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > vma, vmf->address, false); > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > unlock: > if (vmf->pte) > pte_unmap_unlock(vmf->pte, vmf->ptl); > + /* Clear the swap cache pin for direct swapin after PTL unlock */ > + if (folio && !swapcache) > + swapcache_clear(si, entry); > out: > if (si) > put_swap_device(si); > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (vmf->pte) > pte_unmap_unlock(vmf->pte, vmf->ptl); > out_page: > + if (!swapcache) > + swapcache_clear(si, entry); > folio_unlock(folio); > out_release: > folio_put(folio); What happens? do_swap_page .. swapcache_prepare() <- tured the cache flag on folio = vma_alloc_folio <- failed to allocate the folio page = &foio->page; <- crash but it's out of scope from this patch .. if (!folio) goto unlock; .. unlock: swapcache_clear(si, entry) <- it's skipped this time. Can we simply introduce a boolean flag to state the special case and clear the cache state based on the flag? if (swapcache_prepare()) goto out; need_clear_cache = true; out_path: if (need_clear_cache) swapcache_clear