From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C091AC48295 for ; Mon, 5 Feb 2024 14:16:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1A13C6B0074; Mon, 5 Feb 2024 09:16:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1525E6B0075; Mon, 5 Feb 2024 09:16:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 018CF6B0078; Mon, 5 Feb 2024 09:16:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E6D8A6B0074 for ; Mon, 5 Feb 2024 09:16:58 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A36B3160232 for ; Mon, 5 Feb 2024 14:16:58 +0000 (UTC) X-FDA: 81757951716.26.C92D6FF Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) by imf18.hostedemail.com (Postfix) with ESMTP id C13931C0002 for ; Mon, 5 Feb 2024 14:16:56 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=T80qvgdn; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707142616; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y693qVlKnW7MrGgRCQ2yznDJT6ahGyd4dQiskVCDqY8=; b=46S8oWP4J16AVdJiyl4WsB8LnIcPLD8kEc1FuiTlG373ESf7JGM7Pqx3KRb2ef/7fDQ4Ut TNInQqRTdtUTV1KsFOXP7sqBK+5+1JR4cY+4brselmBophdhQR0iaOTcii7gHgsEk+3MzU 7oX8Ax60RlAA3i6GMhEjFBLxkRVvpHs= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=T80qvgdn; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707142616; a=rsa-sha256; cv=none; b=hjVEw7zV9ZYQlTk5EXUn3LFDNH4JucbMlpodcE+UewTcexBuID8nQd21qVqmDCjGJqZu+4 xVflm+bJZOcApkaN4KbhKoWuSSrnwl+x/23akNMts4A/gNg8fjIe+5eEZJSCFD7jirFPol R97tkYnWThQhgKzl95vT6YRFb9U+78s= Received: by mail-lj1-f182.google.com with SMTP id 38308e7fff4ca-2d0a96bad85so17840571fa.1 for ; Mon, 05 Feb 2024 06:16:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707142615; x=1707747415; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=y693qVlKnW7MrGgRCQ2yznDJT6ahGyd4dQiskVCDqY8=; b=T80qvgdntecmrtHG7gVdoVHtRuvPtydGbgPtqeduHun7/CcFaL2k6sZqlcf28BfOMU ZiJirpVzR3dqqQDP4q5xRW3MSkE6f0d1Ec0UfreBzs9baEkWBmUj+4HR949OvJTOfOMt wLN22FuyqIpdu71UUPvEZLQahIYoPrCx5bLZv0rPI25w/xjMP+8J9bvOj2hd9RXL9b5L 3q1ooIaMP1FbjXX2afMGZbHC6mOa0CRU1beQQlRP0bEJTeAK0VsYm8eGeZJ/l4sUDcQT aCjoto4ZRlkuRimUpqQm5wvjliuieAKbFLlMffpt8BJaaYzeGHthLPmdSoq5oJzzbbXb paXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707142615; x=1707747415; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=y693qVlKnW7MrGgRCQ2yznDJT6ahGyd4dQiskVCDqY8=; b=T44v7AQER79jQX+VDzc6TGS2yNBtZjpLpXyXKgJTSkBN1sxfEdFOdo9Yk4BQH5ysYC +s0PtMDQsPjZuwpUmu7FyiRQUl941UjgvMYNs6HSRIx1xE/8cnig6B8UhymZtzbtu7KU 66/p9Rnhu3xsRSUh69l1d36N0mrdyfPMRMgq/FhLqKvxr31yqCzjzFC6tQQkGWE47kPl xrMDvD0czZ81tojwnsFkVLCpYMj/pPk61ew+fE1HB5VA57qWx/JJ8OFlO7nUEObQSRWl 2KRKmxMU+rdp0AGp29wOeEydqxaoo2FlTFK1NBRVII/7HNrC68QyTOgrQCq6AD8n+W7Y Q0mA== X-Gm-Message-State: AOJu0YyPEA5TrG5zo0FxjxDt82Fa5Hztnk/zl+NcpcnMA0w+vyi9/Cbc IKANsFKOayc2WeOGE9t51Mth0BtQ5nf/jmiX5FDxTMqqQFv5p4nVcUUc47xDAel7ryCzOxJxwVS pomqfjbj4OfJxlifsSwXzUP2a6cM= X-Google-Smtp-Source: AGHT+IEUSj0T4591sSE8qKJHUcVrqra1VsOm7YnWe7xFhJQZ5Y+jhiVoNzLmFx68LCegotBhkQ3TwZ7pkDUoHSnNs1k= X-Received: by 2002:a2e:9d51:0:b0:2d0:50c4:cafe with SMTP id y17-20020a2e9d51000000b002d050c4cafemr7428765ljj.37.1707142614642; Mon, 05 Feb 2024 06:16:54 -0800 (PST) MIME-Version: 1.0 References: <20240205110959.4021-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Mon, 5 Feb 2024 22:16:36 +0800 Message-ID: Subject: Re: [PATCH] mm/swap: fix race condition in direct swapin path To: Barry Song <21cnbao@gmail.com> Cc: linux-mm@kvack.org, Andrew Morton , "Huang, Ying" , Chris Li , Minchan Kim , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , Yu Zhao , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: C13931C0002 X-Stat-Signature: 949ej51emndytzhww3j8psooqqjn6epd X-HE-Tag: 1707142616-537771 X-HE-Meta: U2FsdGVkX19ml6InuYcO3f+YrOMqrSrrAJr4/PyuItEHOu2SPKNgA3TtIDHsavpYfkY0d2YSp0Pf78l/8myNir8aYT6EH+cSg3/CkaRy3PliiBo8mSNTxEK+5ErvppQIbuAnYbxvUQJbdWPGQ1IeTJ/f+HckLPOpopSEYtW4hEyTL9CHL8e0s/KBsNnfNlVpOmjA9qB+K/gR9w6MqS+8ONodMTGgSxF7b5eUIm5mA8CmDERYsb3a/EvP4v20KvrTauSn4wW5zhBRBDKUmqkPeHhBVh82lsiIyVuHtPLLXN6dRSoRGYwhO+Pk8IuC3Jz5+qD4pCky0KPht6dXcoKeNY40GRMxtboiY+Yhyc+yWq6OvQypdlUfyY0J+kFW8idnaqhdybIMgVGQvFpRwXzb6EUxZdpHD5hrZgj/JIhAAfz1M7M6uBmYVhkgaOBsW8odzWO9KG4zYtRvHDxPjerCEGI4K6vPc2AsOj/xbp+Eaf4X1pr+PPti5tTaFKdJeVQUzI2ID9qoP6EbnP2j//xw+76sodVDekFFvUSKa0uy75Vei9ni2c7dZdBDvDdv5oPoQAE4vnmGN0WZwck0YzjgD1QMIG34lylpAVhNqYES3cEnDGkpyKtL7syGep1wOzTxDMXADx5KO9hNcqf8vcsjNZNkpL7Nz83j2Rxbhgrs293Tc7v2dnWC8TyfkqpV2kE7a1yWwQM9OUvw0pqo3pIpW7TwPG+sXs41eU+MQ/OJ+1MjbZ0+FRzMHAlfrQlwJuu+Gzeq60839YFT5s9Ot3RM6U6Vsql1Fj7wgn6xl4ZLw5loRgyhHK/Kywpp9p1Jepbgb95h3wrSMp3j3U82h2s0bvgUFsBM2RoIJrUBBmYQA9RAUYOLmrRc6kKiVJtk3phtGWhRs2srkbpGb3M0Q3Xvg26ITeH2ezJfNZBhZtxzIFtSvxfp0IxTWAkcfodfO1P4WjRYpSzUHUXXKg/B7Ae lFCWnUy6 Hh+rupmfto1F5muExPAXPL/DctXAn1fTFupLqtTzXAbMd/yfQ9Qkb4zkAhoMihmq/zjMNUPFVDdjW88ZyYeaiuZCC8y31UFCueW6255W119F5y/cl4OC6DPIYjr8H8WUcbtiTc5+XcOMSI3wGzjBDf08JzYXKfRBvOknoZawrGRXBTvai4qu26QNmdfqEqd5/cce5mTqGaZsqw785vL7FRqu+0cQF5kgWDo46wBm+HowSkjOcYQkO1C0nW2ebKGcveMtZc2fp1lPtX6vdLpUZJ2JnOIGcfU/QccAbfKhKPyL16E0S01gbOy9wRhOTTUyMQsUnjdugdC7hdnmRs/OyO7Rgo28iFDkQGzPaA5kAY4mn/iB65F4l6LvROTPWsAssSIbdfTGUXp9bQJ7lFxEGp9bkQERVk8hr0psF0XYWLxGTxcNbwVHGo7ZZxNoqY/HjSgEANlvI5c1dJEv2kRRPbhjcVYBBQ7l2DONO9lqn1I8xqRk1MJU9Y+J3ParI+JGU3Npg8YJZQu/JORlDluK1mCmFCt/DElC0BM+3tDnyYlLCL6qRmWo3roQu29YK/eRJBGIeRuUmaZUDa+s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 5, 2024 at 8:25=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrote= : > Hi, Barry Thanks for the comments. > On Mon, Feb 5, 2024 at 7:10=E2=80=AFPM Kairui Song wro= te: > > > > From: Kairui Song > > > > In the direct swapin path, when two or more threads swapin the same ent= ry > > at the same time, they get different pages (A, B) because swap cache is > > skipped. Before one thread (T0) finishes the swapin and installs page (= A) > > to the PTE, another thread (T1) could finish swapin of page (B), > > swap_free the entry, then modify and swap-out the page again, using the > > Even if T0's swap_read_folio is later than T1, problems can still happen= . > after T1 swaps in and sets ptes, then frees the swap entry. T0 reads zRAM > later. it will get zero as zRAM will fill zero for freed slot, > > static int zram_read_from_zspool(struct zram *zram, struct page *page, > u32 index) > { > ... > > > value =3D handle ? zram_get_element(zram, index) : 0; > mem =3D kmap_local_page(page); > zram_fill_page(mem, PAGE_SIZE, value); > kunmap_local(mem); > return 0; > } > } > > Even though nobody modifies the data before the page is swapped out to th= e > same swap offset as before tT0's orig_pte, T0's pte_same check is still t= rue > and T0 will map filled zeroed page to pte. > > so there is more than one risk besides modified data losses. Thanks for the complement, I think this is true, and it shares the same problem of the entry reuse, so this patch also covered this potential race. I can add more words later to cover this case as well. > > > same entry. It break the pte_same check because PTE value is unchanged, > > causing ABA problem. Then thread (T0) will then install the stalled pag= e > > (A) into the PTE so new data in page (B) is lost, one possible callstac= k > > is like this: > > > > CPU0 CPU1 > > ---- ---- > > do_swap_page() do_swap_page() with same entry > > > > > > swap_readpage() <- read to page A swap_readpage() <- read to page B > > > > .. set_pte_at() > > swap_free() <- Now the entry is fre= ed. > > > > > > pte_same() <- Check pass, PTE seems > > unchanged, but page A > > is stalled! > > swap_free() <- page B content lost! > > set_pte_at() <- staled page A installed! > > > > To fix this, reuse swapcache_prepare which will pin the swap entry usin= g > > the cache flag, and allow only one thread to pin it. Release the pin > > after PT unlocked. Racers will simply busy wait since it's a rare > > and very short event. > > > > Other methods like increasing the swap count don't seem to be a good > > idea after some tests, that will cause racers to fall back to the > > cached swapin path, two swapin path being used at the same time > > leads to a much more complex scenario. > > > > Reproducer: > > > > This race issue can be triggered easily using a well constructed > > reproducer and patched brd (with a delay in read path) [1]: > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > Polulating 32MB of memory region... > > Keep swapping out... > > Starting round 0... > > Spawning 65536 workers... > > 32746 workers spawned, wait for done... > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > Round 0 Failed, 15 data loss! > > i am also reading these codes recently. It is quite unbelievable this > is really happening > now. as freeing swaps is returning slot to slots_ret, but allocating > swap is from slots. > so if swapfile is large, the chance that the newly allocated swap was > a recently freed swap > is close to 0%. but yes, the code does have the risk. Indeed, for reproducing I used a 32M swap device, and the data being swapped in/out is large enough to make full use of it. So the reproduce rate is increased by a lot. It's not a completely fictional test as some low end device do have smaller swaps, and real world race could happen in many strange ways. > > > > This reproducer spawns multiple threads sharing the same memory region > > using a small swap device. Every two threads updates mapped pages one b= y > > one in opposite direction trying to create a race, with one dedicated > > thread keep swapping out the data out using madvise. > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > so the race should be totally possible in production. > > > > After this patch, I ran the reproducer for over a few hundred rounds > > and no data loss observed. > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > zram: > > > > Before: 10934698 us > > After: 11157121 us > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronou= s device") > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stres= s-race [1] > > Signed-off-by: Kairui Song > > I will also run your patch on my problem I reported today[1]. will update > the result to you this week. > > [1] https://lore.kernel.org/linux-mm/d4f602db-403b-4b1f-a3de-affeb40bc499= @arm.com/T/#m41701d0c0e127cdae636e97a13ab521364a810f4 > Thanks!