From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8358C4828D for ; Tue, 6 Feb 2024 06:03:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 528F76B006E; Tue, 6 Feb 2024 01:03:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D8E56B0071; Tue, 6 Feb 2024 01:03:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C7646B0072; Tue, 6 Feb 2024 01:03:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2D0B96B006E for ; Tue, 6 Feb 2024 01:03:31 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 61215120B51 for ; Tue, 6 Feb 2024 06:03:30 +0000 (UTC) X-FDA: 81760336980.20.4037D8D Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) by imf15.hostedemail.com (Postfix) with ESMTP id 8E20FA0007 for ; Tue, 6 Feb 2024 06:03:28 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Qd11bw5H; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707199408; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kjFH8fbG0Ckyf4Y0Lr6C3ppBK8/hdbSaW/tDrqzlhi8=; b=znHyBrM/ACms1iGxNdr6UEtdcechEFLpb60tiBUsQbYs29/CuuEcTvbmphunB26/YGR6Va BY636SdyMSGR+6zGjHJDFMaDR+W6I45WZLQejeeZ+GLZkGplWnuJBdCzM7OaGNhEAxglmr Vj8nZWpJTR64So8+4mEs5S3yBui2LJI= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Qd11bw5H; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707199408; a=rsa-sha256; cv=none; b=Vce9SMkBCUXGpgx4vDjxaKQF8ri5Att62Gv0q0hiXR/DqQL884/AgX+ET3YpIlAmlOUxms xYgrRGQYWIRhsh4m46ubPO/qX9Wa2ZhY+2eEEkwVbqFXrh+umQLj2bZgvkjrg7VX5Tjf5C P1NS1RWb5J54Dbn05hiz5j5IM5eZy2g= Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-40f00adacfeso26355e9.1 for ; Mon, 05 Feb 2024 22:03:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707199407; x=1707804207; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kjFH8fbG0Ckyf4Y0Lr6C3ppBK8/hdbSaW/tDrqzlhi8=; b=Qd11bw5HyhmfeHcWoR9Bjv3FzpfXD+pwuJqKIxDtXz4QKBHBmi7SRgghq8oMTRKjku XS4K9Kp1dU8d3BlFfhs9WGucOajfm6wUFdF/NDhEdjRpzE0DKSwBPIwUGkzcMo0ID9cE hJC0bPRbrocYff/TvYsJkXvp1VnxE9W7qVxyRhH8Mfa5wMRKA2T22m9NY8ZTVXuLgRSq tDIpUAZw9HKu9x1uhKA7LROTYg2PKIW2+qzMJAMvNNL0eT1rwMEBxKKU0fMwrVk/JYmH JxkesobBVLjjw8LNizah5J+ji3YH2gQ7yHgJXL7MlJ2wSrpsbNRVbphvww6ZbVvDO85R UDFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707199407; x=1707804207; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kjFH8fbG0Ckyf4Y0Lr6C3ppBK8/hdbSaW/tDrqzlhi8=; b=fS9f6R+t4366v9yjVu0nmEJKBM5qnXyNxygfM4yDSPPC/F8r6kAWVZOub4MiogJUSl aNYlr03mDu1KRmVYlYV4hYBnCjmt3O8vu35a1jWhkFQqa72nijum6oAbxo+nHBWdZybr D2XplUUCUYEnsXptZRKcEV3Q9NoutUXtGktb5MosvwUhefHU4j7TAkUYoNk1l/YWBASs Bhdukz9w7ffMsizFDtFpyCxb6GmP3F6X+zUfNZsV0PI7jnLUqklESDNKATu9xpKqa8PZ l0iWvM0IhIFV5tov45tQjEM8fkeec93nGhA9OxVIYya/0Mdd0/WWG6iLxRtLKB+20HRe QvMg== X-Gm-Message-State: AOJu0YwK0SjXsBsdHz8Pebgx4x9pAh8NSpW/yQU3mcpGg0WGSRangBPO IgzR0KUeK2X3u4JZ0NkULH6LSsfPp1u9M14cVNJ+ddIOk3tTlSnGyQ6psA3Iq6EacjxgBAF5Eaz DMRuHRS70KHgmM98eRWyRcM7vYnW9NJ8I+TiiB/Jgp8Uvnm02/cbB X-Google-Smtp-Source: AGHT+IEEW3JbfMzW7WyrRJQVHFCdE7tui9MBRHxUgJxt8SfSAJRDiSJgaaNfFo8POz1iaNq0F9cDwueTtNqwrGbmWVo= X-Received: by 2002:a05:600c:3d8a:b0:40e:f5c6:738a with SMTP id bi10-20020a05600c3d8a00b0040ef5c6738amr134041wmb.0.1707199406818; Mon, 05 Feb 2024 22:03:26 -0800 (PST) MIME-Version: 1.0 References: <20240205110959.4021-1-ryncsn@gmail.com> In-Reply-To: <20240205110959.4021-1-ryncsn@gmail.com> From: Yu Zhao Date: Mon, 5 Feb 2024 23:02:48 -0700 Message-ID: Subject: Re: [PATCH] mm/swap: fix race condition in direct swapin path To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , "Huang, Ying" , Chris Li , Minchan Kim , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: dewczcjqp6nq6b1ueywc5z6m7nrsmkcy X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 8E20FA0007 X-HE-Tag: 1707199408-298573 X-HE-Meta: U2FsdGVkX18w5M7I6nIZ2OS9tClvF9PI6a94QqL/UNaeewN/YJoLAdOR1v4URDUpel+EjpTHP1Mk7l+a7lwklwW1vQWEg+1oNMd6RKHXILppGsOSTJ0zAJQ2awimqtT1pWNiXnump58R/e7afOWfaviY+N4Ua9vdHCEetm/8gEUKKch6pcgG5VBPBHRxH8qunHZ2nvDXGGIlw72Gf1W1ob78EjWgJeEOlCQVOA5lZ+gidZGIO7u6vdIQMFZYLkOwek2nYkY3SLQG/2L30VtCqXLFAKlCUCI0UBEuyubJ1nRC2MJDXwFKL7LAv6DVH++ow7BNKsgGirtC/UyDT7/OXS/vwwToRD+XoEE2BH79Bi41IGWRd+ZOFivlwWXjYL2asKO6JqBnF4aJsvnY3I1TOPEGFShkwcDnOciKzMgXeJGoxUccoBoDUHHm7nLZYEy/Mv/ut04UCjG1Wc6EWiNgZT283psgTlbKIdxiWdB/kg+kpcCg4j+kco7+XaFKq6hSASvCcRWADDVOMDruV3m/baiVHLIjfkr6eKgAzIzPjLeZnEq0aYgCttHi5cWw4/vS01CN6izgkEw3WEKDpkJPq2ESMEgRhJ757svdBfk/Xf69NW3ufeFBKenuRww6yqxiZkGV0QQuGR/53ARbI5QZdD8gw1FDqrpkhi6Y8NHpAiJycbwJ5pDzR+iqvycIRWiF1N7qzLPEgG2Q6jgaGuo1YlaaZM6lXq6QPhI3WPT7Akz4i3VXkZlZfFqvGE75ak1QhlrSOgE1EesUEZv66JMlbjXOn/R/u0NVI1Iq5wcJInr4G1PLTg8q4VCF2w549lDEhfbnMlM2KPAQejWreY9tYWc1zD7mhEBwpIMKTvn0QV/ydMtd7UZsVrqsDqSxnLYQMMuEhr1C7xAjyb4tOQqYLwJuTXQqRjAFXSM9LG0ohPh5EGAiRX0xh+TDTnyAelsuvCxhUs5l3CphdeXU/G9 yhB8fSYm /G4Fu/JPYuYYo4xWH0hhZts9f9/OQHtD/Aq4XtVd+ENRXOlj4I4RKG8S7TCrFP2HmUIRiDJ0X21pb6bN/tN9w3XUtyyeeC+WjAjwELOphtBDokLckBhXIVIiSbGVbbiM4BNDLmQZ6OkURyOwoDLvkqEc3XBJhx/nk3Vfijxx3AwAlWMSHctgNw5f3AT6o2H3bXbmTtrhUeo5Sev/wnrtCWAL4jVH5xkVtmtjs9cIPrd12rHn3EySENaiD1JjqZO9qi8AhKa1vUDXrYv1aaTR0sNWAWx3ADM0ohTclvqQRd+GwOvwShN+W/DQ34Fyu6rh6tQzVKrk56jH6iCoTz484+o61U4a6zHJIVfGd7gpdYr2Ac5iWHKSZ7gOa6vaUbPnPh1RFxynxiEPGg+tT/kDuZ8NmWkV6eBcQEcfJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 5, 2024 at 4:10=E2=80=AFAM Kairui Song wrote= : > > From: Kairui Song > > In the direct swapin path, when two or more threads swapin the same entry There is no other places referring to that path as "direct" swapin. I'd rephrase it as: "When skipping swapcache for SWP_SYNCHRONOUS_IO, ...", and similarly for the subject: "mm: fix race when skipping swapcache". > at the same time, they get different pages (A, B) because swap cache is > skipped. Before one thread (T0) finishes the swapin and installs page (A) > to the PTE, another thread (T1) could finish swapin of page (B), > swap_free the entry, then modify and swap-out the page again, using the > same entry. It break the pte_same check because PTE value is unchanged, > causing ABA problem. Then thread (T0) will then install the stalled page > (A) into the PTE so new data in page (B) is lost, one possible callstack > is like this: > > CPU0 CPU1 > ---- ---- > do_swap_page() do_swap_page() with same entry > > > swap_readpage() <- read to page A swap_readpage() <- read to page B > > ... set_pte_at() > swap_free() <- Now the entry is freed= . > > > pte_same() <- Check pass, PTE seems > unchanged, but page A > is stalled! > swap_free() <- page B content lost! > set_pte_at() <- staled page A installed! > > To fix this, reuse swapcache_prepare which will pin the swap entry using > the cache flag, and allow only one thread to pin it. Release the pin > after PT unlocked. Racers will simply busy wait since it's a rare > and very short event. > > Other methods like increasing the swap count don't seem to be a good > idea after some tests, that will cause racers to fall back to the > cached swapin path, two swapin path being used at the same time > leads to a much more complex scenario. > > Reproducer: > > This race issue can be triggered easily using a well constructed > reproducer and patched brd (with a delay in read path) [1]: > > With latest 6.8 mainline, race caused data loss can be observed easily: > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > Polulating 32MB of memory region... > Keep swapping out... > Starting round 0... > Spawning 65536 workers... > 32746 workers spawned, wait for done... > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > Round 0 Failed, 15 data loss! > > This reproducer spawns multiple threads sharing the same memory region > using a small swap device. Every two threads updates mapped pages one by > one in opposite direction trying to create a race, with one dedicated > thread keep swapping out the data out using madvise. > > The reproducer created a reproduce rate of about once every 5 minutes, > so the race should be totally possible in production. > > After this patch, I ran the reproducer for over a few hundred rounds > and no data loss observed. > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > zram: > > Before: 10934698 us > After: 11157121 us > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous = device") > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-= race [1] > Signed-off-by: Kairui Song Cc: stable@vger.kernel.org Acked-by: Yu Zhao