From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3A3CAC48BC4
	for <linux-mm@archiver.kernel.org>; Tue, 20 Feb 2024 05:38:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C567D6B0078; Tue, 20 Feb 2024 00:38:08 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BDFB76B007B; Tue, 20 Feb 2024 00:38:08 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A813E6B007D; Tue, 20 Feb 2024 00:38:08 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 90DB86B0078
	for <linux-mm@kvack.org>; Tue, 20 Feb 2024 00:38:08 -0500 (EST)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 63EE01A053F
	for <linux-mm@kvack.org>; Tue, 20 Feb 2024 05:38:08 +0000 (UTC)
X-FDA: 81811076256.22.22378FA
Received: from mail-oi1-f171.google.com (mail-oi1-f171.google.com [209.85.167.171])
	by imf02.hostedemail.com (Postfix) with ESMTP id 37D5A80005
	for <linux-mm@kvack.org>; Tue, 20 Feb 2024 05:38:05 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b="AzFpDR8/";
	dmarc=pass (policy=quarantine) header.from=bytedance.com;
	spf=pass (imf02.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.167.171 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1708407486;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=BsUbBKxG+9/DSpuC0GbtFN56JUFXR38SfCxEBxcRQTc=;
	b=eVJfqK/fyNjLGetrWDU0Q6xLdjoPio3YufmSY2y2aTIXUayeuzT3DDQjXR3VHuIXMa/TEb
	PsJux5/N4+ch0oCQObzh7yXyrhgsFBtXcKgthoKIKK4J2BgvWMq3bu1vHEInmkKLSj4P0Z
	O5RUoeRLriBgwPjSTiMM7Z/rxMQypuI=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b="AzFpDR8/";
	dmarc=pass (policy=quarantine) header.from=bytedance.com;
	spf=pass (imf02.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.167.171 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708407486; a=rsa-sha256;
	cv=none;
	b=MfLIbKw4dbe3rEkyuIfkCvwCOZp35q5LFH53z3esvAp65xZ8BXQNIFeH2jozIzoa3w4rWh
	ylD2RmFSXQnXoQYrxtNtv4miEMUSzyJYys9fgXvCFe3YIjWLys8CWEbdstK2dvfky8mWGX
	0a9Gg+z0XG09XWzgViYSZks2gYBah5E=
Received: by mail-oi1-f171.google.com with SMTP id 5614622812f47-3c167d2cfdeso30015b6e.0
        for <linux-mm@kvack.org>; Mon, 19 Feb 2024 21:38:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance.com; s=google; t=1708407484; x=1709012284; darn=kvack.org;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=BsUbBKxG+9/DSpuC0GbtFN56JUFXR38SfCxEBxcRQTc=;
        b=AzFpDR8/SV7ghdXSZC/Kky53oHiTL85ZLO/BTNS/3+kiRUZc1/o3w6+q/Pc95/s4N8
         D6bA4lEXuryO6asPZRvVJSYSSgE7RtfM+TVk1KeMH/LsV5jrR2jB2Q1tR7p7C60lke0I
         tNw0tNXANgNDP9DupjK4VUKrhE66r45xGT7kvYJujb4EBkY5P5BmOx7sPoUk6+Mtrc/u
         Tt6FTjdg673WPXEGVxsusjcDhN5bk7iHZc7y246UfHvfol248MvZUU4HC/gz8yKa7ss2
         HZOT8BSmTa4GuWX79YUxWNzq+mu7Pz9a9Qu97Tb7T3/nGzMsH56wDqafg+5bmv1ZxaBJ
         8nQA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1708407484; x=1709012284;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=BsUbBKxG+9/DSpuC0GbtFN56JUFXR38SfCxEBxcRQTc=;
        b=eou1LC0XTcE7T03RM5TVcwAVdPmUq8qFwKrXIBLFjyu6t9lLqHZgE4yoFjJ6dSWkff
         UAcZlJpFegy2UIgRIoxqeB+smpwFhYbhQEZ9v5XU6XR6iTH9xC6PQ1lhhBYleglKs0SP
         lJF0QhPnK0Ppi30YplPrKvReIw1ynqREuaXqE8c203Usi7PlBSgTp/LGQGywui8HlYn2
         Xtltzqgh9lcOJY/HN4mkZUPNVrI+MQk79sBWRpTFcH34TJ743R9O5b5Y0uJbILtz0Xzz
         m0QMz5r/qE3h67PT8Trs9PXDngrPZh75zS6Rgy+rVpDtbct0FNZJeWxE86WoxfTUg8MX
         +Byg==
X-Forwarded-Encrypted: i=1; AJvYcCVXMGz/LSWWtc1U63+Y44sh0vsRJYxxaf4jYa+A3gJsRvD6bVJpEvn6oVwd/rd2I1yWSNhlYJq97qU/Tfcrxm3FSaM=
X-Gm-Message-State: AOJu0YxQEPeAVUdtCS195y8P0iG72FNAWi1wbWGPn/UvUSZKU5W27LJi
	8WQicTqKJwqQn2XeefmoHk0KWjj4yWI9hT4fwG4rtMHVKjfX01epjgcU0ZWAafQ=
X-Google-Smtp-Source: AGHT+IHW8CqKtgBQfHGmTQx8RqPz0IXldVfckF1KCEDOZ+r8iwRuYqBrJF3hw5h1Yuf2GeB0nZgK9Q==
X-Received: by 2002:a05:6358:5716:b0:17a:e876:1148 with SMTP id a22-20020a056358571600b0017ae8761148mr20647663rwf.12.1708407483888;
        Mon, 19 Feb 2024 21:38:03 -0800 (PST)
Received: from [10.254.117.3] ([139.177.225.234])
        by smtp.gmail.com with ESMTPSA id q18-20020a056a00085200b006e47d62150fsm347952pfk.211.2024.02.19.21.37.57
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 19 Feb 2024 21:38:03 -0800 (PST)
Message-ID: <c156d1c4-a5f2-4207-8b5f-10f38e0ce6fe@bytedance.com>
Date: Tue, 20 Feb 2024 13:37:55 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v4] mm/swap: fix race when skipping swapcache
Content-Language: en-US
To: Kairui Song <ryncsn@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>, "Huang, Ying" <ying.huang@intel.com>,
 linux-mm <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>,
 Chris Li <chrisl@kernel.org>, Minchan Kim <minchan@kernel.org>,
 Barry Song <v-songbaohua@oppo.com>, Yu Zhao <yuzhao@google.com>,
 SeongJae Park <sj@kernel.org>, David Hildenbrand <david@redhat.com>,
 Hugh Dickins <hughd@google.com>, Johannes Weiner <hannes@cmpxchg.org>,
 Matthew Wilcox <willy@infradead.org>, Michal Hocko <mhocko@suse.com>,
 Yosry Ahmed <yosryahmed@google.com>, Aaron Lu <aaron.lu@intel.com>,
 stable@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>
References: <20240219082040.7495-1-ryncsn@gmail.com>
 <CAGsJ_4xX+K=f_4g2c7NcpYqWH0_bvQshxgz_YoDscqeHwWpM0w@mail.gmail.com>
 <dfac93be-db4c-4e37-a750-93cb19d07a68@bytedance.com>
 <CAMgjq7A9Gy0QK1f379wR6gp=dJM9DLSqMDq8OH5nMF_XxiOggQ@mail.gmail.com>
From: Chengming Zhou <zhouchengming@bytedance.com>
In-Reply-To: <CAMgjq7A9Gy0QK1f379wR6gp=dJM9DLSqMDq8OH5nMF_XxiOggQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 37D5A80005
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: h94nyh1cwc8nd9d9y18bp1sxtwhbq75c
X-HE-Tag: 1708407485-802108
X-HE-Meta: U2FsdGVkX1+cQF7CjCw8d1ecKd++/asFR+AW8WMjJl44UNIoCmiLfbOuvd/uX4o/UiUoVhNia1Pt56zASGYqKZ/naYJkQIOn/DsPUT5kQX9EQXU3BTeyu8yuVsR5lNsOFm3YcBk5v8GGELJyD9sPkNeOo660L/itHf0IICIPf5yGVp5IGQoZ06/K/G7lxDPXY2ovBcvENIjUrwcohdcYp8SzmoCbCoTrgkY4KDoSDinqmM3glNj0iH4JX5e4Ul0on/ImKgEXH0CHJUj8BbjUCqDy0gMvKUAL06mYCZMV/5CiCyXdXVoRcImxzwRX8LSoBK5+YRbU1jwLFggrONWV3ZytbTeCOYN4OYR7XSSSQUcmOfCN9eWnxjQtZsb3LcsMGItQXE18br+4o3t3vs1krpySlXfwpBu5GxNeIQWbrfTjwA8TH+i4+RitP5zuKSb0inlnbe3swdWIJkoStiuFay31P7m+Z4qhd0Cgt++aUIh8i2Taz6dWZfNPrtyiD86jcYM67gG4iyZ/qLKMyuVEJxRq5fUoHuLVDv6hi5gu4WheqgeGtPZcnHHB9JBdPqOfaUT6yUOZ5/J9FrqmHsLxxPHZIrEoAUHVbrlbEPjHVYihEE/0LIv1wXcFr4igpxmJyuvvtpAd4nn3YxmLzi5s6QjGpSi0J8/Qhwj/dR7YvXAN/dHNqYsBW9ipAHOcaoqXs+Wq16SCeWnWsaHVPJ6gm3YSuElKHg1lg752uaRlALZ+FdyOqUV4coYOVuTsZpxpfTjbGHhqbtksZr/ilgodJf8Z/Hx3t9kjHpkQ40b1eBpwKfEG8QcikUTagc/l9FuKlVpfWlcMg1FbtCQu0N415CtSAMwDtTC+izt+fhIxbkv2kcGghHv9Ancy48DZYMVKup8ACcGPIxKlFbfYm+sc01mOmJeySYZZexx3DfsgXF7jdWwYGQtAh+jV3doOJyBoc6Te/dmpkuYTUwm/dXK
 ycRzB9EJ
 aqeFtFMe+0gVRs350cnI/wWSG3oWFM0Dqh98GKMY8qOXcWx7+Y2+tNbEl01Q6mPIfKzITeEOpFGXwX9//k/6mIetqRKbr43z622WpSNf+KelXT639R9MDg8C0CAA8iAOVsWcX71U3d1WpKufHc6wQutp4g0mlCqXBsq7Q1k4b7cHDTwA9hwVtG9mFdOyMxpQsR0oNevroHEymDyMOgXSINShifliIDW23+2M5AxS2Gb5EwX13BYl0Op9sA9nAb4xz0+xIPz6MvxYp4/ne/GUXSb4V5vR8kkcZ4mjVMyqCIBvaBFwVUQ1QU0wJk72f2go2chl/X35aEogus3vH1ZpuM1FT1nR9A3QSCW/gsJvN7x1+mp7B1ebG/eawi/MuGvEatGrbC3WrXTlseNg3Y23KOLf8G59n0E7/HOQ25ZVfpRA5o4ggqGMRJLBdckkB7paU3rn5AHZbUJacynLxF3SXEnqZNDab5S0iE6Z/UxYZPohFUhLRMH1T2AJB6N7gi7sxlBlq+EyOuiAsFVxdDqhJ4bxUageGLDzn78iEiR7JjL1wZj3nGlPJmEDzi0AJGswT053ncFduW2qmZ0aqYoGBcXURqsjIgT/aCwa8NOkkhPaEg+HPpqvzmpQM6syyi5d7836JsC+MpJsVfaRLd8o1qycZSAfbwwhnJn20jSHx0SiRq90=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 2024/2/20 13:32, Kairui Song wrote:
> On Tue, Feb 20, 2024 at 12:49 PM Chengming Zhou <zhouchengming@bytedance.com>
> wrote:
>>
>> On 2024/2/20 06:10, Barry Song wrote:
>>> On Mon, Feb 19, 2024 at 9:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>>>>
>>>> From: Kairui Song <kasong@tencent.com>
>>>>
>>>> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
>>>> swapin the same entry at the same time, they get different pages (A,
> B).
>>>> Before one thread (T0) finishes the swapin and installs page (A)
>>>> to the PTE, another thread (T1) could finish swapin of page (B),
>>>> swap_free the entry, then swap out the possibly modified page
>>>> reusing the same entry. It breaks the pte_same check in (T0) because
>>>> PTE value is unchanged, causing ABA problem. Thread (T0) will
>>>> install a stalled page (A) into the PTE and cause data corruption.
>>>>
>>>> One possible callstack is like this:
>>>>
>>>> CPU0                                 CPU1
>>>> ----                                 ----
>>>> do_swap_page()                       do_swap_page() with same entry
>>>> <direct swapin path>                 <direct swapin path>
>>>> <alloc page A>                       <alloc page B>
>>>> swap_read_folio() <- read to page A  swap_read_folio() <- read to page
> B
>>>> <slow on later locks or interrupt>   <finished swapin first>
>>>> ..                                  set_pte_at()
>>>>                                      swap_free() <- entry is free
>>>>                                      <write to page B, now page A
> stalled>
>>>>                                      <swap out page B to same swap
> entry>
>>>> pte_same() <- Check pass, PTE seems
>>>>               unchanged, but page A
>>>>               is stalled!
>>>> swap_free() <- page B content lost!
>>>> set_pte_at() <- staled page A installed!
>>>>
>>>> And besides, for ZRAM, swap_free() allows the swap device to discard
>>>> the entry content, so even if page (B) is not modified, if
>>>> swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
>>>> it may also cause data loss.
>>>>
>>>> To fix this, reuse swapcache_prepare which will pin the swap entry
> using
>>>> the cache flag, and allow only one thread to swap it in, also prevent
>>>> any parallel code from putting the entry in the cache. Release the pin
>>>> after PT unlocked.
>>>>
>>>> Racers just loop and wait since it's a rare and very short event.
>>>> A schedule_timeout_uninterruptible(1) call is added to avoid repeated
>>>> page faults wasting too much CPU, causing livelock or adding too much
>>>> noise to perf statistics. A similar livelock issue was described in
>>>> commit 029c4628b2eb ("mm: swap: get rid of livelock in swapin
> readahead")
>>>>
>>>> Reproducer:
>>>>
>>>> This race issue can be triggered easily using a well constructed
>>>> reproducer and patched brd (with a delay in read path) [1]:
>>>>
>>>> With latest 6.8 mainline, race caused data loss can be observed easily:
>>>> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
>>>>   Polulating 32MB of memory region...
>>>>   Keep swapping out...
>>>>   Starting round 0...
>>>>   Spawning 65536 workers...
>>>>   32746 workers spawned, wait for done...
>>>>   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
>>>>   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
>>>>   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
>>>>   Round 0 Failed, 15 data loss!
>>>>
>>>> This reproducer spawns multiple threads sharing the same memory region
>>>> using a small swap device. Every two threads updates mapped pages one
> by
>>>> one in opposite direction trying to create a race, with one dedicated
>>>> thread keep swapping out the data out using madvise.
>>>>
>>>> The reproducer created a reproduce rate of about once every 5 minutes,
>>>> so the race should be totally possible in production.
>>>>
>>>> After this patch, I ran the reproducer for over a few hundred rounds
>>>> and no data loss observed.
>>>>
>>>> Performance overhead is minimal, microbenchmark swapin 10G from 32G
>>>> zram:
>>>>
>>>> Before:     10934698 us
>>>> After:      11157121 us
>>>> Cached:     13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
>>>>
>>>> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of
> synchronous device")
>>>> Link:
> https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
>>>> Reported-by: "Huang, Ying" <ying.huang@intel.com>
>>>> Closes:
> https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
>>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>>> Cc: stable@vger.kernel.org
>>>>
>>>> ---
>>>> V3:
> https://lore.kernel.org/all/20240216095105.14502-1-ryncsn@gmail.com/
>>>> Update from V3:
>>>> - Use schedule_timeout_uninterruptible(1) for now instead of
> schedule() to
>>>>   prevent the busy faulting task holds CPU and livelocks [Huang, Ying]
>>>>
>>>> V2:
> https://lore.kernel.org/all/20240206182559.32264-1-ryncsn@gmail.com/
>>>> Update from V2:
>>>> - Add a schedule() if raced to prevent repeated page faults wasting CPU
>>>>   and add noise to perf statistics.
>>>> - Use a bool to state the special case instead of reusing existing
>>>>   variables fixing error handling [Minchan Kim].
>>>>
>>>> V1: https://lore.kernel.org/all/20240205110959.4021-1-ryncsn@gmail.com/
>>>> Update from V1:
>>>> - Add some words on ZRAM case, it will discard swap content on
> swap_free
>>>>   so the race window is a bit different but cure is the same. [Barry
> Song]
>>>> - Update comments make it cleaner [Huang, Ying]
>>>> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae
> Park]
>>>> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO
>>>>   instead of "direct swapin path" [Yu Zhao]
>>>> - Update commit message.
>>>> - Collect Review and Acks.
>>>>
>>>>  include/linux/swap.h |  5 +++++
>>>>  mm/memory.c          | 20 ++++++++++++++++++++
>>>>  mm/swap.h            |  5 +++++
>>>>  mm/swapfile.c        | 13 +++++++++++++
>>>>  4 files changed, 43 insertions(+)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 4db00ddad261..8d28f6091a32 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
>>>>         return 0;
>>>>  }
>>>>
>>>> +static inline int swapcache_prepare(swp_entry_t swp)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>>  static inline void swap_free(swp_entry_t swp)
>>>>  {
>>>>  }
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 7e1f4849463a..a99f5e7be9a5 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3799,6 +3799,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>         struct page *page;
>>>>         struct swap_info_struct *si = NULL;
>>>>         rmap_t rmap_flags = RMAP_NONE;
>>>> +       bool need_clear_cache = false;
>>>>         bool exclusive = false;
>>>>         swp_entry_t entry;
>>>>         pte_t pte;
>>>> @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>         if (!folio) {
>>>>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>>>                     __swap_count(entry) == 1) {
>>>> +                       /*
>>>> +                        * Prevent parallel swapin from proceeding with
>>>> +                        * the cache flag. Otherwise, another thread
> may
>>>> +                        * finish swapin first, free the entry, and
> swapout
>>>> +                        * reusing the same entry. It's undetectable as
>>>> +                        * pte_same() returns true due to entry reuse.
>>>> +                        */
>>>> +                       if (swapcache_prepare(entry)) {
>>>> +                               /* Relax a bit to prevent rapid
> repeated page faults */
>>>> +                               schedule_timeout_uninterruptible(1);
>>>
>>> Not a ideal model, imaging two tasks,
>>>
>>> task A  - low priority running on a LITTLE core
>>> task B - high priority and have real-time requirements such as audio,
>>>               graphics running on a big core.
>>>
>>> The original code will make B win even if it is a bit later than A as
> its CPU is
>>> much faster to finish swap_read_folio for example from zRAM. task B's
>>> swap-in can finish very soon.
>>>
>>> With the patch, B will wait a tick and its real-time performance will be
>>> negatively affected from time to time once low priority and high
> priority
>>> tasks fault in the same PTE and high priority tasks are a bit later than
>>> low priority tasks. This is a kind of priority inversion.
>>>
>>> When we support large folio swap-in, things can get worse. For example,
>>> to swap-in 16 or even more pages in one do_swap_page, the chance for
>>> task A and task B located in the same range of 16 PTEs will increase
>>> though they are not located in the same PTE.
>>>
>>> Please consider this is not a blocker for this patch. But I will put
> the problem
>>> in my list and run some real tests on Android phones later.
>>
>> Good point. Late for the discussion, I'm wondering why not get an extra
> reference
>> on the swap entry, instead of swapcache_prepare()? Then the faster thread
> will
>> succeed, but can't free the swap entry. Later, the slower thread will
> find the
>> changed pte value and fail, and free the swap entry. Maybe I missed
> something?
> 
> Hi, Chengming
> 
> That was my initial purpose. Then found a lot of problems with it. Increase
> swap count here, it may race with another swap free and end up increasing
> the swap count of a freed entry.
> 
> That can be fixed with audits and new helpers, but there are many other
> potential issues too. One major problem is that after count bump, raced
> swap threads will fallback to cached swap in. Pages in swapcache can be
> swaped out without allocating an entry, making the problem we were trying
> to resolve more serious.

Thanks for your clarification! Right, there are many issues I just ignored...