From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A3CAC48BC4 for ; Tue, 20 Feb 2024 05:38:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C567D6B0078; Tue, 20 Feb 2024 00:38:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BDFB76B007B; Tue, 20 Feb 2024 00:38:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A813E6B007D; Tue, 20 Feb 2024 00:38:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 90DB86B0078 for ; Tue, 20 Feb 2024 00:38:08 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 63EE01A053F for ; Tue, 20 Feb 2024 05:38:08 +0000 (UTC) X-FDA: 81811076256.22.22378FA Received: from mail-oi1-f171.google.com (mail-oi1-f171.google.com [209.85.167.171]) by imf02.hostedemail.com (Postfix) with ESMTP id 37D5A80005 for ; Tue, 20 Feb 2024 05:38:05 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="AzFpDR8/"; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf02.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.167.171 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708407486; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BsUbBKxG+9/DSpuC0GbtFN56JUFXR38SfCxEBxcRQTc=; b=eVJfqK/fyNjLGetrWDU0Q6xLdjoPio3YufmSY2y2aTIXUayeuzT3DDQjXR3VHuIXMa/TEb PsJux5/N4+ch0oCQObzh7yXyrhgsFBtXcKgthoKIKK4J2BgvWMq3bu1vHEInmkKLSj4P0Z O5RUoeRLriBgwPjSTiMM7Z/rxMQypuI= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="AzFpDR8/"; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf02.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.167.171 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708407486; a=rsa-sha256; cv=none; b=MfLIbKw4dbe3rEkyuIfkCvwCOZp35q5LFH53z3esvAp65xZ8BXQNIFeH2jozIzoa3w4rWh ylD2RmFSXQnXoQYrxtNtv4miEMUSzyJYys9fgXvCFe3YIjWLys8CWEbdstK2dvfky8mWGX 0a9Gg+z0XG09XWzgViYSZks2gYBah5E= Received: by mail-oi1-f171.google.com with SMTP id 5614622812f47-3c167d2cfdeso30015b6e.0 for ; Mon, 19 Feb 2024 21:38:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1708407484; x=1709012284; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=BsUbBKxG+9/DSpuC0GbtFN56JUFXR38SfCxEBxcRQTc=; b=AzFpDR8/SV7ghdXSZC/Kky53oHiTL85ZLO/BTNS/3+kiRUZc1/o3w6+q/Pc95/s4N8 D6bA4lEXuryO6asPZRvVJSYSSgE7RtfM+TVk1KeMH/LsV5jrR2jB2Q1tR7p7C60lke0I tNw0tNXANgNDP9DupjK4VUKrhE66r45xGT7kvYJujb4EBkY5P5BmOx7sPoUk6+Mtrc/u Tt6FTjdg673WPXEGVxsusjcDhN5bk7iHZc7y246UfHvfol248MvZUU4HC/gz8yKa7ss2 HZOT8BSmTa4GuWX79YUxWNzq+mu7Pz9a9Qu97Tb7T3/nGzMsH56wDqafg+5bmv1ZxaBJ 8nQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708407484; x=1709012284; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=BsUbBKxG+9/DSpuC0GbtFN56JUFXR38SfCxEBxcRQTc=; b=eou1LC0XTcE7T03RM5TVcwAVdPmUq8qFwKrXIBLFjyu6t9lLqHZgE4yoFjJ6dSWkff UAcZlJpFegy2UIgRIoxqeB+smpwFhYbhQEZ9v5XU6XR6iTH9xC6PQ1lhhBYleglKs0SP lJF0QhPnK0Ppi30YplPrKvReIw1ynqREuaXqE8c203Usi7PlBSgTp/LGQGywui8HlYn2 Xtltzqgh9lcOJY/HN4mkZUPNVrI+MQk79sBWRpTFcH34TJ743R9O5b5Y0uJbILtz0Xzz m0QMz5r/qE3h67PT8Trs9PXDngrPZh75zS6Rgy+rVpDtbct0FNZJeWxE86WoxfTUg8MX +Byg== X-Forwarded-Encrypted: i=1; AJvYcCVXMGz/LSWWtc1U63+Y44sh0vsRJYxxaf4jYa+A3gJsRvD6bVJpEvn6oVwd/rd2I1yWSNhlYJq97qU/Tfcrxm3FSaM= X-Gm-Message-State: AOJu0YxQEPeAVUdtCS195y8P0iG72FNAWi1wbWGPn/UvUSZKU5W27LJi 8WQicTqKJwqQn2XeefmoHk0KWjj4yWI9hT4fwG4rtMHVKjfX01epjgcU0ZWAafQ= X-Google-Smtp-Source: AGHT+IHW8CqKtgBQfHGmTQx8RqPz0IXldVfckF1KCEDOZ+r8iwRuYqBrJF3hw5h1Yuf2GeB0nZgK9Q== X-Received: by 2002:a05:6358:5716:b0:17a:e876:1148 with SMTP id a22-20020a056358571600b0017ae8761148mr20647663rwf.12.1708407483888; Mon, 19 Feb 2024 21:38:03 -0800 (PST) Received: from [10.254.117.3] ([139.177.225.234]) by smtp.gmail.com with ESMTPSA id q18-20020a056a00085200b006e47d62150fsm347952pfk.211.2024.02.19.21.37.57 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 19 Feb 2024 21:38:03 -0800 (PST) Message-ID: Date: Tue, 20 Feb 2024 13:37:55 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4] mm/swap: fix race when skipping swapcache Content-Language: en-US To: Kairui Song Cc: Barry Song <21cnbao@gmail.com>, "Huang, Ying" , linux-mm , Andrew Morton , Chris Li , Minchan Kim , Barry Song , Yu Zhao , SeongJae Park , David Hildenbrand , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , Aaron Lu , stable@vger.kernel.org, LKML References: <20240219082040.7495-1-ryncsn@gmail.com> From: Chengming Zhou In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 37D5A80005 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: h94nyh1cwc8nd9d9y18bp1sxtwhbq75c X-HE-Tag: 1708407485-802108 X-HE-Meta: U2FsdGVkX1+cQF7CjCw8d1ecKd++/asFR+AW8WMjJl44UNIoCmiLfbOuvd/uX4o/UiUoVhNia1Pt56zASGYqKZ/naYJkQIOn/DsPUT5kQX9EQXU3BTeyu8yuVsR5lNsOFm3YcBk5v8GGELJyD9sPkNeOo660L/itHf0IICIPf5yGVp5IGQoZ06/K/G7lxDPXY2ovBcvENIjUrwcohdcYp8SzmoCbCoTrgkY4KDoSDinqmM3glNj0iH4JX5e4Ul0on/ImKgEXH0CHJUj8BbjUCqDy0gMvKUAL06mYCZMV/5CiCyXdXVoRcImxzwRX8LSoBK5+YRbU1jwLFggrONWV3ZytbTeCOYN4OYR7XSSSQUcmOfCN9eWnxjQtZsb3LcsMGItQXE18br+4o3t3vs1krpySlXfwpBu5GxNeIQWbrfTjwA8TH+i4+RitP5zuKSb0inlnbe3swdWIJkoStiuFay31P7m+Z4qhd0Cgt++aUIh8i2Taz6dWZfNPrtyiD86jcYM67gG4iyZ/qLKMyuVEJxRq5fUoHuLVDv6hi5gu4WheqgeGtPZcnHHB9JBdPqOfaUT6yUOZ5/J9FrqmHsLxxPHZIrEoAUHVbrlbEPjHVYihEE/0LIv1wXcFr4igpxmJyuvvtpAd4nn3YxmLzi5s6QjGpSi0J8/Qhwj/dR7YvXAN/dHNqYsBW9ipAHOcaoqXs+Wq16SCeWnWsaHVPJ6gm3YSuElKHg1lg752uaRlALZ+FdyOqUV4coYOVuTsZpxpfTjbGHhqbtksZr/ilgodJf8Z/Hx3t9kjHpkQ40b1eBpwKfEG8QcikUTagc/l9FuKlVpfWlcMg1FbtCQu0N415CtSAMwDtTC+izt+fhIxbkv2kcGghHv9Ancy48DZYMVKup8ACcGPIxKlFbfYm+sc01mOmJeySYZZexx3DfsgXF7jdWwYGQtAh+jV3doOJyBoc6Te/dmpkuYTUwm/dXK ycRzB9EJ aqeFtFMe+0gVRs350cnI/wWSG3oWFM0Dqh98GKMY8qOXcWx7+Y2+tNbEl01Q6mPIfKzITeEOpFGXwX9//k/6mIetqRKbr43z622WpSNf+KelXT639R9MDg8C0CAA8iAOVsWcX71U3d1WpKufHc6wQutp4g0mlCqXBsq7Q1k4b7cHDTwA9hwVtG9mFdOyMxpQsR0oNevroHEymDyMOgXSINShifliIDW23+2M5AxS2Gb5EwX13BYl0Op9sA9nAb4xz0+xIPz6MvxYp4/ne/GUXSb4V5vR8kkcZ4mjVMyqCIBvaBFwVUQ1QU0wJk72f2go2chl/X35aEogus3vH1ZpuM1FT1nR9A3QSCW/gsJvN7x1+mp7B1ebG/eawi/MuGvEatGrbC3WrXTlseNg3Y23KOLf8G59n0E7/HOQ25ZVfpRA5o4ggqGMRJLBdckkB7paU3rn5AHZbUJacynLxF3SXEnqZNDab5S0iE6Z/UxYZPohFUhLRMH1T2AJB6N7gi7sxlBlq+EyOuiAsFVxdDqhJ4bxUageGLDzn78iEiR7JjL1wZj3nGlPJmEDzi0AJGswT053ncFduW2qmZ0aqYoGBcXURqsjIgT/aCwa8NOkkhPaEg+HPpqvzmpQM6syyi5d7836JsC+MpJsVfaRLd8o1qycZSAfbwwhnJn20jSHx0SiRq90= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/2/20 13:32, Kairui Song wrote: > On Tue, Feb 20, 2024 at 12:49 PM Chengming Zhou > wrote: >> >> On 2024/2/20 06:10, Barry Song wrote: >>> On Mon, Feb 19, 2024 at 9:21 PM Kairui Song wrote: >>>> >>>> From: Kairui Song >>>> >>>> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads >>>> swapin the same entry at the same time, they get different pages (A, > B). >>>> Before one thread (T0) finishes the swapin and installs page (A) >>>> to the PTE, another thread (T1) could finish swapin of page (B), >>>> swap_free the entry, then swap out the possibly modified page >>>> reusing the same entry. It breaks the pte_same check in (T0) because >>>> PTE value is unchanged, causing ABA problem. Thread (T0) will >>>> install a stalled page (A) into the PTE and cause data corruption. >>>> >>>> One possible callstack is like this: >>>> >>>> CPU0 CPU1 >>>> ---- ---- >>>> do_swap_page() do_swap_page() with same entry >>>> >>>> >>>> swap_read_folio() <- read to page A swap_read_folio() <- read to page > B >>>> >>>> .. set_pte_at() >>>> swap_free() <- entry is free >>>> stalled> >>>> entry> >>>> pte_same() <- Check pass, PTE seems >>>> unchanged, but page A >>>> is stalled! >>>> swap_free() <- page B content lost! >>>> set_pte_at() <- staled page A installed! >>>> >>>> And besides, for ZRAM, swap_free() allows the swap device to discard >>>> the entry content, so even if page (B) is not modified, if >>>> swap_read_folio() on CPU0 happens later than swap_free() on CPU1, >>>> it may also cause data loss. >>>> >>>> To fix this, reuse swapcache_prepare which will pin the swap entry > using >>>> the cache flag, and allow only one thread to swap it in, also prevent >>>> any parallel code from putting the entry in the cache. Release the pin >>>> after PT unlocked. >>>> >>>> Racers just loop and wait since it's a rare and very short event. >>>> A schedule_timeout_uninterruptible(1) call is added to avoid repeated >>>> page faults wasting too much CPU, causing livelock or adding too much >>>> noise to perf statistics. A similar livelock issue was described in >>>> commit 029c4628b2eb ("mm: swap: get rid of livelock in swapin > readahead") >>>> >>>> Reproducer: >>>> >>>> This race issue can be triggered easily using a well constructed >>>> reproducer and patched brd (with a delay in read path) [1]: >>>> >>>> With latest 6.8 mainline, race caused data loss can be observed easily: >>>> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out >>>> Polulating 32MB of memory region... >>>> Keep swapping out... >>>> Starting round 0... >>>> Spawning 65536 workers... >>>> 32746 workers spawned, wait for done... >>>> Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! >>>> Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! >>>> Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! >>>> Round 0 Failed, 15 data loss! >>>> >>>> This reproducer spawns multiple threads sharing the same memory region >>>> using a small swap device. Every two threads updates mapped pages one > by >>>> one in opposite direction trying to create a race, with one dedicated >>>> thread keep swapping out the data out using madvise. >>>> >>>> The reproducer created a reproduce rate of about once every 5 minutes, >>>> so the race should be totally possible in production. >>>> >>>> After this patch, I ran the reproducer for over a few hundred rounds >>>> and no data loss observed. >>>> >>>> Performance overhead is minimal, microbenchmark swapin 10G from 32G >>>> zram: >>>> >>>> Before: 10934698 us >>>> After: 11157121 us >>>> Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) >>>> >>>> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of > synchronous device") >>>> Link: > https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] >>>> Reported-by: "Huang, Ying" >>>> Closes: > https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ >>>> Signed-off-by: Kairui Song >>>> Cc: stable@vger.kernel.org >>>> >>>> --- >>>> V3: > https://lore.kernel.org/all/20240216095105.14502-1-ryncsn@gmail.com/ >>>> Update from V3: >>>> - Use schedule_timeout_uninterruptible(1) for now instead of > schedule() to >>>> prevent the busy faulting task holds CPU and livelocks [Huang, Ying] >>>> >>>> V2: > https://lore.kernel.org/all/20240206182559.32264-1-ryncsn@gmail.com/ >>>> Update from V2: >>>> - Add a schedule() if raced to prevent repeated page faults wasting CPU >>>> and add noise to perf statistics. >>>> - Use a bool to state the special case instead of reusing existing >>>> variables fixing error handling [Minchan Kim]. >>>> >>>> V1: https://lore.kernel.org/all/20240205110959.4021-1-ryncsn@gmail.com/ >>>> Update from V1: >>>> - Add some words on ZRAM case, it will discard swap content on > swap_free >>>> so the race window is a bit different but cure is the same. [Barry > Song] >>>> - Update comments make it cleaner [Huang, Ying] >>>> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae > Park] >>>> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO >>>> instead of "direct swapin path" [Yu Zhao] >>>> - Update commit message. >>>> - Collect Review and Acks. >>>> >>>> include/linux/swap.h | 5 +++++ >>>> mm/memory.c | 20 ++++++++++++++++++++ >>>> mm/swap.h | 5 +++++ >>>> mm/swapfile.c | 13 +++++++++++++ >>>> 4 files changed, 43 insertions(+) >>>> >>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>> index 4db00ddad261..8d28f6091a32 100644 >>>> --- a/include/linux/swap.h >>>> +++ b/include/linux/swap.h >>>> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) >>>> return 0; >>>> } >>>> >>>> +static inline int swapcache_prepare(swp_entry_t swp) >>>> +{ >>>> + return 0; >>>> +} >>>> + >>>> static inline void swap_free(swp_entry_t swp) >>>> { >>>> } >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 7e1f4849463a..a99f5e7be9a5 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -3799,6 +3799,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> struct page *page; >>>> struct swap_info_struct *si = NULL; >>>> rmap_t rmap_flags = RMAP_NONE; >>>> + bool need_clear_cache = false; >>>> bool exclusive = false; >>>> swp_entry_t entry; >>>> pte_t pte; >>>> @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> if (!folio) { >>>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >>>> __swap_count(entry) == 1) { >>>> + /* >>>> + * Prevent parallel swapin from proceeding with >>>> + * the cache flag. Otherwise, another thread > may >>>> + * finish swapin first, free the entry, and > swapout >>>> + * reusing the same entry. It's undetectable as >>>> + * pte_same() returns true due to entry reuse. >>>> + */ >>>> + if (swapcache_prepare(entry)) { >>>> + /* Relax a bit to prevent rapid > repeated page faults */ >>>> + schedule_timeout_uninterruptible(1); >>> >>> Not a ideal model, imaging two tasks, >>> >>> task A - low priority running on a LITTLE core >>> task B - high priority and have real-time requirements such as audio, >>> graphics running on a big core. >>> >>> The original code will make B win even if it is a bit later than A as > its CPU is >>> much faster to finish swap_read_folio for example from zRAM. task B's >>> swap-in can finish very soon. >>> >>> With the patch, B will wait a tick and its real-time performance will be >>> negatively affected from time to time once low priority and high > priority >>> tasks fault in the same PTE and high priority tasks are a bit later than >>> low priority tasks. This is a kind of priority inversion. >>> >>> When we support large folio swap-in, things can get worse. For example, >>> to swap-in 16 or even more pages in one do_swap_page, the chance for >>> task A and task B located in the same range of 16 PTEs will increase >>> though they are not located in the same PTE. >>> >>> Please consider this is not a blocker for this patch. But I will put > the problem >>> in my list and run some real tests on Android phones later. >> >> Good point. Late for the discussion, I'm wondering why not get an extra > reference >> on the swap entry, instead of swapcache_prepare()? Then the faster thread > will >> succeed, but can't free the swap entry. Later, the slower thread will > find the >> changed pte value and fail, and free the swap entry. Maybe I missed > something? > > Hi, Chengming > > That was my initial purpose. Then found a lot of problems with it. Increase > swap count here, it may race with another swap free and end up increasing > the swap count of a freed entry. > > That can be fixed with audits and new helpers, but there are many other > potential issues too. One major problem is that after count bump, raced > swap threads will fallback to cached swap in. Pages in swapcache can be > swaped out without allocating an entry, making the problem we were trying > to resolve more serious. Thanks for your clarification! Right, there are many issues I just ignored...