From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36078C48295 for ; Mon, 5 Feb 2024 12:25:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9BB956B00A4; Mon, 5 Feb 2024 07:25:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 96B566B00A5; Mon, 5 Feb 2024 07:25:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 833B76B00A7; Mon, 5 Feb 2024 07:25:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6CE526B00A4 for ; Mon, 5 Feb 2024 07:25:45 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 349D840A08 for ; Mon, 5 Feb 2024 12:25:45 +0000 (UTC) X-FDA: 81757671450.30.C9EBB7F Received: from mail-vs1-f53.google.com (mail-vs1-f53.google.com [209.85.217.53]) by imf21.hostedemail.com (Postfix) with ESMTP id 72C0D1C0019 for ; Mon, 5 Feb 2024 12:25:42 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YhsRi9KH; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.53 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707135942; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HhpyPAiOlxVOuQdirm3kdFMpnHsavNXmNifVphDhXig=; b=I82KUuB96QJbLbROPa71tYeMftdVE4iY/C1G7T2enDUxtQrafVLTV4niDNPax+I8Nb3/1r kLCA87senRrJmMvKBNR3xZkJUSqTJ7jn7DAZk1P5y8slQpsz8dZFIJathrC4GPBY+bsFng fqVJYrJZoYNLaak9yNh3t0XTklOA3ik= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YhsRi9KH; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.53 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707135942; a=rsa-sha256; cv=none; b=uZcEKhUyNzPf5Qs5vgjo0RaFSkg8m0738M0D9/kjkUSI53NPTC7l+PrOW96OdqFZcmNN3w mu+N/bOdFF1s+hlUoO7PZ4HZo6s2uP6JPYqr4WqXCjT8G7WPXPP1ywZTMGivWJZY1WW4Wg SM9Owr6OD7jdphIIEzrMEXLOz4I5A+Y= Received: by mail-vs1-f53.google.com with SMTP id ada2fe7eead31-46d15fbf359so187580137.0 for ; Mon, 05 Feb 2024 04:25:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707135941; x=1707740741; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HhpyPAiOlxVOuQdirm3kdFMpnHsavNXmNifVphDhXig=; b=YhsRi9KHuw1hnjSJtVrR/JB361eA0WDbVD7wXa86AHwEgZHg/pPHkqsze4KKgCOMXu kukKpNc8AqIU2kb2uvPuJdTyF1jSyMx3Ppd83GBnUvpxyJH0IOP5yLPCqKqeTrIicArE DJsMS1RP0GMjkC90LNteNqh4uPyJaMM6koqg0CQxb/dfePmI4nFJyLNj62Z3Gh/pzyVx EUb99RoOCZvZym8Cu/l/aEsbzTZVCwWorasT+GJzUcGUuyBr/irxQwVxjSy2+nmcJs9O 4GyF1yHy6ckNMHqMiMw2Ylac5JTLewIH+Sjb8LQSO00dQ9Auq78T2ZwmVg8sOrWH0Ei5 Gwhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707135941; x=1707740741; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HhpyPAiOlxVOuQdirm3kdFMpnHsavNXmNifVphDhXig=; b=VwrOGTvF4GTdsXyPEgMjTzTgvOJls4aaXIu+EeHC6CiXERKESb3YQU3W5tiNpOAzlz rLoIrwdYFoXQQ1Ifm94Qjp4kP+bCkzuZbdRVc1IlMxKFKqDmkZ7SHKBUGwGDZVlRKko9 ztPTJkt2KRDNmfes61mprcclOGyfNn2/NG+AYc7q2eQL9fK+e6x2vulwSEO+WEE8eghS Yd/wwwHvnfonNZHM/YR5yASWfE/MxgAZYMUFi25QCGfJomIvcqN0Eb35H2y3wjEYreit dbWoBjDr0nWihwyDyecoDE7/IsQFRiamDnJuhhTI4XFrBVHv5UT/y/dfK8diQiqyX4d5 8ZWQ== X-Gm-Message-State: AOJu0YywtATc0WddhOH2/wsuXnzZqFPsPSvEyKWPhllbKqnMRjjeCtNH iCd0TxGsv033zOhmhzbbWbB8OBgMcVfbyZmaUEpL3TE9l+klt4BR/Nt/HkRJ9OXBR+hTWcm8B8J 867xuPQ2tu4TYBcvYYWf/0PZrpkE= X-Google-Smtp-Source: AGHT+IFZpyepxiGRB197zrpnSsou5H6aL4bj+7sXdNlF+UATXmWg6Om4zqXCRwtnCgMslHK2/fU5MKyPwmy7pq8Ckwo= X-Received: by 2002:a05:6102:2a66:b0:46d:13a3:721 with SMTP id hp6-20020a0561022a6600b0046d13a30721mr2544541vsb.29.1707135941456; Mon, 05 Feb 2024 04:25:41 -0800 (PST) MIME-Version: 1.0 References: <20240205110959.4021-1-ryncsn@gmail.com> In-Reply-To: <20240205110959.4021-1-ryncsn@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Mon, 5 Feb 2024 20:25:29 +0800 Message-ID: Subject: Re: [PATCH] mm/swap: fix race condition in direct swapin path To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , "Huang, Ying" , Chris Li , Minchan Kim , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , Yu Zhao , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 72C0D1C0019 X-Stat-Signature: 5rtwgfj9xe44abyp94dfuyphjno8b6w7 X-Rspam-User: X-HE-Tag: 1707135942-704946 X-HE-Meta: U2FsdGVkX18KBAydnLXmBafxHofxOq6atYvLQkYv2Ea6tnaAvakHHHHxMQxpMu0cN7knbgx85iV0ktH92myZ4OcmOi4o2TLG6V4/BCx1OgDgEOBKJ/aJUELT0d34JCR5XJgpGWDLMK2g+awAiQkiQkNpdLqZWGXUSJjQH39GcNsRYdSYyjfNOB9w+SXIHiAiDIXBnRmNSxLJXjxJSEwtA/zXyiXILaVtO5W/hdnR1Kv+d84EVnsC87v/wRor50fuXvndfjKcwHBn7lHCSuJthsqNwjpxQGzDQ5Gju2Jev0SFxVPQZbRpaUux7Yv4KT/yDGpLIyWHtI/Gw87GdvOGxwDu6uBjC+bL7ubVGXQ8vhOKJjg8J6d71e5S0z2QgikHzg3zHZT5qCMwrNRwveFF+8pnFChT2JNKf2w0crE3XGECvTyYN3SFgvy101JsUDcPxpqv2FmAGrPMauaGNrSzzyTwO6KAOw0gwYzQA3kklwqquCjU3qDF2NeL0N2R7OfSJCqU4ONJDEUB1/Fym6wtWsPuA2IgUSH9GT8BkdHwcfxhWyefh3usVQOk9O2hCSBSzsRteOqo1WJdM3l84TFicYHeWMNbwFDNemdB10aSqirQpmjD2Z+8n2JhphMFmjJ8gFTJ4tGplfnEtfxvTK2UU2x0tyLZNPNI+AR1Sz4s1sClmb17jdgjMFTn29GeC5+29W0UXsuadncOOG8BEVD2Q5+I4XEuGvAGMd+APjou6W1SZ8sl+V5WDyKjM4CFQNkVzeZx0dnLCrC34mc8jLLfiacTpO1GA7SX1PQ+v+8IVLrSnOYDVyxw1wb/zEg4AW665rzonVTPunI3gvuPTTQ7fkIl8EDOs/KIxDovhUe7FcWfkqrfS/P+QZ5oDpbSnoaw1cgmEZY7qCXRVsGxXqX45QJN+8UfQ8/X3K9aIt3CA6a+2Wv2BGWQDIRFdHRgbZkoVhczA6+uHjk+BF2LY4n 6e7D90KC 6oweCjoRHoY4VSw/kb8j2sN4dgkAx7wW1/ZX1J+T5zgteGpjb3t3zwT98uqbbDkG9RRuaecKd8g7okZSnCpJWPDj+8np9NxzcB3VqTXshwJKicxseG7y2J7fNOd2qAIUu7RVyiL9R3cRk5MN7FGF7l6QcB5ujFM9x8xEAnHtfG1xIFVv9tMHhC7YJP4cOEOsX8BxPD4LtEoZ7TVziIAxUr7ewEk1oqViv0kLk09s3yTW6J5oBfAgUnr3GSb+oJCVx3h1sHh9AIvc6Nkck3PTeB5fR5CH2fzquLf2uLd2gYoSDVXGPYc5PU9OXn2Gr26Mio+8AxvHGSw6F5Mo90mZC25TUK7z1J1aUiVhMNzZak1Kk3XC9bR6e421ZLaG1sNHa0knYASU8t2USaP8R+BV6+bmlgcoqmjf/3IMvGWSi8VY8OuMP1J4IzxixFs1/2FEl8YpRVEQfdSTmvvKCuW2AvKuXH713Pjl1z4tt1YBfM/CwjHEb0uSp2vFO3xDxd7Mp7nFpuiLvxGxwIiEtZ006wDMU0yodlJYPjsGxXw3ABSYT86Cv2jA38cIxqIkgvorPD6Vt X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 5, 2024 at 7:10=E2=80=AFPM Kairui Song wrote= : > > From: Kairui Song > > In the direct swapin path, when two or more threads swapin the same entry > at the same time, they get different pages (A, B) because swap cache is > skipped. Before one thread (T0) finishes the swapin and installs page (A) > to the PTE, another thread (T1) could finish swapin of page (B), > swap_free the entry, then modify and swap-out the page again, using the Even if T0's swap_read_folio is later than T1, problems can still happen. after T1 swaps in and sets ptes, then frees the swap entry. T0 reads zRAM later. it will get zero as zRAM will fill zero for freed slot, static int zram_read_from_zspool(struct zram *zram, struct page *page, u32 index) { ... value =3D handle ? zram_get_element(zram, index) : 0; mem =3D kmap_local_page(page); zram_fill_page(mem, PAGE_SIZE, value); kunmap_local(mem); return 0; } } Even though nobody modifies the data before the page is swapped out to the same swap offset as before tT0's orig_pte, T0's pte_same check is still tru= e and T0 will map filled zeroed page to pte. so there is more than one risk besides modified data losses. > same entry. It break the pte_same check because PTE value is unchanged, > causing ABA problem. Then thread (T0) will then install the stalled page > (A) into the PTE so new data in page (B) is lost, one possible callstack > is like this: > > CPU0 CPU1 > ---- ---- > do_swap_page() do_swap_page() with same entry > > > swap_readpage() <- read to page A swap_readpage() <- read to page B > > .. set_pte_at() > swap_free() <- Now the entry is freed= . > > > pte_same() <- Check pass, PTE seems > unchanged, but page A > is stalled! > swap_free() <- page B content lost! > set_pte_at() <- staled page A installed! > > To fix this, reuse swapcache_prepare which will pin the swap entry using > the cache flag, and allow only one thread to pin it. Release the pin > after PT unlocked. Racers will simply busy wait since it's a rare > and very short event. > > Other methods like increasing the swap count don't seem to be a good > idea after some tests, that will cause racers to fall back to the > cached swapin path, two swapin path being used at the same time > leads to a much more complex scenario. > > Reproducer: > > This race issue can be triggered easily using a well constructed > reproducer and patched brd (with a delay in read path) [1]: > > With latest 6.8 mainline, race caused data loss can be observed easily: > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > Polulating 32MB of memory region... > Keep swapping out... > Starting round 0... > Spawning 65536 workers... > 32746 workers spawned, wait for done... > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > Round 0 Failed, 15 data loss! i am also reading these codes recently. It is quite unbelievable this is really happening now. as freeing swaps is returning slot to slots_ret, but allocating swap is from slots. so if swapfile is large, the chance that the newly allocated swap was a recently freed swap is close to 0%. but yes, the code does have the risk. > > This reproducer spawns multiple threads sharing the same memory region > using a small swap device. Every two threads updates mapped pages one by > one in opposite direction trying to create a race, with one dedicated > thread keep swapping out the data out using madvise. > > The reproducer created a reproduce rate of about once every 5 minutes, > so the race should be totally possible in production. > > After this patch, I ran the reproducer for over a few hundred rounds > and no data loss observed. > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > zram: > > Before: 10934698 us > After: 11157121 us > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous = device") > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-= race [1] > Signed-off-by: Kairui Song I will also run your patch on my problem I reported today[1]. will update the result to you this week. [1] https://lore.kernel.org/linux-mm/d4f602db-403b-4b1f-a3de-affeb40bc499@a= rm.com/T/#m41701d0c0e127cdae636e97a13ab521364a810f4 > --- > Huge thanks to Huang Ying and Chris Li for help finding this issue! > > mm/memory.c | 19 +++++++++++++++++++ > mm/swap.h | 5 +++++ > mm/swapfile.c | 16 ++++++++++++++++ > 3 files changed, 40 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 7e1f4849463a..fd7c55a292f1 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3867,6 +3867,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > __swap_count(entry) =3D=3D 1) { > + /* > + * With swap count =3D=3D 1, after we read the en= try, > + * other threads could finish swapin first, free > + * the entry, then swapout the modified page usin= g > + * the same entry. Now the content we just read i= s > + * stalled, and it's undetectable as pte_same() > + * returns true due to entry reuse. > + * > + * So pin the swap entry using the cache flag eve= n > + * cache is not used. > + */ > + if (swapcache_prepare(entry)) > + goto out; > + > /* skip swapcache */ > folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0= , > vma, vmf->address, false)= ; > @@ -4116,6 +4130,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > unlock: > if (vmf->pte) > pte_unmap_unlock(vmf->pte, vmf->ptl); > + /* Clear the swap cache pin for direct swapin after PTL unlock */ > + if (folio && !swapcache) > + swapcache_clear(si, entry); > out: > if (si) > put_swap_device(si); > @@ -4124,6 +4141,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (vmf->pte) > pte_unmap_unlock(vmf->pte, vmf->ptl); > out_page: > + if (!swapcache) > + swapcache_clear(si, entry); > folio_unlock(folio); > out_release: > folio_put(folio); > diff --git a/mm/swap.h b/mm/swap.h > index 758c46ca671e..fc2f6ade7f80 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio, > void delete_from_swap_cache(struct folio *folio); > void clear_shadow_from_swap_cache(int type, unsigned long begin, > unsigned long end); > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); > struct folio *swap_cache_get_folio(swp_entry_t entry, > struct vm_area_struct *vma, unsigned long addr); > struct folio *filemap_get_incore_folio(struct address_space *mapping, > @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struc= t writeback_control *wbc) > return 0; > } > > +static inline void swapcache_clear(struct swap_info_struct *si, swp_entr= y_t entry) > +{ > +} > + > static inline struct folio *swap_cache_get_folio(swp_entry_t entry, > struct vm_area_struct *vma, unsigned long addr) > { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 556ff7347d5f..f7d4ad152a7f 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -3365,6 +3365,22 @@ int swapcache_prepare(swp_entry_t entry) > return __swap_duplicate(entry, SWAP_HAS_CACHE); > } > > +/* > + * Clear the cache flag and release pinned entry. > + */ > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) > +{ > + struct swap_cluster_info *ci; > + unsigned long offset =3D swp_offset(entry); > + unsigned char usage; > + > + ci =3D lock_cluster_or_swap_info(si, offset); > + usage =3D __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); > + unlock_cluster_or_swap_info(si, ci); > + if (!usage) > + free_swap_slot(entry); > +} > + > struct swap_info_struct *swp_swap_info(swp_entry_t entry) > { > return swap_type_to_swap_info(swp_type(entry)); > -- > 2.43.0 > > Thanks Barry