From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7CC0C54E58 for ; Mon, 18 Mar 2024 08:01:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 15BE16B0082; Mon, 18 Mar 2024 04:01:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 10C276B0083; Mon, 18 Mar 2024 04:01:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F15F76B0085; Mon, 18 Mar 2024 04:01:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E2F9A6B0082 for ; Mon, 18 Mar 2024 04:01:24 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8DF99A0D3D for ; Mon, 18 Mar 2024 08:01:24 +0000 (UTC) X-FDA: 81909414888.03.434CB6D Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com [209.85.208.171]) by imf16.hostedemail.com (Postfix) with ESMTP id 9B4EA18000A for ; Mon, 18 Mar 2024 08:01:22 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Dx/SKFyZ"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of huangzhaoyang@gmail.com designates 209.85.208.171 as permitted sender) smtp.mailfrom=huangzhaoyang@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710748882; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1RkSK5/rL7+nrcDu6W4JMfMUWleyDEMj8nS89Ven6gI=; b=AaxAROg0Hdufeuw5Go+bqHbF76HZVxByjETjXRXQl72TfHqIUss8JVp2tefVU3jky+3LVY rmQe4ETX4BCjcVi0KSGHSDH0rAVW2kOW6dSPBO2iHYObozY9JA4SgAVa3fXkx9oYnZ/Rok +Pv4DpqH51PK9jmpohQGQqgGECh9wwQ= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Dx/SKFyZ"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of huangzhaoyang@gmail.com designates 209.85.208.171 as permitted sender) smtp.mailfrom=huangzhaoyang@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710748882; a=rsa-sha256; cv=none; b=l4MGP0l8DIp2wyewQfAMP0d4zPbzEBIdiYE2WNPTUZro9vMKcvsfXS5k40YB6+FKgvxrQe IXJOC7x/KzS9gTIuWyXLPx/l1LnyG/Y7l4PH+A7M2NWy/z4kuElktuqBTKyydBrT1OZNa4 FlpxdbfTipQ34jXtcUHfH+9NYpxoSRE= Received: by mail-lj1-f171.google.com with SMTP id 38308e7fff4ca-2d2509c66daso58399861fa.3 for ; Mon, 18 Mar 2024 01:01:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710748881; x=1711353681; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=1RkSK5/rL7+nrcDu6W4JMfMUWleyDEMj8nS89Ven6gI=; b=Dx/SKFyZc0fXf34hX+I3HlgThIa33EMV/s8AAaHEGfDJw15ZDbMG1saVivg9b+0SAi Q8AeqkKQunZLDZqe0rqiSj6dOuwnPmUoje2jJh+IK6Ey8P7kNVgvdauM4iM2LrurgoSl x78WpZBpJDFRzELcpVkIlEU9tLmiM1eZbht8higJ/l2dQNoixT8xmKi8yefep6OFmTCK 7CEFAjnEzALUmNSGmM+QP1knfX7EswiXo78FgbVtakDXyJuZ60aV2P7mDflkzfvrCJ+r x4U8L23WS65AgBJ0/4cURx7q2dt/VCFMUNteAryoygS+CrBGagjc+HJ2f27sZWjNkDPI IR0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710748881; x=1711353681; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1RkSK5/rL7+nrcDu6W4JMfMUWleyDEMj8nS89Ven6gI=; b=AFKAZiD9Yr19yfpnFa/6Lz//Eu/YiKoRN1wP/MvXZIp4y0IIj+Ov7tzGx0RwLXR9XY RiuzLOvUF2tUhA4iwqwB7uiSXakfmLjVfmKofj/fkSYFm9b9wSoXjRLcEyY1uiTRJGE0 opaY5QzeYjPy5JtM/qZEeiNtcRG/rRufg8R+ikiy463otlqooh1dCyiB+5jzbuIXlOWH sZxLGOOsDdUVDmZNel0+HWqrTg6B75Umkxtglf4Wzr8X+FuFyO924SM0D735nlPEtNvK B40DLJNDAbRc0QbLdkoh+dzGX5IYRs21a5m0HtsuUWrbPDS5BKJR767qGqvJ84wcC/B8 kAfg== X-Forwarded-Encrypted: i=1; AJvYcCXSFty+52Dt9naSd7FsP8dzvxA/AhFDpwOIckiU8uhFC++AT7JG5znVxRGJcfoDFn1UQKLR31hYNQDNhWHg7wL0Rf4= X-Gm-Message-State: AOJu0YwfOib7CeQTQYjUC4Ch8FGMNAmfyNklPt4O9ibcGkglPH3AJGBX G4NLVAA4Mj57GCWOYVMk8y1d96MS5rQtxSPwA9xN52+raESNaxatTrvITqg5XoqMK4pjKLaa3Hv yMtShqlU+TYE7gw80uoN9XDwz1OE= X-Google-Smtp-Source: AGHT+IGMH3lhqUYgmFgqJRNqfeoYwJtJ/H3jRknEGjY/jGfZI4wfOZNsfbaQsNbkTlBe8wYxxITrsdV3A9ZgrQj0d0c= X-Received: by 2002:a2e:b5c8:0:b0:2d2:6227:d30a with SMTP id g8-20020a2eb5c8000000b002d26227d30amr6771186ljn.2.1710748880358; Mon, 18 Mar 2024 01:01:20 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Zhaoyang Huang Date: Mon, 18 Mar 2024 16:01:09 +0800 Message-ID: Subject: Re: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru To: Matthew Wilcox Cc: =?UTF-8?B?6buE5pyd6ZizIChaaGFveWFuZyBIdWFuZyk=?= , Andrew Morton , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , =?UTF-8?B?5bq357qq5ruoIChTdGV2ZSBLYW5nKQ==?= Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: r6jx9xptxm43rsa6ezuj3bwrrhtggdit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 9B4EA18000A X-HE-Tag: 1710748882-624013 X-HE-Meta: U2FsdGVkX18FT8eFtZocBN33KoL3QXhJrLBlNuSAGkG9lUx/9pK8lm2RntM4mkqiHd0aKz2sEFlqznbIr2H0CmixDrIjYRGjLZcQeqHbYjCsdMQp1yNwfr972Ni6nqv8vtHtcKKHFPJlnk6wRkCchLgglxaUwAzUKEUFW5v9aOo0k/e/BZHQqx4JEhSlnilGxf+zBLg7BFemElmD7DG2CWcwRA/ITvWTIKBx1B6M+2+P6TP7/oMagXs06ghgrTotyM5x6k8IPMWz5eQ+Lew4p+Mq0HZyL619KnCPHy+ZgurLciIDR+NA/sn1v39+IxI/6FTkCkVUwJGOtlkuAmc0F+opJd2Hw+R4Y9uMrZnan7k2OmqG5+DNYbLwJkHcbk8wlCBbkNk/cOrsAX55feCT5bicq//nL5vtg/Bo06XWO+as65cGOhlBqT9DZKw+UudAK5TRnqy6SAR43Lv/pfC9QZmXD0Xg6v3spFEV4jd2v70r/pJp3pKJiJLqoFdjod8W1ox3UaHdhLfXmuRqljx7kwy2RBLigwxNkUM2FkuDAHVP1iJDC7zCh0AdR2+QI5Rw0Ox9F6sOBau20Ckkpmg28YWPMG6py2qex+u68JgFVzzLhKINGRPRCQuh+5NoymJp8dq7d2qMnQ6VHbuBaebwIh/+xWRdc0GDSk3at4YSgiKK9rjLEudktJxtOjH4UyltvGFnKvXN0SejhJe3Pah+HibyHeDicBsfpBU4buReTEYxNJBzPR802dcbQhW0ujIPGhh3Sb34me5t/msa0ZpzJoFIFY1+Y1BVoV8IEK2hbAiBTB9hMRD/lSAvM3MmSDSFMmeOHJM1LoH19bF+iIZjGDNO9F4H4EcLU+Nr6HMXAoO5J+4UB6Y9H/dNaQuw1cHHAaCLuLy+KOfmtTWp7E8VSICqXgkTIdp4nZBMKtItsmknOezFAJ73GT8Ru09cbyTojc+Z9j/so53dx8p2dYQ gCbJaLI+ DnFI4NQ9QLqtJLpNrsM9lHuipuFVwcTUMcREl7fdmPL7Jn5Bx8WNScOOPLt/HQ9pBhKBKrxvJ9YdA0f0aCtzK94ZuPVLLa0uqDrkAIPwdtkoALpYIewvhoKjOvM9bRzg5Qvl2aW8FVxP2OgjObNbvK7fwAhh/DekCeiAhQFR/fEk72sgxg//nBGl3WoYTrELbHuhDNiffCNJ81AR+Z5crvLlb0eOZKGCp5ChoxhNZ0LYdtTOP6TS+kLgcHCNJqtLb2VM+e1gdmMfEzvDCSM6r4fwmXRrmoFL8BDpkiKr4KD9ZZIc0jsaiFlZZH3Pa9bXLF4+4QgRG27MLJqU9+W2GR0umliGrmTbObFuJMi42dLTrY7GCDkZS93e/HfHtoqqjpJb3L+pvlD2nd1mjw86KEukbACZczqGFbS3gfeb/K2dGb1ac2gHPYanTnuD/bQFeBD0AxTq6Eihr2R3FLwTaSsyizVrwOZUp/apGgLwMNREGUNhffX2fUlvE9t/vu7e6gMPZIpSDb1TA8XQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000059, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 18, 2024 at 2:15=E2=80=AFPM Zhaoyang Huang wrote: > > On Mon, Mar 18, 2024 at 11:28=E2=80=AFAM Matthew Wilcox wrote: > > > > On Mon, Mar 18, 2024 at 01:37:04AM +0000, =E9=BB=84=E6=9C=9D=E9=98=B3 (= Zhaoyang Huang) wrote: > > > >On Sun, Mar 17, 2024 at 12:07:40PM +0800, Zhaoyang Huang wrote: > > > >> Could it be this scenario, where folio comes from pte(thread 0), l= ocal > > > >> fbatch(thread 1) and page cache(thread 2) concurrently and proceed > > > >> intermixed without lock's protection? Actually, IMO, thread 1 also > > > >> could see the folio with refcnt=3D=3D1 since it doesn't care if th= e page > > > >> is on the page cache or not. > > > >> > > > >> madivise_cold_and_pageout does no explicit folio_get thing since t= he > > > >> folio comes from pte which implies it has one refcnt from pagecach= e > > > > > > > >Mmm, no. It's implicit, but madvise_cold_or_pageout_pte_range() > > > >does guarantee that the folio has at least one refcount. > > > > > > > >Since we get the folio from vm_normal_folio(vma, addr, ptent); we kn= ow that > > > >there is at least one mapcount on the folio. refcount is always >= =3D mapcount. > > > >Since we hold pte_offset_map_lock(), we know that mapcount (and ther= efore > > > >refcount) cannot be decremented until we call pte_unmap_unlock(), wh= ich we > > > >don't do until we have called folio_isolate_lru(). > > > > > > > >Good try though, took me a few minutes of looking at it to convince = myself that > > > >it was safe. > > > > > > > >Something to bear in mind is that if the race you outline is real, f= ailing to hold a > > > >refcount on the folio leaves the caller susceptible to the > > > >VM_BUG_ON_FOLIO(!folio_ref_count(folio), folio); if the other thread= calls > > > >folio_put(). > > > Resend the chart via outlook. > > > I think the problem rely on an special timing which is rare, I would = like to list them below in timing sequence. > > > > > > 1. thread 0 calls folio_isolate_lru with refcnt =3D=3D 1 > > > > (i assume you mean refcnt =3D=3D 2 here, otherwise none of this makes s= ense) > > > > > 2. thread 1 calls release_pages with refcnt =3D=3D 2.(IMO, it could b= e 1 as release_pages doesn't care if the folio is used by page cache or fs) > > > 3. thread 2 decrease refcnt to 1 by calling filemap_free_folio.(as I = mentioned in 2, thread 2 is not mandatary here) > > > 4. thread 1 calls folio_put_testzero and pass.(lruvec->lock has not b= een take here) > > > > But there's already a bug here. > > > > Rearrange the order of this: > > > > 2. thread 1 calls release_pages with refcount =3D=3D 2 (decreasing refc= ount to 1) > > 3. thread 2 decrease refcount to 0 by calling filemap_free_folio > > 1. thread 0 calls folio_isolate_lru() and hits the BUG(). > > > > > 5. thread 0 clear folio's PG_lru by calling folio_test_clear_lru. The= folio_get behind has no meaning there. > > > 6. thread 1 failed in folio_test_lru and leave the folio on the LRU. > > > 7. thread 1 add folio to pages_to_free wrongly which could break the = LRU's->list and will have next folio experience list_del_invalid > > > > > > #thread 0(madivise_cold_and_pageout) #1(lru_add_drain->fbatch_= release_pages) #2(read_pages->filemap_remove_folios) > > > refcnt =3D=3D 1(represent page cache) refcnt=3D=3D2(anoth= er one represent LRU) folio comes from page cache > > > > This is still illegible. Try it this way: > > > > Thread 0 Thread 1 Thread 2 > > madvise_cold_or_pageout_pte_range > > lru_add_drain > > fbatch_release_pages > > read_pages > > filemap_remove_folio > Thread 0 Thread 1 Thread 2 > madvise_cold_or_pageout_pte_range > truncate_inode_pages_range > fbatch_release_pages > truncate_inode_pages_range > filemap_remove_folio > Sorry for the confusion. Rearrange the timing chart like above > according to the real panic's stacktrace. Thread 1&2 are all from > truncate_inode_pages_range(I think thread2(read_pages) is not > mandatory here as thread 0&1 could rely on the same refcnt=3D=3D1). > > > > Some accuracy in your report would also be appreciated. There's no > > function called madivise_cold_and_pageout, nor is there a function call= ed > > filemap_remove_folios(). It's a little detail, but it's annoying for > > me to try to find which function you're actually referring to. I have > > to guess, and it puts me in a bad mood. > > > > At any rate, these three functions cannot do what you're proposing. > > In read_page(), when we call filemap_remove_folio(), the folio in > > question will not have the uptodate flag set, so can never have been > > put in the page tables, so cannot be found by madvise(). > > > > Also, as I said in my earlier email, madvise_cold_or_pageout_pte_range(= ) > > does guarantee that the refcount on the folio is held and can never > > decrease to zero while folio_isolate_lru() is running. So that's two > > ways this scenario cannot happen. > The madivse_xxx comes from my presumption which has any proof. > Whereas, It looks like truncate_inode_pages_range just cares about > page cache refcnt by folio_put_testzero without noticing any task's VM > stuff. Furthermore, I notice that move_folios_to_lru is safe as it > runs with holding lruvec->lock. > > BTW, I think we need to protect all folio_test_clear_lru/folio_test_lru by moving them into lruvec->lock in such as __page_cache_release and folio_activate functions. Otherwise, there is always a race window between judging PG_lru and following actions.