From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 368C5C021B2 for ; Sat, 22 Feb 2025 21:31:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 40FA26B007B; Sat, 22 Feb 2025 16:31:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3BFA86B0082; Sat, 22 Feb 2025 16:31:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 290306B0083; Sat, 22 Feb 2025 16:31:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0CB136B007B for ; Sat, 22 Feb 2025 16:31:52 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 5E107A43F4 for ; Sat, 22 Feb 2025 21:31:51 +0000 (UTC) X-FDA: 83148878022.17.524C9DC Received: from mail-ua1-f42.google.com (mail-ua1-f42.google.com [209.85.222.42]) by imf25.hostedemail.com (Postfix) with ESMTP id 82985A000E for ; Sat, 22 Feb 2025 21:31:49 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gSJtTWOI; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740259909; a=rsa-sha256; cv=none; b=a3+jHMivf1ARzw3DMc9+HoJwyMW3pB9MHZohXZIq5LXJ/xBayKzHMHK7XLBRsKiwVLoepF Hely6SvJgxtfciCDSYiJEf2WIunHIxrhvv3nsAilXXPYdvhasynK/VnL5JHcIthhHeFjUP zG2HSnJfEqG5L3L6tZrteb5VaLvEx8s= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gSJtTWOI; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740259909; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OFl0XQfY5EgWWOaf98TRRQVcDwnlYMlHf963+xnJa3Q=; b=4utIUYanFmqaK9i1NhjJWYVI3Cbe5JLlXIKinJMWiVL/VMyfhMoLfJjFhMDEPnkIHUifz5 A0NZn8NFo9eQtAEo9XWCjc8b1+UL2W7O7bGHMm9rL7qxEeT2JEOoCEd7yjJzKWmRlnvVNo DRqZka2mmdKJ/zDjls5MQcNIuukR8Ro= Received: by mail-ua1-f42.google.com with SMTP id a1e0cc1a2514c-868ddc4c6b6so934617241.2 for ; Sat, 22 Feb 2025 13:31:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1740259908; x=1740864708; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OFl0XQfY5EgWWOaf98TRRQVcDwnlYMlHf963+xnJa3Q=; b=gSJtTWOIflN0ud2lWdEXigPetl8SNd9vlkFzyX4Ot+cBCTQG+MmFLJ6y/YWzQ1n1eM U8WKbby2nwphPk59m0Gzklq2MM1ncexMOuQFURG5N4ZiuYCYGKxFxWyGpXk+T+bAFvgm AMHArTK4PGSKU3kYgWivHptWYHp3Ywne8Nhf6zEs6vvFlcl7weS/BilbZ4b9c+QWnPcH ee/cGXeyE/7KAh2fjXnNs8mqYhyOpC7e6LOaXUSrVvyI6EHkA1IUqb82H+MmrugLgNzA Ku6D3DNhLk7FLM66tHzyeOpP/tKo4yI5wezevZyLpXjEbCPqnqVqkhyTFJ1kIreo1Klu Bl9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740259908; x=1740864708; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OFl0XQfY5EgWWOaf98TRRQVcDwnlYMlHf963+xnJa3Q=; b=umP4LKaRBW9zBWajUvrlw0JueHDEfdJufbcRTMSvWxg34nQ7IflBcL5UEzFu6975U2 TOCPX8Kve7P/4/wGT+x5rv7UQr89+UAPbWD9DifCB0dVwBDGU+5uvZ2rr6CIzrO7Abnv 9mSNVf8yVSXMoq3DhZKM3vRseK/s9I38Z2q1TBAO3Q4spraveBFl6WsUVrIKVqatVkz7 NXramt7Ro83jqhmSuILTNgz2VFsC6+JaqyTWvlNxhm8XTIy1bOx8h8ieS9CIImceDmYh fO6YdRmX9CLKZ9AhR+F747nqy7Ufpyy10mirDLA0FPeiuqWYaGarerrOKWJGcBgKih4e M8ow== X-Forwarded-Encrypted: i=1; AJvYcCWI0yJa0mAojMYdQ/XVo1twQ620XdZmUzq3amkKJnBrQuGXvkGpv0eepsONgdQxvnmw8cbmel0Uxw==@kvack.org X-Gm-Message-State: AOJu0YxhF5JIUDjx44OE8Zn3fbFVX2LninlYXSSqOlJqhpVnkR6TYdXd WN4KtoAJIc+qenwV5GPtNLs5DuRISfG7FseNW+fJo30V1FGcYrXFmvZt1fwgtxPxfpm/LFL2dy4 w98j8PwZ0yyM0qGqN3++jEmy33kc= X-Gm-Gg: ASbGncu9BHrTrGYErkWnuccxnC/VC/UVdu+ewMYFSBhjZIafci0ptTfhx7gIbkth5m6 6BfefwjpnTbywpBC13gDWdlcIfyvuDJBQupu4ZqrJROgJxsqrvMUL0IGbsEKEsh4AF8vv6RE8YJ 0PXWa69eU= X-Google-Smtp-Source: AGHT+IHoyujMRhEdPKxeKf2mWlugvpAIglgbKM2uZWU15AhMHrumLw69+SgSol8v/MwdrCrAt4+D0bU4BZUUTD2wt7o= X-Received: by 2002:a05:6122:2529:b0:51b:b750:8303 with SMTP id 71dfb90a1353d-521ee49c7fbmr3952395e0c.11.1740259908445; Sat, 22 Feb 2025 13:31:48 -0800 (PST) MIME-Version: 1.0 References: <69dbca2b-cf67-4fd8-ba22-7e6211b3e7c4@redhat.com> <20250220092101.71966-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Sun, 23 Feb 2025 10:31:37 +1300 X-Gm-Features: AWEUYZno13prQTj8vWDPxZEpQGT6slHD7oIiAxSZ34dWeJnuJMR5z9qAQPqbza0 Message-ID: Subject: Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache To: Peter Xu Cc: david@redhat.com, Liam.Howlett@oracle.com, aarcange@redhat.com, akpm@linux-foundation.org, axelrasmussen@google.com, bgeffon@google.com, brauner@kernel.org, hughd@google.com, jannh@google.com, kaleshsingh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lokeshgidra@google.com, mhocko@suse.com, ngeoffray@google.com, rppt@kernel.org, ryan.roberts@arm.com, shuah@kernel.org, surenb@google.com, v-songbaohua@oppo.com, viro@zeniv.linux.org.uk, willy@infradead.org, zhangpeng362@huawei.com, zhengtangquan@oppo.com, yuzhao@google.com, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 82985A000E X-Stat-Signature: i44dasut3khfuog641s557ub9r9bijnj X-Rspam-User: X-HE-Tag: 1740259909-743607 X-HE-Meta: U2FsdGVkX1+TVGQi/Qo4tTBI5s9XGyzhMiW+R9B9rnFZnMQ3T7A0l6sB/E6k0gtdOKkbsPkjzdJlWwA0u0w4hNzXs4sFz3ihx8gDW4q/g5GIjYBFB/TAyIV7kOHHrMRAyTfn2qV4k/oX3DO2NcqpCLJzA6V/UGgyTiOkaAKhKc5jwYP7b0RLJigd6xBZhdosH6l1/3lr+DdanX8bXldiI3Izan/uvb85HAM3Ef6IuB+wz4yl4eJwTy5iSDFQA+zfYGDxUBq5oppDtRe9n7O9SX/+3ojdnSNoaJcpYTCr30OuPCOzrSgU8GDl2HSwxfwWqZVpj6GyOxPRpNVtxkkQ1Ux7HkH5BcK+/d7FEhfBIyOkKzdKoliHgLHMSKGWowSyETBx1SElnNjznw+3xGUn3/NJrgCmngKbqYRgb4YCaCec5jgBBnHKhwh/Ig+9p2Pm9wIjhrLmFO5az2rzZsqflKX7cmIXf8g2t925k4v3Pte3xkm72M8Atjy2SGLpQP+WCIEFDRoPGsGzoeu1y8rBO+3HhqotBoo4oyUXNVj5PEi0UYr5t6Pxj/TjFjLpY4yPa0HFxQ4BP8pKGCQPEpPPFihievtQzNR+mFM3Q+DfbRn9/n08JwYMMWmsb2xBMOFueHhX61vbxmxZHWRDXkZZUawshXKLFjQdhC8198OMriSmiG5R8/4wpTxE//oSArld1i19Kt6RcYfEc6FJivjZPyy7DCln5vktS4faPSiNsaDICTCPBDdFFcgK/cqX1yr7D+FBoMjnsnAsT1YGM8ntdX1ZV/3qnMKYK1up7yusUMcaihASYZoNrAqMEu0CMMfIxaC1xz6zsqTKGMfeb3U8tNTvPO03ppwbeeB6MWFh7NRSQZmkoDyyOAWtk31rMJ/X5MUlHuvxhobKwFVTJqPABbl4+H1yrdWb0ZymWgWJRW/dyAXiqCLOAX4sObVstkhHV7j7rfX5STXPD+kaV28 OURGTuzS XqByf/Pe8fDdSI4QOu4ybufFyM8H+cf23o+iEzvnF8TeSJDe/41Nm0+Z8mf4GTelHzSk0Err1RhG44HJihT3vbhCWhmVwhZXf4fkZRxobXxuT7YhTpqRcGueZREUbkWgraW5LUwHCV23oTj1T/2/n48f3Wowx/w/OnxipVVkqhOGUpcz8gTCu0+9/Dg+OXzxKh/s9VFvdZpYHNlHVVRSBvfhO9gxsKZfM8gilpPtANKMXQPpQNJ5sS/QcBIPG3pmLY/9lvosR2QMcGak1n2oC2ZC1294nCB/HrkcNfjl7+BaLW0TvU3QNGl4vIMuj1JNBbp6Vmzm0XjfQIF6Mz79FBy/YV5kLTuHcgy6RgOoIJfXOjmU+xeQ9Ly8Uskdv9I6b+uG7ZwL3t5Bkd3k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000069, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Feb 21, 2025 at 2:49=E2=80=AFPM Peter Xu wrote: > > On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote: > > On Fri, Feb 21, 2025 at 12:32=E2=80=AFPM Peter Xu w= rote: > > > > > > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote: > > > > 2. src_anon_vma and its lock =E2=80=93 swapcache doesn=E2=80=99t re= quire it=EF=BC=88folio is not mapped=EF=BC=89 > > > > > > Could you help explain what guarantees the rmap walk not happen on a > > > swapcache page? > > > > > > I'm not familiar with this path, though at least I see damon can star= t a > > > rmap walk on PageAnon almost with no locking.. some explanations wou= ld be > > > appreciated. > > > > I am observing the following in folio_referenced(), which the anon_vma = lock > > was originally intended to protect. > > > > if (!pra.mapcount) > > return 0; > > > > I assume all other rmap walks should do the same? > > Yes normally there'll be a folio_mapcount() check, however.. > > > > > int folio_referenced(struct folio *folio, int is_locked, > > struct mem_cgroup *memcg, unsigned long *vm_flags) > > { > > > > bool we_locked =3D false; > > struct folio_referenced_arg pra =3D { > > .mapcount =3D folio_mapcount(folio), > > .memcg =3D memcg, > > }; > > > > struct rmap_walk_control rwc =3D { > > .rmap_one =3D folio_referenced_one, > > .arg =3D (void *)&pra, > > .anon_lock =3D folio_lock_anon_vma_read, > > .try_lock =3D true, > > .invalid_vma =3D invalid_folio_referenced_vma, > > }; > > > > *vm_flags =3D 0; > > if (!pra.mapcount) > > return 0; > > ... > > } > > > > By the way, since the folio has been under reclamation in this case and > > isn't in the lru, this should also prevent the rmap walk, right? > > .. I'm not sure whether it's always working. > > The thing is anon doesn't even require folio lock held during (1) checkin= g > mapcount and (2) doing the rmap walk, in all similar cases as above. I s= ee > nothing blocks it from a concurrent thread zapping that last mapcount: > > thread 1 thread 2 > -------- -------- > [whatever scanner] > check folio_mapcount(), non-zero > zap the last map.. then m= apcount=3D=3D0 > rmap_walk() > > Not sure if I missed something. > > The other thing is IIUC swapcache page can also have chance to be faulted > in but only if a read not write. I actually had a feeling that your > reproducer triggered that exact path, causing a read swap in, reusing the > swapcache page, and hit the sanity check there somehow (even as mentioned > in the other reply, I don't yet know why the 1st check didn't seem to > work.. as we do check folio->index twice..). > > Said that, I'm not sure if above concern will happen in this specific cas= e, > as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in s= wp > entry so we know it's definitely not mapped elsewhere, meanwhile if we ho= ld > pgtable lock so maybe it can't get mapped back.. it is just still tricky, > at least we do some dances all over releasing and retaking locks. > > We could either justify that's safe, or maybe still ok and simpler if we > could take anon_vma write lock, making sure nobody will be able to read t= he > folio->index when it's prone to an update. What prompted me to do the former is that folio_get_anon_vma() returns NULL for an unmapped folio. As for the latter, we need to carefully evaluat= e whether the change below is safe. --- a/mm/rmap.c +++ b/mm/rmap.c @@ -505,7 +505,7 @@ struct anon_vma *folio_get_anon_vma(const struct folio *folio) anon_mapping =3D (unsigned long)READ_ONCE(folio->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) !=3D PAGE_MAPPING_ANON) goto out; - if (!folio_mapped(folio)) + if (!folio_mapped(folio) && !folio_test_swapcache(folio)) goto out; anon_vma =3D (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON)= ; @@ -521,7 +521,7 @@ struct anon_vma *folio_get_anon_vma(const struct folio *folio) * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zer= o() * above cannot corrupt). */ - if (!folio_mapped(folio)) { + if (!folio_mapped(folio) && !folio_test_swapcache(folio)) { rcu_read_unlock(); put_anon_vma(anon_vma); return NULL; The above change, combined with the change below, has also resolved the mTH= P -EBUSY issue. diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e5718835a964..1ef991b5c225 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1333,6 +1333,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pte_unmap(&orig_src_pte); pte_unmap(&orig_dst_pte); src_pte =3D dst_pte =3D NULL; + folio_wait_writeback(src_folio); err =3D split_folio(src_folio); if (err) goto out; @@ -1343,7 +1344,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto retry; } - if (!src_anon_vma && pte_present(orig_src_pte)) { + if (!src_anon_vma) { /* * folio_referenced walks the anon_vma chain * without the folio lock. Serialize against it with split_folio() returns -EBUSY if the folio is under writeback or if folio_get_anon_vma() returns NULL. I have no issues with the latter, provided the change in folio_get_anon_vma= () is safe, as it also resolves the mTHP -EBUSY issue. We need to carefully consider the five places where folio_get_anon_vma() is called, as this patch will also be backported to stable. 1 2618 mm/huge_memory.c <> src_anon_vma =3D folio_get_anon_vma(src_folio); 2 3765 mm/huge_memory.c <<__folio_split>> anon_vma =3D folio_get_anon_vma(folio); 3 1280 mm/migrate.c <> anon_vma =3D folio_get_anon_vma(src); 4 1485 mm/migrate.c <> anon_vma =3D folio_get_anon_vma(src); 5 1354 mm/userfaultfd.c <> src_anon_vma =3D folio_get_anon_vma(src_folio); > > Thanks, > > -- > Peter Xu > Thanks barry