From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC793C2BD09 for ; Thu, 27 Jun 2024 04:48:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0F9A56B0092; Thu, 27 Jun 2024 00:48:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0A9726B0095; Thu, 27 Jun 2024 00:48:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E8C1C6B0096; Thu, 27 Jun 2024 00:48:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C97AE6B0092 for ; Thu, 27 Jun 2024 00:48:21 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 42E6BA435B for ; Thu, 27 Jun 2024 04:48:21 +0000 (UTC) X-FDA: 82275437202.14.FCD2B50 Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41]) by imf04.hostedemail.com (Postfix) with ESMTP id 6321140002 for ; Thu, 27 Jun 2024 04:48:19 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=056Tcca6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of yuzhao@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719463685; a=rsa-sha256; cv=none; b=SHrZJqHe+pRAqHoYl4cOX8PE+jUtimZt68U86D7MnmFQrlk+d39fbTbcvJRcRFyIdIj4u+ Izge7U8IVeReA/pgcyH8K5pvh7ijzCAU3MMnyV/vW9dlnFNVfG6qtMMFhirHuoXV25WEXi +RqPCyZZBSMAGEANNO9YccSeQ6lWQR0= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=056Tcca6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of yuzhao@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719463685; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XrB6PjzyRVpPzuHkYiujZRdgBDwYGbdw8e3cnsTZ/UE=; b=Ykdc8vOkeLu4wzykTIsFDkpXWCL0NcfW27wBYScI2hbxqoMHwYamUAUAR4FRXl6Vj4wIwB fyGXkW9z1k9ds2S82kPoQPoTqufstr7KGjIXdKO9ZPIEl1/kNeIDoac/coLx7pplhQWTTi 9+UD8xsrdZgkPZD50Hu9rwWEyg5zjNQ= Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-57a16f4b8bfso5871a12.0 for ; Wed, 26 Jun 2024 21:48:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719463698; x=1720068498; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XrB6PjzyRVpPzuHkYiujZRdgBDwYGbdw8e3cnsTZ/UE=; b=056Tcca6hczmRUIYjs0uI6+akIjRMnjCpERnBZRAGKwhXffjP/oYrfjqgTa036fSZ4 hmOXtOb6SfN/Co7/+hBvThbfWfwiOfKt30iW8U2mWOyofT1ii4uKt9JpOLXc29LSWSev SBlkAoKHdqBCMlPtd6uHfsAqj8v6fmQIXNhv2GE6dz2xIq2Eh7aYN+qqORENg/6iSY0F 1/JnVbLk3ukTMiKGkF4+QVh4z3ErCCLJI0zLk2LwtRpC+4VUovqTZTsAj2qYQQsMSFRK Gz9rzQolfd0Q1atNPak9wMbWBUCHEXOCWWO4uU+pQmmMeRnmDqUjFvtRcr4bd2VIlB1S eBaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719463698; x=1720068498; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XrB6PjzyRVpPzuHkYiujZRdgBDwYGbdw8e3cnsTZ/UE=; b=HN9TyuJb7uPm2W0j7AxvmeHpcGsbZQBtBJRW0G+NpxkRFdgI+eVCrpn370YbMPab1A vPLbJE70wLvDRshEQTPikDpjSdnvQJTikOE5Lo65qyELHy8MdggbLCnFLc+dYVD6vsOR un12OAuNCHLPlHp9Ih46bugLnULFE8+yaWXrN64ZvLQK9tceVSN5xzuntDV8B+YH5c3F I7Len0I7nkq7GfUH3/0MtQLzCvEX2d8OAKafmgblHUzECcUMuL9fDOnV7gXXxcKG3lJn SRv5Wv5AA4bIqmOTNLOLs1l1vQ9k1N30UQ+hUqTCA2qeWaLyLu+zrBd4OdZQpLElklMK D0FA== X-Forwarded-Encrypted: i=1; AJvYcCVAXaOBN4K/J8BCrtF8w7x0D7JTeg+mMGdLy4Osnms/vqNdFJmCoTKPbhLDTCvYw9Wyna0rkF0TOrae+DZeOUkSrFU= X-Gm-Message-State: AOJu0Yylo0nIpXrQGQYa9rapGXRvavBnLq+TUPE48XmgZA2/LzQf8LaY HMhbKyXGkywht9OtAv3W716tLyBRpQqTCimz8SXt/wiCvikrHp+lWXn2SYbj8BKkqTo46KbAarD cx4Ogal7JoXSfG8XFPerKmECcn3nInUpvAo8X X-Google-Smtp-Source: AGHT+IG5vIGHICRuHtKEsXe+KN+Zu3YrlWiJlbfI/zB+HJ9KhdefMIpSiaWNyYpYtDnKPAmpyLzow6wtttxh9uyUI7c= X-Received: by 2002:a50:d7d7:0:b0:57d:3d3b:f62f with SMTP id 4fb4d7f45d1cf-584b42f6751mr76047a12.1.1719463697331; Wed, 26 Jun 2024 21:48:17 -0700 (PDT) MIME-Version: 1.0 References: <20240621213717.1099079-1-yuzhao@google.com> <7380dad0-829a-48b6-a69e-e3a02ded30de@linux.dev> In-Reply-To: <7380dad0-829a-48b6-a69e-e3a02ded30de@linux.dev> From: Yu Zhao Date: Wed, 26 Jun 2024 22:47:39 -0600 Message-ID: Subject: Re: [RFC PATCH] mm/hugetlb_vmemmap: fix race with speculative PFN walkers To: Muchun Song Cc: Andrew Morton , linux-mm@kvack.org, David Hildenbrand , Frank van der Linden , "Matthew Wilcox (Oracle)" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 6321140002 X-Stat-Signature: ao8dff1m7xi1k4q6f4io8qcq8rd6ertm X-Rspam-User: X-HE-Tag: 1719463699-752483 X-HE-Meta: U2FsdGVkX18Ue8kAZiblokVDu20jPJDeGAYwOCQYPnAiZ4B3UWQZwU0J7MjpCKwbhH0A4b6kRsy2mIppkSiCgYK9MlAqHzyvDJKcPZl95M9VFW+BRCJKR5BLrWsQCEmwHYKflf7UZLF44MSYMROhW4z9Pjf4h/cacBl4bSf+J9TdI+AFpZkKgf0ZsuIzaoZazk4v2apnTkfXSf3WChM5zXJUw0/ckDgkMRdlDg0yiDv6oyy9B6uEn4TclTsb3FatwyHiR2QYsIsZBdJ85a9U4BUT8fr8lmyNeLZek/uns4k8EBbIjr/8AugnHsbGn1E/KHCLo90EIYFF2FlQ5d7AgM5O1OHRvMbWY1DfDVFL2u5klf7jWnZNdlb/H/DlvI+anAvLatg2u6Jh5CXIkSJvZduRRonwSZ7LSD3WtfMNKM3wJoVQ54HDM1Wdk45IqRTc9gTH3Rb8o93f7kJ+FX3tZRbGfxxkCGccHfRdI5RzEH1lYQXMukZMvz7Fz6oz26hAvYwiHQrZOMsvClaKdjX2JWz+m3fUlXgdzobQblmC9IxOlWNhVKZnvbCgT9fEZKPNDomis7qrQYZTW7iOhnI9XW0khg3tVlxMDB8sRHH4nh/SznQErjaGvDKCL0zuxbcHjzu0U7XLhzsfw8lkmgrCLNwRmG4Iyqp6o3K+Ww6vqaX+a1oZFCA749B5O4u4TEWw1XDW143Dqldh2dttvbTGCsmm8f+a50gobPyYe7YDIXsaNMxTjIaxldLQS4jvPpXK3H4FSAUuVhQzolbY4F594v0oSc1T8QMD688OYpIUJWrcsVU+kwqbdHQC+7ELMhHdYqn1UegQPliaHhUR2+FAOuzxiQqf3LeCzHJSnbSoEfmSMgQkYx099xTH4iqrrfsFS8gNDT3ZNCX/bkPyz/l/gubdlsaWmbT9QpIFE/8Cq2PX0Jsn+MjrpZ+EXbqS4Tz7ZZ7GdCD/RTzBnRRwiWk KVGlUeDk CwtbelXcjJi5QzK3JjzzOcmpb1TJjQLGyZxxHrpGy6fusG/KhTojGyEvz8shzLJvTYhnL5UZbMdoaamabcT+hZKy3eIpqV8PmFdlDDyUNczHGqJj8QWEHZuSPwlWCpvbR6Eev4z+9GPW1d7NNPQ57+Tysgp0GICnjEjZRbjD0hzpwHLmGzYxccBKVR5xD0B9D5EqtoGFHVdwkNpaEAO6fnvTkIK2y4DNxJvhh3DsyiKwzc0VTxmlhBqmInrLvlS/x6CBr/sBRmo3FUcWGGrrndsyP/ie1sAoiU1hB4qipAsoCncO5qPcGHH1bQBOqsNvKYg/c X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 25, 2024 at 8:38=E2=80=AFPM Muchun Song = wrote: > > > > On 2024/6/22 05:37, Yu Zhao wrote: > > While investigating HVO for THPs [1], it turns out that speculative > > PFN walkers like compaction can race with vmemmap modificatioins, > > e.g., > > > > CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker) > > ----------------------------- ------------------------------ > > Allocates an LRU folio page1 > > Sees page1 > > Frees page1 > > > > Allocates a hugeTLB folio page2 > > (page1 being a tail of page2) > > > > Updates vmemmap mapping page1 > > get_page_unless_zero(page1) > > > > Even though page1 has a zero refcnt after HVO, get_page_unless_zero() > > can still try to modify its read-only struct page resulting in a > > crash. > > > > An independent report [2] confirmed this race. > > Right. Thanks for your continuous focus on this race. > > > > > There are two discussed approaches to fix this race: > > 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without > > triggering a PF. > > 2. Use RCU to make sure get_page_unless_zero() either sees zero > > refcnts through the old vmemmap or non-zero refcnts through the new > > one. > > > > The second approach is preferred here because: > > 1. It can prevent illegal modifications to struct page[] that is HVO; > > 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix > > similar races in other places, e.g., arch_remove_memory() on x86 > > [3], which frees vmemmap mapping offlined struct page[]. > > > > While adding synchronize_rcu(), the goal is to be surgical, rather > > than optimized. Specifically, calls to synchronize_rcu() on the error > > handling paths can be coalesced, but it is not done for the sake of > > Simplicity: noticeably, this fix removes ~50% more lines than it adds. > > > > [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ > > [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.= dev/ > > [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat= .com/ > > > > Signed-off-by: Yu Zhao > > --- > > include/linux/page_ref.h | 8 ++++++- > > mm/hugetlb.c | 50 +++++----------------------------------= - > > mm/hugetlb_vmemmap.c | 16 +++++++++++++ > > 3 files changed, 29 insertions(+), 45 deletions(-) > > > > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h > > index 1acf5bac7f50..add92e8f31b2 100644 > > --- a/include/linux/page_ref.h > > +++ b/include/linux/page_ref.h > > @@ -230,7 +230,13 @@ static inline int folio_ref_dec_return(struct foli= o *folio) > > > > static inline bool page_ref_add_unless(struct page *page, int nr, int= u) > > { > > - bool ret =3D atomic_add_unless(&page->_refcount, nr, u); > > + bool ret =3D false; > > + > > + rcu_read_lock(); > > + /* avoid writing to the vmemmap area being remapped */ > > + if (!page_is_fake_head(page) && page_ref_count(page) !=3D u) > > + ret =3D atomic_add_unless(&page->_refcount, nr, u); > > + rcu_read_unlock(); > > > > if (page_ref_tracepoint_active(page_ref_mod_unless)) > > __page_ref_mod_unless(page, nr, ret); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index f35abff8be60..271d83a7cde0 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1629,9 +1629,8 @@ static inline void destroy_compound_gigantic_foli= o(struct folio *folio, > > * > > * Must be called with hugetlb lock held. > > */ > > -static void __remove_hugetlb_folio(struct hstate *h, struct folio *fol= io, > > - bool adjust_surpl= us, > > - bool demote) > > +static void remove_hugetlb_folio(struct hstate *h, struct folio *folio= , > > + bool adjust_surpl= us) > > { > > int nid =3D folio_nid(folio); > > > > @@ -1661,33 +1660,13 @@ static void __remove_hugetlb_folio(struct hstat= e *h, struct folio *folio, > > if (!folio_test_hugetlb_vmemmap_optimized(folio)) > > __folio_clear_hugetlb(folio); > > > > - /* > > - * In the case of demote we do not ref count the page as it will= soon > > - * be turned into a page of smaller size. > > - */ > > - if (!demote) > > - folio_ref_unfreeze(folio, 1); > > - > > h->nr_huge_pages--; > > h->nr_huge_pages_node[nid]--; > > } > > > > -static void remove_hugetlb_folio(struct hstate *h, struct folio *folio= , > > - bool adjust_surpl= us) > > -{ > > - __remove_hugetlb_folio(h, folio, adjust_surplus, false); > > -} > > - > > -static void remove_hugetlb_folio_for_demote(struct hstate *h, struct f= olio *folio, > > - bool adjust_surpl= us) > > -{ > > - __remove_hugetlb_folio(h, folio, adjust_surplus, true); > > -} > > - > > static void add_hugetlb_folio(struct hstate *h, struct folio *folio, > > bool adjust_surplus) > > { > > - int zeroed; > > int nid =3D folio_nid(folio); > > > > VM_BUG_ON_FOLIO(!folio_test_hugetlb_vmemmap_optimized(folio), fol= io); > > @@ -1711,21 +1690,6 @@ static void add_hugetlb_folio(struct hstate *h, = struct folio *folio, > > */ > > folio_set_hugetlb_vmemmap_optimized(folio); > > > > - /* > > - * This folio is about to be managed by the hugetlb allocator and > > - * should have no users. Drop our reference, and check for other= s > > - * just in case. > > - */ > > - zeroed =3D folio_put_testzero(folio); > > - if (unlikely(!zeroed)) > > - /* > > - * It is VERY unlikely soneone else has taken a ref > > - * on the folio. In this case, we simply return as > > - * free_huge_folio() will be called when this other ref > > - * is dropped. > > - */ > > - return; > > - > > arch_clear_hugetlb_flags(folio); > > enqueue_hugetlb_folio(h, folio); > > } > > @@ -1779,6 +1743,8 @@ static void __update_and_free_hugetlb_folio(struc= t hstate *h, > > spin_unlock_irq(&hugetlb_lock); > > } > > > > + folio_ref_unfreeze(folio, 1); > > + > > /* > > * Non-gigantic pages demoted from CMA allocated gigantic pages > > * need to be given back to CMA in free_gigantic_folio. > > @@ -3079,11 +3045,8 @@ static int alloc_and_dissolve_hugetlb_folio(stru= ct hstate *h, > > > > free_new: > > spin_unlock_irq(&hugetlb_lock); > > - if (new_folio) { > > - /* Folio has a zero ref count, but needs a ref to be free= d */ > > - folio_ref_unfreeze(new_folio, 1); > > + if (new_folio) > > update_and_free_hugetlb_folio(h, new_folio, false); > > - } > > Look into this function, we have: > > dissolve_free_huge_page > retry: > if (!folio_test_hugetlb(folio)) > return; > if (!folio_ref_count(folio)) > if (unlikely(!folio_test_hugetlb_freed(folio))) > goto retry; > remove_hugetlb_folio(h, folio, false); > > Since you have not raised the refcount in remove_hugetlb_folio(), we will > disslove this page again if there is a concurrent dissolve_free_huge_page= () > processing routine. Then, the statistics will be wrong (like > ->nr_huge_pages). Thanks for pointing this out! > A solution seems easy, we should clear folio_clear_hugetlb_freed in > remove_hugetlb_folio. Agreed.