From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DBACFC3DA61
	for <linux-mm@archiver.kernel.org>; Mon, 29 Jul 2024 16:40:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6B9746B00B5; Mon, 29 Jul 2024 12:40:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 66A3F6B00B9; Mon, 29 Jul 2024 12:40:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 50A306B00BB; Mon, 29 Jul 2024 12:40:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 31F946B00B5
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 12:40:06 -0400 (EDT)
Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id D9D4C160435
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 16:40:05 +0000 (UTC)
X-FDA: 82393352370.19.3ACB5CC
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf20.hostedemail.com (Postfix) with ESMTP id 5CFEC1C0021
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 16:40:03 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=a0Uv6SHT;
	spf=pass (imf20.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722271149;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=nG+U8CP71bDhVREKToFWLK1NsMPuT0lccg4lODePGLw=;
	b=Xn8yFeviekggQ8BvZARMda0XFcXifHj1xdokpr7xkjSZ4pjGn5JxCj+V8qv7JWsqFNhz0j
	SYiSD/3nGyH1nAD8+3zCIeoW029OfzzLqKml+mQQZLpaZ2rz1YvGgWl8PeFf6ImwQ0FXAe
	FYtwIFlpX+m4eZZcJKPp7DPVBTNLZmk=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722271149; a=rsa-sha256;
	cv=none;
	b=VfrSXfpXZPjlDuZFSzkllrYE6M1Q5tabYYdTOdG/UKXpa5yPOtfv2/nBz8rUnmxw6++254
	unNjKtonjhbxfack+sjovqZ3oH80mj816W6C8pEkv4gOSCaDsID8b5ExEpsYoSi1CaHcDl
	8xOT+tfzLlMqMwOyku+Xe5mccQT6m4Y=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=a0Uv6SHT;
	spf=pass (imf20.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1722271202;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=nG+U8CP71bDhVREKToFWLK1NsMPuT0lccg4lODePGLw=;
	b=a0Uv6SHT1i+CnhrkrHvZlYNZar8Hua9DVzzqKxgs5zcs8dfwFVJOiUNifR79GhkBk2y1qy
	Mxp0pfdxChlUI/0HLJ4CbinfkU5ussHFxIk7AdQ4jkp3zS8UIZ1cdQ++AlGdBoxD9itRXK
	T2DIZRagH8f36cTtOLAMk8qgvIjPrQo=
Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com
 [209.85.219.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-447-x9YwsCrlPoy90fx0tQt5Vg-1; Mon, 29 Jul 2024 12:39:59 -0400
X-MC-Unique: x9YwsCrlPoy90fx0tQt5Vg-1
Received: by mail-qv1-f70.google.com with SMTP id 6a1803df08f44-6b7a47a271cso7921886d6.2
        for <linux-mm@kvack.org>; Mon, 29 Jul 2024 09:39:59 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722271199; x=1722875999;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=nG+U8CP71bDhVREKToFWLK1NsMPuT0lccg4lODePGLw=;
        b=dvk0avY8yx2L3DJmQs52llK5sKX9Sn3IIQSuC2nfmr2IR6yail3+5zTDsE2rq15agO
         YTOHBg0E4oSGZbnDUYkNElKaIbJPNNE3ozs7/YiHhZJu6JGHt9p327MTxJjWJ1BRPi4D
         BEKRh4kuCJVIbtNzXqLd+J903cl4VPs2oPXG2xWPXRopQ+pMKgwbhzHBqBjaV19J7Nkw
         9EySgCjBwBMfwIPLWDps2nwJ8zbPt0DYSP76ZJUOX0v9pncNGqHcxIO0oZPHBIs//PJe
         YBuSyIsxAJqXMuikf1iDMbgj5xfaJISuHhmg0tdK9+ArJV64YJMaKrg2w3CiQBt4KpjR
         O9xw==
X-Forwarded-Encrypted: i=1; AJvYcCVQqM14F1zXaUJaS1ejFTjkmLJRjP5etSmn4gGZcZEq/Dy6fCdmhd93jDopcgVknhxukGL8mEpltFR3HPlfjTE9JDU=
X-Gm-Message-State: AOJu0YzdbqFMQ1IWI9+kp+Zz+XYRvF2XWvmwVYxYT6wFVRJ0lS1ujD1C
	MPICd8GW39ukDW5G4Sl2rumPQN1BW5kMrO0iERJGkxwkwHbfz5HFchTukQEpQgejWV943obFRvt
	y5+MqxTFvE9PQKq477p/LgldDy1/Kht3kT0oV3D7b6skO+fEg
X-Received: by 2002:a05:6214:2b0a:b0:6b7:7832:2211 with SMTP id 6a1803df08f44-6bb3e2043dbmr110039646d6.3.1722271198806;
        Mon, 29 Jul 2024 09:39:58 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IFVuspPakX+V9lQ3PvkgS/PfvHrGrnZJjM7JWJBxvIiTeHSsdJs+rJCguqpGnqUpgVT690niQ==
X-Received: by 2002:a05:6214:2b0a:b0:6b7:7832:2211 with SMTP id 6a1803df08f44-6bb3e2043dbmr110039506d6.3.1722271198330;
        Mon, 29 Jul 2024 09:39:58 -0700 (PDT)
Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117])
        by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6bb3faef212sm54076776d6.122.2024.07.29.09.39.57
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 29 Jul 2024 09:39:58 -0700 (PDT)
Date: Mon, 29 Jul 2024 12:39:54 -0400
From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Muchun Song <muchun.song@linux.dev>,
	Oscar Salvador <osalvador@suse.de>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH v1 1/2] mm: let pte_lockptr() consume a pte_t pointer
Message-ID: <ZqfF2hhe60TE8xhQ@x1n>
References: <20240725183955.2268884-1-david@redhat.com>
 <20240725183955.2268884-2-david@redhat.com>
 <ZqPCjd35OdNRrcfl@x1n>
 <bf2069ed-f2b8-49d4-baf0-dbd2189362f9@redhat.com>
 <ZqQVDwv4RM-wIW7S@x1n>
 <9e671388-a5c6-4de1-8c85-b7af8aee7f44@redhat.com>
 <ZqfCmFUFyPU3WGES@x1n>
MIME-Version: 1.0
In-Reply-To: <ZqfCmFUFyPU3WGES@x1n>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Stat-Signature: efpe9z4x4xdrbpxuocpprkopszff4hgo
X-Rspamd-Queue-Id: 5CFEC1C0021
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1722271203-666831
X-HE-Meta: U2FsdGVkX18BduBCJskw38DBjiB9/z1QvfcXXXO+O/SK3rRvTLpUlVtAqS+OLJDAloVrRUlQNUNSAw8OBOjUYjqzMMtL8yr8mKqr//gRlSGrseBGYa9yOFQemoYlyLSKvgssHZsaj3Rbt9z1NDsqSwlUqCsB0NntWs6UmKz0yYi2voplZpiF4yQeKACVMxhsTsSAsbdG2QblVt56/bi2dmOTcgsiOsATA9xKr5h+Hwu+QdHet0XoiKvlfkBm6uTJFMEAA0yetNuSOWwraJ1BHRdPLfrAXbdX+yIqt+UJF0aB3/ZAM6IO3Xi7RKCRHTn7mjsA8Mrg5D4vBVIpMuk3WB/3nGnakiQGZUmtWWffHlWT7KP3ig6lsMT0mspQ2T3Oh6ed4TyMLh9X2IgmfeLDEiHgRKfPq2BIMyADgYs2URyV/W7P2SYg4YdSm//5YVpJ1ceAcPfqbOpc399EwdNG+oUeOFueNMGJ5I2Ul0bnRO/2MSDSoCjoU7iA2fp9AC43n+rQUmbsAAUFiMrg9jfTtIr+X5sUEysD/VFEDlmM1m8DDqCNyWWIG+NNskpDgRkkuYUv8aSS/EvwayRl8kGMrDpv6MAwlBZe5PDLzVwZzW83TYVnEZ9uOYft3ohs+OHVTmdob2WdTd81YLo2BO9dGT3l50x5j7JzN27koYhhrBo9H2vDH0nXejY38+3tJMwl2VCD4sP9+f3ISZViWBvBazTelqDDKT21m5u02L50w9uA1uZZL+mcNRxPDyw85GyNVeUT8x9YH2gw/zfH11evDqCYvM7iPmkZtkMhd5rRSryNIjpvZp0BxOOGmCy0YWowU6kwSNKOqaR+kifEBPjQGNHXi/sYipfnTKRJa3DkxYutw3djhrnUuzcTdlIJkoyBiP4JyvTTdz6w8ZeamWi0JBNASCdGh4hcPV+Y+G6SKkQjT9SttNz+jdWVzTBjfdtGPcdhsxzRyVt9C+K+DRK
 Z4XcYrVA
 UN7axkMH+NvNn9PG4IgTLE3efqnjZ0ktG/pvxUXUfDS/YIdj64L2Fz/g+Ld8yMyLJrm5nAbq5D9whvcEFxdt/SdaRYfGoBdejO6+6UqpXrVlbhUk6odNlEyOi8Wa5dqowNHmxM2ubkAgUKsH9lS4LvTmysYDl7rZQ8oo9j3hO9T39C7kXIBhfQEfmjBXzYeyl1D5xpBZy8i+FYfjEoAbmG+qcQjAC6/vvmXL4EdhKAQa4MhKr2FZHSnjPLYQqZhtQ503lpcbl5Usapw18dATssubT1hC5gkb8hlfoE5DaTce0Dmc3+v23KmkD9u5RwCqRVbInK8YATI01Q2FzObMeU6UbXIuwdlYrCp7FBlbWGWDsqbN3Q2UJraCzdT43Fyy0uvoaNhlzhnkz6fPWwVltJSanBokdM2G4RBKLFUlmF85eKNMzM/tW/RXznqm3/3BDNd0+yo8QwndOxq4LN9pPlAbxjc3tfDpCxUTSK4075mOYrFK3/r2yGm58sSj46TFZIlSGLz2vy0PS2ZqWyTZ1NapDnPbLAbXHakdhZwp5AHCHgbfiwzyj+wAihg4G8CqhomdqBqll4CotTLo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jul 29, 2024 at 12:26:00PM -0400, Peter Xu wrote:
> On Fri, Jul 26, 2024 at 11:48:01PM +0200, David Hildenbrand wrote:
> > On 26.07.24 23:28, Peter Xu wrote:
> > > On Fri, Jul 26, 2024 at 06:02:17PM +0200, David Hildenbrand wrote:
> > > > On 26.07.24 17:36, Peter Xu wrote:
> > > > > On Thu, Jul 25, 2024 at 08:39:54PM +0200, David Hildenbrand wrote:
> > > > > > pte_lockptr() is the only *_lockptr() function that doesn't consume
> > > > > > what would be expected: it consumes a pmd_t pointer instead of a pte_t
> > > > > > pointer.
> > > > > > 
> > > > > > Let's change that. The two callers in pgtable-generic.c are easily
> > > > > > adjusted. Adjust khugepaged.c:retract_page_tables() to simply do a
> > > > > > pte_offset_map_nolock() to obtain the lock, even though we won't actually
> > > > > > be traversing the page table.
> > > > > > 
> > > > > > This makes the code more similar to the other variants and avoids other
> > > > > > hacks to make the new pte_lockptr() version happy. pte_lockptr() users
> > > > > > reside now only in  pgtable-generic.c.
> > > > > > 
> > > > > > Maybe, using pte_offset_map_nolock() is the right thing to do because
> > > > > > the PTE table could have been removed in the meantime? At least it sounds
> > > > > > more future proof if we ever have other means of page table reclaim.
> > > > > 
> > > > > I think it can't change, because anyone who wants to race against this
> > > > > should try to take the pmd lock first (which was held already)?
> > > > 
> > > > That doesn't explain why it is safe for us to assume that after we took the
> > > > PMD lock that the PMD actually still points at a completely empty page
> > > > table. Likely it currently works by accident, because we only have a single
> > > > such user that makes this assumption. It might certainly be a different once
> > > > we asynchronously reclaim page tables.
> > > 
> > > I think it's safe because find_pmd_or_thp_or_none() returned SUCCEED, and
> > > we're holding i_mmap lock for read.  I don't see any way that this pmd can
> > > become a non-pgtable-page.
> > > 
> > > I meant, AFAIU tearing down pgtable in whatever sane way will need to at
> > > least take both mmap write lock and i_mmap write lock (in this case, a file
> > > mapping), no?
> > 
> > Skimming over [1] where I still owe a review I think we can now do it now
> > purely under the read locks, with the PMD lock held.
> 
> Err, how I missed that.. yeah you're definitely right, and that's the
> context here where we're collapsing.  I think I somehow forgot all Hugh's
> work when I replied there, sorry.
> 
> > 
> > I think this is also what collapse_pte_mapped_thp() ends up doing: replace a
> > PTE table that maps a folio by a PMD (present or none, depends) that maps a
> > folio only while holding the mmap lock in read mode. Of course, here the
> > table is not empty but we need similar ways of making PT walkers aware of
> > concurrent page table retraction.
> > 
> > IIRC, that was the magic added to __pte_offset_map(), such that
> > pte_offset_map_nolock/pte_offset_map_lock can fail on races.
> 
> Said that, I still think current code (before this patch) is safe, same to
> a hard-coded line to lock the pte pgtable lock.  Again, I'm fine if you
> prefer pte_offset_map_nolock() but I just think the rcu read lock and stuff
> can be avoided.
> 
> I think it's because such collapse so far can only happen in such path
> where we need to hold the large folio (PMD-level) lock first.  It means
> anyone who could change this pmd entry to be not a pte pgtable is blocked
> already, hence it must keeping being a pte pgtable page even if we don't
> take any rcu.
> 
> > 
> > 
> > But if we hold the PMD lock, nothing should actually change (so far my
> > understanding) -- we cannot suddenly rip out a page table.
> > 
> > [1]
> > https://lkml.kernel.org/r/cover.1719570849.git.zhengqi.arch@bytedance.com
> > 
> > > 
> > > > 
> > > > But yes, the PMD cannot get modified while we hold the PMD lock, otherwise
> > > > we'd be in trouble
> > > > 
> > > > > 
> > > > > I wonder an open coded "ptlock_ptr(page_ptdesc(pmd_page(*pmd)))" would be
> > > > > nicer here, but only if my understanding is correct.
> > > > 
> > > > I really don't like open-coding that. Fortunately we were able to limit the
> > > > use of ptlock_ptr to a single user outside of arch/x86/xen/mmu_pv.c so far.
> > > 
> > > I'm fine if you prefer like that; I don't see it a huge deal to me.
> > 
> > Let's keep it like that, unless we can come up with something neater. At
> > least it makes the code also more consistent with similar code in that file
> > and the overhead should be  minimal.
> > 
> > I was briefly thinking about actually testing if the PT is full of
> > pte_none(), either as a debugging check or to also handle what is currently
> > handled via:
> > 
> > if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> > 
> > Seems wasteful just because some part of a VMA might have a private page
> > mapped / uffd-wp active to let all other parts suffer.
> > 
> > Will think about if that is really worth it.
> > 
> > ... also because I still want to understand why the PTL of the PMD table is
> > required at all. What if we lock it first and somebody else wants to lock it
> > after us while we already ripped it out? Sure there must be some reason for
> > the lock, I just don't understand it yet :/.
> 
> IIUC the pte pgtable lock will be needed for checking anon_vma safely.
> 
> e.g., consider if we don't take the pte pgtable lock, I think it's possible
> some thread tries to inject a private pte (and prepare anon_vma before
> that) concurrently with this thread trying to collapse the pgtable into a
> huge pmd.  I mean, when without the pte pgtable lock held, I think it's
> racy to check this line:
> 
>         if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
>                 ...
>         }
> 
> On the 1st condition.

Hmm, right after I replied, I think it also guarantees safety on the 2nd
condition..

Note that one thing I still prefer a hard-coded line over
pte_offset_map_nolock() is that, the new code seems to say we can treat the
pte pgtable page differently from the pte pgtable lock.  But I think
they're really in the same realm.

In short, AFAIU the rcu lock not only protects the pte pgtable's existance,
but also protects the pte lock.

>From that POV, below new code in this patch:

-               ptl = pte_lockptr(mm, pmd);
+
+               /*
+                * No need to check the PTE table content, but we'll grab the
+                * PTE table lock while we zap it.
+                */
+               pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+               if (!pte)
+                       goto unlock_pmd;
                if (ptl != pml)
                        spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+               pte_unmap(pte);

Could be very misleading, where it seems to say that it's fine to use the
pte pgtable lock even after rcu unlock.  It might make the code harder to
understand.

Thanks,

-- 
Peter Xu