From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7970AEB64DC
	for <linux-mm@archiver.kernel.org>; Wed, 19 Jul 2023 01:52:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id BC401280021; Tue, 18 Jul 2023 21:52:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B73F88D0012; Tue, 18 Jul 2023 21:52:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A44E0280021; Tue, 18 Jul 2023 21:52:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 934568D0012
	for <linux-mm@kvack.org>; Tue, 18 Jul 2023 21:52:49 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 53F23803C5
	for <linux-mm@kvack.org>; Wed, 19 Jul 2023 01:52:49 +0000 (UTC)
X-FDA: 81026687658.12.884A531
Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46])
	by imf15.hostedemail.com (Postfix) with ESMTP id 510DEA000D
	for <linux-mm@kvack.org>; Wed, 19 Jul 2023 01:52:47 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=mo41huiF;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf15.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1689731567;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=hm8UsgnXaymaPZWlJt+yM3rqUkl+oIHbLCaIfDuxi60=;
	b=p5YzA8AulNkUgiz6T2eMjALfPhg8TtXzK4ilXi4pDMA4pw8kIPyzurj8hAReD3h6ZUHrk+
	HUK0PB4J0iRskzt5e/fqeEC0qVV9rE2hdVwrwrMQ4h8kfHB+xAUpYJ7X9fOfuWIP6/CShV
	XW3w4JL5Cu5Pyb/JedqfDJ4MptdhjDQ=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=mo41huiF;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf15.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689731567; a=rsa-sha256;
	cv=none;
	b=8VhZONESUA3ew5vlLP+LfzuTlZPFADIZQCG3Vb2qirIcGiXp+c8rd7reTpDC/QpzagRKlP
	5GycUcvPxqyppNJMBIezabj65rypwEOn/4bcFo19kCshpArrVyKa81cnWry+AqEfsZAiKK
	4Gzf7M3MRYeKz3GIG8dYP4U2HooODSw=
Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-9891c73e0fbso64432866b.1
        for <linux-mm@kvack.org>; Tue, 18 Jul 2023 18:52:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1689731566; x=1692323566;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=hm8UsgnXaymaPZWlJt+yM3rqUkl+oIHbLCaIfDuxi60=;
        b=mo41huiFOp4/LAQot5cnPMGXdzt4LO4jLSKIKDb2tXvojrOpRthxN7keBAr+vOg9Td
         JN2PJUwPl4Bsph6C9KLtjW9SxnefSU3H5iFyb2Vg4CthS/AI4atcu0zselh7c6HId4Lt
         8r78egy2N9bumGVSEYj6qX2288xTQ+cv3xRaEK7fjcmDeFxqr7IBxFhBGO6Ny9ZrhZ9I
         S/aTKEAchuOxmK48OnHv15YEF+ZhsXeHM6DLUrRdFX1vYTJ+dsqRTbuVN7ltKYOr+qVo
         YEEiounkKh2m6dk8rT5RsMAE5C3I8W0n6vmiVk585J9Ao3ao+2KHY3SJI0j5FPunsP8n
         Vesw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689731566; x=1692323566;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=hm8UsgnXaymaPZWlJt+yM3rqUkl+oIHbLCaIfDuxi60=;
        b=N4f5wFq9pWFPa4nySwVV7vAvX3jsBOjNfyZOJ8xgLf3PfB/Z1NtXe7GqW1+RJ5lq7B
         BCk/vHyRJxnPRHprOV0cPBydZZ/RwGzZwdFD52Q2y5hgou6M746S+ilpgo0MYxrm9UlB
         VldQMyIHgcTsqbiKg6jLb4gElDeZYDiKsxPvwJ/KaUBe5+LQk9lmbIO8XF6ZnccEDdv/
         MM0wvKfoWdGfdu7gQmp4RH5apGPW3kFhsFj8QgREwXpFOVntXuAK321GHMu0Ql96tIGE
         oQfYQtsBaPN0jQEGtqBfEwUXj+KP5H/Xa2OSfcIxH9qW9GkPEb8n9FjqsIoAezrQC9TO
         2LGg==
X-Gm-Message-State: ABy/qLa2mF1tcnMvCrrHADbPpRgyX6amqlIl9NEro7SUbiZY5VQ9S5tM
	c1qwtQHGC3JIFmr1BTzy+IFzXPzVML3ceioYZHNjWw==
X-Google-Smtp-Source: APBJJlGYxqE9W3VQnXNQBmfNYSj9BiqiGm6XNyY2ubOaY1wGxJAlXmw5qiQJVpqpUcXM/CXJZWvyvjCCm/fNTFRWKcM=
X-Received: by 2002:a17:907:3ea0:b0:993:f9d8:9fd0 with SMTP id
 hs32-20020a1709073ea000b00993f9d89fd0mr996265ejc.1.1689731565653; Tue, 18 Jul
 2023 18:52:45 -0700 (PDT)
MIME-Version: 1.0
References: <20230712060144.3006358-1-fengwei.yin@intel.com>
 <20230712060144.3006358-4-fengwei.yin@intel.com> <CAOUHufYef--8MxFettL6fOGjVx2vyZHZQU6EEaTCoW0XBvuC8Q@mail.gmail.com>
 <CAOUHufZ6=9P_=CAOQyw0xw-3q707q-1FVV09dBNDC-hpcpj2Pg@mail.gmail.com>
 <40cbc39e-5179-c2f4-3cea-0a98395aaff1@intel.com> <CAOUHufZHyEvU-c2O6B6stM_QVMxc22zV4Szn52myYqjdZvptUA@mail.gmail.com>
 <16844254-7248-f557-b1eb-b8b102c877a2@intel.com> <CAJD7tkYAkVOE2caqEj_hTmm47Kex451prBQ1wKTRUiOwnDcwNA@mail.gmail.com>
 <b995e802-1500-6930-79d0-8cc4bfe89589@intel.com> <CAJD7tkZtHku-kaK02MAdgaxNzr9hQkPty=cw44R_9HdTS+Pd5w@mail.gmail.com>
In-Reply-To: <CAJD7tkZtHku-kaK02MAdgaxNzr9hQkPty=cw44R_9HdTS+Pd5w@mail.gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 18 Jul 2023 18:52:09 -0700
Message-ID: <CAJD7tkZWXdHwpW5AeKqmn6TVCXm1wmKr-2RN2baRJ7c4ciTJng@mail.gmail.com>
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle
 large folio
To: Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	akpm@linux-foundation.org, willy@infradead.org, david@redhat.com, 
	ryan.roberts@arm.com, shy828301@gmail.com, 
	Yin Fengwei <fengwei.yin@intel.com>, Hugh Dickins <hughd@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 510DEA000D
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: uczzkkodi8yx1zwfjgngjx39eye6h69g
X-HE-Tag: 1689731567-409922
X-HE-Meta: U2FsdGVkX1/VYeJmiTQ+NwD/ubk2B1g1JNs5YccZddlzx9LKvuAtqiAJYJm9+d/HPgdn0XRCEyytHbkaWPikd0ojuoNJ8TXNeI9Vdt1LYqMPVtiSSCPRpD6Q1BDV6GPI7scRMH1CRrrYLSIu0gCseggk+ZhpWCa6wLQXHzsCuNKym01qWbUj5lgpKMTvzSE6Q6qNYo0fTohF1cx8ocPV7iGIb7WA2NRMlRdhvgP2YMJDDud9WBPX411t6SlT1E5rmvdvf3EgkBtY/zft+cOksB1nSK7ZhMWvYtCmVQeGcE0GSghb7m820tBC9rHqE+rTWvFklfww6FgslOE2AR3c1M8WcaHVt/1fhSigG/rkRnl3EzeiUCcl1aQcZVAxGUQ3QKFTCxYqK2lGQGo2qQNa6SRhIi5qX0nXPKHuX8tk2ZcJMHHuxPt4EjiLMC33y/qNs60YEdN5fJaj2gy+fkQIPIhy6BpGvS6slHbL3klb2FODEpOxtcsb9F50vc31UlqFALYCScXC7iY9Qn2LJQNEQLvZBC3Lzsjl63Ow2uMRG3CmDCAAdgU0gwZkqT/h/vWlD7g2DhKwDKezbRW32/j9SEsSses5wZ476N47xUfeyc8yibuTximQhYHQuuhjrEmgAN8M/GUQ/3D0lZWQ6Mg+KuoWk6nO+CugWJWJx7qxAAex5m0BcdRU6MYb5L4yYPCYPjajC6ni7cu0A+N/8JkIDyJidzLQvemgKbf1CjqOyEqe13x90LIEshKVjYOeHMrGGMoQBY1TpZQTVwV3RlQt6n+wdGPEZ7Q8RcLSoNPGVEjnR67ehY3bD/Jar3emqAH6ewEmO4PPjGFn7kz9JbE849dN99ONr/4mHKmuNVBerK1iyqB/uAGAMwqpkP8DPmX+crn+TwslqZY2c9Out4vv0FeEM4lbzTc3GPjNwZ8ddeasPfOWNeyoBlrVQWoJiSORznRK2CjBUQ08numrAm5
 MsZAVHfV
 YSfrD1gVtX6XW6iPqHEhTSKw4jmKTDPh/B6Dcj3D69/y4/MPCSNwq87T3iVnGp9xWZM8ZRxr6OA2fvX5Sl8Bu3G0N5Nt6j3xFWccm78tF1cTQ0gMqIovfDd++LFDPHapHjUjIDAvKjlMxe4Y4X1NAE6dza6093BBkPb9GFYWDz33q74oBhSu5P7mmcKwZ2ADki1qgA6GyxCqGwzf3l9uiNNj6L3gUtuyyDBR3+lRp13/GG02LtTFyOxc2cG3E9wQBWGF3XcwOICBeIl+3cDjL3haZ1AI6DIPk7lhuuA192JsiDZfNDVg9F2/qhQqYvS+VDauzk8LYefAavWWxKUF437h+QHe2qKWhWuYJNSyi9OXzJgHF9X5gHOok6WIVz00e53na
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000007, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jul 18, 2023 at 6:32=E2=80=AFPM Yosry Ahmed <yosryahmed@google.com>=
 wrote:
>
> On Tue, Jul 18, 2023 at 4:47=E2=80=AFPM Yin Fengwei <fengwei.yin@intel.co=
m> wrote:
> >
> >
> >
> > On 7/19/23 06:48, Yosry Ahmed wrote:
> > > On Sun, Jul 16, 2023 at 6:58=E2=80=AFPM Yin Fengwei <fengwei.yin@inte=
l.com> wrote:
> > >>
> > >>
> > >>
> > >> On 7/17/23 08:35, Yu Zhao wrote:
> > >>> On Sun, Jul 16, 2023 at 6:00=E2=80=AFPM Yin, Fengwei <fengwei.yin@i=
ntel.com> wrote:
> > >>>>
> > >>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
> > >>>>> There is a problem here that I didn't have the time to elaborate:=
 we
> > >>>>> can't mlock() a folio that is within the range but not fully mapp=
ed
> > >>>>> because this folio can be on the deferred split queue. When the s=
plit
> > >>>>> happens, those unmapped folios (not mapped by this vma but are ma=
pped
> > >>>>> into other vmas) will be stranded on the unevictable lru.
> > >>>>
> > >>>> This should be fine unless I missed something. During large folio =
split,
> > >>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio w=
ill be
> > >>>> munlocked in unmap_folio(). So the head/tail pages will be evictab=
le always.
> > >>>
> > >>> It's close but not entirely accurate: munlock can fail on isolated =
folios.
> > >> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable l=
eft.
> > >>
> > >> Could this also happen against normal 4K page? I mean when user try =
to munlock
> > >> a normal 4K page and this 4K page is isolated. So it become unevicta=
ble page?
> > >
> > > Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> > > cpu 2 is isolating the folio for any purpose:
> > >
> > > cpu1                                        cpu2
> > >                                                 isolate folio
> > > folio_test_clear_lru() // 0
> > >                                                 putback folio // add
> > > to unevictable list
> > > folio_test_clear_mlocked()
> > Yes. Yu showed this sequence to me in another email. I thought the putb=
ack_lru()
> > could correct the none-mlocked but unevictable folio. But it doesn't be=
cause
> > of this race.
>
> (+Hugh Dickins for vis)
>
> Yu, I am not familiar with the split_folio() case, so I am not sure it
> is the same exact race I stated above.
>
> Can you confirm whether or not doing folio_test_clear_mlocked() before
> folio_test_clear_lru() would fix the race you are referring to? IIUC,
> in this case, we make sure we clear PG_mlocked before we try to to
> clear PG_lru. If we fail to clear it, then someone else have the folio
> isolated after we clear PG_mlocked, so we can be sure that when they
> put the folio back it will be correctly made evictable.
>
> Is my understanding correct?

Hmm, actually this might not be enough. In folio_add_lru() we will
call folio_batch_add_and_move(), which calls lru_add_fn() and *then*
sets PG_lru. Since we check folio_evictable() in lru_add_fn(), the
race can still happen:


cpu1                              cpu2
                                      folio_evictable() //false
folio_test_clear_mlocked()
folio_test_clear_lru() //false
                                      folio_set_lru()

Relying on PG_lru for synchronization might not be enough with the
current code. We might need to revert 2262ace60713 ("mm/munlock:
delete smp_mb() from __pagevec_lru_add_fn()").

Sorry for going back and forth here, I am thinking out loud.

>
> If yes, I can add this fix to my next version of the RFC series to
> rework mlock_count. It would be a lot more complicated with the
> current implementation (as I stated in a previous email).
>
> >
> > >
> > >
> > > The page would be stranded on the unevictable list in this case, no?
> > > Maybe we should only try to isolate the page (clear PG_lru) after we
> > > possibly clear PG_mlocked? In this case if we fail to isolate we know
> > > for sure that whoever has the page isolated will observe that
> > > PG_mlocked is clear and correctly make the page evictable.
> > >
> > > This probably would be complicated with the current implementation, a=
s
> > > we first need to decrement mlock_count to determine if we want to
> > > clear PG_mlocked, and to do so we need to isolate the page as
> > > mlock_count overlays page->lru. With the proposal in [1] to rework
> > > mlock_count, it might be much simpler as far as I can tell. I intend
> > > to refresh this proposal soon-ish.
> > >
> > > [1]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@g=
oogle.com/
> > >
> > >>
> > >>
> > >> Regards
> > >> Yin, Fengwei
> > >>