From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DB661C77B72
	for <linux-mm@archiver.kernel.org>; Fri, 14 Apr 2023 20:32:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 21D60900002; Fri, 14 Apr 2023 16:32:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1CDF46B0075; Fri, 14 Apr 2023 16:32:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 06F3E900002; Fri, 14 Apr 2023 16:32:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id E8C436B0072
	for <linux-mm@kvack.org>; Fri, 14 Apr 2023 16:32:20 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id AA35B12042C
	for <linux-mm@kvack.org>; Fri, 14 Apr 2023 20:32:20 +0000 (UTC)
X-FDA: 80681144040.12.7AE6D7F
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf16.hostedemail.com (Postfix) with ESMTP id 10B39180028
	for <linux-mm@kvack.org>; Fri, 14 Apr 2023 20:32:17 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=F61Xsi7h;
	spf=none (imf16.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1681504338;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=zkJ52LKFN3KW2VRgFxWenolm+nSjIO2xurI61vRkfQg=;
	b=7fMinUJwveh/q5Cxp6ReYDQ6D4NqKTafIIcR4Vo/kRJD69VuKT9e5VK0AkoWmgw09crXy7
	xv7ekcCWNXh4+4dfSEhq7eNsrqWeGX09ttUgJL0bfECXyiiV7RPdZ+T1Blb7ICh1FEoKS9
	zIuHDkknjErbQrQVB1b+39yqGts22Zo=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=F61Xsi7h;
	spf=none (imf16.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681504338; a=rsa-sha256;
	cv=none;
	b=AyvDsIonaKVBHLawO0Vq/3TA9ic8lKUEXEah/uG+w9/EHW7v9oIlBGOe6KRZsvsUiillg6
	Whg0FY8zkb5MEVh/R8fL4aCPBUu0wu2i1zCI2emJqGnE0upbzQNJflzBktwCaBkTyiMgCB
	8UfZc+15cQnJ60W0JPzwNLCQ+M8PP2s=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=zkJ52LKFN3KW2VRgFxWenolm+nSjIO2xurI61vRkfQg=; b=F61Xsi7hJg8p79oBCG/Ya4KZTs
	NYyItO8lVqFUsyrFHUeHuSURJ8s59rMlIiWS7BsvUy4qB3i5uzGwW+rdiuBav36hIRnTg46zEearw
	Xnseesrp3Lz6ju4rdZfBg9QAqsDrYMYtutNLh339JNW6vkxUIa4JL2Cbc/iTetC3zgnKGnAH1MCIg
	AiM6IMqSgf0hcUWq/q1jH4TL5Q7NyoJixXwfUwfFjE1D9amcThVPP3aqN4qX/zKxhYnHJ93qN/xTW
	cqlx53Uk6XjvJNHnU7VSi9nk/bQuOX061GIoEoD2sgE2fW8toStzRjxFgZ6LUsEpK1+Jjy9Nh4AP+
	cmvyNvcw==;
Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux))
	id 1pnQ5P-0093og-C7; Fri, 14 Apr 2023 20:31:59 +0000
Date: Fri, 14 Apr 2023 21:31:59 +0100
From: Matthew Wilcox <willy@infradead.org>
To: Suren Baghdasaryan <surenb@google.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@suse.com,
	josef@toxicpanda.com, jack@suse.cz, ldufour@linux.ibm.com,
	laurent.dufour@fr.ibm.com, michel@lespinasse.org,
	liam.howlett@oracle.com, jglisse@google.com, vbabka@suse.cz,
	minchan@google.com, dave@stgolabs.net, punit.agrawal@bytedance.com,
	lstoakes@gmail.com, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@android.com
Subject: Re: [PATCH 1/1] mm: handle swap page faults if the faulting page can
 be locked
Message-ID: <ZDm4P37XXyMBOMdZ@casper.infradead.org>
References: <20230414180043.1839745-1-surenb@google.com>
 <ZDmetaUdmlEz/W8Q@casper.infradead.org>
 <CAJuCfpFPNiZmqQPP+K7CAuiFP5qLdd6W9T84VQNdRsN-9ggm1w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAJuCfpFPNiZmqQPP+K7CAuiFP5qLdd6W9T84VQNdRsN-9ggm1w@mail.gmail.com>
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 10B39180028
X-Rspam-User: 
X-Stat-Signature: yjpzhig86pkqicz7cjh4nqukopbxu4oi
X-HE-Tag: 1681504337-102851
X-HE-Meta: U2FsdGVkX1/yZM+XdaE2u/N/n4c/toOiCw851gi8oSWRaVbC/YYfMTOE9v4wANUNfz26/rhFwryc6LsmifabOu1XIi0Cm7lOdmYjR3zg6ajiFcF34sTBZJiK3M+Tue3xWohwqCUsuzz8nFWmQlhnRuoZnARexqrxz3RIjOAVV07AcmLYbinvsAGBOVGXALn/HARMHenQTrpV+R1i9G/o0CCSUYyykjA48RKP8c84z47S1RIg5c+9m2M006kORm8lZ5Qibm86ZRNE6kH9KNgQQnEGadGREcI0NtNR+qdEVgEPDWEllFSMbzdfQB0V884yMdsHw99GxIrSkPE3HzaQkBq6Aj32jp9yOPAcmVNnBr9ESi4dojlke+D/t5W36egXFbw32Ifg19EiM80N7DWH2e5Z7cdlxqNQgbyweWnzJE48A2QJThc9gcMoGtqBg6F8LH/gQ6nkxrfgSrRmAkzJX9Z6tBb4mX6bzNgiYmwMTHkBG4yWlC5oI2xKy/EoiyGJXhK0u5VJxxMjfI0fRxsat55dzVGbGPbKaU68axYsAYmDIw3e9gU4IKK6ESXv03bO9OO4i+s5MR/Fgoey4AB2klpWxm7NOoyJnYnMQFwqKMLjNI5P+Ig6NTMt0jRs68Yh/ZjzvqpIbDS0/pKHpj8+1oY77+VaxJbB/bxgYfk/0xr0U0WLXX/mVE12fuwyRAhmSufr88tsZSgIrn6R/tjeuNyeU9bEhxJ9I3RvHtzvz8Ir/wXv1rMelu3Ay2iu361uuPVFA4GoNnsVvmEh0ggVGdbdhZ4DUexEIkutkqPeuqSD2T0A879L7oZ//H9yON9uzAL4R2bL9rTswloVhVT6A6MVggLi/A7arqETMO3kjN0qSMfSXZQcAiKH+MQk35ZFtFRD1nh/bNNeuRXvSN8pFAfoQhM+0UDMaBVzObe4gWyS3da0QhC09Lo0dQ8Z7L23XbygigsKWIexprJOuhi
 q8/zSyJl
 J6b4hocqD+YKTGzWrJvIxRXDYKp/8Flini1B5c7Qrou4BT+ei7vUUWaHL/fdSiplML0W+aAoxrZCR8ZSgjjMVVSt1g7iGFa7jXy++2yekmMOqu2XzK4AlTVcgQSkOrKxRErVZjNR8br15JakEZ7zoPWZ/IePMnwioFiCi
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000004, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote:
> >  - We can call migration_entry_wait().  This will wait for PG_locked to
> >    become clear (in migration_entry_wait_on_locked()).  As previously
> >    discussed offline, I think this is safe to do while holding the VMA
> >    locked.

Just to be clear, this particular use of PG_locked is not during I/O,
it's during page migration.  This is a few orders of magnitude
different.

> >  - We can call swap_readpage() if we allocate a new folio.  I haven't
> >    traced through all this code to tell if it's OK.

... whereas this will wait for I/O.  If we decide that's not OK, we'll
need to test for FAULT_FLAG_VMA_LOCK and bail out of this path.

> > So ... I believe this is all OK, but we're definitely now willing to
> > wait for I/O from the swap device while holding the VMA lock when we
> > weren't before.  And maybe we should make a bigger deal of it in the
> > changelog.
> >
> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> > maybe we should be waiting for the folio lock with the VMA locked.
> 
> Wouldn't that cause holding the VMA lock for the duration of swap I/O
> (something you said we want to avoid in the previous paragraph) and
> effectively undo d065bd810b6d ("mm: retry page fault when blocking on
> disk transfer") for VMA locks?

I'm not certain we want to avoid holding the VMA lock for the duration
of an I/O.  Here's how I understand the rationale for avoiding holding
the mmap_lock while we perform I/O (before the existence of the VMA lock):

 - If everybody is doing page faults, there is no specific problem;
   we all hold the lock for read and multiple page faults can be handled
   in parallel.
 - As soon as one thread attempts to manipulate the tree (eg calls
   mmap()), all new readers must wait (as the rwsem is fair), and the
   writer must wait for all existing readers to finish.  That's
   potentially milliseconds for an I/O during which time all page faults
   stop.

Now we have the per-VMA lock, faults which can be handled without taking
the mmap_lock can still be satisfied, as long as that VMA is not being
modified.  It is rare for a real application to take a page fault on a
VMA which is being modified.

So modifications to the tree will generally not take VMA locks on VMAs
which are currently handling faults, and new faults will generally not
find a VMA which is write-locked.

When we find a locked folio (presumably for I/O, although folios are
locked for other reasons), if we fall back to taking the mmap_lock
for read, we increase contention on the mmap_lock and make the page
fault wait on any mmap() operation.  If we simply sleep waiting for the
I/O, we make any mmap() operation _which touches this VMA_ wait for
the I/O to complete.  But I think that's OK, because new page faults
can continue to be serviced ... as long as they don't need to take
the mmap_lock.

So ... I think what we _really_ want here is ...

+++ b/mm/filemap.c
@@ -1690,7 +1690,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
 bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
                         unsigned int flags)
 {
-       if (fault_flag_allow_retry_first(flags)) {
+       if (!(flags & FAULT_FLAG_VMA_LOCK) &&
+           fault_flag_allow_retry_first(flags)) {
                /*
                 * CAUTION! In this case, mmap_lock is not released
                 * even though return 0.
@@ -1710,7 +1711,8 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,

                ret = __folio_lock_killable(folio);
                if (ret) {
-                       mmap_read_unlock(mm);
+                       if (!(flags & FAULT_FLAG_VMA_LOCK))
+                               mmap_read_unlock(mm);
                        return false;
                }
        } else {