From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5F686C77B7C
	for <linux-mm@archiver.kernel.org>; Mon,  1 May 2023 17:55:12 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E8DEC900004; Mon,  1 May 2023 13:55:11 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E160C900002; Mon,  1 May 2023 13:55:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C9010900004; Mon,  1 May 2023 13:55:11 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id B2B46900002
	for <linux-mm@kvack.org>; Mon,  1 May 2023 13:55:11 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 6D08B40819
	for <linux-mm@kvack.org>; Mon,  1 May 2023 17:55:11 +0000 (UTC)
X-FDA: 80742437622.27.B45B851
Received: from mail-yw1-f179.google.com (mail-yw1-f179.google.com [209.85.128.179])
	by imf28.hostedemail.com (Postfix) with ESMTP id 8B806C001C
	for <linux-mm@kvack.org>; Mon,  1 May 2023 17:55:09 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=lP6cll0z;
	spf=pass (imf28.hostedemail.com: domain of surenb@google.com designates 209.85.128.179 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1682963709;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=K1weW95EDkyvCe/cwOd0PB4zhkheUJ8J7X9stA1ol6Q=;
	b=gZ9VUGMtIEnBE6WvGLYOqOASdbe8YU2aFIbuHOhDdBPBDCpggX8uH2ze9lWRs0VDQyZBoh
	4csFexzWaBl51vYKJHK5qjVi43ENiw5M8QqkNF8Mis2/aiUgLLRJaZZUyyqVMebDucQ7Ap
	UGVp8kWhEXCS/Y3iaAqTrmxIemoQabA=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=lP6cll0z;
	spf=pass (imf28.hostedemail.com: domain of surenb@google.com designates 209.85.128.179 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682963709; a=rsa-sha256;
	cv=none;
	b=uGdaQMOGS+p8KrGU6RyNOaGdBsvSrjARX7NEp3DXaMvwWcwgogQl1YWByJuliZhIayDOa6
	eJJXXC5lDFDJYLy7Cbp+gixVWf2l93wZYu/iPDnIVD8fwd15dvhIHvvrNKcyYXIW3AvKcq
	OnhYGpOUHf8JtJomgFO8gaQTt260rHQ=
Received: by mail-yw1-f179.google.com with SMTP id 00721157ae682-55a5a830238so10454507b3.3
        for <linux-mm@kvack.org>; Mon, 01 May 2023 10:55:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1682963708; x=1685555708;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=K1weW95EDkyvCe/cwOd0PB4zhkheUJ8J7X9stA1ol6Q=;
        b=lP6cll0z5vbj6X6IJg7KZRDM+XxjC5Ngg8DG7vvrNCivjyJ9kDOBXO1WuWwCXQLHBn
         1E2BcPfZg3mWljCec9RRGGmDeqX0YL+xrRiyufqAyVMDGg5mVTBIxino6+wyW+n3ryxi
         yrKIUOeNmF2b1D7OcQWhZucW46kMKOfOgsrSy92HHWLwiz6f7YjnQcs9EKULAb6y0GRs
         Vb54uSol9NsKc6zWrFK+mvrqu9PnOjW2BqKrfeTE297ljHqs8Dn+hsyJalVG888MMiqo
         50pjicv54BL6AISuVbz8ViY8whn2Ac/9umlOC18eVGvnZyT2xscjWtebXKSHbTHF/n9T
         NTKQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1682963708; x=1685555708;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=K1weW95EDkyvCe/cwOd0PB4zhkheUJ8J7X9stA1ol6Q=;
        b=ApnsPx8zGwEsR9/nO5wcjPVVXKbanzSMqoPfSdlulk58B7qvDhxH3CgCo8+G5+losN
         Osy76eVkWnOXbAz2mcjeNsM/76fw49mRPp8Sl/R6CFDIduO5EzaumaeYb2n7SOafLxVC
         8CfKRFKfkJfe92dUyrHv3duJ11O5aHOT5KPaDaBUCUx1zHWxCByT78FfYkznDNfxPPmN
         6n5vBQdy4W99iZ5020LLJU0CfvN2ZxXOQDcmQ5tr0E1NVXlA3WcTDyI2BmwIzyl/TVTY
         Pou6Md7OeZxnnKoj7IIBkSYpvMEAPzcQO1SmtHM18MMlzv5+2TQQpA/d+tXTSiDavlyX
         Prog==
X-Gm-Message-State: AC+VfDy0OwuIHosGGBfGzjjUsMtXOh3dmJYUeeOczrnd1b68qVtokPhV
	7CZfYM5/LZv+fxF/wEQ6T5nedpy3hWO5AyLBURg2xQ==
X-Google-Smtp-Source: ACHHUZ7hjXzhbIK1lFI88VY95LtZ84gfi8LXLoxdp40GiU3x23472skHcwg4uViuJvkrj1tcHIY697aV+7pvRDqs4eA=
X-Received: by 2002:a81:8787:0:b0:556:1065:e6a8 with SMTP id
 x129-20020a818787000000b005561065e6a8mr12344466ywf.2.1682963708480; Mon, 01
 May 2023 10:55:08 -0700 (PDT)
MIME-Version: 1.0
References: <20230414180043.1839745-1-surenb@google.com> <ZDmetaUdmlEz/W8Q@casper.infradead.org>
 <87sfczuxkc.fsf@nvidia.com> <CAJuCfpEV1OiM423bykYQTxDC1=bQAqhAwd5fiKYifsk=seP6yw@mail.gmail.com>
 <877cuaulrm.fsf@nvidia.com> <CAJuCfpFEwdnpTXGRpKtXaZ+F4RW4+DkxaRRTiBVy4jE9cG=TEw@mail.gmail.com>
 <CAJuCfpHpnOevnuQSAK2WwtambXqiaoS8gpSFQsi9=O=szKe8vg@mail.gmail.com>
In-Reply-To: <CAJuCfpHpnOevnuQSAK2WwtambXqiaoS8gpSFQsi9=O=szKe8vg@mail.gmail.com>
From: Suren Baghdasaryan <surenb@google.com>
Date: Mon, 1 May 2023 10:54:57 -0700
Message-ID: <CAJuCfpEMKv5Twk=V49cpCTpT5+-6bEP5gH=hk8VQKXi_OVPsng@mail.gmail.com>
Subject: Re: [PATCH 1/1] mm: handle swap page faults if the faulting page can
 be locked
To: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>, akpm@linux-foundation.org, hannes@cmpxchg.org, 
	mhocko@suse.com, josef@toxicpanda.com, jack@suse.cz, ldufour@linux.ibm.com, 
	laurent.dufour@fr.ibm.com, michel@lespinasse.org, liam.howlett@oracle.com, 
	jglisse@google.com, vbabka@suse.cz, minchan@google.com, dave@stgolabs.net, 
	punit.agrawal@bytedance.com, lstoakes@gmail.com, linux-mm@kvack.org, 
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, 
	kernel-team@android.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 8B806C001C
X-Stat-Signature: angfn34chkhosqjq8jwusoqjonumsbzk
X-Rspam-User: 
X-HE-Tag: 1682963709-313941
X-HE-Meta: U2FsdGVkX18QYzUJkkwtryGfcKWdHTOvBro7VtqFOmtoq8gTxjNVUMSXj2cjC6YT93uDYPUAdGpo2o/BzSxbRsrsQ/Bza0lHvm4MebkKJLP/yn81ODrakTOUi+fyr+r9kk9HdtOi2oaZTErQB0pqJ7pk0V9WK6kh0e4m7O89isXPkmUFAc48WfgabnfGJzf1AUtXV4d+i8BJgw2gvhCO1fcNFmeJ6HDoeZIX7Q7scwfBiUv2VOjzqnbmzh4nvZceluAdvjJSJt1rplDT/wVWcKOy9bj8ThVcXzAAQ+95Ql36jddhZmvZ0X1PuVMWAZ/kBga3Hcd2Ev+goYJN68NCdrJpE92jHURsNmo2ChVJKLHSplDII//bC4yfxaFArO1fSkadTpKR2i2WtmRArxiI3wHrFahOzwXVQGK3oGKnVI5/uVt1iTyPDARpn1koeeq3InjMDbrbPMANhs4dqNG6FsMcMEyescUWEMYSp14g1L29YlvXu9GEQzVGKt2WVwX4ISmR7CPyavgtndSmOZWMNi4VQDMgITS2bwNpwB5bzejd7IV9zWGRpm9QGWbe6/KI6VgezWTMXuaMIZ9QGTKc9wF/tfgsshzl2ul3rLnlIiQS9ENN1/jldWrGlu2bhFUuY3rrBoyoMDQjwHL1js2ioZ1XZ1AYRaE7V7xBkviqVy5TRofYh4bK0JYBKyBzeLLEk4qbTY2vl0bvueR6oaSEWrQbWFtfpCkvxrypdrgtVqVTRICIM0M6s7Hd4StLMjRg/CN6BwZkRjmF5b/0h/2tzDQULu9zm2/sr2sYo0FECj1HWZGPeLdZXSGIEXqGXmB0Q+NNj38b1Ouy7r2a90+bD5h2x/UWNX/96eORWpO49eS1m3ZyMmNLr0EmWllQg82a/nEbGXLNxrBcGNFCy37WERCgBbaHapELXm5xDvRLlGVh+gO6rGiY6MkMsxkbbwOenwSd/p6PP9ixG4NuNTw
 MdqvsFFi
 BKywv3mQukpjmzz4M1Nj5rb+TcawIUhG3S1u+XVvaWSV2/3X0+Col2Ra1qhp5YquGiWmLMAIu2SxZw6e6t+HNF2Qag/KWSvpNO1JkZKBaawpebNI+PCdFvGWqs5J7U8V8T0o0kO46/uc8qOISui3qknC7FCXzR80WaPjT898bsLzWUY1xB5Yjg4u+tNB9c7txsExxX78jRu7mj9So6yBrxkL4u+ZE0MRI/WuVmgiJm1l4d4OBuPtOrBG53GWPExIY/PJjfq6OjwLtfZNCrmwg6prWLEJJP8b/RSOAoxEzzLDjIoLUCuE4BtczBaw+2lf/I5Y/xA3cf/cyJ7yE6AUT/d7+vlfynJgwOl4U+JxcL8yE/8CoBgo+VqAejF88XIenCI0DWyR1XYCwRgs=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Apr 17, 2023 at 6:07=E2=80=AFPM Suren Baghdasaryan <surenb@google.c=
om> wrote:
>
> On Mon, Apr 17, 2023 at 4:50=E2=80=AFPM Suren Baghdasaryan <surenb@google=
.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 4:33=E2=80=AFPM Alistair Popple <apopple@nvidia=
.com> wrote:
> > >
> > >
> > > Suren Baghdasaryan <surenb@google.com> writes:
> > >
> > > > On Sun, Apr 16, 2023 at 6:06=E2=80=AFPM Alistair Popple <apopple@nv=
idia.com> wrote:
> > > >>
> > > >>
> > > >> Matthew Wilcox <willy@infradead.org> writes:
> > > >>
> > > >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wro=
te:
> > > >> >> When page fault is handled under VMA lock protection, all swap =
page
> > > >> >> faults are retried with mmap_lock because folio_lock_or_retry
> > > >> >> implementation has to drop and reacquire mmap_lock if folio cou=
ld
> > > >> >> not be immediately locked.
> > > >> >> Instead of retrying all swapped page faults, retry only when fo=
lio
> > > >> >> locking fails.
> > > >> >
> > > >> > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > >> >
> > > >> > Let's just review what can now be handled under the VMA lock ins=
tead of
> > > >> > the mmap_lock, in case somebody knows better than me that it's n=
ot safe.
> > > >> >
> > > >> >  - We can call migration_entry_wait().  This will wait for PG_lo=
cked to
> > > >> >    become clear (in migration_entry_wait_on_locked()).  As previ=
ously
> > > >> >    discussed offline, I think this is safe to do while holding t=
he VMA
> > > >> >    locked.
> > > >>
> > > >> Do we even need to be holding the VMA locked while in
> > > >> migration_entry_wait()? My understanding is we're just waiting for
> > > >> PG_locked to be cleared so we can return with a reasonable chance =
the
> > > >> migration entry is gone. If for example it has been unmapped or
> > > >> protections downgraded we will simply refault.
> > > >
> > > > If we drop VMA lock before migration_entry_wait() then we would nee=
d
> > > > to lock_vma_under_rcu again after the wait. In which case it might =
be
> > > > simpler to retry the fault with some special return code to indicat=
e
> > > > that VMA lock is not held anymore and we want to retry without taki=
ng
> > > > mmap_lock. I think it's similar to the last options Matthew suggest=
ed
> > > > earlier. In which case we can reuse the same retry mechanism for bo=
th
> > > > cases, here and in __folio_lock_or_retry.
> > >
> > > Good point. Agree there is no reason to re-take the VMA lock after th=
e
> > > wait, although in this case we shouldn't need to retry the fault
> > > (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the =
way
> > > out to userspace.
> >
> > Actually, __collapse_huge_page_swapin() which calls do_swap_page() can
> > use VMA reference again inside its loop unless we return
> > VM_FAULT_RETRY or VM_FAULT_ERROR. That is not safe since we dropped
> > the VMA lock and stability of the VMA is not guaranteed at that point.
> > So, we do need to return VM_FAULT_RETRY maybe with another bit
> > indicating that retry does not need to fallback to mmap_lock. Smth
> > like "return VM_FAULT_RETRY | VM_FAULT_USE_VMA_LOCK".
>
> False alarm. __collapse_huge_page_swapin is always called under
> mmap_lock protection. I'll go over the code once more to make sure
> nothing else would use VMA after we drop the VMA lock in page fault
> path.

I posted a new series at
https://lore.kernel.org/all/20230501175025.36233-1-surenb@google.com/
It implements suggestions discussed in this thread. Feedback is
appreciated! Thanks!

>
>
> >
> > >
> > > >>
> > > >> >  - We can call remove_device_exclusive_entry().  That calls
> > > >> >    folio_lock_or_retry(), which will fail if it can't get the VM=
A lock.
> > > >>
> > > >> Looks ok to me.
> > > >>
> > > >> >  - We can call pgmap->ops->migrate_to_ram().  Perhaps somebody f=
amiliar
> > > >> >    with Nouveau and amdkfd could comment on how safe this is?
> > > >>
> > > >> Currently this won't work because drives assume mmap_lock is held =
during
> > > >> pgmap->ops->migrate_to_ram(). Primarily this is because
> > > >> migrate_vma_setup()/migrate_vma_pages() is used to handle the faul=
t and
> > > >> that asserts mmap_lock is taken in walk_page_range() and also
> > > >> migrate_vma_insert_page().
> > > >>
> > > >> So I don't think we can call that case without mmap_lock.
> > > >>
> > > >> At a glance it seems it should be relatively easy to move to using
> > > >> lock_vma_under_rcu(). Drivers will need updating as well though be=
cause
> > > >> migrate_vma_setup() is called outside of fault handling paths so d=
rivers
> > > >> will currently take mmap_lock rather than vma lock when looking up=
 the
> > > >> vma. See for example nouveau_svmm_bind().
> > > >
> > > > Thanks for the pointers, Alistair! It does look like we need to be
> > > > more careful with the migrate_to_ram() path. For now I can fallback=
 to
> > > > retrying with mmap_lock for this case, like with do with all cases
> > > > today. Afterwards this path can be made ready for working under VMA
> > > > lock and we can remove that retry. Does that sound good?
> > >
> > > Sounds good to me. Fixing that shouldn't be too difficult but will ne=
ed
> > > changes to at least Nouveau and amdkfd (and hmm-tests obviously). Hap=
py
> > > to look at doing that if/when this change makes it in. Thanks.
> > >
> > > >>
> > > >> >  - I believe we can't call handle_pte_marker() because we exclud=
e UFFD
> > > >> >    VMAs earlier.
> > > >> >  - We can call swap_readpage() if we allocate a new folio.  I ha=
ven't
> > > >> >    traced through all this code to tell if it's OK.
> > > >> >
> > > >> > So ... I believe this is all OK, but we're definitely now willin=
g to
> > > >> > wait for I/O from the swap device while holding the VMA lock whe=
n we
> > > >> > weren't before.  And maybe we should make a bigger deal of it in=
 the
> > > >> > changelog.
> > > >> >
> > > >> > And maybe we shouldn't just be failing the folio_lock_or_retry()=
,
> > > >> > maybe we should be waiting for the folio lock with the VMA locke=
d.
> > > >>
> > >