From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F686C77B7C for ; Mon, 1 May 2023 17:55:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E8DEC900004; Mon, 1 May 2023 13:55:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E160C900002; Mon, 1 May 2023 13:55:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9010900004; Mon, 1 May 2023 13:55:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B2B46900002 for ; Mon, 1 May 2023 13:55:11 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6D08B40819 for ; Mon, 1 May 2023 17:55:11 +0000 (UTC) X-FDA: 80742437622.27.B45B851 Received: from mail-yw1-f179.google.com (mail-yw1-f179.google.com [209.85.128.179]) by imf28.hostedemail.com (Postfix) with ESMTP id 8B806C001C for ; Mon, 1 May 2023 17:55:09 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=lP6cll0z; spf=pass (imf28.hostedemail.com: domain of surenb@google.com designates 209.85.128.179 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682963709; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=K1weW95EDkyvCe/cwOd0PB4zhkheUJ8J7X9stA1ol6Q=; b=gZ9VUGMtIEnBE6WvGLYOqOASdbe8YU2aFIbuHOhDdBPBDCpggX8uH2ze9lWRs0VDQyZBoh 4csFexzWaBl51vYKJHK5qjVi43ENiw5M8QqkNF8Mis2/aiUgLLRJaZZUyyqVMebDucQ7Ap UGVp8kWhEXCS/Y3iaAqTrmxIemoQabA= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=lP6cll0z; spf=pass (imf28.hostedemail.com: domain of surenb@google.com designates 209.85.128.179 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682963709; a=rsa-sha256; cv=none; b=uGdaQMOGS+p8KrGU6RyNOaGdBsvSrjARX7NEp3DXaMvwWcwgogQl1YWByJuliZhIayDOa6 eJJXXC5lDFDJYLy7Cbp+gixVWf2l93wZYu/iPDnIVD8fwd15dvhIHvvrNKcyYXIW3AvKcq OnhYGpOUHf8JtJomgFO8gaQTt260rHQ= Received: by mail-yw1-f179.google.com with SMTP id 00721157ae682-55a5a830238so10454507b3.3 for ; Mon, 01 May 2023 10:55:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682963708; x=1685555708; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=K1weW95EDkyvCe/cwOd0PB4zhkheUJ8J7X9stA1ol6Q=; b=lP6cll0z5vbj6X6IJg7KZRDM+XxjC5Ngg8DG7vvrNCivjyJ9kDOBXO1WuWwCXQLHBn 1E2BcPfZg3mWljCec9RRGGmDeqX0YL+xrRiyufqAyVMDGg5mVTBIxino6+wyW+n3ryxi yrKIUOeNmF2b1D7OcQWhZucW46kMKOfOgsrSy92HHWLwiz6f7YjnQcs9EKULAb6y0GRs Vb54uSol9NsKc6zWrFK+mvrqu9PnOjW2BqKrfeTE297ljHqs8Dn+hsyJalVG888MMiqo 50pjicv54BL6AISuVbz8ViY8whn2Ac/9umlOC18eVGvnZyT2xscjWtebXKSHbTHF/n9T NTKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682963708; x=1685555708; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K1weW95EDkyvCe/cwOd0PB4zhkheUJ8J7X9stA1ol6Q=; b=ApnsPx8zGwEsR9/nO5wcjPVVXKbanzSMqoPfSdlulk58B7qvDhxH3CgCo8+G5+losN Osy76eVkWnOXbAz2mcjeNsM/76fw49mRPp8Sl/R6CFDIduO5EzaumaeYb2n7SOafLxVC 8CfKRFKfkJfe92dUyrHv3duJ11O5aHOT5KPaDaBUCUx1zHWxCByT78FfYkznDNfxPPmN 6n5vBQdy4W99iZ5020LLJU0CfvN2ZxXOQDcmQ5tr0E1NVXlA3WcTDyI2BmwIzyl/TVTY Pou6Md7OeZxnnKoj7IIBkSYpvMEAPzcQO1SmtHM18MMlzv5+2TQQpA/d+tXTSiDavlyX Prog== X-Gm-Message-State: AC+VfDy0OwuIHosGGBfGzjjUsMtXOh3dmJYUeeOczrnd1b68qVtokPhV 7CZfYM5/LZv+fxF/wEQ6T5nedpy3hWO5AyLBURg2xQ== X-Google-Smtp-Source: ACHHUZ7hjXzhbIK1lFI88VY95LtZ84gfi8LXLoxdp40GiU3x23472skHcwg4uViuJvkrj1tcHIY697aV+7pvRDqs4eA= X-Received: by 2002:a81:8787:0:b0:556:1065:e6a8 with SMTP id x129-20020a818787000000b005561065e6a8mr12344466ywf.2.1682963708480; Mon, 01 May 2023 10:55:08 -0700 (PDT) MIME-Version: 1.0 References: <20230414180043.1839745-1-surenb@google.com> <87sfczuxkc.fsf@nvidia.com> <877cuaulrm.fsf@nvidia.com> In-Reply-To: From: Suren Baghdasaryan Date: Mon, 1 May 2023 10:54:57 -0700 Message-ID: Subject: Re: [PATCH 1/1] mm: handle swap page faults if the faulting page can be locked To: Alistair Popple Cc: Matthew Wilcox , akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@suse.com, josef@toxicpanda.com, jack@suse.cz, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, michel@lespinasse.org, liam.howlett@oracle.com, jglisse@google.com, vbabka@suse.cz, minchan@google.com, dave@stgolabs.net, punit.agrawal@bytedance.com, lstoakes@gmail.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 8B806C001C X-Stat-Signature: angfn34chkhosqjq8jwusoqjonumsbzk X-Rspam-User: X-HE-Tag: 1682963709-313941 X-HE-Meta: U2FsdGVkX18QYzUJkkwtryGfcKWdHTOvBro7VtqFOmtoq8gTxjNVUMSXj2cjC6YT93uDYPUAdGpo2o/BzSxbRsrsQ/Bza0lHvm4MebkKJLP/yn81ODrakTOUi+fyr+r9kk9HdtOi2oaZTErQB0pqJ7pk0V9WK6kh0e4m7O89isXPkmUFAc48WfgabnfGJzf1AUtXV4d+i8BJgw2gvhCO1fcNFmeJ6HDoeZIX7Q7scwfBiUv2VOjzqnbmzh4nvZceluAdvjJSJt1rplDT/wVWcKOy9bj8ThVcXzAAQ+95Ql36jddhZmvZ0X1PuVMWAZ/kBga3Hcd2Ev+goYJN68NCdrJpE92jHURsNmo2ChVJKLHSplDII//bC4yfxaFArO1fSkadTpKR2i2WtmRArxiI3wHrFahOzwXVQGK3oGKnVI5/uVt1iTyPDARpn1koeeq3InjMDbrbPMANhs4dqNG6FsMcMEyescUWEMYSp14g1L29YlvXu9GEQzVGKt2WVwX4ISmR7CPyavgtndSmOZWMNi4VQDMgITS2bwNpwB5bzejd7IV9zWGRpm9QGWbe6/KI6VgezWTMXuaMIZ9QGTKc9wF/tfgsshzl2ul3rLnlIiQS9ENN1/jldWrGlu2bhFUuY3rrBoyoMDQjwHL1js2ioZ1XZ1AYRaE7V7xBkviqVy5TRofYh4bK0JYBKyBzeLLEk4qbTY2vl0bvueR6oaSEWrQbWFtfpCkvxrypdrgtVqVTRICIM0M6s7Hd4StLMjRg/CN6BwZkRjmF5b/0h/2tzDQULu9zm2/sr2sYo0FECj1HWZGPeLdZXSGIEXqGXmB0Q+NNj38b1Ouy7r2a90+bD5h2x/UWNX/96eORWpO49eS1m3ZyMmNLr0EmWllQg82a/nEbGXLNxrBcGNFCy37WERCgBbaHapELXm5xDvRLlGVh+gO6rGiY6MkMsxkbbwOenwSd/p6PP9ixG4NuNTw MdqvsFFi BKywv3mQukpjmzz4M1Nj5rb+TcawIUhG3S1u+XVvaWSV2/3X0+Col2Ra1qhp5YquGiWmLMAIu2SxZw6e6t+HNF2Qag/KWSvpNO1JkZKBaawpebNI+PCdFvGWqs5J7U8V8T0o0kO46/uc8qOISui3qknC7FCXzR80WaPjT898bsLzWUY1xB5Yjg4u+tNB9c7txsExxX78jRu7mj9So6yBrxkL4u+ZE0MRI/WuVmgiJm1l4d4OBuPtOrBG53GWPExIY/PJjfq6OjwLtfZNCrmwg6prWLEJJP8b/RSOAoxEzzLDjIoLUCuE4BtczBaw+2lf/I5Y/xA3cf/cyJ7yE6AUT/d7+vlfynJgwOl4U+JxcL8yE/8CoBgo+VqAejF88XIenCI0DWyR1XYCwRgs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Apr 17, 2023 at 6:07=E2=80=AFPM Suren Baghdasaryan wrote: > > On Mon, Apr 17, 2023 at 4:50=E2=80=AFPM Suren Baghdasaryan wrote: > > > > On Mon, Apr 17, 2023 at 4:33=E2=80=AFPM Alistair Popple wrote: > > > > > > > > > Suren Baghdasaryan writes: > > > > > > > On Sun, Apr 16, 2023 at 6:06=E2=80=AFPM Alistair Popple wrote: > > > >> > > > >> > > > >> Matthew Wilcox writes: > > > >> > > > >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wro= te: > > > >> >> When page fault is handled under VMA lock protection, all swap = page > > > >> >> faults are retried with mmap_lock because folio_lock_or_retry > > > >> >> implementation has to drop and reacquire mmap_lock if folio cou= ld > > > >> >> not be immediately locked. > > > >> >> Instead of retrying all swapped page faults, retry only when fo= lio > > > >> >> locking fails. > > > >> > > > > >> > Reviewed-by: Matthew Wilcox (Oracle) > > > >> > > > > >> > Let's just review what can now be handled under the VMA lock ins= tead of > > > >> > the mmap_lock, in case somebody knows better than me that it's n= ot safe. > > > >> > > > > >> > - We can call migration_entry_wait(). This will wait for PG_lo= cked to > > > >> > become clear (in migration_entry_wait_on_locked()). As previ= ously > > > >> > discussed offline, I think this is safe to do while holding t= he VMA > > > >> > locked. > > > >> > > > >> Do we even need to be holding the VMA locked while in > > > >> migration_entry_wait()? My understanding is we're just waiting for > > > >> PG_locked to be cleared so we can return with a reasonable chance = the > > > >> migration entry is gone. If for example it has been unmapped or > > > >> protections downgraded we will simply refault. > > > > > > > > If we drop VMA lock before migration_entry_wait() then we would nee= d > > > > to lock_vma_under_rcu again after the wait. In which case it might = be > > > > simpler to retry the fault with some special return code to indicat= e > > > > that VMA lock is not held anymore and we want to retry without taki= ng > > > > mmap_lock. I think it's similar to the last options Matthew suggest= ed > > > > earlier. In which case we can reuse the same retry mechanism for bo= th > > > > cases, here and in __folio_lock_or_retry. > > > > > > Good point. Agree there is no reason to re-take the VMA lock after th= e > > > wait, although in this case we shouldn't need to retry the fault > > > (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the = way > > > out to userspace. > > > > Actually, __collapse_huge_page_swapin() which calls do_swap_page() can > > use VMA reference again inside its loop unless we return > > VM_FAULT_RETRY or VM_FAULT_ERROR. That is not safe since we dropped > > the VMA lock and stability of the VMA is not guaranteed at that point. > > So, we do need to return VM_FAULT_RETRY maybe with another bit > > indicating that retry does not need to fallback to mmap_lock. Smth > > like "return VM_FAULT_RETRY | VM_FAULT_USE_VMA_LOCK". > > False alarm. __collapse_huge_page_swapin is always called under > mmap_lock protection. I'll go over the code once more to make sure > nothing else would use VMA after we drop the VMA lock in page fault > path. I posted a new series at https://lore.kernel.org/all/20230501175025.36233-1-surenb@google.com/ It implements suggestions discussed in this thread. Feedback is appreciated! Thanks! > > > > > > > > > > >> > > > >> > - We can call remove_device_exclusive_entry(). That calls > > > >> > folio_lock_or_retry(), which will fail if it can't get the VM= A lock. > > > >> > > > >> Looks ok to me. > > > >> > > > >> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody f= amiliar > > > >> > with Nouveau and amdkfd could comment on how safe this is? > > > >> > > > >> Currently this won't work because drives assume mmap_lock is held = during > > > >> pgmap->ops->migrate_to_ram(). Primarily this is because > > > >> migrate_vma_setup()/migrate_vma_pages() is used to handle the faul= t and > > > >> that asserts mmap_lock is taken in walk_page_range() and also > > > >> migrate_vma_insert_page(). > > > >> > > > >> So I don't think we can call that case without mmap_lock. > > > >> > > > >> At a glance it seems it should be relatively easy to move to using > > > >> lock_vma_under_rcu(). Drivers will need updating as well though be= cause > > > >> migrate_vma_setup() is called outside of fault handling paths so d= rivers > > > >> will currently take mmap_lock rather than vma lock when looking up= the > > > >> vma. See for example nouveau_svmm_bind(). > > > > > > > > Thanks for the pointers, Alistair! It does look like we need to be > > > > more careful with the migrate_to_ram() path. For now I can fallback= to > > > > retrying with mmap_lock for this case, like with do with all cases > > > > today. Afterwards this path can be made ready for working under VMA > > > > lock and we can remove that retry. Does that sound good? > > > > > > Sounds good to me. Fixing that shouldn't be too difficult but will ne= ed > > > changes to at least Nouveau and amdkfd (and hmm-tests obviously). Hap= py > > > to look at doing that if/when this change makes it in. Thanks. > > > > > > >> > > > >> > - I believe we can't call handle_pte_marker() because we exclud= e UFFD > > > >> > VMAs earlier. > > > >> > - We can call swap_readpage() if we allocate a new folio. I ha= ven't > > > >> > traced through all this code to tell if it's OK. > > > >> > > > > >> > So ... I believe this is all OK, but we're definitely now willin= g to > > > >> > wait for I/O from the swap device while holding the VMA lock whe= n we > > > >> > weren't before. And maybe we should make a bigger deal of it in= the > > > >> > changelog. > > > >> > > > > >> > And maybe we shouldn't just be failing the folio_lock_or_retry()= , > > > >> > maybe we should be waiting for the folio lock with the VMA locke= d. > > > >> > > >