From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0F092C001DE
	for <linux-mm@archiver.kernel.org>; Mon, 31 Jul 2023 16:10:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9FD856B00FB; Mon, 31 Jul 2023 12:10:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9AE7E28006D; Mon, 31 Jul 2023 12:10:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 84F286B00FD; Mon, 31 Jul 2023 12:10:40 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 6F2946B00FB
	for <linux-mm@kvack.org>; Mon, 31 Jul 2023 12:10:40 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 199FF120B8E
	for <linux-mm@kvack.org>; Mon, 31 Jul 2023 16:10:40 +0000 (UTC)
X-FDA: 81072395040.27.E4CF703
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf18.hostedemail.com (Postfix) with ESMTP id A44461C0023
	for <linux-mm@kvack.org>; Mon, 31 Jul 2023 16:10:37 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=QlYUvaCV;
	spf=pass (imf18.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1690819837;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=9D9FieDLZcsb1FDGxBpvmkRicBVLa/sqJyjLA/WaIr0=;
	b=HkFzVKgO6we3vQF093nuLG2r0wnnjJeqCd3gi8BYyPQ+dLGh1VGLFb3x9tpOil/DspCxKl
	rf7NvfNQA7M47KEXd9FK/LJ1u0Z8dI6W0lF6VzPLS7Eb3iWhHaGdOqHdHWZtEsredPtGpT
	253BP37zkccTslX64Xkxq+zqZwZXEfg=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690819837; a=rsa-sha256;
	cv=none;
	b=hPauzUkvM5bD3/rP9CFBlgWr2+L2o7kW5rj5fBQCiaRcOtZaUJbZ/FUZWTypfp3X+r5pNL
	ck8muiXs160Pkrpd6XRm0oRRg9Rgzzut94Bpq86wJyDYdEE6xlqsoCLP8CHZ9wHCl10j05
	W7z+AMqQysvzDT6uZHJFE9MB2UHe1k4=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=QlYUvaCV;
	spf=pass (imf18.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1690819836;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=9D9FieDLZcsb1FDGxBpvmkRicBVLa/sqJyjLA/WaIr0=;
	b=QlYUvaCVwMfvqFbH1JUuuYA5ZPmVFiqCYe5EZXtP+yAQyQ6fh0man6VCjxxi1IG8Knkouf
	zD+4CG3clQf0ymIuKPM6+zkIY9brPv8m9YBQEDLFnvwQYqvkAi6bOwcvbWhpx/6u5mPwhv
	zchBvvDGFKpdomysPhxa9OwCRB1IMvE=
Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com
 [209.85.219.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-156-dckP5xJ-PA6ujeUc0Tottw-1; Mon, 31 Jul 2023 12:10:34 -0400
X-MC-Unique: dckP5xJ-PA6ujeUc0Tottw-1
Received: by mail-qv1-f70.google.com with SMTP id 6a1803df08f44-63d289847dcso11431016d6.0
        for <linux-mm@kvack.org>; Mon, 31 Jul 2023 09:10:34 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1690819834; x=1691424634;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=9D9FieDLZcsb1FDGxBpvmkRicBVLa/sqJyjLA/WaIr0=;
        b=hS1MmHM60idkGPWYAFC7ZymZ8iqRmRs+CQBHJ4Iq44PqyQuu0I1VSD/BPjXY/MrRZA
         Qwtmj2JWirLbZGM4hBNqZcE6eu4PufFnd8HkconKpDZu5O/bSZCTrDxwgQ3ars0opkNI
         aaOaC+qSOYUg0cOLWBYbnARKr4GlIlCtUc98r18VQCL08MzTRh66AAVbcfIFIYm9N11r
         47Cf+F/vOs6i02h6UUveAK22wm6TWG9W4NZbaklEXMckpJJUsOc6Gr4VZd2JV0gHE7I0
         d6jLIeQZey1qaZco3248/ootgyWZEtV5bYVfTS5+sfHKtS2D598e5HhHaoUnLBKBeoY/
         W7Mw==
X-Gm-Message-State: ABy/qLZvutbs8ObqyRbmBPfxXLQtL6T0f2gl53wEl1/3tLy8a6rqv/+W
	fIHfkj3I6FnD3Ng8ehBUZNXFu+Ke2uIMB34vgskiLaFi77xIv8DVLpWO07rqA33fv4EHcOm0Jz1
	+Q3s5vI/NSyU=
X-Received: by 2002:a05:6214:5090:b0:63d:ee8:4127 with SMTP id kk16-20020a056214509000b0063d0ee84127mr9673542qvb.2.1690819834086;
        Mon, 31 Jul 2023 09:10:34 -0700 (PDT)
X-Google-Smtp-Source: APBJJlFBW0/2UJO3rdppPnlxWOX9l7PeksP2SIlJUnOMWCR2gyoWHZpO/GRBaouJmPC9W6C8tXdceA==
X-Received: by 2002:a05:6214:5090:b0:63d:ee8:4127 with SMTP id kk16-20020a056214509000b0063d0ee84127mr9673500qvb.2.1690819833497;
        Mon, 31 Jul 2023 09:10:33 -0700 (PDT)
Received: from x1n (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39])
        by smtp.gmail.com with ESMTPSA id k4-20020a0cf584000000b00632209f7157sm2333834qvm.143.2023.07.31.09.10.32
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 31 Jul 2023 09:10:33 -0700 (PDT)
Date: Mon, 31 Jul 2023 12:10:31 -0400
From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	liubo <liubo254@huawei.com>, Matthew Wilcox <willy@infradead.org>,
	Hugh Dickins <hughd@google.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone
 fallout
Message-ID: <ZMfc9+/44kViqjeN@x1n>
References: <20230727212845.135673-1-david@redhat.com>
 <CAHk-=wiig=N75AGP7UAG9scmghWAqsTB5NRO6RiWLOB5YWfcTQ@mail.gmail.com>
 <412bb30f-0417-802c-3fc4-a4e9d5891c5d@redhat.com>
 <66e26ad5-982e-fe2a-e4cd-de0e552da0ca@redhat.com>
MIME-Version: 1.0
In-Reply-To: <66e26ad5-982e-fe2a-e4cd-de0e552da0ca@redhat.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Rspamd-Queue-Id: A44461C0023
X-Rspam-User: 
X-Stat-Signature: 6sghqrr5zm3zzurn7ri6ihesc6f4osse
X-Rspamd-Server: rspam03
X-HE-Tag: 1690819837-56055
X-HE-Meta: U2FsdGVkX1+nLp7J4VJr2F2yFB1b47nttR84UZjOjXpKQrJu+TzDyMLxHWXfhgTH8U1s2xJQFx7zK+R9gE1cCXS+VGPf2x8HZsZQRLN26+yMEcqgh8ytaC4afutWzvQNzP7PvGKqlIOQewi3i+y1fJF5bYDYHfOLi9ezWa56yyfW6vuSTmhtuK0Wv+uCN+EWpvgPz5tDG6iJI4EPqy+8iAXutkTqmNe0fhxm5TxffmEH2+EFId6uSeXkCNxFbPVeCxWtbEDEWUbGNEfATakB+BtgNzCysFXg738GL6UANcuDD0sBgboqolS1FxEoxOuSHsiCIwBVfHVWBOhJ8Z5jJEEnbldX0WTppqEoyUXUIdfwMp88YKI2Vm2YIG1vFTZAUF/h9xaiQLkmWEqfFrz4K01nvhjQaXdceotJ2JEh72DFl3bFh0efqN/ZXIYptCvdKx+c2nHPa937C1p0S0rJ++co+jfkh5e68SH3i/TGcORt5GpR4gw+uSVicrA1Xpe5bQilLfa3LjR7RLZ0M1tkMawQCIIlghgBt0kU4hWR+GM37hb3z+ryqmfidnOa7+G5g7iCFPrE/4rWkOuF5BuBrNFsDsUpR5D2GNt/ehMcuGB/FloYZ8mGxB6xNFrglj6cf7bschM0IzR8bRZ8hZQPYbVF7SaJu2+SOfLr8OBR4DXzfkzIvAcFM/s+TZQ0sApOXwBL+uh3YfbHDLGdEPmFLP4xuimXul99HsM+y9z2kCaLQ8omVUZpiOvwl2yO69rbeFUoGJCN/FDeoLlGlbnf5hQh7sH7YgARkffOl/kSMWvxq2Q+99MtHyeYpuq2RnuEPFZM5Ps1fGWBC0n2WEHK2x+MDou1hwJ7+qlSTrgSsK0fhWu+izv+VOXcRNJalKDhS+dMetcXQ0nPJbAqjND9xKYKunV/EQmNFQ2vcjlvebB6QVWZXzR+Rd1XBQIsVVf2kGXUUvtU5j+6QIAIzf0
 s5IspgVY
 k8d0IWAR5eHtP7QYPWOPJszkYDg8MpWyGYCO+x1NF8X8m/JrKQiNY8c9r48T6bLJL9IjlLysMsF1V2uiNvvFPeI0P5h54lnWVUuTC7mpB4lSW13g2WFnsfuARMHjTGYdguf1JF47d4bU7Yw/d5qHQqvwZRBYGCQUNnUNddnxmaYzr401SIscmsgxoTYGjH6ieqX3xADdMEG2xmHelw317y5qCowjWznQ2MH2X6Ffl0lcIoUAIlkrD780qYdh23OZaa2CM0t29Q1ZOFTGdmIzs6lEkYb4zlDGb7mtzJSBYe5dJc7t+JS+MCAgw5txnJYS1tAkn1UIMUS7MTBI8NDwBL91O98ZA0fVLrN4Ew06PhxXmN1dfvnCzIh1IC5k1SddKpLkNRlmCDW0geAhsqbqwANKNoDC4aMItn47mtuBsT5WuHKEj3De99G1ZxSNdixt3VIoE8Lmq1Ashbjrh0ddsVCYn/dWUovT60j+vl2RH40tlQYo0hUGpznjiC/XibslNYjWzOGTCMG4/JOrlN9KydpKbmIS06Q7lmLTyTBxVL3OpI3+TOTAP34GxktQ4E+JAblOBmK0HckKieVAGB+joiv1T66ALPmQRIRm30BKoms1cg/vnkmS/zH0kJmWhlWtK15g+o9TYCr7eMAJLFrmHTWHuMHEeT9FHBjDZOohjV5ZDy5+2fbnmL/Bvqmncz6+0jSiSFPsjMaUqhg6kXGVsRVlBIAosVVtl+vzVSU1ELsDSgEeGmTm6Vr8B3EXk6BpApli6
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Sat, Jul 29, 2023 at 11:35:22AM +0200, David Hildenbrand wrote:
> On 29.07.23 00:35, David Hildenbrand wrote:
> > > The original reason for not setting FOLL_NUMA all the time is
> > > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
> > > page faults from gup/gup_fast") from way back in 2012:
> > > 
> > >            * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
> > >            * would be called on PROT_NONE ranges. We must never invoke
> > >            * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
> > >            * page faults would unprotect the PROT_NONE ranges if
> > >            * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
> > >            * bitflag. So to avoid that, don't set FOLL_NUMA if
> > >            * FOLL_FORCE is set.
> > > 
> > > but I don't think the original reason for this is *true* any more.
> > > 
> > > Because then two years later in 2014, in commit c46a7c817e66 ("x86:
> > > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
> > > Mel made the code able to distinguish between PROT_NONE and NUMA
> > > pages, and he changed the comment above too.
> > 
> > 
> 
> Sleeping over it and looking into some nasty details, I realized the following things:
> 
> 
> (1) The pte_protnone() comment in include/linux/pgtable.h is
>     either wrong or misleading.
> 
>     Especially the "For PROT_NONE VMAs, the PTEs are not marked
>     _PAGE_PROTNONE" is *wrong* nowadays on x86.
> 
>     Doing an mprotect(PROT_NONE) will also result in pte_protnone()
>     succeeding, because the pages *are* marked _PAGE_PROTNONE.
> 
>     The comment should be something like this
> 
>     /*
>      * In an inaccessible (PROT_NONE) VMA, pte_protnone() *may* indicate
>      * "yes". It is perfectly valid to indicate "no" in that case,
>      * which is why our default implementation defaults to "always no".
>      *
>      * In an accessible VMA, however, pte_protnone() *reliably*
>      * indicates PROT_NONE page protection due to NUMA hinting. NUMA
>      * hinting faults only apply in accessible VMAs.
>      *
>      * So, to reliably distinguish between PROT_NONE due to an
>      * inaccessible VMA and NUMA hinting, looking at the VMA
>      * accessibility is sufficient.
>      */
> 
>     I'll send that as a separate patch.
> 
> 
> (2) Consequently, commit c46a7c817e66 from 2014 does not tell the whole
>     story.
> 
>     commit 21d9ee3eda77 ("mm: remove remaining references to NUMA
>     hinting bits and helpers") from 2015 made the distinction again
>     impossible.
> 
>     Setting FOLL_FORCE | FOLL_HONOR_NUMA_HINT would end up never making
>     progress in GUP with an inaccessible (PROT_NONE) VMA.

If we also teach follow_page_mask() on vma_is_accessible(), we should still
be good, am I right?

Basically fast-gup will stop working on protnone, and it always fallbacks
to slow-gup. Then it seems we're good decoupling FORCE with NUMA hint.

I assume that that's what you did below in the patch too, which looks right
to me.

> 
>     (a) GUP sees the pte_protnone() and triggers a NUMA hinting fault,
>         although NUMA hinting does not apply.
> 
>     (b) handle_mm_fault() refuses to do anything with pte_protnone() in
>         an inaccessible VMA. And even if it would do something, the new
>         PTE would end up as pte_protnone() again.
>     So, GUP will keep retrying. I have a reproducer that triggers that
>     using ptrace read in an inaccessible VMA.
> 
>     It's easy to make that work in GUP, simply by looking at the VMA
>     accessibility.
> 
>     See my patch proposal, that cleanly decouples FOLL_FORCE from
>     FOLL_HONOR_NUMA_HINT.
> 
> 
> (3) follow_page() does not check VMA permissions and, therefore, my
>     "implied FOLL_FORCE" assumption is not actually completely wrong.
> 
>     And especially callers that dont't pass FOLL_WRITE really expect
>     follow_page() to work even if the VMA is inaccessible.
> 
>     But the interaction with NUMA hinting is just nasty, absolutely
>     agreed.
> 
>     As raised in another comment, I'm looking into removing the
>     "foll_flags" parameter from follow_page() completely and cleanly
>     documenting the semantics of follow_page().
> 
>     IMHO, the less follow_page(), the better. Let's see what we can do
>     to improve that.
> 
> 
> So this would be the patch I would suggest as the first fix we can also
> backport to stable.
> 
> Gave it a quick test, also with my ptrace read reproducer (trigger
> FOLL_FORCE on inaccessible VMA; make sure it works and that the pages don't
> suddenly end up readable in the page table). Seems to work.
> 
> I'll follow up with cleanups and moving FOLL_HONOR_NUMA_HINT setting to the
> relevant callers (especially KVM). Further, I'll add a selftest to make
> sure that ptrace on inaccessible VMAs keeps working as expected.
> 
> 
> 
> From 36c1aeb9aa3e859762f671776601a71179247d17 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Fri, 28 Jul 2023 21:57:04 +0200
> Subject: [PATCH] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
> 
> As it turns out, unfortunately commit 474098edac26 ("mm/gup: replace
> FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and
> follow_trans_huge_pmd() never set FOLL_NUMA because they really don't want
> NUMA hinting faults.
> 
> As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page
> faults from gup/gup_fast"): "Other follow_page callers like KSM should not
> use FOLL_NUMA, or they would fail to get the pages if they use follow_page
> instead of get_user_pages."
> 
> While we didn't get BUG reports on the changed follow_page() semantics yet
> (and it's just a matter of time), liubo reported [1] that smaps_rollup
> results are imprecise, because they miss accounting of pages that are
> mapped PROT_NONE due to NUMA hinting.
> 
> As KVM really depends on these NUMA hinting faults, removing the
> pte_protnone()/pmd_protnone() handling in GUP code completely is not really
> an option.
> 
> To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
> to restore the original behavior and add better comments.
> 
> Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE. To make that
> combination work in inaccessible VMAs, we have to perform proper VMA
> accessibility checks in gup_can_follow_protnone().
> 
> Move gup_can_follow_protnone() to internal.h which feels more
> appropriate and is required as long as FOLL_HONOR_NUMA_FAULT is an
> internal flag.
> 
> As Linus notes [2], this handling doesn't make sense for many GUP users.
> So we should really see if we instead let relevant GUP callers specify it
> manually, and not trigger NUMA hinting faults from GUP as default.
> 
> [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
> [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com
> 
> Reported-by: liubo <liubo254@huawei.com>
> Reported-by: Peter Xu <peterx@redhat.com>
> Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: liubo <liubo254@huawei.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h | 15 ---------------
>  mm/gup.c           | 18 ++++++++++++++----
>  mm/huge_memory.c   |  2 +-
>  mm/internal.h      | 31 +++++++++++++++++++++++++++++++
>  4 files changed, 46 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..f8d7fa3c01c1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3400,21 +3400,6 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
>  	return 0;
>  }
> -/*
> - * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
> - * a (NUMA hinting) fault is required.
> - */
> -static inline bool gup_can_follow_protnone(unsigned int flags)
> -{
> -	/*
> -	 * FOLL_FORCE has to be able to make progress even if the VMA is
> -	 * inaccessible. Further, FOLL_FORCE access usually does not represent
> -	 * application behaviour and we should avoid triggering NUMA hinting
> -	 * faults.
> -	 */
> -	return flags & FOLL_FORCE;
> -}
> -
>  typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  			       unsigned long size, pte_fn_t fn, void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 76d222ccc3ff..54b8d77f3a3d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -597,7 +597,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	pte = ptep_get(ptep);
>  	if (!pte_present(pte))
>  		goto no_page;
> -	if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
> +	if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
>  		goto no_page;
>  	page = vm_normal_page(vma, address, pte);
> @@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  	if (likely(!pmd_trans_huge(pmdval)))
>  		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> -	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags))
> +	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
>  		return no_page_table(vma, flags);
>  	ptl = pmd_lock(mm, pmd);
> @@ -851,6 +851,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  	if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
>  		return NULL;
> +	/*
> +	 * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect
> +	 * to fail on PROT_NONE-mapped pages.
> +	 */
>  	page = follow_page_mask(vma, address, foll_flags, &ctx);
>  	if (ctx.pgmap)
>  		put_dev_pagemap(ctx.pgmap);
> @@ -1200,6 +1204,12 @@ static long __get_user_pages(struct mm_struct *mm,
>  	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
> +	/*
> +	 * For now, always trigger NUMA hinting faults. Some GUP users like
> +	 * KVM really require it to benefit from autonuma.
> +	 */
> +	gup_flags |= FOLL_HONOR_NUMA_FAULT;
> +
>  	do {
>  		struct page *page;
>  		unsigned int foll_flags = gup_flags;
> @@ -2551,7 +2561,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>  		struct page *page;
>  		struct folio *folio;
> -		if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
> +		if (pte_protnone(pte) && !gup_can_follow_protnone(NULL, flags))
>  			goto pte_unmap;
>  		if (!pte_access_permitted(pte, flags & FOLL_WRITE))
> @@ -2971,7 +2981,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
>  			     pmd_devmap(pmd))) {
>  			if (pmd_protnone(pmd) &&
> -			    !gup_can_follow_protnone(flags))
> +			    !gup_can_follow_protnone(NULL, flags))
>  				return 0;
>  			if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index eb3678360b97..ef6bdc4a6fec 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1468,7 +1468,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  		return ERR_PTR(-EFAULT);
>  	/* Full NUMA hinting faults to serialise migration in fault paths */
> -	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags))
> +	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
>  		return NULL;
>  	if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
> diff --git a/mm/internal.h b/mm/internal.h
> index a7d9e980429a..7db17259c51a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -937,6 +937,8 @@ enum {
>  	FOLL_FAST_ONLY = 1 << 20,
>  	/* allow unlocking the mmap lock */
>  	FOLL_UNLOCKABLE = 1 << 21,
> +	/* Honor (trigger) NUMA hinting faults on PROT_NONE-mapped pages. */
> +	FOLL_HONOR_NUMA_FAULT = 1 << 22,
>  };
>  /*
> @@ -1004,6 +1006,35 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
>  	return !PageAnonExclusive(page);
>  }
> +/*
> + * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
> + * a (NUMA hinting) fault is required.
> + */
> +static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
> +					   unsigned int flags)
> +{
> +	/*
> +	 * If callers don't want to honor NUMA hinting faults, no need to
> +	 * determine if we would actually have to trigger a NUMA hinting fault.
> +	 */
> +	if (!(flags & FOLL_HONOR_NUMA_FAULT))
> +		return true;
> +
> +	/* We really need the VMA ... */
> +	if (!vma)
> +		return false;

I'm not sure whether the compiler will be smart enough to inline this for
fast-gup on pmd/pte.  One way to guarantee this is we simply always bail
out for fast-gup on protnone (ignoring calling gup_can_follow_protnone()
with a comment), as discussed above.  Then WARN_ON_ONCE(!vma) for all the
rest callers, assuming that's a required knowledge to know what the
protnone means.

Thanks,

> +
> +	/*
> +	 * ... because NUMA hinting faults only apply in accessible VMAs. In
> +	 * inaccessible (PROT_NONE) VMAs, NUMA hinting faults don't apply.
> +	 *
> +	 * Requiring a fault here even for inaccessible VMAs would mean that
> +	 * FOLL_FORCE cannot make any progress, because handle_mm_fault()
> +	 * refuses to process NUMA hinting faults in inaccessible VMAs.
> +	 */
> +	return !vma_is_accessible(vma);
> +}
> +
>  extern bool mirrored_kernelcore;
>  static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
> -- 
> 2.41.0
> 
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Peter Xu