Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	liubo <liubo254@huawei.com>, Matthew Wilcox <willy@infradead.org>,
	Hugh Dickins <hughd@google.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout
Date: Mon, 31 Jul 2023 21:00:06 +0200	[thread overview]
Message-ID: <a453d403-fc96-e4a0-71ee-c61d527e70da@redhat.com> (raw)
In-Reply-To: <CAHk-=wiREarX5MQx9AppxPzV6jXCCQRs5KVKgHoGYwATRL6nPg@mail.gmail.com>

On 31.07.23 20:23, Linus Torvalds wrote:
> On Mon, 31 Jul 2023 at 09:20, David Hildenbrand <david@redhat.com> wrote:
>>

Hi Linus,

>> I modified it slightly: FOLL_HONOR_NUMA_FAULT is now set in
>> is_valid_gup_args(), such that it will always be set for any GUP users,
>> including GUP-fast.
> 
> But do we actually want that? It is actively crazy to honor NUMA
> faulting at least for get_user_pages_remote().

This would only be for the stable backport that would go in first and 
where I want to be a bit careful.

Next step would be to let the callers (KVM) specify 
FOLL_HONOR_NUMA_FAULT, as suggested by you.

> 
> So right now, GUP-fast requires us to honor NUMA faults, because
> GUP-fast doesn't have a vma (which in turn is because GUP-fast doesn't
> take any locks).

With FOLL_HONOR_NUMA_FAULT moved to the GUP caller that would no longer 
be the case.

Anybody who

(1) doesn't specify FOLL_HONOR_NUMA_FAULT, which is the majority
(2) doesn't specify FOLL_WRITE

Would get GUP-fast just grabbing these pte_protnone() pages.

> 
> So GUP-fast can only look at the page table data, and as such *has* to
> fail if the page table is inaccessible.

gup_fast_only, yes, which is what KVM uses if a writable PFN is desired.

> 
> But GUP in general? Why would it want to honor numa faulting?
> Particularly by default, and _particularly_ for things like
> FOLL_REMOTE.

KVM currently does [virt/kvm/kvm_main.c]:

(1) hva_to_pfn_fast(): call get_user_page_fast_only(FOLL_WRITE) if a
     writable PFN is desired
(2) hva_to_pfn_slow(): call get_user_pages_unlocked()

So in the "!writable" case, we would always call 
get_user_pages_unlocked() and never honor NUMA faults.

Converting that to some other pattern might be possible (although KVM 
plays quite some tricks here!), but assuming we would always first do a 
get_user_page_fast_only(), then when not intending to write (!FOLL_WRITE)

(1) get_user_page_fast_only() would honor NUMA faults and fail
(2) get_user_pages() would not honor NUMA faults and succeed

Hmmm ... so we would have to use get_user_pages_fast()? It might be 
possible, but I am not sure if we want get_user_pages_fast() to always 
honor NUMA faults, because ...

> 
> In fact, I feel like this is what the real rule should be: we simply
> define that get_user_pages_fast() is about looking up the page in the
> page tables.
> 
> So if you want something that acts like a page table lookup, you use
> that "fast" thing.  It's literally how it is designed. The whole - and
> pretty much only - point of it is that it can be used with no locking
> at all, because it basically acts like the hardware lookup does.
> 

... I see what you mean (HW would similarly refuse to use such a page), 
but I do wonder if that makes the API clearer and if this is what we 
actually want.

We do have callers of pin_user_pages_fast() and friends that maybe 
*really* shouldn't care about NUMA hinting. 
iov_iter_extract_user_pages() is one example -- used for O_DIRECT nowadays.

Their logic is "if it's directly in the page table, create, hand it 
over. If not, please go the slow path.". In many cases user space just 
touched these pages so they are very likely in the page table.

Converting them to pin_user_pages() would mean they will just run slower 
in the common case.

Converting them to a manual pin_user_pages_fast_only() + 
pin_user_pages() doesn't seem very compelling.

... so we would need a new API? :/

> So then if KVM wants to look up a page in the page table, that is what
> kvm should use, and it automatically gets the "honor numa faults"
> behavior, not because it sets a magic flag, but simply because that is
> how GUP-fast *works*.
> 
> But if you use the "normal" get/pin_user_pages() function, which looks
> up the vma, at that point you are following things at a "software
> level", and it wouldn't do NUMA faulting, it would just get the page.

My main problem with that is that pin_user_pages_fast() and friends are 
used all over the place for a "likely already in the page table case, so 
just make everything faster as default".

Always honoring NUMA faults here does not sound like the improvement we 
wanted to have :) ... we actually *don't* want to honor NUMA faults here.

We just have to find a way to make the KVM special case happy.

Thanks!

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2023-07-31 19:00 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-27 21:28 David Hildenbrand
2023-07-27 21:28 ` [PATCH v1 2/4] mm/gup: Make follow_page() succeed again on PROT_NONE PTEs/PMDs David Hildenbrand
2023-07-28  2:30   ` John Hubbard
2023-07-28  9:08     ` David Hildenbrand
2023-07-28 10:12       ` David Hildenbrand
2023-07-27 21:28 ` [PATCH v1 3/4] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() David Hildenbrand
2023-07-27 21:28 ` [PATCH v1 4/4] mm/gup: document FOLL_FORCE behavior David Hildenbrand
2023-07-28 16:18 ` [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout Linus Torvalds
2023-07-28 17:30   ` David Hildenbrand
2023-07-28 17:54     ` David Hildenbrand
2023-07-28 19:40     ` David Hildenbrand
2023-07-28 19:50       ` Peter Xu
2023-07-28 20:00         ` David Hildenbrand
2023-08-02 10:24     ` Mel Gorman
2023-07-28 19:39   ` Peter Xu
2023-07-28 19:52     ` David Hildenbrand
2023-07-28 20:23     ` Linus Torvalds
2023-07-28 20:33       ` David Hildenbrand
2023-07-28 20:50         ` Linus Torvalds
2023-07-28 21:02           ` David Hildenbrand
2023-07-28 21:20             ` Peter Xu
2023-07-28 21:31               ` David Hildenbrand
2023-07-28 22:14                 ` Jason Gunthorpe
2023-07-31 16:01                   ` Peter Xu
2023-07-28 21:32               ` John Hubbard
2023-07-28 21:49                 ` Peter Xu
2023-07-28 22:00                   ` John Hubbard
2023-07-31 16:05                     ` Peter Xu
     [not found]   ` <412bb30f-0417-802c-3fc4-a4e9d5891c5d@redhat.com>
2023-07-29  9:35     ` David Hildenbrand
2023-07-31 16:10       ` Peter Xu
2023-07-31 16:20         ` David Hildenbrand
2023-07-31 18:23           ` Linus Torvalds
2023-07-31 18:51             ` Peter Xu
2023-07-31 19:00             ` David Hildenbrand [this message]
2023-07-31 19:07               ` Linus Torvalds
2023-07-31 19:22                 ` David Hildenbrand
2023-08-01 13:05               ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a453d403-fc96-e4a0-71ee-c61d527e70da@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liubo254@huawei.com \
    --cc=mgorman@suse.de \
    --cc=peterx@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox