linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>,
	Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Matthew Wilcox <willy@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Kees Cook <keescook@chromium.org>,
	Leon Romanovsky <leonro@nvidia.com>, Jan Kara <jack@suse.cz>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Nadav Amit <nadav.amit@gmail.com>, Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
Date: Sat, 16 Jan 2021 12:42:10 +0100	[thread overview]
Message-ID: <b46a0eb9-f80f-e459-d31d-ed9074e38ede@redhat.com> (raw)
In-Reply-To: <cba836c6-4054-c966-8254-a11915351b6b@nvidia.com>

On 16.01.21 04:40, John Hubbard wrote:
> On 1/15/21 11:46 AM, David Hildenbrand wrote:
>>>> 7) There is no easy way to detect if a page really was pinned: we might
>>>> have false positives. Further, there is no way to distinguish if it was
>>>> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
>>>> we most probably would need more counters, which we cannot fit into
>>>> struct page. (AFAIU, for huge pages it's easier).
>>>
>>> I think this is the real issue. We can only store so much information,
>>> so we have to decide which things work and which things are broken. So
>>> far someone hasn't presented a way to record everything at least..
>>
>> I do wonder how many (especially long-term) GUP readers/writers we have
>> to expect, and especially, support for a single base page. Do we have a
>> rough estimate?
>>
>> With RDMA, I would assume we only need a single one (e.g., once RDMA
>> device; I'm pretty sure I'm wrong, sounds too easy).
>> With VFIO I guess we need one for each VFIO container (~ in the worst
>> case one for each passthrough device).
>> With direct I/O, vmsplice and other GUP users ?? No idea.
>>
>> If we could somehow put a limit on the #GUP we support, and fail further
>> GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the
>> refcount into something like (assume max #15 GUP READ and #15 GUP R/W,
>> which is most probably a horribly bad choice)
>>
>> [ GUP READ ][ GUP R/W ] [  ordinary ]
>> 31  ...  28 27  ...  24 23   ....   0
>>
>> But due to saturate handling in "ordinary", we would lose further 2 bits
>> (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea
>> how many bits we actually need in practice.
>>
>> Maybe we need less for GUP READ, because most users want GUP R/W? No idea.
>>
>> Just wild ideas. Most probably that has already been discussed, and most
>> probably people figured that it's impossible :)
>>
> 
> I proposed this exact idea a few days ago [1]. It's remarkable that we both
> picked nearly identical values for the layout! :)

Heh! Somehow I missed that. But well, there were *a lot* of mails :)

> 
> But as the responses show, security problems prevent pursuing that approach.
It still feels kind of wrong to waste valuable space in the memmap.


In an ideal world (well, one that still only allows for a 64 byte memmap
:) ), we would:

1) Partition the refcount into separate fields that cannot overflow into
each other, similar to my example above, but maybe add even more fields.

2) Reject attempts that would result in an overflow to everything except
the "ordinary" field (e.g., GUP fields in my example above).

3) Put an upper limit on the "ordinary" field that we ever expect for
sane workloads (E.g., 10 bits). In addition, reserve some bits (like the
saturate logic) that we handle as a "red zone".

4) For the "ordinary" field, as soon as we enter the red zone, we know
we have an attack going on. We continue on paths that we cannot fail
(e.g., get_page()) but eventually try stopping the attacker(s). AFAIU,
we know the attacker(s) are something (e.g., one ore multiple processes)
that has direct access to the page in their address space. Of course,
the more paths we can reject, the better.


Now, we would:

a) Have to know what sane upper limits on the "ordinary" field are. I
have no idea which values we can expect. Attacker vs. sane workload.

b) Need a way to identify the attacker(s). In the simplest case, this is
a single process. In the hard case, this involves many processes.

c) Need a way to stop the attacker(s). Doing that out of random context
is problematic. Last resort is doing this asynchronously from another
thread, which leaves more time for the attacker to do harm.


Of course, problem gets more involved as soon as we might have a
malicious child process that uses a page from a well-behaving parent
process for the attack.

Imagine we kill relevant processes, we might end up killing someone
who's not responsible. And even if we don't kill, but instead reject
try_get_page(), we might degrade the well-behaving parent process AFAIKS.

Alternatives to killing the process might be unmapping the problematic
page from the address space.

Reminds me a little about handling memory errors for a page, eventually
killing all users of that page. mm/memory-failure.c:kill_procs().


Complicated problem :)

-- 
Thanks,

David / dhildenb



      reply	other threads:[~2021-01-16 11:42 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-10  0:44 Andrea Arcangeli
2021-01-10  0:44 ` [PATCH 1/1] " Andrea Arcangeli
2021-01-10  2:54   ` Andrea Arcangeli
2021-01-11 14:11     ` Kirill A. Shutemov
2021-01-10  0:55 ` [PATCH 0/1] " Linus Torvalds
2021-01-10  1:19   ` Linus Torvalds
2021-01-10  1:37     ` Linus Torvalds
2021-01-10  3:24       ` Andrea Arcangeli
2021-01-10  2:51     ` Andrea Arcangeli
2021-01-10  3:51       ` Linus Torvalds
2021-01-10 19:30         ` Linus Torvalds
2021-01-11  1:18           ` Jason Gunthorpe
2021-01-11  7:26           ` John Hubbard
2021-01-11 12:42             ` Matthew Wilcox
2021-01-11 16:05             ` Jason Gunthorpe
2021-01-11 16:15               ` Michal Hocko
2021-01-11 19:19             ` Linus Torvalds
2021-01-11 22:18               ` Linus Torvalds
2021-01-12 17:07                 ` Andy Lutomirski
2021-01-12 23:51                 ` Jerome Glisse
2021-01-13  2:16                 ` Matthew Wilcox
2021-01-13  2:43                   ` Linus Torvalds
2021-01-13  3:31                   ` Linus Torvalds
2021-01-13  8:52                     ` David Hildenbrand
2021-01-13  8:57                       ` David Hildenbrand
2021-01-13 12:32                     ` Kirill A. Shutemov
2021-01-13 12:55                       ` Matthew Wilcox
2021-01-13 19:54                         ` Linus Torvalds
2021-01-13 23:54           ` Peter Xu
2021-01-11 15:52       ` Jason Gunthorpe
2021-01-15  8:59 ` David Hildenbrand
2021-01-15 18:37   ` Jason Gunthorpe
2021-01-15 19:46     ` David Hildenbrand
2021-01-15 19:53       ` Jason Gunthorpe
2021-01-16  3:40       ` John Hubbard
2021-01-16 11:42         ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b46a0eb9-f80f-e459-d31d-ed9074e38ede@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keescook@chromium.org \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox