linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	David Rientjes <rientjes@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Yang Shi <shy828301@gmail.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Michal Hocko <mhocko@kernel.org>, Nadav Amit <namit@vmware.com>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Xu <peterx@redhat.com>, Donald Dutile <ddutile@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>,
	Jan Kara <jack@suse.cz>, Linux-MM <linux-mm@kvack.org>,
	"open list:KERNEL SELFTEST FRAMEWORK"
	<linux-kselftest@vger.kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)
Date: Sat, 18 Dec 2021 10:57:54 +0100	[thread overview]
Message-ID: <40e7e0ab-0828-b2e7-339f-35f68a228b3d@redhat.com> (raw)
In-Reply-To: <CAHk-=wjvoTRSb87R-D50yOXqX4mshjiiAyurAKCsdW0_J+sf7A@mail.gmail.com>

On 18.12.21 00:20, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> The pages stay PageAnon(). swap-backed pages simply set a bit IIRC.
>> mapcount still applies.
> 
> Our code-base is too large for me to remember all the details, but if
> we still end up having PageAnon for swapbacked pages, then mapcount
> can increase from another process faulting in an pte with that swap
> entry.

"Our code-base is too large for me to remember all the details". I
second that.

You might a valid point with the mapcount regarding concurrent swapin in
the current code, I'll have to think further about that if it could be a
problem and if it cannot be handled without heavy synchronization (I
think the concern is that gup unsharing could miss doing an unshare
because it doesn't detect that there are other page sharers not
expressed in the mapcount code but via the swap code when seeing
mapcount == 1).

Do you have any other concerns regarding the semantics/stability
regarding the following points (as discussed, fork() is not the issue
because it can be handled via write_protect_seq or something comparable.
handling per-process thingies is not the problem):

a) Using PageAnon(): It cannot possibly change in the pagefault path or
   in the gup-fast-only path (otherwise there would be use-after-free
   already).
b) Using PageKsm(): It cannot possibly change in the pagefault path or
   in the gup-fast path (otherwise there would be use-after-free
   already).
c) Using mapcount: It cannot possibly change in the way we care about or
   cannot detect  (mapcount going from == 1 to > 1 concurrently) in the
   pagefault path or in the gup-fast path due to fork().

You're point for c) is that we might currently not handle swap
correctly. Any other concerns, especially regarding the mapcount or is
that it?


IIUC, any GUP approach to detect necessary unsharing would at least
require a check for a) and b). What we're arguing about is c).

> 
> And mmap_sem doesn't protect against that. Again, page_lock() does.
> 
> And taking the page lock was a big performance issue.
> 
> One of the reasons that new COW handling is so nice is that you can do
> things like
> 
>                 if (!trylock_page(page))
>                         goto copy;
> 
> exactly because in the a/b world order, the copy case is always safe.
> 
> In your model, as far as I can tell, you leave the page read-only and
> a subsequent COW fault _can_ happen, which means that now the
> subsequent COW needs to b every very careful, because if it ever
> copies a page that was GUP'ed, you just broke the rules.
> 
> So COWing too much is a bug (because it breaks the page from the GUP),
> but COWing too little is an even worse problem (because it measn that
> now the GUP user can see data it shouldn't have seen).

Good summary, I'll extend below.

> 
> Our old code literally COWed too  little. It's why all those changes
> happened in the first place.

Let's see if we can agree on some things to get a common understanding.


What can happen with COW is:

1) Missed COW

We miss a COW, therefore someone has access to a wrong page.

This is the security issue as in patch #11. The security issue
documented in [1].

2) Unnecessary COW

We do a COW, but there are no other valid users, so it's just overhead +
noise.

The performance issue documented in section 5 in [1].

3) Wrong COW

We do a COW but there are other valid users (-> GUP).

The memory corruption issue documented in section 2 and 3 in [1].

Most notably, the io_uring reproducer which races with the
page_maybe_dma_pinned() check in current code can trigger this easily,
and exactly this issues is what gives me nightmares. [2]


Does that make sense? If we agree on the above, then here is how the
currently discussed approaches differ:

page_count != 1:
* 1) cannot happen
* 2) can happen easily (speculative references due to pagecache,
     migration, daemon, pagevec, ...)
* 3) can happen in the current code

mapcount > 1:
* 1) your concern is that this can happen due to concurrent swapin
* 2) cannot happen.
* 3) your concern is that this can happen due to concurrent swapin


If we can agree on that, I can see why you dislike mapcount, can you see
why I dislike page_count?

Ideally we'd really have a fast and reliable check for "is this page
shared and could get used by multiple processes -- either multiple
processes are already mapping it R/O or could map it via the swap R/O
later".


> This is why I'm pushing that whole story line of
> 
>  (1) COW is based purely on refcounting, because that's the only thing
> that obviously can never COW too little.

I am completely missing how 2) or 3) could *ever* be handled properly
for page_count != 1. 3) is obviously more important and gives me nightmares.


And that's what I'm trying to communicate the whole time: page_count is
absolutely fragile, because anything that results in a page getting
mapped R/O into a page table can trigger 3). And as [2] proves that can
even happen with *swap*.

(see how we're running into the same swap issues with both approaches?
Stupid swap :) )

> 
>  (2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then
> makes sure to not mark pinned pages COW again (that "(b)" rule).
> 
> and here "don't use page_mapcount()" really is about that (1).
> 
> You do seem to have kept (1) in that your COW rules don't seem to
> change (but maybe I missed it), but because your GUP-vs-COW semantics
> are very different indeed, I'm not at all convinced about (2).

Oh yes, sorry, not in the context of this series. The point is that the
current page_count != 1 covers mapcount > 1, so we can adjust that
separately later.


You mentioned "design", so let's assume we have a nice function:

/*
 * Check if an anon page is shared or exclusively used by a single
 * process: if shared, the page is shared by multiple processes either
 * mapping the page R/O ("active sharing") or having swap entries that
 * could result in the page getting mapped R/O ("inactive sharing").
 *
 * This function is safe to be called under mmap_lock in read/write mode
 * because it prevents concurrent fork() sharing the page.
 * This function is safe to be called from gup-fast-only in IRQ context,
 * as it detects concurrent fork() sharing the page
 */
bool page_anon_shared();


Can we agree that that would that be a suitable function for (1) and (2)
instead of using either the page_count or the mapcount directly? (yes,
how to actually make it reliable due to swapin is to be discussed, but
it might be a problem worth solving if that's the way to go)

For hugetlb, this would really have to use the mapcount as explained
(after all, fortunately there is no swap ...).



[1]
https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/

[2]
https://gitlab.com/aarcange/kernel-testcases-for-v5.11/-/blob/main/io_uring_swap.c
-- 
Thanks,

David / dhildenb



  reply	other threads:[~2021-12-18  9:58 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-17 11:30 [PATCH v1 00/11] mm: COW fixes part 1: fix the COW security issue for THP and hugetlb David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant David Hildenbrand
2021-12-17 17:02   ` Nadav Amit
2021-12-17 17:29     ` David Hildenbrand
2021-12-17 17:49       ` David Hildenbrand
2021-12-17 18:01         ` Nadav Amit
2021-12-17 21:28   ` Thomas Gleixner
2021-12-17 22:02     ` David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 02/11] mm: thp: consolidate mapcount logic on THP split David Hildenbrand
2021-12-17 19:06   ` Yang Shi
2021-12-18 14:24   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount() David Hildenbrand
2021-12-17 17:16   ` Nadav Amit
2021-12-17 17:30     ` David Hildenbrand
2021-12-17 18:06   ` Mike Kravetz
2021-12-17 18:11     ` David Hildenbrand
2021-12-17 19:07   ` Yang Shi
2021-12-18 14:31   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 04/11] mm: thp: simlify total_mapcount() David Hildenbrand
2021-12-17 19:12   ` Yang Shi
2021-12-18 14:35   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 05/11] mm: thp: allow for reading the THP mapcount atomically via a raw_seqlock_t David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) David Hildenbrand
2021-12-17 19:04   ` Linus Torvalds
2021-12-17 19:22     ` Linus Torvalds
2021-12-17 20:17       ` David Hildenbrand
2021-12-17 20:36         ` Linus Torvalds
2021-12-17 20:39           ` Linus Torvalds
2021-12-17 20:43             ` Linus Torvalds
2021-12-17 20:42           ` David Hildenbrand
2021-12-17 20:45             ` Linus Torvalds
2021-12-18 22:52               ` Kirill A. Shutemov
2021-12-18 23:05                 ` Linus Torvalds
2021-12-17 20:47           ` Jason Gunthorpe
2021-12-17 20:56             ` Linus Torvalds
2021-12-17 21:17               ` David Hildenbrand
2021-12-17 21:04             ` David Hildenbrand
2021-12-18  0:50               ` Jason Gunthorpe
2021-12-17 21:15             ` Nadav Amit
2021-12-17 21:20               ` David Hildenbrand
2021-12-18  0:50               ` Jason Gunthorpe
2021-12-18  1:53               ` Linus Torvalds
2021-12-18  2:17                 ` Linus Torvalds
2021-12-18  2:42                   ` Linus Torvalds
2021-12-18  3:36                     ` Linus Torvalds
2021-12-18  3:05                 ` Jason Gunthorpe
2021-12-18  3:30                   ` Nadav Amit
2021-12-18  3:38                     ` Linus Torvalds
2021-12-18 18:42                       ` Jason Gunthorpe
2021-12-18 21:48                         ` Nadav Amit
2021-12-18 22:53                           ` Linus Torvalds
2021-12-19  0:19                             ` Nadav Amit
2021-12-19  0:35                               ` Linus Torvalds
2021-12-19  6:02                                 ` Nadav Amit
2021-12-19  8:01                                   ` John Hubbard
2021-12-19 11:30                                     ` Matthew Wilcox
2021-12-19 17:27                                   ` Linus Torvalds
2021-12-19 17:44                                     ` David Hildenbrand
2021-12-19 17:44                                     ` Linus Torvalds
2021-12-19 17:59                                       ` David Hildenbrand
2021-12-19 21:12                                         ` Matthew Wilcox
2021-12-19 21:27                                           ` Linus Torvalds
2021-12-19 21:47                                             ` Matthew Wilcox
2021-12-19 21:53                                               ` Linus Torvalds
2021-12-19 22:02                                                 ` Matthew Wilcox
2021-12-19 22:12                                                   ` Linus Torvalds
2021-12-19 22:26                                                     ` Matthew Wilcox
2021-12-20 18:37                                           ` Matthew Wilcox
2021-12-20 18:52                                             ` Matthew Wilcox
2021-12-20 19:38                                               ` Linus Torvalds
2021-12-20 19:15                                             ` Linus Torvalds
2021-12-20 21:02                                               ` Matthew Wilcox
2021-12-20 21:27                                                 ` Linus Torvalds
2021-12-21  1:03                                         ` Jason Gunthorpe
2021-12-21  3:29                                           ` Matthew Wilcox
2021-12-21  8:58                                           ` David Hildenbrand
2021-12-21 14:28                                             ` Jason Gunthorpe
     [not found]                                               ` <303f21d3-42b4-2f11-3f22-28f89f819080@redhat.com>
2021-12-21 23:54                                                 ` Jason Gunthorpe
2021-12-21 17:05                                             ` Linus Torvalds
2021-12-21 17:40                                               ` David Hildenbrand
2021-12-21 18:00                                                 ` Linus Torvalds
     [not found]                                                   ` <dda021c8-69ec-c660-46be-793ae345a5bb@redhat.com>
2021-12-21 21:11                                                     ` John Hubbard
2021-12-21 18:07                                                 ` Jan Kara
2021-12-21 18:30                                                   ` Linus Torvalds
     [not found]                                                     ` <d23ede12-5df7-2f28-00fd-ea58d85ae400@redhat.com>
2021-12-21 18:58                                                       ` Linus Torvalds
2021-12-21 21:16                                                     ` John Hubbard
2021-12-21 19:07                                                 ` Jason Gunthorpe
     [not found]                                                   ` <3e0868e6-c714-1bf8-163f-389989bf5189@redhat.com>
     [not found]                                                     ` <dfe1c8d5-6fac-9040-0272-6d77bafa6a16@redhat.com>
2021-12-22 12:41                                                       ` Jan Kara
     [not found]                                                         ` <4a28e8a0-2efa-8b5e-10b5-38f1fc143a98@redhat.com>
2021-12-22 14:42                                                           ` Jan Kara
     [not found]                                                             ` <505d3d0f-23ee-0eec-0571-8058b8eedb97@redhat.com>
2021-12-22 16:08                                                               ` Jan Kara
2021-12-22 16:44                                                                 ` Matthew Wilcox
2021-12-22 18:40                                                                 ` Linus Torvalds
2021-12-23 12:54                                                                   ` Jan Kara
2021-12-23 17:18                                                                     ` Linus Torvalds
2021-12-23  0:21                                                           ` Matthew Wilcox
2021-12-24  2:53                                                             ` Jason Gunthorpe
2021-12-24  4:53                                                               ` Matthew Wilcox
2022-01-04  0:33                                                                 ` Jason Gunthorpe
2021-12-21 23:59                                                 ` Jason Gunthorpe
2021-12-22 12:44                                                   ` Jan Kara
2021-12-17 20:45     ` David Hildenbrand
2021-12-17 20:51       ` Linus Torvalds
2021-12-17 20:55         ` David Hildenbrand
2021-12-17 21:36           ` Linus Torvalds
2021-12-17 21:47             ` David Hildenbrand
2021-12-17 21:50               ` Linus Torvalds
2021-12-17 22:29                 ` David Hildenbrand
2021-12-17 22:58                   ` Linus Torvalds
2021-12-17 23:29                     ` David Hildenbrand
2021-12-17 23:53                       ` Nadav Amit
2021-12-18  4:02                         ` Linus Torvalds
2021-12-18  4:52                           ` Nadav Amit
2021-12-18  5:03                             ` Matthew Wilcox
2021-12-18  5:23                               ` Nadav Amit
2021-12-18 18:37                               ` Linus Torvalds
2021-12-17 22:18               ` Linus Torvalds
2021-12-17 22:43                 ` David Hildenbrand
2021-12-17 23:20                   ` Linus Torvalds
2021-12-18  9:57                     ` David Hildenbrand [this message]
2021-12-18 19:21                       ` Linus Torvalds
2021-12-18 19:52                         ` Linus Torvalds
2021-12-19  8:43                           ` David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 07/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (!hugetlb) David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 08/11] mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 09/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (hugetlb) David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 10/11] mm: thp: introduce and use page_trans_huge_anon_shared() David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 11/11] selftests/vm: add tests for the known COW security issues David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40e7e0ab-0828-b2e7-339f-35f68a228b3d@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=ddutile@redhat.com \
    --cc=guro@fb.com \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=namit@vmware.com \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rppt@linux.ibm.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox