From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Qft/=DF=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8EB37C4727E
	for <linux-mm@archiver.kernel.org>; Mon, 28 Sep 2020 17:54:52 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id EABE1208D5
	for <linux-mm@archiver.kernel.org>; Mon, 28 Sep 2020 17:54:51 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="JESamvpC"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EABE1208D5
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 10ED58E0001; Mon, 28 Sep 2020 13:54:51 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0BFAD6B0068; Mon, 28 Sep 2020 13:54:51 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EF0E08E0001; Mon, 28 Sep 2020 13:54:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0094.hostedemail.com [216.40.44.94])
	by kanga.kvack.org (Postfix) with ESMTP id D8FCD6B005D
	for <linux-mm@kvack.org>; Mon, 28 Sep 2020 13:54:50 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 6FBA48249980
	for <linux-mm@kvack.org>; Mon, 28 Sep 2020 17:54:50 +0000 (UTC)
X-FDA: 77313220740.21.rake41_1c046a227183
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin21.hostedemail.com (Postfix) with ESMTP id 4A49218045C0D
	for <linux-mm@kvack.org>; Mon, 28 Sep 2020 17:54:50 +0000 (UTC)
X-HE-Tag: rake41_1c046a227183
X-Filterd-Recvd-Size: 8798
Received: from mail-lf1-f67.google.com (mail-lf1-f67.google.com [209.85.167.67])
	by imf17.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 28 Sep 2020 17:54:49 +0000 (UTC)
Received: by mail-lf1-f67.google.com with SMTP id d15so2300578lfq.11
        for <linux-mm@kvack.org>; Mon, 28 Sep 2020 10:54:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linux-foundation.org; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=nVQLqfhAMue9WFpIvdR+SPVZ27tvqqW+KhexMPe51Eg=;
        b=JESamvpCr7P6D1koJK+bYfaWc/gIRk+gxvqsb639smqknsgpUtq8ybZgdz3KmpdwSm
         BOGvMs13i1hIOhbXZ0kGX29MC4GwdbFoVGM2scepvfxdiRoWcX0JUlgsoAwS9IECgBin
         SZ2Psk9nWs4dIJn0dtt/AdBJw4kExf40QIIsI=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=nVQLqfhAMue9WFpIvdR+SPVZ27tvqqW+KhexMPe51Eg=;
        b=KCpVSq04R/HFJQ7SCNr83uBqZAhPvxAjcWxEoOjeQeedycFCl4qA0ta77KFjb1hHvP
         t0QAb+lH3vebqs9dXr6YW18EYZFycRY07hm31mAUK4Fh2aIgVjColfo8BZ9MpDH8mnc1
         OhygiJ3FTsiC1yP87S2t1TfP36LT614xet7ghU6ofNFaDi3YNFauyqzIaBtHhhLPL2up
         xGQ95EUKUBqcS+MTA8KiFwOnhEQGWoUXWTyzylH/EcunLaLiggD/kaTBrkNgo9y4w1JV
         iZtp1OqG/TjjD0spASfS46F0jN0T3ejHOqkdDNp713/xNMzfETCdfB5ZoXdZOizowmFA
         WiwA==
X-Gm-Message-State: AOAM531G0/zLr06vinI4IAeTwaDEKqEeImO6iS3iykvVqfYQq5cI6+8Z
	ca8tNIQAdRd3MSm5qK4x/bm0U4EgBMav/Q==
X-Google-Smtp-Source: ABdhPJxe97KzXpLi68bFeXvtAr8yZEFJZtMTbrQBZ+Jj5TGKO4tghsteK4qHzfG95+tAA39gKAPUEw==
X-Received: by 2002:a19:b97:: with SMTP id 145mr799703lfl.193.1601315687298;
        Mon, 28 Sep 2020 10:54:47 -0700 (PDT)
Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com. [209.85.167.52])
        by smtp.gmail.com with ESMTPSA id 192sm2928568lfb.154.2020.09.28.10.54.45
        for <linux-mm@kvack.org>
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 28 Sep 2020 10:54:45 -0700 (PDT)
Received: by mail-lf1-f52.google.com with SMTP id y11so2333988lfl.5
        for <linux-mm@kvack.org>; Mon, 28 Sep 2020 10:54:45 -0700 (PDT)
X-Received: by 2002:ac2:4a6a:: with SMTP id q10mr786502lfp.534.1601315685097;
 Mon, 28 Sep 2020 10:54:45 -0700 (PDT)
MIME-Version: 1.0
References: <CAHk-=wgz5SXKA6-uZ_BimOP1C7pHJag0ndz=tnJDAZS_Z+FrGQ@mail.gmail.com>
 <CAHk-=whDSH_MRMt80JaSwoquzt=1nQ-0n3w0aVngoWPAc10BCw@mail.gmail.com>
 <20200926004136.GJ9916@ziepe.ca> <CAHk-=wiutA_J-OfvrD8Kp3SoYcfMHUwsU7ViOH48q7QN0AQ6eg@mail.gmail.com>
 <CAHk-=wi_gd+JWj-8t8tc8cy3WZ7NMj-_1hATfH3Rt0ytUxtMpQ@mail.gmail.com>
 <20200927062337.GE2280698@unreal> <CAHk-=winqSOFsdn1ntYL13s2UuhpQQ9+GRvjWth3sA5APY4Wwg@mail.gmail.com>
 <CAHk-=wj61s30pt8POVtKYVamYTh6h=7-_ser2Hx9sEjqeACkDA@mail.gmail.com>
 <20200928124937.GN9916@ziepe.ca> <CAHk-=wj6aTsqq6BAUci-NYJ3b-EkDwVgz_NvW_kW8KBqGocouQ@mail.gmail.com>
 <20200928172256.GB59869@xz-x1>
In-Reply-To: <20200928172256.GB59869@xz-x1>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 28 Sep 2020 10:54:28 -0700
X-Gmail-Original-Message-ID: <CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com>
Message-ID: <CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com>
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned
To: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leonro@nvidia.com>, John Hubbard <jhubbard@nvidia.com>, 
	Linux-MM <linux-mm@kvack.org>, 
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@suse.com>, Kirill Tkhai <ktkhai@virtuozzo.com>, 
	Kirill Shutemov <kirill@shutemov.name>, Hugh Dickins <hughd@google.com>, Christoph Hellwig <hch@lst.de>, 
	Andrea Arcangeli <aarcange@redhat.com>, Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Sep 28, 2020 at 10:23 AM Peter Xu <peterx@redhat.com> wrote:
>
> Yes...  Actually I am also thinking about the complete solution to cover
> read-only fast-gups too, but now I start to doubt this, at least for the fork()
> path.  E.g. if we'd finally like to use pte_protnone() to replace the current
> pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer
> the same degradation on normal fork()s, or even more.  Seems unacceptable.

So I think the real question about pinned read gups is what semantics
they should have.

Because honestly, I think we have two options:

 - the current "it gets a shared copy from the page tables"

 - the "this is an exclusive pin, and it _will_ follow the source VM
changes, and never break"

because honestly, if we get a shared copy at the time of the pinning
(like we do now), then "fork()" is entirely immaterial. The fork() can
have happened ages ago, that page is shared with other processes, and
anybody process writing to it - including very much the pinning one -
will cause a copy-on-write and get a copy of the page.

IOW, the current - and past - semantics for read pinning is that you
get a copy of the page, but any changes made by the pinning process
may OR MAY NOT show up in your pinned copy.

Again: doing a concurrent fork() is entirely immaterial, because the
page can have been made a read-only COW page by _previous_ fork()
calls (or KSM logic or whatever).

In other words: read pinning gets a page efficiently, but there is
zero guarantee of any future coherence with the process doing
subsequent writes.

That has always been the semantics, and FOLL_PIN didn't change that at
all. You may have had things that worked almost by accident (ie you
had made the page private by writing to it after the fork, so the read
pinning _effectively_ gave you a page that was coherent), but even
that was always accidental rather than anything else. Afaik it could
easily be broken by KSM, for example.

In other words, a read pin isn't really any different from a read GUP.
You get a reference to a page that is valid at the time of the page
lookup, and absolutely nothing more.

Now, the alternative is to make a read pin have the same guarantees as
a write pin, and say "this will stay attached to this MM until unmap
or unpin".

But honestly, that is largely going to _be_ the same as a write pin,
because it absolutely needs to do a page COW at the time of the
pinning to get that initial exclusive guarantee in the first place.
Without that initial exclusivity, you cannot avoid future COW events
breaking the wrong way.

So I think the "you get a reference to the page at the time of the
pin, and the page _may_ or may not change under you if the original
process writes to it" are really the only relevant semantics. Because
if you need those exclusive semantics, you might as well just use a
write pin.

The downside of a write pin is that it not only makes that page
exclusive, it also (a) marks it dirty and (b) requires write access.
That can matter particularly for shared mappings. So if you know
you're doing the pin on a shared mmap, then a read pin is the right
thing, because the page will stay around - not because of the VM it
happens in, but because of the underlying file mapping!

See the difference?

> The other question is, whether we should emphasize and document somewhere that
> MADV_DONTFORK is still (and should always be) the preferred way, because
> changes like this series can potentially encourage the other way.

I really suspect that the concurrent fork() case is fundamentally hard
to handle.

Is it impossible? No. Even without any real locking, we could change
the code to do a seqcount_t, for example. The fastgup code wouldn't
take a lock, but it would just fail and fall back to the slow code if
the sequence count fails.

So the copy_page_range() code would do a write count around the copy:

    write_seqcount_begin(&mm->seq);
    .. do the copy ..
    write_seqcount_end(&mm->seq);

and the fast-gup code would do a

    seq = raw_read_seqcount(&mm->seq);
    if (seq & 1)
        return -EAGAIN;

at the top, and do a

    if (__read_seqcount_t_retry(&mm->seq, seq) {
       .. Uhhuh, that failed, drop the ref to the page again ..
        return -EAGAIN;
    }

after getting the pin reference.

We could make this conditional on FOLL_PIN, or maybe even a new flag
("FOLL_FORK_CONSISTENT").

So I think we can serialize with fork() without serializing each and every PTE.

If we want to and really need to.

Hmm?

               Linus