From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8EB37C4727E for ; Mon, 28 Sep 2020 17:54:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EABE1208D5 for ; Mon, 28 Sep 2020 17:54:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="JESamvpC" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EABE1208D5 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 10ED58E0001; Mon, 28 Sep 2020 13:54:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0BFAD6B0068; Mon, 28 Sep 2020 13:54:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EF0E08E0001; Mon, 28 Sep 2020 13:54:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0094.hostedemail.com [216.40.44.94]) by kanga.kvack.org (Postfix) with ESMTP id D8FCD6B005D for ; Mon, 28 Sep 2020 13:54:50 -0400 (EDT) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 6FBA48249980 for ; Mon, 28 Sep 2020 17:54:50 +0000 (UTC) X-FDA: 77313220740.21.rake41_1c046a227183 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin21.hostedemail.com (Postfix) with ESMTP id 4A49218045C0D for ; Mon, 28 Sep 2020 17:54:50 +0000 (UTC) X-HE-Tag: rake41_1c046a227183 X-Filterd-Recvd-Size: 8798 Received: from mail-lf1-f67.google.com (mail-lf1-f67.google.com [209.85.167.67]) by imf17.hostedemail.com (Postfix) with ESMTP for ; Mon, 28 Sep 2020 17:54:49 +0000 (UTC) Received: by mail-lf1-f67.google.com with SMTP id d15so2300578lfq.11 for ; Mon, 28 Sep 2020 10:54:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=nVQLqfhAMue9WFpIvdR+SPVZ27tvqqW+KhexMPe51Eg=; b=JESamvpCr7P6D1koJK+bYfaWc/gIRk+gxvqsb639smqknsgpUtq8ybZgdz3KmpdwSm BOGvMs13i1hIOhbXZ0kGX29MC4GwdbFoVGM2scepvfxdiRoWcX0JUlgsoAwS9IECgBin SZ2Psk9nWs4dIJn0dtt/AdBJw4kExf40QIIsI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=nVQLqfhAMue9WFpIvdR+SPVZ27tvqqW+KhexMPe51Eg=; b=KCpVSq04R/HFJQ7SCNr83uBqZAhPvxAjcWxEoOjeQeedycFCl4qA0ta77KFjb1hHvP t0QAb+lH3vebqs9dXr6YW18EYZFycRY07hm31mAUK4Fh2aIgVjColfo8BZ9MpDH8mnc1 OhygiJ3FTsiC1yP87S2t1TfP36LT614xet7ghU6ofNFaDi3YNFauyqzIaBtHhhLPL2up xGQ95EUKUBqcS+MTA8KiFwOnhEQGWoUXWTyzylH/EcunLaLiggD/kaTBrkNgo9y4w1JV iZtp1OqG/TjjD0spASfS46F0jN0T3ejHOqkdDNp713/xNMzfETCdfB5ZoXdZOizowmFA WiwA== X-Gm-Message-State: AOAM531G0/zLr06vinI4IAeTwaDEKqEeImO6iS3iykvVqfYQq5cI6+8Z ca8tNIQAdRd3MSm5qK4x/bm0U4EgBMav/Q== X-Google-Smtp-Source: ABdhPJxe97KzXpLi68bFeXvtAr8yZEFJZtMTbrQBZ+Jj5TGKO4tghsteK4qHzfG95+tAA39gKAPUEw== X-Received: by 2002:a19:b97:: with SMTP id 145mr799703lfl.193.1601315687298; Mon, 28 Sep 2020 10:54:47 -0700 (PDT) Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com. [209.85.167.52]) by smtp.gmail.com with ESMTPSA id 192sm2928568lfb.154.2020.09.28.10.54.45 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 28 Sep 2020 10:54:45 -0700 (PDT) Received: by mail-lf1-f52.google.com with SMTP id y11so2333988lfl.5 for ; Mon, 28 Sep 2020 10:54:45 -0700 (PDT) X-Received: by 2002:ac2:4a6a:: with SMTP id q10mr786502lfp.534.1601315685097; Mon, 28 Sep 2020 10:54:45 -0700 (PDT) MIME-Version: 1.0 References: <20200926004136.GJ9916@ziepe.ca> <20200927062337.GE2280698@unreal> <20200928124937.GN9916@ziepe.ca> <20200928172256.GB59869@xz-x1> In-Reply-To: <20200928172256.GB59869@xz-x1> From: Linus Torvalds Date: Mon, 28 Sep 2020 10:54:28 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned To: Peter Xu Cc: Jason Gunthorpe , Leon Romanovsky , John Hubbard , Linux-MM , Linux Kernel Mailing List , Andrew Morton , Jan Kara , Michal Hocko , Kirill Tkhai , Kirill Shutemov , Hugh Dickins , Christoph Hellwig , Andrea Arcangeli , Oleg Nesterov , Jann Horn Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Sep 28, 2020 at 10:23 AM Peter Xu wrote: > > Yes... Actually I am also thinking about the complete solution to cover > read-only fast-gups too, but now I start to doubt this, at least for the fork() > path. E.g. if we'd finally like to use pte_protnone() to replace the current > pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer > the same degradation on normal fork()s, or even more. Seems unacceptable. So I think the real question about pinned read gups is what semantics they should have. Because honestly, I think we have two options: - the current "it gets a shared copy from the page tables" - the "this is an exclusive pin, and it _will_ follow the source VM changes, and never break" because honestly, if we get a shared copy at the time of the pinning (like we do now), then "fork()" is entirely immaterial. The fork() can have happened ages ago, that page is shared with other processes, and anybody process writing to it - including very much the pinning one - will cause a copy-on-write and get a copy of the page. IOW, the current - and past - semantics for read pinning is that you get a copy of the page, but any changes made by the pinning process may OR MAY NOT show up in your pinned copy. Again: doing a concurrent fork() is entirely immaterial, because the page can have been made a read-only COW page by _previous_ fork() calls (or KSM logic or whatever). In other words: read pinning gets a page efficiently, but there is zero guarantee of any future coherence with the process doing subsequent writes. That has always been the semantics, and FOLL_PIN didn't change that at all. You may have had things that worked almost by accident (ie you had made the page private by writing to it after the fork, so the read pinning _effectively_ gave you a page that was coherent), but even that was always accidental rather than anything else. Afaik it could easily be broken by KSM, for example. In other words, a read pin isn't really any different from a read GUP. You get a reference to a page that is valid at the time of the page lookup, and absolutely nothing more. Now, the alternative is to make a read pin have the same guarantees as a write pin, and say "this will stay attached to this MM until unmap or unpin". But honestly, that is largely going to _be_ the same as a write pin, because it absolutely needs to do a page COW at the time of the pinning to get that initial exclusive guarantee in the first place. Without that initial exclusivity, you cannot avoid future COW events breaking the wrong way. So I think the "you get a reference to the page at the time of the pin, and the page _may_ or may not change under you if the original process writes to it" are really the only relevant semantics. Because if you need those exclusive semantics, you might as well just use a write pin. The downside of a write pin is that it not only makes that page exclusive, it also (a) marks it dirty and (b) requires write access. That can matter particularly for shared mappings. So if you know you're doing the pin on a shared mmap, then a read pin is the right thing, because the page will stay around - not because of the VM it happens in, but because of the underlying file mapping! See the difference? > The other question is, whether we should emphasize and document somewhere that > MADV_DONTFORK is still (and should always be) the preferred way, because > changes like this series can potentially encourage the other way. I really suspect that the concurrent fork() case is fundamentally hard to handle. Is it impossible? No. Even without any real locking, we could change the code to do a seqcount_t, for example. The fastgup code wouldn't take a lock, but it would just fail and fall back to the slow code if the sequence count fails. So the copy_page_range() code would do a write count around the copy: write_seqcount_begin(&mm->seq); .. do the copy .. write_seqcount_end(&mm->seq); and the fast-gup code would do a seq = raw_read_seqcount(&mm->seq); if (seq & 1) return -EAGAIN; at the top, and do a if (__read_seqcount_t_retry(&mm->seq, seq) { .. Uhhuh, that failed, drop the ref to the page again .. return -EAGAIN; } after getting the pin reference. We could make this conditional on FOLL_PIN, or maybe even a new flag ("FOLL_FORK_CONSISTENT"). So I think we can serialize with fork() without serializing each and every PTE. If we want to and really need to. Hmm? Linus