From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.8 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_NONE,USER_AGENT_SANE_1, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D49CC47082 for ; Sat, 29 May 2021 20:16:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D6BF06112F for ; Sat, 29 May 2021 20:16:03 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D6BF06112F Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4F52F6B006C; Sat, 29 May 2021 16:16:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 49E0E6B006E; Sat, 29 May 2021 16:16:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E2BFD6B0070; Sat, 29 May 2021 16:16:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0056.hostedemail.com [216.40.44.56]) by kanga.kvack.org (Postfix) with ESMTP id A0E5E6B006C for ; Sat, 29 May 2021 16:16:01 -0400 (EDT) Received: from smtpin33.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 2F2268249980 for ; Sat, 29 May 2021 20:16:01 +0000 (UTC) X-FDA: 78195374922.33.9AE4DD3 Received: from mail-ot1-f53.google.com (mail-ot1-f53.google.com [209.85.210.53]) by imf29.hostedemail.com (Postfix) with ESMTP id 26DAF366 for ; Sat, 29 May 2021 20:15:47 +0000 (UTC) Received: by mail-ot1-f53.google.com with SMTP id v19-20020a0568301413b0290304f00e3d88so7025288otp.4 for ; Sat, 29 May 2021 13:16:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=BPVIrimcUkfawrwhxQJhVkAGWp3nHZ+YOSsJJY8v6L4=; b=iQuuVP1mw7xS0oVBuoDB1d2rU7OlpTlPqQ0IaVF0yHf2MpVMWfXkFC10qrv82uK2U5 SaHZuldF3r+r9sN2vf5poLCMNhCiw4T8OEEPsPGO2inlx4AVYBfiPjqYWQc57fm9xoIa ejdT9ZuCdr9nMc8TrFAl9+Ev6zQ7ln+rF4i9BC2opwJbJNv69xOAoIlEWPJQXRJYRXi+ ajbKAIHElv4gkPByhQwih2Q/ZBU4ROJ9j4hVaNlMeLq6baigRAKk9P//pAyrnb2ekZ4D FMTGvqkCcv5vyvxVHB+ZBzLNos7mM7KanS1freHJ0lyGJ9wf1pZbTAw4mI8Ugvzymf0m aTeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=BPVIrimcUkfawrwhxQJhVkAGWp3nHZ+YOSsJJY8v6L4=; b=VWGB/uPn9OQ/u8T899/9Ht3HqKmmTphc2kOrINpFOKP6VHNxU78UWFg2z5K2ysOm99 WAKgMFCFFmGbkVWiLIDtKGoKcu12L58R9vmbs9DVcivrzXEvdUfLqPEsuVRgcC0uONPP tFvr+WAxM1SU1PDPAQDij2iDraR07lUWd7DqILqidmw2gdIu30ZF0o9Ogn1SsUBGSfdv 56ClSQf3s62maVh6UOtfGuHv689uKFwZUNxd0Dn3n3XXtF6nFJtqwBcLU4HLqUoDEax7 2tpKCN++A2RQMwV0e++f9EeHFRuH2u+G1QNmaXFp9S5Ul2Qgztxn5A6hIYOpCOQlek1w cMCA== X-Gm-Message-State: AOAM530lsZEnpRzvEtD3iz5FnQJSGreNu8Jo6hyx19AIOdHssEl0ZWPE p9E39mxHK/GQGuh5YLUEcs58Mg== X-Google-Smtp-Source: ABdhPJyogzismKVV0E3uH+Va7htgK3+100Crb5uZ5iQLOJ94MJxXUtOjoTQEOG1L0pIBZB5Ej+gMHQ== X-Received: by 2002:a9d:7315:: with SMTP id e21mr12058931otk.288.1622319359862; Sat, 29 May 2021 13:15:59 -0700 (PDT) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id r2sm1972883otq.28.2021.05.29.13.15.58 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Sat, 29 May 2021 13:15:59 -0700 (PDT) Date: Sat, 29 May 2021 13:15:43 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Linus Torvalds cc: "Lin, Ming" , Hugh Dickins , Simon Ser , Peter Xu , "Kirill A. Shutemov" , Matthew Wilcox , Dan Williams , "Kirill A. Shutemov" , Will Deacon , Linux Kernel Mailing List , David Herrmann , "linux-mm@kvack.org" , Greg Kroah-Hartman , "tytso@mit.edu" Subject: Re: Sealed memfd & no-fault mmap In-Reply-To: Message-ID: References: <20210429154807.hptls4vnmq2svuea@box> <20210429183836.GF8339@xz-x1> <7718ec5b-0a9e-ffa6-16f2-bc0b6afbd9ab@gmail.com> <80c87e6b-6050-bf23-2185-ded408df4d0f@gmail.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=iQuuVP1m; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of hughd@google.com designates 209.85.210.53 as permitted sender) smtp.mailfrom=hughd@google.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 26DAF366 X-Stat-Signature: 19prih5ai6xytuwdc7q9zdwi1dxedfhu X-HE-Tag: 1622319347-613641 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, 29 May 2021, Linus Torvalds wrote: > On Fri, May 28, 2021 at 9:31 PM Lin, Ming wrote: > > > > I should check the vma is not writable. > > > > - if (!IS_NOFAULT(inode)) > > + if (!IS_NOFAULT(inode) || (vma->vm_flags & VM_WRITE)) > > return -EINVAL; > > That might be enough, yes. > > But if this is sufficient for the compositor needs, and the rule is > that this only works for read-only mappings, then I think the flag in > the inode becomes the wrong thing to do. > > Because if it's a read-only mapping, and we only ever care about > inserting zero pages into the page tables - and they never become part > of the shared memory region itself, then it really is purely about > that mmap, not about the shm inode. > > So then it really does become purely about one particular mmap, and it > really should be a "madvise()" issue, not a "mark inode as no-fault". Yes, madvise or mmap flag: the recipient of this fd ought not to be (even capable of) interfering with other maps of the shared object. And IIUC it would have to be the recipient (Wayland compositor) doing the NOFAULT business, because (going back to the original mail) we are only considering this so that Wayland might satisfy clients who predate or refuse Linux-only APIs. So, an ioctl (or fcntl, as sealing chose) at the client end cannot be expected; and could not be relied on anyway. > > I'd almost be inclined to just add a new "flags" field to the vma. > We've been running out of vma flags for a long time, to the point that > some of them are only available on 64-bit architectures. > > I get the feeling that we should just bite the bullet and make > "vm_flags" be an u64. Or possibly make it two explicitly 32-bit > entities (vm_flags and vm_extra). Get rid of the silly 64-bit-only > "high" flags, and get rid of our artificial "we don't have enough > bits". u64 saves messing around in the vma_merge() area, which has to consider whether adjacent vm_flags are identical. > > Because we already in practice *do* have enough bits, we've just > artificially limited ourselves to "on 32-bit architectures we only > have 32 bits in that field". Yes, that artificial limitation to 32-bit has been silly all along. > > But all of this is very much dependent on that "this NOFAULT case > really only works for reads, not for writes". > > (Alternatively, we could allow the *mapping* itself to be writable, > but always fault on writes, and only insert a read-only zero page) NOFAULT? Does BSD use "fault" differently, and in Linux terms we would say NOSIGBUS to mean the same? Can someone point to a specification of BSD's __MAP_NOFAULT? Searching just found me references to bugs. What mainly worries me about the suggestion is: what happens to the zero page inserted into NOFAULT mappings, when later a page for that offset is created and added to page cache? Treating it as an opaque blob of zeroes, that stays there ever after, hiding the subsequent data: easy to implement, but a hack that we would probably regret. (And I notice that even the quote from David Herrmann in the original post allows for the possibility that client may want to expand the object.) I believe the correct behaviour would be to unmap the nofault page then, allowing the proper page to be faulted in after. That is certainly doable (the old mm/filemap_xip.c used to do so), but might get into some awkward race territory, with filesystem dependence (reminiscent of hole punch, in reverse). shmem could operate that way, and be the better for it: but I wouldn't want to add that, without also cleaning away all the shmem_recalc_inode() stuff. Hugh