From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 74DA7C52D7D for ; Fri, 16 Aug 2024 21:52:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F08F68D00B4; Fri, 16 Aug 2024 17:52:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EB90E8D00A2; Fri, 16 Aug 2024 17:52:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D7FDF8D00B4; Fri, 16 Aug 2024 17:52:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B98428D00A2 for ; Fri, 16 Aug 2024 17:52:29 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 720204030E for ; Fri, 16 Aug 2024 21:52:29 +0000 (UTC) X-FDA: 82459458018.29.A76175E Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf04.hostedemail.com (Postfix) with ESMTP id AE12440003 for ; Fri, 16 Aug 2024 21:52:27 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=g58KRg+h; spf=pass (imf04.hostedemail.com: domain of 3Gsq_ZgsKCOkLNVPcWPjeYRRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--ackerleytng.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3Gsq_ZgsKCOkLNVPcWPjeYRRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723845134; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=b4vUtlej2XubFdEQblSjm6EcKT3Hh6Wknx4PIkCBxnA=; b=x5DcWlix4W72RhogNZUb6iYziiTBv/3FwNwo+APHSS/wM9vSyH2PgmIlXcr6yoWGEUIfBX wXIeajjn9BvHKvzCpQH7rbW9deAjWNuGwqX4+dDpf2G4Gxh2NE+jM2D2+9CudxTAEEjQ6B HOeTsBEya80g95NKRSUHEm7DygJSJw8= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=g58KRg+h; spf=pass (imf04.hostedemail.com: domain of 3Gsq_ZgsKCOkLNVPcWPjeYRRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--ackerleytng.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3Gsq_ZgsKCOkLNVPcWPjeYRRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723845134; a=rsa-sha256; cv=none; b=4Dg8ZJCGuPuyy95JuK3X+Qh6aVA4TCg0HTB4DROPoCvk9hsW4sHyH4/EYfG5ZHpDfsaUaa KL2qN/vlPlYdgRPaut8prh2rb2QGvPlH8gK1NpuuWNXnwAAxd0CdQfKEbeMyt2gRpEMTsY oz9rTLAOFFrja3UvW0R48mS2dRYWH8Y= Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e0bbd1ca079so4247283276.2 for ; Fri, 16 Aug 2024 14:52:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1723845147; x=1724449947; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=b4vUtlej2XubFdEQblSjm6EcKT3Hh6Wknx4PIkCBxnA=; b=g58KRg+h8j0gy4ePYh2u+RI8juKckEuMvbNXMmuQCjlIfxIsimgXBY0sTtIw3HKQz0 33MzrDmG/6i228sXBFRVDb1rCs6C4S2TlxYA4H6UFtu37rp9hrTsXsf/bGQ7KdtimPW7 cNY2VDIdgBfz6upb7Ici4SucV+F/23TUtwXTCu9sX2LkLbiSAgv2IYglLFYl8DO1lo3h exDWwSdPWUAJqLpIcD5nWJmhZ07uLRILfmWBorynFX0lCqqIVF7JKnUQViLtk3tTq7Qv 44dQpnS5mJagObRl2zm8mer1s+kvFmky38qmOMXAPqOelHfrUpu8mFfsS7bFowgArzmi wLKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723845147; x=1724449947; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=b4vUtlej2XubFdEQblSjm6EcKT3Hh6Wknx4PIkCBxnA=; b=IpLVDlkiptnGufq7M5+VhFKtgUebgnBxxtnI5gYLB5UviPRS9c8P8G+pA/dXt4iP5d 9A+fVtNL8UwQLJ+39HmplrRZbkG+w7pxEqRbf6oS7Q514gm4rCSsoCCVcf5BuDLj4zLY 1QkAgqJK/uhS8e8raE4saV08gLcLwuKOLOWBBipxEaGZ4mrbtg3mJJdWc4V0ME8fGVBs 2C+K8/PRqqpgm8Y77rnseFz6T2HpGAIt/36mJ0oeKU5j5BOgYZG8OlMIZ2fz2RFqPFoT yNPpZ2NZwosXtgGH2LH6iwYjwnoNfhrWlFCzQN1dCY6HGce+PtQi6HzPK5h8Hj6ruRdy xJKA== X-Forwarded-Encrypted: i=1; AJvYcCW2bqnhwZBLYGWD9HW4inp5+9EUb5aP9AYvfc+Ud0L0n1OEQM1CYL/E3NHyo0MUrfZmtEtuQBkYUg==@kvack.org X-Gm-Message-State: AOJu0Yx3QftiscAfVmOmZF4S1bXdIccPDYr+uXfQ/Ta73GOuzdVrt1t0 0Z3xga5qLdQb3U1wHdASYW5iCc6zsiGqR3xLKuUeytvAqhrUddpZNs8ovfIAZHQ89qec/0yxeg/ p3zJrvR6NPAmL8Z2YAQ7FAw== X-Google-Smtp-Source: AGHT+IEs8zyvsoWuEvbphvAg5cMjHgbJYgPfPLSoGdtpOJDuLoPEAib3HGlcNl6V4/E+t5mIJiod+fTrhqt78lV+hg== X-Received: from ackerleytng-ctop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:13f8]) (user=ackerleytng job=sendgmr) by 2002:a25:8751:0:b0:e11:6fcc:b656 with SMTP id 3f1490d57ef6-e1180e97369mr34695276.6.1723845146700; Fri, 16 Aug 2024 14:52:26 -0700 (PDT) Date: Fri, 16 Aug 2024 21:52:25 +0000 In-Reply-To: <94c5d735-821c-40ba-ae85-1881c6f4445d@redhat.com> Mime-Version: 1.0 References: <20240805-guest-memfd-lib-v1-0-e5a29a4ff5d7@quicinc.com> <20240805-guest-memfd-lib-v1-4-e5a29a4ff5d7@quicinc.com> <4cdd93ba-9019-4c12-a0e6-07b430980278@redhat.com> <94c5d735-821c-40ba-ae85-1881c6f4445d@redhat.com> Message-ID: Subject: Re: [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages From: Ackerley Tng To: David Hildenbrand , Fuad Tabba Cc: Elliot Berman , Andrew Morton , Paolo Bonzini , Sean Christopherson , Patrick Roy , qperret@google.com, linux-coco@lists.linux.dev, linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Stat-Signature: hs43xywnrf8wdizbdjdb833nsznzdis5 X-Rspamd-Queue-Id: AE12440003 X-Rspamd-Server: rspam11 X-HE-Tag: 1723845147-711966 X-HE-Meta: U2FsdGVkX1/NHjEt6PGEr+jirT8Mtse22dQIbd1OklNuGrNYd7AwqPzU3aw7dBjZRPnYLOvq/WK9x02HpOiTpYCN+C+vbbbNLg+9nb3C321O7k4Nmp4YzvlL+5Ej+XZHNF/3m6UmjGsQ2D+O4UBNoOTgCt/yrbVOqyy2IVlV5k8KDxOBhzULNRudxxpVfJ5eo9amA94YeuGYtP5Dy59NGJybL65sGMeeacXAwTtUHue8aknBMm/icv36pTqx9823U0bo510FVXVHoFtvaqhO3GJC3I2fYx1J0jVufPzmrva1TfI7Zup3ipJ6AAdfOpyMCSgc6TrtSF/VfPHrp7R0sKuKUazOYq5oozvfrtDbpFCFqS317LmKUvx8HqNSdaJqrkYUf4CJBom3QPNebOlnOQTNzj7RkOdfkQG7WPK54/qGAGxtLBlypVKE8HBaL4qj+YBGfGZs6Xp+uv1VF2GAuKE3h3OGEpEQlJfQswQJXL9AzdvAcQlqkArBTjQceonLK/zAX7qJbhA3RwTRRJd8+uFKCV1bBA31XXwYZmDakzdwV8GA2ozCwwThPduFXbBCWooaJ+cF+spjgKwnWLmZ9J2xYpy5Az9lrtOHre3jSWkW1IgMa+vSe0Lw4YQPM+UzyTKcpV+PUo/6ePs+6v5ul2pVXJqJLooYX4/jBi9q1v0Ue4gtCL23ZOhY5RzXTs/uwIYnsXrcvsDXEBePZaSxhhGlFN1+Ld/nrjoiiYMacZcUfkvr8Axcub5GGYfYtCrus51/NJEmG2QdFLYm475peswIO8Fb1tT0QPSVwDAC1SpMHQ6MOebqnp72rS4fbzwwqcW6uxuN1Rau1cUHQPDgkbmWMTrQlmlQtqkjk/BWuCqkAnSU4m/N23ZjvX+UHsXd/oJQYTD65AaH/KDE7MaEsqw099uqsXfoDNiidRGTuX2+TYa7Mb0a0L5DOv8twiXXCGIW1b6TtlGJR2tYX5C BFpsJEyS yI5z6uMdM+baI2a9OyBJbbfcv0B5hMmzhnD7YSiWx0juPSKNa6NF/YS3gBwExvCUzFSeophEo4HbkLJGO3ROlnS1SvLJ5UR9eQ/Kki2T0S/zqJGoy1SiiepnWbCopgocAkAUNxgvhv1vGxUHQQfdhH0/kiczOQKc8E/DF3UDQLfyj4+NvsZCt80HU6hlw1NzGBOPH5yeOQDRnuaZ8ARRETBNlNgguv3zcGCas6GV+G3UMwUXrlY8vWhjyuZSM+pRWJI6EniO7T5DUVdwHBCnn/ezyzVkNEtkq1wW6gDFkZsCCwZo9ab2uP6YkYP4UxI/F/xlAty4Gx6mxTjw9tlZcW/Z5TH/9U8Y4kVszbEYl+logQ1afxWZlSuf+an1HRhacCTBejLarMsfmYDnac85XECUF0r5qV1ynCvlW4PJG8fYLex8qw6gaAfGbIsRopUfqjicJ1ET5Hg2hh1AXByoQOTF1weLbqezvT8bhFUEEJbNvCbWEnJdeNvJo0O0aOX88ArC2zIVoxbRJYuzcESU4zKphl0+m7ugfA1JtsItCgBkJTrcuxzKcjJ+/a2jPrB5XZrpaz6biSDWyInq0TS4G3DgUtw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: David Hildenbrand writes: > On 16.08.24 19:45, Ackerley Tng wrote: >> >> >> >> IIUC folio_lock() isn't a prerequisite for taking a refcount on the >> folio. > > Right, to do folio_lock() you only have to guarantee that the folio > cannot get freed concurrently. So you piggyback on another reference > (you hold indirectly). > >> >> Even if we are able to figure out a "safe" refcount, and check that the >> current refcount == "safe" refcount before removing from direct map, >> what's stopping some other part of the kernel from taking a refcount >> just after the check happens and causing trouble with the folio's >> removal from direct map? > > Once the page was unmapped from user space, and there were no additional > references (e.g., GUP, whatever), any new references can only be > (should, unless BUG :) ) temporary speculative references that should > not try accessing page content, and that should back off if the folio is > not deemed interesting or cannot be locked. (e.g., page > migration/compaction/offlining). I thought about it again - I think the vmsplice() cases are taken care of once we check that the folios are not mapped into userspace, since vmsplice() reads from a mapping. splice() reads from the fd directly, but that's taken care since guest_memfd doesn't have a .splice_read() handler. Reading /proc/pid/mem also requires the pages to first be mapped, IIUC, otherwise the pages won't show up, so checking that there are no more mappings to userspace takes care of this. > > Of course, there are some corner cases (kgdb, hibernation, /proc/kcore), > but most of these can be dealt with in one way or the other (make these > back off and not read/write page content, similar to how we handled it > for secretmem). Does that really leave us with these corner cases? And so perhaps we could get away with just taking the folio_lock() to keep away the speculative references? So something like 1. Check that the folio is not mapped and not pinned. 2. folio_lock() all the folios about to be removed from direct map -- With the lock, all other accesses should be speculative -- 3. Check that the refcount == "safe" refcount 3a. Unlock and return to userspace with -EAGAIN 4. Remove from direct map 5. folio_unlock() all those folios Perhaps a very naive question: can the "safe" refcount be statically determined by walking through the code and counting where refcount is expected to be incremented? Or perhaps the "safe" refcount may differ based on kernel config. Could we perhaps have a single static variable safe_refcount, and whenever a new guest_memfd folio is allocated, do safe_refcount = min(new_folio_refcount, safe_refcount) > > These (kgdb, /proc/kcore) might not even take a folio reference, they > just "access stuff" and we only have to teach them to "not access that". > >> >>> (noting that also folio_maybe_dma_pinned() can have false positives in >>> some cases due to speculative references or *many* references). >> >> Are false positives (speculative references) okay since it's better to >> be safe than remove from direct map prematurely? > > folio_maybe_dma_pinned() is primarily used in fork context. Copying more > (if the folio maybe pinned and, therefore, must not get COW-shared with > other processes and must instead create a private page copy) is the > "better safe than sorry". So false positives (that happen rarely) are > tolerable. > > Regading the directmap, it would -- just like with additional references > -- detect that the page cannot currently be removed from the direct map. > It's similarly "better safe than sorry", but here means that we likely > must retry if we cannot easily fallback to something else like for the > fork+COW case. > > -- > Cheers, > > David / dhildenb