From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4055C3DA49 for ; Tue, 16 Jul 2024 16:03:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2449A6B007B; Tue, 16 Jul 2024 12:03:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1F4246B0082; Tue, 16 Jul 2024 12:03:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0BBFA6B0085; Tue, 16 Jul 2024 12:03:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E468A6B007B for ; Tue, 16 Jul 2024 12:03:04 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 48DF6A0728 for ; Tue, 16 Jul 2024 16:03:04 +0000 (UTC) X-FDA: 82346084688.03.FB4BEDE Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) by imf12.hostedemail.com (Postfix) with ESMTP id 7015840009 for ; Tue, 16 Jul 2024 16:03:02 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=3h1husaP; spf=pass (imf12.hostedemail.com: domain of 3tZmWZgYKCH4ugcpleiqqing.eqonkpwz-oomxcem.qti@flex--seanjc.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3tZmWZgYKCH4ugcpleiqqing.eqonkpwz-oomxcem.qti@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721145762; a=rsa-sha256; cv=none; b=W8Lb4gxK5n/M8qOiulR1ye5AhGMF81JkyXyN50mkgwfVCEUH4Hg4sF/X321U2j6U81zI5E MHcdgLFv/aD2pdjhdD+pARzThJ01yzDEEWj9jI2b65fGO5pWMwGlKK3ouawxNJD+VgCWoj wbgoQCexckme6JJig3bFAM2xjbjvdR4= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=3h1husaP; spf=pass (imf12.hostedemail.com: domain of 3tZmWZgYKCH4ugcpleiqqing.eqonkpwz-oomxcem.qti@flex--seanjc.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3tZmWZgYKCH4ugcpleiqqing.eqonkpwz-oomxcem.qti@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721145762; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BMs2wY7Im2gJwcvsvkhutz6yojdHTU9eBYvDppSee3o=; b=P9fz4Ruy0DfK6XBRui879TUS5zlsJRyl4v33CWrY/RmepwrOllqCkV2D7BKPFETVBlRiO4 3s7MtMDqcYPoe5AHr6PUgc7TDLRxrtypufgH2A7D5nQVc9U5S2lXuOlFr1DA+/7nk13HKA v+qN3vp0vif5KZ2M527qmkwWPlbasoM= Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-65fe57ed70dso63595057b3.2 for ; Tue, 16 Jul 2024 09:03:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721145781; x=1721750581; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=BMs2wY7Im2gJwcvsvkhutz6yojdHTU9eBYvDppSee3o=; b=3h1husaPUK+GEV7mzZvmCov/bW98Z3hGYtAumDznWQ63xYWLl6+LdHkXzlOMl08KWy kt6dFHwuq7/YWtcL/50LNlVSheY6UCfoPLFU2YXOnshaFb734yQFiXyyYj8uWZHJKYYC 9aIpz6vOjWsCkzPvnR9Yw5oD/2ArB/B/XDf/HQ62ZGn2VIXy1ymupjOuWI24Rnk8/swd PAfxAIOCheq10u8IUiQIRrc1AWfMag12iHuIXDTv8P5Jwxu/8P5Y57qZnDAuySgCXKYA 9tPTR14K8odla/W5dquMY2c4GgGeoMiaERhnI2XxDep8qjM24uWG+LTBX8Tg1pm5PZCS jYMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721145781; x=1721750581; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=BMs2wY7Im2gJwcvsvkhutz6yojdHTU9eBYvDppSee3o=; b=Bv3e85afftTUnOqyDQ23BSmHiRn88zbsugFwDr4Z0oTuW7jSSrFOidxn2J5x67rEBf kFrnUGXP56qVErwIoAku1noiPMiGNtfl2rQHLw63JLPmMK0DqBgiRftCqnGwX48fc5cr nGamm9Do025DWofR3YK+LGTnAbrpPj53Gx7+l3jc3GVVZMUnkd3O2VHRWKCX5Lyj/b7b m6GMxby1Jf5TIFLPqsiQPdORsIZNvGmTPy6+/UvVDJX1dsswEDPJw2t3flJ75SPrjySS nQv5CIxRLm0Fsm1CmYltW334L+iW1TN1GRI64fmAo5F/onrA/T8VdBpsD439vByHoQ8N P34Q== X-Forwarded-Encrypted: i=1; AJvYcCV+vvmlUP5rDupuWPqhqSJDjf+dAmLpXQqhhlgkayAwSkVaqf3fEiUJ4W1lW6fKAaf4at7IYnj2NbIbXFfgeCv1rt0= X-Gm-Message-State: AOJu0Yz00+kQ50LN8JTu2ytMgt/XCs9t1coFwIJ/u0B0t4QMAtky5AyL EVTzjpYd/+Gi3e/xswCqf11SebmWt7qNP86Tu/oaVvTEzL08QWrsTu6LHaQxIkYSaNO338l/ybo rPA== X-Google-Smtp-Source: AGHT+IHGZLEiSniWpPk9RJXTJkROIXXHkCGmFW1fdkqpo+4QpAqRJbF6e6n8zR9oFjopeWKxjqT3PAofiHk= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:690c:fd4:b0:62d:cef:67dd with SMTP id 00721157ae682-6637f1c259emr1606177b3.1.1721145781339; Tue, 16 Jul 2024 09:03:01 -0700 (PDT) Date: Tue, 16 Jul 2024 09:03:00 -0700 In-Reply-To: <20240712232937.2861788-1-ackerleytng@google.com> Mime-Version: 1.0 References: <20240618-exclusive-gup-v1-0-30472a19c5d1@quicinc.com> <20240712232937.2861788-1-ackerleytng@google.com> Message-ID: Subject: Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning From: Sean Christopherson To: Ackerley Tng Cc: quic_eberman@quicinc.com, akpm@linux-foundation.org, david@redhat.com, kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, maz@kernel.org, pbonzini@redhat.com, shuah@kernel.org, tabba@google.com, willy@infradead.org, vannapurve@google.com, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, qperret@google.com, smostafa@google.com, fvdl@google.com, hughd@google.com Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: xi8rrmp9r6aumwnastw3yd619ujchxqx X-Rspamd-Queue-Id: 7015840009 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1721145782-587078 X-HE-Meta: U2FsdGVkX18SrVr729DUWced4RU2Bfehh5lY8ih91Pm6zX5shZq+4ER1vW+4clDD6YWM8ZGa/QffZG3L+uHP80YtOdGKyoe9Q8mDrmC1QeTdLi+zm78gorZ0rxYfx0sbJE6h3PL2NejSo5OfHYneskkjEVvLXP9w+OpaX/+X+JRSED0FoX9FR9w0C2wv/nMbAJK8jDN3nHambVFtGZmd1mlef7ARU64RYaWKv5GFY6fWtQO0UUTxTZelu/1Ln7mGixjVEdo7dRZClXv2kjelNFY3yca6NN2WF44gi/WPEXBm2+M9cxmP10CjSUi/EHc7WxWvzvF180Ei3+t0o2A07ssq4IeYAvWUQzAtwojO9gptp/xmX4kEiAQ7tFIlaPZSr3vY07I/C914YmDEHauDULsNLbrjRXWs5q2NAjkMfZS8oFbPN5SkqAxpILpn1BBNg3P/Js7DCK09Nb6j2+2qN1lAhmZcRflJu7GWIqB+PzbGxB+R1lYV6KdJPCbWWucvkvpTelG6clKjJbZwc2P8WqzI8zWqf95R0bS3WBSWaMXz53WpuGcNO0WI9RaeZVmHCP2uEwB/1Swjn9Ptb5FLoKU31x8YYzQgqewY/XTZ2V9rCWA3IY4GdUq3htjsotszSqkW54XENv3rlVLt7UHie20WR5mIT50/FvAz+xer9lJufzm9bUPHsMvHytj7fp9WVVMJ/Jcb4n2Eir0JsxICV3XaPtvxtARBOn3GKxhFiF9dHZvVKhRnG3Rc5PcXwuOWygzMuszK8zJkKeDB1HWZgMhSZX4jnHGPX20XCetiE6qKAQ5kcELb5xiJcunuLi70+LHF6wFvlL6H+JcxHaEPaX3YFb9yhYZ7N8ZNW0BBVabxYz+Jad3gbQ3QztrDMFiGaebs+uoTmStz8wGH1Bc0negL4IrzHDuWTqyy824hRnHD6gbqaY0HvV9NpCtKCbQHyuLm/rcDrDYQou1m94I xPt3NYcw OemrgIQqZCHocV8raWGz+J8fy4nY9x0ee9lqrIaPlZfkMeJaMSMbgrh/kXMvvuV2hSFxO4qSAi7k8PkSJCloeZ3X++vZyXuZg2GtyIEYUjGkmbCJEedlOSUaGSB0eT+YdsjersLytjR3QVJd/PW/gUbPAiXbYobmpVEa17Uq+LLSU1eoZDJcOFt96wA26UewR0THy78Txb1Pe+/vJ7MSyrumm42VbkU2Ga2xI+rqG6P4Jj1uM7SmZoSpY7jwaOwydyp+jVGrsW1X9spThAai+1RJ1cSJiGe4xqxld0+sPvdocDPOrZ1wYMSyprBy1oVOfMuSGcpBuEx6sL/Q59tFg5x3QhZuwdGn/AJ65CHeekpSeaLA/spXjrc0sZC8pTIH1p7/nHzIfW5WbP8c2JQmkvyX5EWjojX8tRBR97D40Uyb/oD7vmA15/AJqaMJByLTo5TsnkKDSd3jMDKZiErE6chen97GzBfWQeD0rPJXLLWcPSk9blEmWwIyApaaID1mSb2A0vCVd02/adanphlf9+4+6H+4smOYbVIzaO2yGwRXe8IB9s8mJfxZO9x8+H9sXytqgrilU8qsDQwfAkmk6RdgtA+FCrKzxDlu62QHWv7IP3gQknbuTYpcxojVIo1NjiGtf4BS5TwmJIq10z+g70FNkEbtvsgqcx/cV X-Bogosity: Ham, tests=bogofilter, spamicity=0.000020, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Thanks for doing the dirty work! On Fri, Jul 12, 2024, Ackerley Tng wrote: > Here=E2=80=99s an update from the Linux MM Alignment Session on July 10 2= 024, 9-10am > PDT: >=20 > The current direction is: >=20 > + Allow mmap() of ranges that cover both shared and private memory, but d= isallow > faulting in of private pages > + On access to private pages, userspace will get some error, perhaps SI= GBUS > + On shared to private conversions, unmap the page and decrease refcoun= ts Note, I would strike the "decrease refcounts" part, as putting references i= s a natural consequence of unmapping memory, not an explicit action guest_memfd= will take when converting from shared=3D>private. And more importantly, guest_memfd will wait for the refcount to hit zero (o= r whatever the baseline refcount is). > + To support huge pages, guest_memfd will take ownership of the hugepages= , and > provide interested parties (userspace, KVM, iommu) with pages to be use= d. > + guest_memfd will track usage of (sub)pages, for both private and shar= ed > memory > + Pages will be broken into smaller (probably 4K) chunks at creation ti= me to > simplify implementation (as opposed to splitting at runtime when priv= ate to > shared conversion is requested by the guest) FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support = that shatters pages at creation. I can see it being an intermediate step, e.g. = to prove correctness and provide a bisection point, but shattering hugepages a= t creation would effectively make hugepage support useless. I don't think we need to sort this out now though, as when the shattering (= and potential reconstituion) occurs doesn't affect the overall direction in any= way (AFAIK). I'm chiming in purely to stave off complaints that this would bre= ak hugepage support :-) > + Core MM infrastructure will still be used to track page table mappi= ngs in > mapcounts and other references (refcounts) per subpage > + HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up= - to > be optimized later. Suggestions: > + Use a tracking data structure other than struct page > + Remove the memory for struct pages backing private memory from th= e > vmemmap, and re-populate the vmemmap on conversion from private t= o > shared > + Implementation pointers for huge page support > + Consensus was that getting core MM to do tracking seems wrong > + Maintaining special page refcounts for guest_memfd pages is difficu= lt to > get working and requires weird special casing in many places. This = was > tried for FS DAX pages and did not work out: [1] >=20 > + Implementation suggestion: use infrastructure similar to what ZONE_DEVI= CE > uses, to provide the huge page to interested parties > + TBD: how to actually get huge pages into guest_memfd > + TBD: how to provide/convert the huge pages to ZONE_DEVICE > + Perhaps reserve them at boot time like in HugeTLB >=20 > + Line of sight to compaction/migration: > + Compaction here means making memory contiguous > + Compaction/migration scope: > + In scope for 4K pages > + Out of scope for 1G pages and anything managed through ZONE_DEVICE > + Out of scope for an initial implementation > + Ideas for future implementations > + Reuse the non-LRU page migration framework as used by memory ballon= ing > + Have userspace drive compaction/migration via ioctls > + Having line of sight to optimizing lost HVO means avoiding being = locked > in to any implementation requiring struct pages > + Without struct pages, it is hard to reuse core MM=E2=80=99s > compaction/migration infrastructure >=20 > + Discuss more details at LPC in Sep 2024, such as how to use huge pages, > shared/private conversion, huge page splitting >=20 > This addresses the prerequisites set out by Fuad and Elliott at the begin= ning of > the session, which were: >=20 > 1. Non-destructive shared/private conversion > + Through having guest_memfd manage and track both shared/private memor= y > 2. Huge page support with the option of converting individual subpages > + Splitting of pages will be managed by guest_memfd > 3. Line of sight to compaction/migration of private memory > + Possibly driven by userspace using guest_memfd ioctls > 4. Loading binaries into guest (private) memory before VM starts > + This was identified as a special case of (1.) above > 5. Non-protected guests in pKVM > + Not discussed during session, but this is a goal of guest_memfd, for = all VM > types [2] >=20 > David Hildenbrand summarized this during the meeting at t=3D47m25s [3]. >=20 > [1]: https://lore.kernel.org/linux-mm/cover.66009f59a7fe77320d413011386c3= ae5c2ee82eb.1719386613.git-series.apopple@nvidia.com/ > [2]: https://lore.kernel.org/lkml/ZnRMn1ObU8TFrms3@google.com/ > [3]: https://drive.google.com/file/d/17lruFrde2XWs6B1jaTrAy9gjv08FnJ45/vi= ew?t=3D47m25s&resourcekey=3D0-LiteoxLd5f4fKoPRMjMTOw