From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49A8FC25B74 for ; Thu, 30 May 2024 17:00:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A504B6B0098; Thu, 30 May 2024 13:00:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9FF4B6B0099; Thu, 30 May 2024 13:00:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C6F86B009A; Thu, 30 May 2024 13:00:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6D4F56B0098 for ; Thu, 30 May 2024 13:00:25 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 094CB810DC for ; Thu, 30 May 2024 17:00:25 +0000 (UTC) X-FDA: 82175675610.21.BF47A40 Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41]) by imf03.hostedemail.com (Postfix) with ESMTP id EECC820023 for ; Thu, 30 May 2024 17:00:22 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=u7qSxgfF; spf=pass (imf03.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.128.41 as permitted sender) smtp.mailfrom=axelrasmussen@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717088423; a=rsa-sha256; cv=none; b=vRzmfORiz02olYn5JKy9o2RQxMQPLZd3NuruV8vu1vHmUVCR8tFtIvdQ0++G/JeE/IY4Gm wmtmGfyaYj4aF1KUvmWd5gbz3fLIWymD+RYXl/VGIAcqjnQ3sDuyMTKBuezTNKo9DUyLWP Sfp+g7qRrsVXy1k8I9XzY49wrGHYldY= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=u7qSxgfF; spf=pass (imf03.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.128.41 as permitted sender) smtp.mailfrom=axelrasmussen@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717088423; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CrryPSP+PKB06wId2k+W00q/ZWUSuyaugZ4OtpAAx5k=; b=PVID11TxvtnGlqXZkRuIUPYu+e5Jcw4tGuIeNo2L6nK2pAQM1FipsGsAPmUYVVyH+TpWG1 Lwmnn3mWc0toUpJEVnyIj5MqTGu0kJJcnF0Ga7TbZg00JlqhWf2uKC9Tg1nlhGV4xji3/W 6icHeCOL5QoXXRgc7pChzowmRmXzPDo= Received: by mail-wm1-f41.google.com with SMTP id 5b1f17b1804b1-4210aa0154eso7704045e9.0 for ; Thu, 30 May 2024 10:00:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717088421; x=1717693221; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=CrryPSP+PKB06wId2k+W00q/ZWUSuyaugZ4OtpAAx5k=; b=u7qSxgfFC12WagckrQN8v8vhpTu4xkquDKnG7dX2FY+zXiUsBXLafTVNTJHaru4Nrt WFsbVqIGg7L5zBFUSu0byjvz/5h/+JqBB5fCvds+Enwg3MtwFjItBez7bpWddtp+Zg4u wT3qxrjye/umMm/0GIh4VRoaL0TrGD9vW+pjMMsrJkfD2bZBmcfx5fw5XIhFeAzvx0kx 0H/Y+HeUHXP1PESyImlqggl0UWn8MWrV3jmyrNJQL2hHsBKg96GOmrOcX2RyayI/uk2l jNMrKJI1F2M76VFyiGTPtDkVhj2ElpZxH++Wy/ZZIFhVm/w+kB39qGxBGu7FjcdQV0rZ VyTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717088421; x=1717693221; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CrryPSP+PKB06wId2k+W00q/ZWUSuyaugZ4OtpAAx5k=; b=wtz3z0yjZlpvS7IZWdwGOfeylAElNAv/2mVeeDz6WhmPQLpEC0i0zDiF0SKZ0GOtZb 2G/TBZE5NxXkE+2CfO7W07SVoapxG6O0H7pytLQDe3dKwWpdfH2A9qY2CPvjg/d5DEtg 4biqcE8jpJbg45grPyTmEeodT8GRDn4ZtOKTyUQHenNSLleYvdlS3sPVBMpJTKgmkYUj TjUrskfRTlbdeNIs9SX3LVSpcAyDA7uqFDx1p+gb1D6YXFK5Ua8sgXZYwUJ+Ge8Gex4j ebV1pLMjKuFP34r+fJW/768lS9A1Q+QcosvsCbp+qSl4WDc296+o2vlqXp+wVuw83FmF +nYQ== X-Forwarded-Encrypted: i=1; AJvYcCVxx1EoMDFI54XFWaKgm4Pw5VRJYaBOj8OOOh1SeI5MsbEA5+hFf2cFmdNR1HCxofSIzKIC/JGskHmzgLB7hpy4qhE= X-Gm-Message-State: AOJu0YzRnRDQTOI+WWV1Z3aRmnX91nF/jzB6dCqxXazdlhX07a+h3e9+ e7WI+1JADH6jBNZ/2SVunjul859+zqc7cbwnJFMuv59Va2oZ0+tvtW+Kke/pT1YGXW6nI7A2rv9 zdreYMdaVJazKA9s2jp1UVMgKnEfGIjreQmjykPUMNzj+H1YFQA== X-Google-Smtp-Source: AGHT+IEfSmal9CehFSAdeBp70df61M88Kx7npSs5J65DG0XPbj7DureRljdxw03LQTxyHR79c6d2lHn+OJTR7nT+plk= X-Received: by 2002:a05:600c:2e44:b0:41b:f979:e19b with SMTP id 5b1f17b1804b1-42127929073mr25603635e9.39.1717088421025; Thu, 30 May 2024 10:00:21 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Axel Rasmussen Date: Thu, 30 May 2024 09:59:44 -0700 Message-ID: Subject: Re: [RFC] Huge remap_pfn_range for vfio-pci To: Peter Xu Cc: Jason Gunthorpe , David Hildenbrand , Sean Christopherson , Linux MM , Alex Williamson Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: EECC820023 X-Stat-Signature: u7r5p4au1mtbpjs1pn96c6zwpspw8xi8 X-HE-Tag: 1717088422-339260 X-HE-Meta: U2FsdGVkX19DR8iO0FKqwLvWs8FrY7BTd00mHNwar8WYxh6kT3eJpEu8IAypLcCe2DemfyK+nWvsfj/au3xOTRj+JwYQ2Jp36P1auCRu+BZOsZMS2KjM2IXeEsSJrxBPatO1s4oiqVKrKvbZO4+d6rj4+jlpydMyQxC1Muxd6PHTEyQrQpMGUDaCA5Kxt3s5q515H6Yi+K6516/skY7wTia+I8PN93g551ckKGtwajbrUpTPwwZvlefqKIPiMfu0KqAp85LgVCUa50M+Z+laSlg1nXogKuN7WSIDfKzkUUifn+IL8qQM1M36Qz/+LIIjgr7bNDkFH00s3Mqg/GM61TeJgRuR34HnWrlo9iK8OoKbzTlw37VCxFq8/ALNm83FWlv6MBnGGcoiyaTgEG4eHbNICSx22Sm/sTOM/IWONfTXe/74BkRiVxXXOrC8IqHbiPIG35VBGKjfPaHg1Pmaqed/bmMn1u7DhH5nqv3MXvEXXebVVoWB8gTO8uhm0hWjuzViekwXMdlcwH0vs4FSVp7IOfR+RoaS45aEwvBcSUVvYBwJxE1dny1o2uePIm/8zRiJAQmsBs4kRAyJYlMcXa49HV12vlOBp1CoqLXWroASjX67EEj22vLAOJ2BH35Elyf6DRirDooAaDUzVyJDrAX3y5ncwRJvhzAbOKQCNGEjfiFeVbbxC5lwFZe+HhAD+tFl/1TJLwXwdWDFNU63BYM098F44D6fBt9e04p2YHAz+OenSurxqYLCVS848fns+KBiL3Q9robGE2/a0tr0WXEigRmobrjZQGfz9Lp6cQr3rwZKVaCiXGc2WuW/AFfvfqx9vFzLqgyYJt1B9awKxph+4c3Uf7GeWwraJ6n2TBH/X9TOJsLPHtt1Nlagz4XGd/teCrAMgJ7XCzN4TQn6MUOFcAudRZXqxzij8kVQVtiveJZNSiku8qrJ1+bOe/MXGr3W8YK480TG7yni/sR ddILSUmd YvIys1HsFkURiIX3GpNNm/VEvHwZi/5ILJI9MqLjuGohW8hJ5q8Sx+U9ufe/7LkTFbPsAfnPgGP2SjlMyncxowaLLQGpSsn9QY606yUNYu7UA53xYoOKI+FcaDhQqobWpoyku6aBckB+a7zz6hKaaKM+tpm0uoU7tqixNaqyeNOw3DIxsNSjPP3N1mjtuwGPpoCntLwJ520/0nVx4x1rEtv0jVV0/OeteuPAmKlh+38CgDG8MqVsA6v1t0hNqCpG+KkdswKGHOgAfVFVhLvRqN/0i3dYoCamCzf6HJdxhGAQm9Gal9yC4HkFI0OzTu5wYrSAS41/3t7cLkCrXy+QbZeFl1mj8GnR9W3XEHrpdohT6a/4NmwyLphAZHw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000186, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 24, 2024 at 4:31=E2=80=AFPM Peter Xu wrote: > > On Fri, May 24, 2024 at 01:54:20PM -0700, Axel Rasmussen wrote: > > Hi, > > Hi, Axel, > > > > > I'm interested in extending remap_pfn_range to allow it to map the > > range hugely (using PUDs or PMDs). The initial user I have in mind is > > vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we > > can get both a performance and host overhead win by doing this hugely. > > > > Another thing I have in the back of my mind is adding something KVM > > can re-use to simplify its whole host_pfn_mapping_level / > > hva_to_pfn_remapped / get_user_page_fast_only thing. > > IIUC kvm should be prepared for it, as host_pfn_mapping_level() can detec= t > any huge mappings using the *_leaf() apis. Right, the KVM code works as is. Sean had been suggesting though that if follow_pte() (or its replacement) returned the level, and had an option to work locklessly, KVM could just re-use it and delete some code. I think we could also avoid doing two page table walks (once for follow_pte, and once to determine the level). Then again, I think it is somewhat debatable what exactly such an API would look like, or whether it would be too KVM-specific to expose generally. > > > > > I know Peter and David are working on some related things (hugetlbfs > > unification and follow_pte et al improvements, respectively). Although > > I have a hacky proof of concept that works, I thought it best to get > > some consensus on the design before I post something, so I don't > > conflict with this existing / upcoming work. > > Yes we're working on that, mostly with Alex. There's a testing branch bu= t > half baked so far: > > https://github.com/xzpeter/linux/commits/huge-pfnmap/ Ah, I hadn't been aware of this, it looks like you're already well on your way to implementing exactly what I was thinking of. :) In that case I'll mostly plan on trying out this branch, and offering any feedback / fixes I find, it would be counter productive to spend time building my own implementation. > > > > > Changing remap_pfn_range to install PUDs or PMDs is straightforward. > > The hairy part is the fault / follow side of things: > > I'm surprised you thought about the fault() path, even if Alex just > officially proposed it yesterday. Maybe you followed the previous > discussions. It's here: > > https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.= com > > > > > 1. follow_pte clearly doesn't work for this, since the leaf might be a > > PUD or PMD instead. Most callers don't care about the PTE itself, they > > care about the pgprot or flags it has set, so my idea was to add a new > > interface which just yields those bits, instead of the actual PTE. > > See: > > https://github.com/xzpeter/linux/commit/2cb4702418a1b740129fc7b379b52e16e= 57032e1 Ah! Thanks for the pointer. This is relatively close to what I had in mind. > > > > > Peter, I think hugetlbfs unification may run into similar issues, do > > you have some plan already to deal with PUD/PMD/PTE being different > > types? > > Exactly. There'll be some shared work between the two projects on fork()= , > mprotect, etc. And yes I plan to cover them all but I'll start with the > pfnmap thing, paving way for hugetlb, while we have Oscar (from SUSE kern= el > team) working concurrently on other paths of hugetlb. > > > > > 2. vfio-pci relies on vm_ops->fault. This is a problem because the > > normal fault handler path doesn't call this until after it has walked > > down to the PTE level, installing PUDs/PMDs along the way. I have only > > gross ideas for how to deal with this: > > > > - Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be > > called earlier in __handle_mm_fault > > - Add a vm_ops->hugepfn_fault (name not important) which should be > > called earlier in __handle_mm_fault > > - Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS > > I actually don't know what exactly you meant here, but Alex already worke= d > on that with huge_fault(). See: > > https://github.com/awilliam/linux-vfio/commit/ec6c970f8374f91df0ebfe180cd= 388ba31187942 > > So far I don't yet understand why we need a new vma flag. Ah, I had discounted huge_fault() thinking it was specific to hugetlbfs or THPs. I should have spent more time reading that code, I agree it looks like it avoids all of what I'm talking about here. :) > > > > > I wonder which of these folks find least offensive? Or is there a > > better way I haven't thought of? > > > > 3. That's also an issue for CoW faults, but I don't know of any real > > use case for CoW huge pfn mappings, so I thought we can just keep the > > existing small mapping behavior for CoW VMAs. Any objections? > > I think we should keep the pud/pmd transparent, so that the old pte > behavior needs to be maintained. E.g., I think we'll need to be able to > split a pud/pmd mapping if mprotect() partially. I had been thinking of ensuring we never had pud/pmds in CoW mappings, but using huge_fault() might make my worry go away entirely. I completely agree we should allow vfio mappings to be mixed size though, in case things aren't quite aligned (due to a mprotect split or any other reason), we can still have a mostly-huge mapping with some ptes on the end(s). > > Thanks, > > -- > Peter Xu >