From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 009DBC25B74 for ; Fri, 24 May 2024 23:31:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 781A36B008A; Fri, 24 May 2024 19:31:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 732456B0092; Fri, 24 May 2024 19:31:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D1F06B0093; Fri, 24 May 2024 19:31:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 378336B008A for ; Fri, 24 May 2024 19:31:56 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id BA810A1F35 for ; Fri, 24 May 2024 23:31:55 +0000 (UTC) X-FDA: 82154889390.28.C2D8116 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 95CBF1C000C for ; Fri, 24 May 2024 23:31:53 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VSHxC5f3; spf=pass (imf21.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716593513; a=rsa-sha256; cv=none; b=7iPTmtjSPixSi5gCMnZqD417vGIqvrJvlfAGQ6cHeBi2yGZ+8oHfptZg470RlA+PdLxQ8r A2IVk00FuPZrIVQDZQ8uHY/v5U4tBurN8yr3uAVv3P5UTOab/a2+nLmk7VkRQIFvlBoJST jsaG4N4H4FJgUjTcdke0WdatCbv9v1Q= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VSHxC5f3; spf=pass (imf21.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716593513; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7XKUhXvUYhWS7CZNzfZg1IrSAC3PoOBeZnF1nGBaYX4=; b=nlD+k0xoJDvB0ehAF2BAv8Z0/b6xluGz9u9QI8bPBx2JXatcmMbn+J2sKOhE35AjOoBHrI 8mOkSC4aZhgVXaz3NaTCwNbnzLuijjf1L5xxEO8vjQdRQTHOBIU1x3ohz2hu0rc3J6eVRr RazNFVZzspVJBnUwcrBlm3gBksv8zSA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1716593512; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=7XKUhXvUYhWS7CZNzfZg1IrSAC3PoOBeZnF1nGBaYX4=; b=VSHxC5f3nD1S8QG07IIm4pIsaVUqSpHA1rAByDU1bvVdM6bSW+Z+BwUApfp2rUy0Qbloog wlxUyPV+OD3+fnBAPxQpH8gppFIXp/sZ/KIKZFqEqswajd1+mJs0buiAVQIv7uvkeYCeZh BD384c7ABAPwjoBxL+j/xtCx1LOBHhs= Received: from mail-oo1-f69.google.com (mail-oo1-f69.google.com [209.85.161.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-44-5z0l2t19P3uwOLwrgR26cw-1; Fri, 24 May 2024 19:31:49 -0400 X-MC-Unique: 5z0l2t19P3uwOLwrgR26cw-1 Received: by mail-oo1-f69.google.com with SMTP id 006d021491bc7-5b3332ae94bso923698eaf.2 for ; Fri, 24 May 2024 16:31:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716593509; x=1717198309; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=7XKUhXvUYhWS7CZNzfZg1IrSAC3PoOBeZnF1nGBaYX4=; b=AHO2rMkf0DeVLtOd/EQDrL+JPXE3ViIFgovR1xwtn3r7kuFFsI/evfgkDlL3VJFXYh s8ZyJMtNKrojAuT+cODCPX2ivlipFKPfVgfslvxBk8XJktjCYvmNPp8Oiiew/tSxEVaD UCzc7zClXBqZkJcj5T8Y/qS6k5jojEq9DTNGYHy7F88ne8Oc3Ol598H0lMC450lW8I9/ +Ql2MYaGBX/n+9nL1eZQ/nZny7dqMcLielcsrZblb0lkEySW19F8KfuNy7EEEcshyXJo lKbGI1gLb6JkNuNcRg+J6mFVFcWmuVQe36tql/T2iCIOoiBdwJOl8N/5/IGfzUqIQHbA p2tA== X-Forwarded-Encrypted: i=1; AJvYcCVcR/mJGnixd1gj6SSb03rw1XlPhx1LOp2dGGZ10WknWgfLk0TeqmYFzeiaWJB1MJDEGbPoiM0SRmMF87QTj8zxuKs= X-Gm-Message-State: AOJu0YzS7iWMTb4FfsWFtNxcETgWs4PsbFhhsPwoMpCea8L4ReH8MDGb 5DBKst1IzPaAAt5iY6DEnnQSdobHsrW/KAqA7Y4W55CJ0z0a3m9PiWSqNgCOf7EFxgsRjfycWXA wPziRHkJ+jyfkxLeilxvqknvVquT7ZNJn71YBquOHiPHeXa1p X-Received: by 2002:a05:6358:280c:b0:18f:310f:3322 with SMTP id e5c5f4694b2df-197e50c4a93mr427211255d.1.1716593508322; Fri, 24 May 2024 16:31:48 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEf6kS2QtzFMr27Dc8rM5FQIYeB8NOfYD3mX0PkwpyFtGsYKoEnrcX9GFe7t5/kVdnXf2rqkg== X-Received: by 2002:a05:6358:280c:b0:18f:310f:3322 with SMTP id e5c5f4694b2df-197e50c4a93mr427207655d.1.1716593507477; Fri, 24 May 2024 16:31:47 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id af79cd13be357-794abd06d35sm102038785a.85.2024.05.24.16.31.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 May 2024 16:31:46 -0700 (PDT) Date: Fri, 24 May 2024 19:31:45 -0400 From: Peter Xu To: Axel Rasmussen Cc: Jason Gunthorpe , David Hildenbrand , Sean Christopherson , Linux MM , Alex Williamson Subject: Re: [RFC] Huge remap_pfn_range for vfio-pci Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: 95CBF1C000C X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: ejs545mmom84gf19c3nwd1jx3jwdemzf X-HE-Tag: 1716593513-222751 X-HE-Meta: U2FsdGVkX1/NP+1bIqSr4GyNZ94DXY0/Mtzex5v49FlwnqYyBtsWl5aCXgmF3QsrUeQEoQM8wcvQw+W0DhY9GVBMu3Dzu6GDo8Xx1YhC9LW+DySikcaVS2OA8amYy2r7YWW2Z7bTzJcO8XmqDfurv7oLrKUoRcpfSc9iZWb0FIuvzsGpcZfq4yRzJ8NsrO6yUYWZbyfAD7oAlKC2sUP/cSiHuRTPU/bZAWQPDoks9DOgKOmPJBcuxGYEdtzqkhwcXGoRf55zjIulDSeBBW5zXWZ0DSI6GP/hx+3+pI9kLYPITZOwQxaRII6WkKna6keeAxW5AdOcsSN3XqwORDkUjeqsWZlqlxFiP01/dCou9FUfQ7/6XtqcfHSyJmadkTurjjgII5tXlzZHRuHiinUtfrPDHkZlLhG1VIpnA5BxHTsn30lm4cfWSvW6PxY7SLVtwQhi7N6ST6nQBPb7ATU2JUOLztxyArM/cKufvCe2kJG+AhVfkI0SOcNW+noOnpVt34QbJgo8MMXpDVOjZzOmJy3PLyuLoFNzmQBoreEll+uvp7fjMqXdH6AYde8q3/i0v6i6ReWeHL6jYZqStiYrutdNffXSEDqgHxt7pkmCwbdnT1eZvGmBz7GgCXwEFZzsR9x+sv4aNOV2PHZlbN1zZyP8jvXp6CUEvrg1eYoaOw2XJNCrxo3fMyKXGAQ5024i9vU29CcLRXifTZi9SJ/D4jU0GKpnjTFqluDhrR9bg2GK1MpMhNrXEZZztkDA65huexG2Su71QVSW1NCSrBKIhjNQEHOdNOxUwLfT+RWGsDfbR7OGpn0GRdhe5BDcDIfMsNkmSV/L8OaZGaqfSA7+HxtkasZukcXM/EP3JWpMG+DTpt8LaeJREUqUpnu1ly+R+/nRyUnPl9SyfFkzO7GFNRbpk02juA75pTME+aQTjJGHUFzAg8bYhJbAOKQDPEhCmT29tDma1z2xvuKioji inW92v6a d3tukcGoF8LVH8/reRgRDnqear5eN/L8Qx/Xdr0EuHJJxIFazO6C7ehgm84hJxmoblgU6Pe7uQV5OARN4RmVnhiCYSB7A3Y5IuRx/B+0tceF4Jc7ISkwF7exso+2BbdEPrTExjiNq1FMBYH0Fxb/FjXq6xC8bBpjHQKfHNQ1c478WPByENeklTBfCKPqwOaNSR7bvTL6jSjSOM3IGGe/WfsFK2zFk8sMsVCitfXd8DUssiJ8hhnjtr4nyt04jUEyExTtQId3D+RVI/NI6xgkMKLBr0WeOwWozcYawxIZVNl6js643ziOyzuUz3SDqdOuOtUTvFbeRyG4Gau0PSRE+qoGOV0Ns136HvY0BN+HCY7F7OD+i+FY0yJlVlT1b6I+8fnZe3tIOOBeqSVbN5fbhIDuqtg401xk3IFvLFTigd8U4Ud8YMEIeZojahyB4eWfqzScoXcnlpyEAgKC7ZHJFxaCU+jKtUqNvc726iZ12wFgdNrTHLg82mC3OHU+r6cf4fjAGYmf8d0HovU+i5Sfvwsx8zN3i9T70OL8WbYkPgr2GyruhbpAmVTFcwQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 24, 2024 at 01:54:20PM -0700, Axel Rasmussen wrote: > Hi, Hi, Axel, > > I'm interested in extending remap_pfn_range to allow it to map the > range hugely (using PUDs or PMDs). The initial user I have in mind is > vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we > can get both a performance and host overhead win by doing this hugely. > > Another thing I have in the back of my mind is adding something KVM > can re-use to simplify its whole host_pfn_mapping_level / > hva_to_pfn_remapped / get_user_page_fast_only thing. IIUC kvm should be prepared for it, as host_pfn_mapping_level() can detect any huge mappings using the *_leaf() apis. > > I know Peter and David are working on some related things (hugetlbfs > unification and follow_pte et al improvements, respectively). Although > I have a hacky proof of concept that works, I thought it best to get > some consensus on the design before I post something, so I don't > conflict with this existing / upcoming work. Yes we're working on that, mostly with Alex. There's a testing branch but half baked so far: https://github.com/xzpeter/linux/commits/huge-pfnmap/ > > Changing remap_pfn_range to install PUDs or PMDs is straightforward. > The hairy part is the fault / follow side of things: I'm surprised you thought about the fault() path, even if Alex just officially proposed it yesterday. Maybe you followed the previous discussions. It's here: https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com > > 1. follow_pte clearly doesn't work for this, since the leaf might be a > PUD or PMD instead. Most callers don't care about the PTE itself, they > care about the pgprot or flags it has set, so my idea was to add a new > interface which just yields those bits, instead of the actual PTE. See: https://github.com/xzpeter/linux/commit/2cb4702418a1b740129fc7b379b52e16e57032e1 > > Peter, I think hugetlbfs unification may run into similar issues, do > you have some plan already to deal with PUD/PMD/PTE being different > types? Exactly. There'll be some shared work between the two projects on fork(), mprotect, etc. And yes I plan to cover them all but I'll start with the pfnmap thing, paving way for hugetlb, while we have Oscar (from SUSE kernel team) working concurrently on other paths of hugetlb. > > 2. vfio-pci relies on vm_ops->fault. This is a problem because the > normal fault handler path doesn't call this until after it has walked > down to the PTE level, installing PUDs/PMDs along the way. I have only > gross ideas for how to deal with this: > > - Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be > called earlier in __handle_mm_fault > - Add a vm_ops->hugepfn_fault (name not important) which should be > called earlier in __handle_mm_fault > - Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS I actually don't know what exactly you meant here, but Alex already worked on that with huge_fault(). See: https://github.com/awilliam/linux-vfio/commit/ec6c970f8374f91df0ebfe180cd388ba31187942 So far I don't yet understand why we need a new vma flag. > > I wonder which of these folks find least offensive? Or is there a > better way I haven't thought of? > > 3. That's also an issue for CoW faults, but I don't know of any real > use case for CoW huge pfn mappings, so I thought we can just keep the > existing small mapping behavior for CoW VMAs. Any objections? I think we should keep the pud/pmd transparent, so that the old pte behavior needs to be maintained. E.g., I think we'll need to be able to split a pud/pmd mapping if mprotect() partially. Thanks, -- Peter Xu