From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB0E3C7115A for ; Wed, 18 Jun 2025 16:56:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 498C86B0095; Wed, 18 Jun 2025 12:56:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 449F66B0096; Wed, 18 Jun 2025 12:56:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 338336B0098; Wed, 18 Jun 2025 12:56:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 248AD6B0095 for ; Wed, 18 Jun 2025 12:56:14 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id AF6FD80638 for ; Wed, 18 Jun 2025 16:56:13 +0000 (UTC) X-FDA: 83569124226.16.E8EE59A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 5CBD714001C for ; Wed, 18 Jun 2025 16:56:11 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YRXyDHsp; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf26.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750265771; a=rsa-sha256; cv=none; b=L27f3+xKYfHxO5bpsy+K7rtl+qeWQ4AEox6IBWbJ91B349+qMWrsfUNqanH51I9tqnXj2U +/jhG/CoEq0szJU8MO3Qyzc9U/OkBZEC3tTumrETPFgJIFZU/c63zmfWN+y63ArTtG2/k8 ETvmLe5PflEbqHp3stkLRUYQEpZapOU= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YRXyDHsp; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf26.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750265771; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fIqJPx+hVdZ6VMZuVeGfPGA+dF/0vqGtCkRLpRZfMd8=; b=Oe2ajcHf8FuyGfy52e46JnDkwYXIRqXHI4IvhCYoE2OLrBICva6lrOYaBCGUeGO3TAYlZR MP3pQHqGPWrfLyFEbj2BozkVyjgYxnVRhFxhwhRuuhSd38syVVpYfOwBVd9oc3e8dpiQvl +7h19YT1CIkftoQfEyhpacES1aU5mcs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1750265770; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=fIqJPx+hVdZ6VMZuVeGfPGA+dF/0vqGtCkRLpRZfMd8=; b=YRXyDHspu0uBHC82+AdeXfVpbE8n6vyGePQvcQVCt8uoJ57PZZaTRhhFuiy9k0TUcEuCBw IpOHGlYMqBRmt+Id9zbWO0edAt301yCiliBTnRW5WNkcxMjG9pIZOcCc7ZEqnHpASvfxIr Cr/MRbVinePE8QjryZv+V5wDPBLwVCo= Received: from mail-pf1-f198.google.com (mail-pf1-f198.google.com [209.85.210.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-222-FNvQTqeFNLCD6KJE7qvX1w-1; Wed, 18 Jun 2025 12:56:08 -0400 X-MC-Unique: FNvQTqeFNLCD6KJE7qvX1w-1 X-Mimecast-MFC-AGG-ID: FNvQTqeFNLCD6KJE7qvX1w_1750265768 Received: by mail-pf1-f198.google.com with SMTP id d2e1a72fcca58-748764d84feso9791454b3a.2 for ; Wed, 18 Jun 2025 09:56:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750265768; x=1750870568; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=fIqJPx+hVdZ6VMZuVeGfPGA+dF/0vqGtCkRLpRZfMd8=; b=iHbzepWaJbgqJ9qkIb4HU/zdK+CT/4BpbAvTe0LlBZ2yJSraAloVzRJP3IVjmgyqZe 8nYU1vqNUvnrcAOP15X5UADDBlb5EHAJi69DCanG7qtaWMwWiCD9LpDzr2e1UjQoTOgM g4iTPdwbHAfBCqdKMKSndmJWsIW7MfvsnuiJQe1AqgVNKcGAY0S4kNo9VTE5skh7bb29 VAI2ac5AewJCc8Q6sku3NVcHjgqqwdT4mYhN3EhTbnv/63sI/JZrAs4elDWUM8j77fmc 4Oa9tNQK9XYoKg01HoJdGqUFyrlGqT51zVVwTWxnzF+QIjnnaP41STo/+qjvc0IOvl4q vVzg== X-Forwarded-Encrypted: i=1; AJvYcCVgHBjYqNnN/xfSNjs21R5lXqF9FhYmn60pGKbNpQC3qAo2zocOBqg7dDMw/blcUTl3MpzhRXkp9Q==@kvack.org X-Gm-Message-State: AOJu0Yy6fhA0rWR8qjy1lRLSIJqVh5dEeYi9J7AKzF5eWle8yid5NDjd do6L3umtYs6JsWmiPvlXGoafruYrRx1hA2h7PHjnyPrGoDyS2DDIUeMY6ICGKPeBVdsFAxKRtUY TvGMsfIXIC1CJrV0Y5/tvHBsC/gjj7AWkMeiOYD7X2xjD9At0pPv0 X-Gm-Gg: ASbGncvT7njSv5Y+bvSDHwEKwxKgCVs6HAza6tPAy0LVOo4BYeY/sxXZxxpj1MSSKLo +oKjdfqa3P4zk/Xt1JNqYvbAY2DUoxmQlXRsEU3vLloykgI6+4gL5yZq6KkspW+oT9jS7K5F/YN SjuFlwdvEYBzC+6qTaFZ1pnpMSuUgoB6S/Zt/TvddidGk2yu3fT2IZsZUoNhQCI1eDMslYpF+Z4 EJBuV/GppPZ/2udRqJYViJfsApG2Oa64sb8cGe7AL/CC5T9s+xlHh4nJbduiHX2LQxcE3qC+VOx msNiu7DyaF6WHA== X-Received: by 2002:a05:6a00:ac9:b0:748:2d1d:f7b7 with SMTP id d2e1a72fcca58-7489cffa98cmr25566533b3a.21.1750265767695; Wed, 18 Jun 2025 09:56:07 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGIZUhEBIr4xiQwMA9O2tUbBgljMCpveJ7muzyq6G0/d5NeOKpjfbruv0BtTaIQsdL6oXmf/A== X-Received: by 2002:a05:6a00:ac9:b0:748:2d1d:f7b7 with SMTP id d2e1a72fcca58-7489cffa98cmr25566487b3a.21.1750265767230; Wed, 18 Jun 2025 09:56:07 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7488ffec9c6sm11280298b3a.9.2025.06.18.09.56.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 09:56:06 -0700 (PDT) Date: Wed, 18 Jun 2025 12:56:01 -0400 From: Peter Xu To: Jason Gunthorpe Cc: "Liam R. Howlett" , Lorenzo Stoakes , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, Andrew Morton , Alex Williamson , Zi Yan , Alex Mastro , David Hildenbrand , Nico Pache Subject: Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Message-ID: References: <20250613142903.GL1174925@nvidia.com> <20250613160956.GN1174925@nvidia.com> <20250613231657.GO1174925@nvidia.com> <20250616230011.GS1174925@nvidia.com> <20250617231807.GD1575786@nvidia.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: TQC5MoHOCSgK7ww1d76rnoBk9HZ8mMjlNNQrsEyFflM_1750265768 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 5CBD714001C X-Stat-Signature: opc1t8dc93fuw581meo7dspt898moqk1 X-Rspam-User: X-HE-Tag: 1750265771-889440 X-HE-Meta: U2FsdGVkX19Y7sIntWu/x2xjbcs7wpG/tFo6SIvqZ/DeoP9Yl2vJunwAsefUEc13EQYePDDI5BN8oy0m97gmLN3ikpTMnWmKPiV8nH74/S8gd2wMyLdwHqvSL5tDwc2WX7+prKksSjWs/lxcDFidSJnUK/zu58iVMUTT75iBxfCdOh+PRV15tjuVdZOIdKhNcV/anhCuSf8FTDNKdggEMMKSXK0W4NPvowcWAakypj947OjnnjrgeM5LLMG2h32oXBTDY+iM+gyQW0GBuvKA58OH920zrD851NlbBZXwquMZWeGji6elh2bjEJj/tQ3MvXelMXW3SSUGSnV7h/9pIlt6mQJLo1sUJjgLjazfMDKCYlLVqRp6rBU+nPpFgymV9BKgDEFs0A+vB+pfZv3E5hrVzF12Sakl51bmHRcXtfpxtVU3EEsGzyYRran/TqUE/A/09uKUqU9o7yPAHkQiKP4FKCMAxDdLNG2zs2JT6YQR7ti72qZNSgYE1SBwNixJtp3NrshjWe+6sZ1+DXXhl1I+iOzn1eG9vdaBlqvbAyQF2RIngJoE8E7IuVu+jzSWYtUc3vkqNoHdu5MVfZjCO4DJ9UOb1+Gv0qPtKm9O0KZ5AisFrcZSniIBYzSB5FrjLIqNsUWHxkkEqQL0bdEIS0begBEQhlaP+to385b3jO+0xBuiah5YlG7uBseYqZk9tjeOhCAo3f0q197UPUbdu2yhVR1f6csmqGQ/5+FpKC9gB83CG5SLlSAewE6nRfv4qOLPlPRZQnP7bORJ6MWB+cdx8JXun4G9p717nz15XifFZbtQjo93048xmdk0TowvxL75DsFYnhEZldLchZjMqH44UFMAC6fnfolUOyopU43AMh5xCr3qCr91L33QP506EHZ+escbX/Ni9LqjcJ3XZTcOdZ5H5Jjt3c/N1721uY9qw8CEmHqqFGRLwn+xcC4LVhIxvxrx18HKUM0eaOZ oKE0DzS0 h0aGo9GFZ+/EFwAxV74fMsYGCAoPHBJCRERfkROLY6iHKojpc6ixcfXO4U81MXHCS9VsIbnsWJ79Qy58ZK3wAzOjQIyUAte3VWd+7l/mtuVgBy6HyAJthlZ0ayVI1FhPbcFo7XPBEMNc4P3GPjXfZVWOZEXPqlc6vczk2sUpSkRUkeNWj5lUOv+oG9PW9veHWVA73fsvDdV1Zu3ZFUJacnGJBk4YhLMJKYfY5jXEW7hWZXi0UZJjnWcBOLukgo2/z0q76hlfQ4qjd3XEUaHP2c+0Lrrs7tfNDjhvhLr6VdrnubM0ayXpIElngIBtqd+evwSLj/BeEfV8/CDKqB7t9yOt5st6lsHfcymTiona6ycri80cgX0UmryS2S65K8jBFAGOkCJk8FdylgXPnN0wcnyGPHpdWKF796gRxAP0VSYwDmNPwiTeCEtZIRAKq0n9AKBviqZrnulAKsgA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 17, 2025 at 07:36:08PM -0400, Peter Xu wrote: > On Tue, Jun 17, 2025 at 08:18:07PM -0300, Jason Gunthorpe wrote: > > On Tue, Jun 17, 2025 at 04:56:13PM -0400, Peter Xu wrote: > > > On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote: > > > > On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote: > > > > > > > > > Can I understand it as a suggestion to pass in a bitmask into the core mm > > > > > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a > > > > > constant "align", so that core mm would try to allocate from the largest > > > > > size to smaller until it finds some working VA to use? > > > > > > > > I don't think you need a bitmask. > > > > > > > > Split the concerns, the caller knows what is inside it's FD. It only > > > > needs to provide the highest pgoff aligned folio/pfn within the FD. > > > > > > Ultimately I even dropped this hint. I found that it's not really > > > get_unmapped_area()'s job to detect over-sized pgoffs. It's mmap()'s job. > > > So I decided to avoid this parameter as of now. > > > > Well, the point of the pgoff is only what you said earlier, to adjust > > the starting alignment so the pgoff aligned high order folios/pfns > > line up properly. > > I meant "highest pgoff" that I dropped. > > We definitely need the pgoff to make it work. So here I dropped "highest > pgoff" passed from the caller because I decided to leave such check to the > mmap() hook later. > > > > > > > The mm knows what leaf page tables options exist. It should try to > > > > align to the closest leaf page table size that is <= the FD's max > > > > aligned folio. > > > > > > So again IMHO this is also not per-FD information, but needs to be passed > > > over from the driver for each call. > > > > It is per-FD in the sense that each FD is unique and each range of > > pgoff could have a unique maximum. > > > > > Likely the "order" parameter appeared in other discussions to imply a > > > maximum supported size from the driver side (or, for a folio, but that is > > > definitely another user after this series can land). > > > > Yes, it is the only information the driver can actually provide and > > comes directly from what it will install in the VMA. > > > > > So far I didn't yet add the "order", because currently VFIO definitely > > > supports all max orders the system supports. Maybe we can add the order > > > when there's a real need, but maybe it won't happen in the near > > > future? > > > > The purpose of the order is to prevent over alignment and waste of > > VMA. Your technique to use the length to limit alignment instead is > > good enough for VFIO but not very general. > > Yes that's also something I didn't like. I think I'll just go ahead and > add the order parameter, then use it in previous patch too. So I changed my mind, slightly. I can still have the "order" parameter to make the API cleaner (even if it'll be a pure overhead.. because all existing caller will pass in PUD_SIZE as of now), but I think I'll still stick with the ifdef in patch 4, as I mentioned here: https://lore.kernel.org/all/aFGMG3763eSv9l8b@x1.local/ The problem is I just noticed yet again that exporting huge_mapping_get_va_aligned() for all configs doesn't make sense. At least it'll need something like this to make !MMU compile for VFIO, while this is definitely some ugliness I also want to avoid.. ===8<=== diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 59fdafb1034b..f40a8fb64eaa 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -548,7 +548,11 @@ static inline unsigned long huge_mapping_get_va_aligned(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { +#ifdef CONFIG_MMU return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags); +#else + return 0; +#endif } static inline bool ===8<=== The issue is still mm_get_unmapped_area() is only exported on CONFIG_MMU, so we need to special case that for huge_mapping_get_va_aligned(), and here for !THP && !MMU. Besides the ugliness, it's also about how to choose a default value to return when mm_get_unmapped_area() isn't available. I gave it a defalut value (0) as example, but I don't even thnk that 0 makes sense. It would (if ever triggerable from any caller on !MMU) mean it will return 0 directly to __get_unmapped_area() and further do_mmap() (of !MMU code, which will come down from ksys_mmap_pgoff() of nommu.c) will take that addr=0 to be the addr to mmap.. that sounds wrong. There's just no way to provide a sane default value for !MMU. So going one step back: huge_mapping_get_va_aligned() (or whatever name we prefer) doesn't make sense to be exported always, but only when CONFIG_MMU. It should follow the same way we treat mm_get_unmapped_area(). Here it also goes back to the question on why !MMU even support mmap(): https://www.kernel.org/doc/Documentation/nommu-mmap.txt So, for the case of v4l driver (v4l2_m2m_get_unmapped_area that I used to quote, which only defines in !MMU and I used to misread..), for example, it's really a minimal mmap() support on ucLinux and that's all about that. My gut feeling is the noMMU use case more or less abused the current get_unmapped_area() hook to provide the physical addresses, so as to make mmap() work even on ucLinux. It's for sure not a proof that we should have huge_mapping_get_va_aligned() or mm_get_unmapped_area() availalbe even for !MMU. That's all about VAs and that do not exist in !MMU as a concept. Thanks, -- Peter Xu