From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8DB3C4345F for ; Fri, 26 Apr 2024 21:20:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2B08F6B0087; Fri, 26 Apr 2024 17:20:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 260C16B0088; Fri, 26 Apr 2024 17:20:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 10AF86B0089; Fri, 26 Apr 2024 17:20:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E2F116B0087 for ; Fri, 26 Apr 2024 17:20:32 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 944BEC047E for ; Fri, 26 Apr 2024 21:20:32 +0000 (UTC) X-FDA: 82052951904.29.0FAB1B9 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 6E0DE4000E for ; Fri, 26 Apr 2024 21:20:30 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=K+Z1ADEJ; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714166430; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YE+xroN3MSFRA2YVeljr0pVRi5jEiA/gyhgABPdlML0=; b=ym4/FCA3T2vVc6PrJ7qI4NmjWjogvKB+oMNiomb+P78UpqIsh6h/Fj9qG3M5+qidIJxfUh cIBLxLur4Kddg6+KxNzO+kTubO2DmRpuLYn/1LOI+30vSoVpUUF5a3H5KhbbE1OR13T/TS 6CqpXSUbXcGRGscvFpIg4fX0ko2qIAc= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=K+Z1ADEJ; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714166430; a=rsa-sha256; cv=none; b=S2u1v+KdKy5KcDehsIFjT9VdSfO6rkaTPOUFXA4B0EAVRi9wFxzuM5cehG6Y77KTlix9pb k2M5gDEdr0f2tXdaDzXN4Xkv8mnoTxZkA9UOsniXKz98+zzDP5qmrdj8hIjxI9TuLz0fsi qIgxGeN30AMS6CStXF82rWdj1Yacu74= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1714166429; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=YE+xroN3MSFRA2YVeljr0pVRi5jEiA/gyhgABPdlML0=; b=K+Z1ADEJ9rSuIz5AQhh4VxCOrbjvoOZGBNavLWxE121+rFBS9Me4U/E7H0fRFKi+iek+5n wszxQ7RP6oXLASSBN+SZo4mVCRRibUCcd8t5ni9g52WnQB/47QWwaAlNeRstTBLUAL6vPW Gd5SQ2dDiK5pmV6dil5JMAu9USigrQs= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-683-JAfqo6h5MQyFZ_D_H0t5Kw-1; Fri, 26 Apr 2024 17:20:28 -0400 X-MC-Unique: JAfqo6h5MQyFZ_D_H0t5Kw-1 Received: by mail-qv1-f72.google.com with SMTP id 6a1803df08f44-69b147e856aso6540276d6.1 for ; Fri, 26 Apr 2024 14:20:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714166428; x=1714771228; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=YE+xroN3MSFRA2YVeljr0pVRi5jEiA/gyhgABPdlML0=; b=VO80q0Or5Y01xMrJKAshngpVuRUfzhMT3d9aKXXL8WeiopRqlAp0KbLQcMrxY4fp2r 4HKWJJLwCt+CAfn8jfOyFXrNHE87KSOJ7tE4zpV6MQsUOdf/BQXqAxVCfIg7+FvhKKjj ZiM6JfUHBroNuhamcIr4zb+oxsPWhJHXTm9ZO5AcWEPSJ4s4/gVUUxWBu12j1wBWu72s oI1rAW5iTWRcxC0HFtQHI/DYMI4qk6fTPdEjszSHV+S1Db/pUa5a8M8oawdk2KbuuABo 1MIztYp3mUNpMY8lh7fhBksbshFGbA0kCRNJdcG0fMT4STHwjkYtfYHYHew3Gt7OQ9gk DCFg== X-Forwarded-Encrypted: i=1; AJvYcCX8bGrt5r7WJLiUWpv+V3Ll4sgYoK2VV/j7v7w3QwWFBCiQ8FNzocq8PDhRxafpdOgRjPKeHd0BEQZGs2kCoueb1Aw= X-Gm-Message-State: AOJu0YxbLxU8p8Tm4Lnli2RUmg5veQiwVZt4WtCpxCCdxzmSBWoXCc8V Or8G6mSTaE5omKnLt0XGBx4axd8Ak1WSFkhiV6ttfms3kwGWGhlIqCj9OCYdoCqQJ89sqoJrYqd k6u+ZUXlOp5Cx/eQTMS2BBKIhQWf6/nYlAIhgdHihvxMDfnFA X-Received: by 2002:a05:620a:4089:b0:790:8c20:e281 with SMTP id f9-20020a05620a408900b007908c20e281mr4399199qko.4.1714166427231; Fri, 26 Apr 2024 14:20:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGfzashahVyLrvAJ9qdUfNGidrRIiANL3lMVu3kzLvmX9ZvfaFZoyGy8b+LUj3RVEq076HTsg== X-Received: by 2002:a05:620a:4089:b0:790:8c20:e281 with SMTP id f9-20020a05620a408900b007908c20e281mr4399166qko.4.1714166426469; Fri, 26 Apr 2024 14:20:26 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id oo8-20020a05620a530800b0078d693c0b4bsm8243842qkn.135.2024.04.26.14.20.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 Apr 2024 14:20:26 -0700 (PDT) Date: Fri, 26 Apr 2024 17:20:23 -0400 From: Peter Xu To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Mike Rapoport , Jason Gunthorpe , John Hubbard , linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-sh@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions Message-ID: References: <20240402125516.223131-1-david@redhat.com> <20240402125516.223131-2-david@redhat.com> <8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com> MIME-Version: 1.0 In-Reply-To: <8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6E0DE4000E X-Stat-Signature: 3etrqmfam73fgwtuyq3d9qcqgiwuc6bc X-Rspam-User: X-HE-Tag: 1714166430-417153 X-HE-Meta: U2FsdGVkX1/qVwxZP6yeJY343sN8DDnF6F6FfP4aGbAd/eDv+i3CWyeNVhD8P8CQ3XQeZhpJYsE8bEwcM98hEP8+PWnPDkUTE5SYJhxIDk6U7ZrLtJko/4Il9TsXVQhXOntq3qMrRDR24hk8gnJEAjFp+YJRDOZHhb8LjH0JEdG1k8CGKJg2j988Pzi165DUA76B6SwSdGv+JD2SfNTuuuQL9oCUWnOuFB2wixJfgGXhCIodGP7/j+Yfc1mH1vlgb6LzT/iSpOKvZWF2sanCKtBrZLtFKA7wGPP2SC5Oju4nThd1nKZqq0oV79a/5KkHfSSBzAvEucH98gjW4/MqQUAXB43kwz1MewvzYm5VQNgxNvaUngFZO2QIobVQeVkcLI0GCqxaYf8PvdgxOfFR9+GiYEZcPsZiAV7iQSey6PKP9D4S3npmmZrFmVrops1RFgGyaxr7JriIeBv0Gs5fvG3Md9CcX9lsGyDptboi2q0EIUuZ043F+mFGGSTcu3I4+spNdyEwhzo615+hiyUj47ApFThNWS6E16/7JQX5XiLDQHl3KfBrHPWYyQwxltRiy852Qgmvyzqdsw5wokz++OuKtRDJhQRy8qu2Xld9UqPVRz36uV6JJGJC6antk7G38EZ074I4eFCaxwjvBDaCbC7WpRqkmlMymqnRrg4P7lb4YD83uthzR7TO2J0NAEDoNTfoaKCAh71SBqnmitmwyBKa39dNtA7FzTQFQbw2b9+v77XZ6c8rRZlamWxohtwHjna0sqza//fBvb0W7WPbLUi/9Pa8+A/TEPvWgekTi3eRKV/jsvwF/SzYKni4aenqzpCIhjf+HIN+5MDHvV5xDaVnDQoolYPCGP+rC2ZgYQvh4n8QI7jekdFD5q1OflNG05gxBPFU7avwe9ed7rPbGGZ7V1Tfn2CtyC4dAZMzRS70evEtfuXc3KMIY4aP1DvCAwYE7E8ZcEUyAjbScld QrmiEmJw 80OKiiEgwStS7GY5Q2qKtmLqobVtQhCHp9hyHEb3obhd3qvN1VFtliC+6/pWOFzXM1xFh0tru3U+4zma1C2AJWRUF4PRaXcR6GVC7qoxlfNr2wWHdqbDXCQ5WKxJ3KuG2AesWZ85uG2NFtbNAZ1+gyBTpYAbH1vH2HZpdxcSGEIdQUcp/UPIeq37sKQrs8lfTkF3jMugIK7Fmj0Th29DJfasXwlcN9hpB97lkbkYZhMk4J5rIusWBDmKq/g5YSQ5kKtL6Jq5QLLneh/+/FE5qwNlTEiyQZtw0Jyq3cgFRKvUTujXxNOdd8K0Ce1AiHFC2GqPO4CYdY9ud4/fBkHFVtyisBBODGRWc+eS27NkLYlqi9O9cFod4E8k6Yd79S5cn2GK9/gyQnYRRwFRs6FpCeP21xSFXlrmJSAZRzm3EF7WjLrkqRpWznzC7din6RkSBV52ozYfEwAHSALFG/LXjnmb1R6YUeJu9jhJD X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 26, 2024 at 07:28:31PM +0200, David Hildenbrand wrote: > On 26.04.24 18:12, Peter Xu wrote: > > On Fri, Apr 26, 2024 at 09:44:58AM -0400, Peter Xu wrote: > > > On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote: > > > > On 02.04.24 14:55, David Hildenbrand wrote: > > > > > Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename > > > > > all relevant internal functions to start with "gup_fast", to make it > > > > > clearer that this is not ordinary GUP. The current mixture of > > > > > "lockless", "gup" and "gup_fast" is confusing. > > > > > > > > > > Further, avoid the term "huge" when talking about a "leaf" -- for > > > > > example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the > > > > > "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that > > > > > stays. > > > > > > > > > > What remains is the "external" interface: > > > > > * get_user_pages_fast_only() > > > > > * get_user_pages_fast() > > > > > * pin_user_pages_fast() > > > > > > > > > > The high-level internal functions for GUP-fast (+slow fallback) are now: > > > > > * internal_get_user_pages_fast() -> gup_fast_fallback() > > > > > * lockless_pages_from_mm() -> gup_fast() > > > > > > > > > > The basic GUP-fast walker functions: > > > > > * gup_pgd_range() -> gup_fast_pgd_range() > > > > > * gup_p4d_range() -> gup_fast_p4d_range() > > > > > * gup_pud_range() -> gup_fast_pud_range() > > > > > * gup_pmd_range() -> gup_fast_pmd_range() > > > > > * gup_pte_range() -> gup_fast_pte_range() > > > > > * gup_huge_pgd() -> gup_fast_pgd_leaf() > > > > > * gup_huge_pud() -> gup_fast_pud_leaf() > > > > > * gup_huge_pmd() -> gup_fast_pmd_leaf() > > > > > > > > > > The weird hugepd stuff: > > > > > * gup_huge_pd() -> gup_fast_hugepd() > > > > > * gup_hugepte() -> gup_fast_hugepte() > > > > > > > > I just realized that we end up calling these from follow_hugepd() as well. > > > > And something seems to be off, because gup_fast_hugepd() won't have the VMA > > > > even in the slow-GUP case to pass it to gup_must_unshare(). > > > > > > > > So these are GUP-fast functions and the terminology seem correct. But the > > > > usage from follow_hugepd() is questionable, > > > > > > > > commit a12083d721d703f985f4403d6b333cc449f838f6 > > > > Author: Peter Xu > > > > Date: Wed Mar 27 11:23:31 2024 -0400 > > > > > > > > mm/gup: handle hugepd for follow_page() > > > > > > > > > > > > states "With previous refactors on fast-gup gup_huge_pd(), most of the code > > > > can be leveraged", which doesn't look quite true just staring the the > > > > gup_must_unshare() call where we don't pass the VMA. Also, > > > > "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense for > > > > slow GUP ... > > > > > > Yes it's not needed, just doesn't look worthwhile to put another helper on > > > top just for this. I mentioned this in the commit message here: > > > > > > There's something not needed for follow page, for example, gup_hugepte() > > > tries to detect pgtable entry change which will never happen with slow > > > gup (which has the pgtable lock held), but that's not a problem to check. > > > > > > > > > > > @Peter, any insights? > > > > > > However I think we should pass vma in for sure, I guess I overlooked that, > > > and it didn't expose in my tests too as I probably missed ./cow. > > > > > > I'll prepare a separate patch on top of this series and the gup-fast rename > > > patches (I saw this one just reached mm-stable), and I'll see whether I can > > > test it too if I can find a Power system fast enough. I'll probably drop > > > the "fast" in the hugepd function names too. > > > > For the missing VMA parameter, the cow.c test might not trigger it. We never need the VMA to make > a pinning decision for anonymous memory. We'll trigger an unsharing fault, get an exclusive anonymous page > and can continue. > > We need the VMA in gup_must_unshare(), when long-term pinning a file hugetlb page. I *think* > the gup_longterm.c selftest should trigger that, especially: > > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) > ... > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) > > > We need a MAP_SHARED page where the PTE is R/O that we want to long-term pin R/O. > I don't remember from the top of my head if the test here might have a R/W-mapped > folio. If so, we could extend it to cover that. Let me try both then. > > > Hmm, so when I enable 2M hugetlb I found ./cow is even failing on x86. > > > > # ./cow | grep -B1 "not ok" > > # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > > not ok 161 No leak from parent into child > > -- > > # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) > > not ok 215 No leak from parent into child > > -- > > # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) > > not ok 269 No leak from child into parent > > -- > > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) > > not ok 323 No leak from child into parent > > > > And it looks like it was always failing.. perhaps since the start? We > > Yes! > > commit 7dad331be7816103eba8c12caeb88fbd3599c0b9 > Author: David Hildenbrand > Date: Tue Sep 27 13:01:17 2022 +0200 > > selftests/vm: anon_cow: hugetlb tests > Let's run all existing test cases with all hugetlb sizes we're able to > detect. > Note that some tests cases still fail. This will, for example, be fixed > once vmsplice properly uses FOLL_PIN instead of FOLL_GET for pinning. > With 2 MiB and 1 GiB hugetlb on x86_64, the expected failures are: > # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > not ok 23 No leak from parent into child > # [RUN] vmsplice() + unmap in child ... with hugetlb (1048576 kB) > not ok 24 No leak from parent into child > # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) > not ok 35 No leak from child into parent > # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (1048576 kB) > not ok 36 No leak from child into parent > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) > not ok 47 No leak from child into parent > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (1048576 kB) > not ok 48 No leak from child into parent > > As it keeps confusing people (until somebody cares enough to fix vmsplice), I already > thought about just disabling the test and adding a comment why it happens and > why nobody cares. I think we should, and when doing so maybe add a rich comment in hugetlb_wp() too explaining everything? > > > didn't do the same on hugetlb v.s. normal anon from that regard on the > > vmsplice() fix. > > > > I drafted a patch to allow refcount>1 detection as the same, then all tests > > pass for me, as below. > > > > David, I'd like to double check with you before I post anything: is that > > your intention to do so when working on the R/O pinning or not? > > Here certainly the "if it's easy it would already have done" principle applies. :) > > The issue is the following: hugetlb pages are scarce resources that cannot usually > be overcommitted. For ordinary memory, we don't care if we COW in some corner case > because there is an unexpected reference. You temporarily consume an additional page > that gets freed as soon as the unexpected reference is dropped. > > For hugetlb, it is problematic. Assume you have reserved a single 1 GiB hugetlb page > and your process uses that in a MAP_PRIVATE mapping. Then it calls fork() and the > child quits immediately. > > If you decide to COW, you would need a second hugetlb page, which we don't have, so > you have to crash the program. > > And in hugetlb it's extremely easy to not get folio_ref_count() == 1: > > hugetlb_fault() will do a folio_get(folio) before calling hugetlb_wp()! > > ... so you essentially always copy. Hmm yes there's one extra refcount. I think this is all fine, we can simply take all of them into account when making a CoW decision. However crashing an userspace can be a problem for sure. > > > At that point I walked away from that, letting vmsplice() be fixed at some point. Dave > Howells was close at some point IIRC ... > > I had some ideas about retrying until the other reference is gone (which cannot be a > longterm GUP pin), but as vmsplice essentially does without FOLL_PIN|FOLL_LONGTERM, > it's quit hopeless to resolve that as long as vmsplice holds longterm references the wrong > way. > > --- > > One could argue that fork() with hugetlb and MAP_PRIVATE is stupid and fragile: assume > your child MM is torn down deferred, and will unmap the hugetlb page deferred. Or assume > you access the page concurrently with fork(). You'd have to COW and crash the program. > BUT, there is a horribly ugly hack in hugetlb COW code where you *steal* the page form > the child program and crash your child. I'm not making that up, it's horrible. I didn't notice that code before; doesn't sound like a very responsible parent.. Looks like either there come a hugetlb guru who can make a decision to break hugetlb ABI at some point, knowing that nobody will really get affected by it, or that's the uncharted area whoever needs to introduce hugetlb v2. Thanks, -- Peter Xu