From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8449AC5B549 for ; Mon, 2 Jun 2025 21:30:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B62C86B0347; Mon, 2 Jun 2025 17:30:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B13E36B0348; Mon, 2 Jun 2025 17:30:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A024A6B0349; Mon, 2 Jun 2025 17:30:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 814726B0347 for ; Mon, 2 Jun 2025 17:30:31 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 195E08142F for ; Mon, 2 Jun 2025 21:30:31 +0000 (UTC) X-FDA: 83511754662.29.6CA4E1A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf27.hostedemail.com (Postfix) with ESMTP id C9F6040007 for ; Mon, 2 Jun 2025 21:30:28 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=E+kOeXJ5; spf=pass (imf27.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748899828; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3HiVoEATGk9yjJPR/c/jH6T+zYBxsGP/XpphPmhjYyo=; b=6JP+7UF3UN4SV1JnHDPLCAzlbroI5nbIQqFg3Ji+rQkXPUltiBDS7XmQzy+I09DuYyMFrG ii8Y6s7bASygmNFpnCe+snaH44QnYQzRG87Mg8cPXPgfCpA5mG1DTM6fWkB3hJkfl9OqkO Bhh3CdzrTh/OQGIlnxhZImt5j8pb7bw= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=E+kOeXJ5; spf=pass (imf27.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748899828; a=rsa-sha256; cv=none; b=sVlyiPoPOY3cDTJVL+uOTpThXYVF73EMTEu3fmRVQm4Vtxz4lCtOoXPJqngd0Py62L9pxP qz85VdYDFoU4WfoTEi4WtdiyLeMNeeFhAK3U3zyLXu5+Wx4D51PArhPmFshRSwxtAtQJlC bWTNG1m5S+cm+HSSN7V4dq2E5TQYEzo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1748899828; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=3HiVoEATGk9yjJPR/c/jH6T+zYBxsGP/XpphPmhjYyo=; b=E+kOeXJ5nJOya6b81AGp/DAI2+p+y7gJHRztMmxa7SYgoHLYDFlcfap+KOMjvqwwa4TwIg G7OZU0WPr5mc1kZFxFAR5LCRlqQ4OxUzG/BKNgGutQ1+VTyXOsJ3peTxadqzkNujP60uf9 YZabsWIoR1zHP9nEi9rQhzQ8E6/lPpo= Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com [209.85.219.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-531-idLQzMdfNRGixctdnn9o6Q-1; Mon, 02 Jun 2025 17:30:25 -0400 X-MC-Unique: idLQzMdfNRGixctdnn9o6Q-1 X-Mimecast-MFC-AGG-ID: idLQzMdfNRGixctdnn9o6Q_1748899824 Received: by mail-qv1-f70.google.com with SMTP id 6a1803df08f44-6facd1cc1f8so52534596d6.0 for ; Mon, 02 Jun 2025 14:30:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748899824; x=1749504624; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=3HiVoEATGk9yjJPR/c/jH6T+zYBxsGP/XpphPmhjYyo=; b=ZIGhK4KRbz7Jwq+VMsSdRDjpvKunUmjunJ+HSa7e6LqU2utovGgXclzsIJLfVYouE4 opNizIAqdNMeoBu2QM3eUYbAWgyeSI9QJcJZXBpp6CvgFTaUtpNGz/BD1cnpMjYe60td 0t2E52Id9st+mxeYWdkr3lAcSx/TSj8HW+Safo9E0qoHg5SzLdGlbq4QcMF2jyQJmb9L 0pfk0kdY3Yrx9c3+/IVjaHtaiem/3852XyfB/xfw2jKvaR079Ci/eYEvxBeN2uDZVRDl jBmkMH0IgzWKKl01XaPxpeaoog9NSeGZnGeSr6jA6wWdXa16RwZTBeXp3PtwzAakUTJB LLSg== X-Forwarded-Encrypted: i=1; AJvYcCW68lQdQoBOOWQ95+YEoNzM3UOyhq2tWrKUBLnzUyxJZwLgje4V/oFm/7k3mR5GlCXQWI8Sh2zTvw==@kvack.org X-Gm-Message-State: AOJu0YzFbAs+HX8nVlH3nkhSHIrWNFOyO/CPyi6pbGgBcLGEXjOjZ4tj XWWxkd5jqcv1SbLWu+S7Sf1nmC/QYAlbWK7xc6r2/TW4hUc+ocjD7ras/XHeoO/QBzHFR42PVLe uYt9b04FjtgPkglXS+nMwQSICirndOG4+4sDQdAiSK/MShBy/3Ch5 X-Gm-Gg: ASbGncvZpemugLOfD2I62/z/5ikvCjAnHXjfaznE1jMmOGWrhESiDJdPBCfumHq74RE Ojloku59r/BdWXOSUMxsyYhyswPKbYw+QyLRSXy7RF5AQF0bmDgf8N7ywiSZQfhYiw65Uv5OUNr WnI/I31cAjQAKqW4BflsveZavSnvgacJPAa7tKUHx/qKZCcFkkg0RX2Xgck5vK4suXtwCxp6bfD VRusEu7ym3NDSJiajMZiazoLkloUTVlSW0tuVJ/D7CVrS0mba2ALpqvRRaHmBVDOMslViGB+Wb/ lS3yxX1RQ6JJgg== X-Received: by 2002:ad4:5d49:0:b0:6f5:372f:1c5b with SMTP id 6a1803df08f44-6faeefd6e53mr1833046d6.11.1748899824250; Mon, 02 Jun 2025 14:30:24 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF2WVIKGvJNDtRuLqSuYVShpB4u16rtJpkOoMZUeuuX2Ps+Hy/F+P/CpxIcF5X/66u+LVMJ/w== X-Received: by 2002:ad4:5d49:0:b0:6f5:372f:1c5b with SMTP id 6a1803df08f44-6faeefd6e53mr1832466d6.11.1748899823764; Mon, 02 Jun 2025 14:30:23 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6fac6e00dcfsm67440526d6.87.2025.06.02.14.30.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 Jun 2025 14:30:23 -0700 (PDT) Date: Mon, 2 Jun 2025 17:30:19 -0400 From: Peter Xu To: Oscar Salvador Cc: Andrew Morton , Muchun Song , David Hildenbrand , James Houghton , Gavin Guo , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp Message-ID: References: <20250602141610.173698-1-osalvador@suse.de> <20250602141610.173698-2-osalvador@suse.de> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: KPqZJ6UKHXd72acl6n02ac1-ANRAMQRoGowx-Rt2DWE_1748899824 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: C9F6040007 X-Stat-Signature: 46i6o4wwqd6y6aywbnu5ws7aiuuifsbh X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1748899828-650502 X-HE-Meta: U2FsdGVkX1+BmWJuTyg3ZJMmtEI5mardtqUOamcYFJtSYsC33QZsP8N5AmMZ4Mcnzeewxx+u6QNqG7cV1lZL5t3ChLTlSB5ljKd3ERD+c3kIEGw08pb8+GgNAOCztoshpoJc2hLyTSETtx6JGWMhqBgmZT5rPqG+wpGBRmZJ4KMuISiFHOhre8wUnR/dy4G8xYRx1MygluZOVYVXl4rR+J9sWHhAnjVa5i2d+W5c/unxnkz7xvPfbnUczg+0u3vAEdIhcUEjbZYg9BxpY2qSl9yrNUBY6uZIwXByJNHkLqDJ6EQQLYohSv/zU7wZFg3gZ1dRXKFqryKs0ZkLEw96NhR2xCeIohnR3J7nptEUqzdC2RiB9WxjPANebaiVWEEsO0VLrX1qdx+mrgOlc9DW2wgHqwfpihOheLLcGL5nX9Tbv3+iLvOHIsCg4DVfqzFiNdJ5+MlAjWx/nE88m1WzTuAvyI6ATfXv72oL1IZxB+3KORQz+fFFc+WZZgeeB/FEWqNN6baNeLIlp6s2h8tqC1vKw6lUylLllpa7bAaOJsN1gRPljKOWknE2MlZasCecx6wsRHdERXYyVf/iYtdwB1YX/gESjspBIoBLD7PN204ls0kiV1FVoX/HbeGJQBosCNBFAyPum4tW/5XtrqSC5GmtbEmJGnhaUwbzyk9KYiuGuu5JoZQrm8EAjfCZfZMXqM478PDaqj4qgMpgug1MhM+DJde9AJDf+8mNALBiGhh69GFtr2EHqDLRu6S2Qv3tMvh6DUPx5KL7J7WkQ2J5LMnSql32V9qZU3eOeD30wSdnUKgtoLPwd5XXXLi1zLeo4ENBelh0b4rZayolc58AavqwpArCjaL2GwsyVrTzzTz1RyBNR8uaGQNu+Kdst94+k6GFYblUaAZE2tILTPR10grDEMYSYNgXXtp8ti96uXBchNyeTpKCfZUaiLc4mwTE82ac3QG2nnFlnXufcug ieaZ59ZY hRmLw0D1vv1jhpZF0Lq4A8tWi26ZrwLS8L+8aoo9IAv/g53gumEym4KZtfvyMe5lROlKU0ziEvK/oC+Ax7QyXtQ+KfGC4hIqS9FpkPdclkOFrEPDeTGn9YzV4dzN5CVPAZX7wy0fo+ElsWXSo54aqCQT9wOfX7C0jAXD7xXEMkv+Xh0WgvCjJHnoq2lezMAB7FBRJnWsWO48si/QUyM1Vl/B7+SXAL1w9ZRiKAl/nsDnQBXmUwf6wUSUF/odLbn48p2ZWidXIvLBAUJjBD5e9sWzl1/HjRMlp44P8PNdOWqkjgmWWyq6bjuJkDfKn9G6/463qr5hNpukssvvOW4J2rL8xkrTbUlsGVGvneEcsmhwFZGVVR8rRXSFA1KxTj3AjJ7HCA+rtY/VgC/EZAff2fH0IoytmsgqOnxin X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 02, 2025 at 10:47:04PM +0200, Oscar Salvador wrote: > On Mon, Jun 02, 2025 at 11:14:26AM -0400, Peter Xu wrote: > > It makes me feel nervous when we move more thing over to fault mutex - I > > had a feeling it's abused. > > > > IIRC the fault mutex was inintially introduced only to solve one problem on > > concurrent allocations. I confess I didn't follow or yet dig into all > > history, though. From that POV, mfill_atomic_hugetlb() is indeed > > reasonable to still take it because it's user-guided fault injections. I'm > > not yet sure about other places, e.g., for truncations. > > Hi Peter, > > taking the mutex for truncation/punching hole was added back in: > > commit b5cec28d36f5ee6b4e6f68a0a40aa1e4045d6d99 > Author: Mike Kravetz > Date: Tue Sep 8 15:01:41 2015 -0700 > > hugetlbfs: truncate_hugepages() takes a range of pages > > when the fallocate functionality and the punching-mode were introduced. > E.g: > > commit c672c7f29f2fdb73e1f72911bf499675c81fcdbb > Author: Mike Kravetz > Date: Tue Sep 8 15:01:35 2015 -0700 > > mm/hugetlb: expose hugetlb fault mutex for use by fallocate > > hugetlb page faults are currently synchronized by the table of mutexes > (htlb_fault_mutex_table). fallocate code will need to synchronize with > the page fault code when it allocates or deletes pages. Expose > interfaces so that fallocate operations can be synchronized with page > faults. Minor name changes to be more consistent with other global > hugetlb symbols. > > Sadly, I cannot really see why it was added. > It was introduced from RFC v2 -> RFC v3: > > RFC v3: > Folded in patch for alloc_huge_page/hugetlb_reserve_pages race > in current code > fallocate allocation and hole punch is synchronized with page > faults via existing mutex table > hole punch uses existing hugetlb_vmtruncate_list instead of more > generic unmap_mapping_range for unmapping > Error handling for the case when region_del() fauils > > But RFC v2 shows no discussions in that matter whatsoever, so go figure > [1]. > I read that some tests were written, so I am not sure whether any of > those tests caught a race. > > Looking at remove_inode_single_folio, there is a case though: > > /* > * If folio is mapped, it was faulted in after being > * unmapped in caller. Unmap (again) while holding > * the fault mutex. The mutex will prevent faults > * until we finish removing the folio. > */ > if (unlikely(folio_mapped(folio))) > hugetlb_unmap_file_folio(h, mapping, folio, index); > > So, while the folio_lock that is taken further down guards us against > removing the page from the pagecache, I guess that we need to take the > mutex to guard us against parallel faults for the page. > > Looking back at earlier code, we did not handle that race. > > It was reworked in > > commit 4aae8d1c051ea00b456da6811bc36d1f69de5445 > Author: Mike Kravetz > Date: Fri Jan 15 16:57:40 2016 -0800 > > mm/hugetlbfs: unmap pages if page fault raced with hole punch > > and the comment looked like: > > /* > * If page is mapped, it was faulted in after being > * unmapped in caller. Unmap (again) now after taking > * the fault mutex. The mutex will prevent faults > * until we finish removing the page. > * > * This race can only happen in the hole punch case. > * Getting here in a truncate operation is a bug. > */ > > So, it was placed to close the race. > Now, I do not know whether we could close that race somewhat different, > but it seems risky, and having the mutex here seems like a good > trade-off. Right, and thanks for the git digging as usual. I would agree hugetlb is more challenge than many other modules on git archaeology. :) Even if I mentioned the invalidate_lock, I don't think I thought deeper than that. I just wished whenever possible we still move hugetlb code closer to generic code, so if that's the goal we may still want to one day have a closer look at whether hugetlb can also use invalidate_lock. Maybe it isn't worthwhile at last: invalidate_lock is currently a rwsem, which normally at least allows concurrent fault, but that's currently what isn't allowed in hugetlb anyway.. If we start to remove finer grained locks that work will be even harder, and removing folio lock in this case in fault path also brings hugetlbfs even further from other file systems. That might be slightly against what we used to wish to do, which is to make it closer to others. Meanwhile I'm also not yet sure the benefit of not taking folio lock all across, e.g. I don't expect perf would change at all even if lock is avoided. We may want to think about that too when doing so. > > > > Meanwhile, IIUC this patch can at least be split into more than one, in > > which case the 1st patch should only change the behavior of > > pagecache_folio, rather than the rest - if we really want to move more > > things over to fault mutex, we can do that in the 2nd+ patches. > > Fair enough. This RFC was mainly for kicking off a discussion and decide > the direction we want to take. > > > I'd suggest we stick with fixing pagecache_folio issue first, which can be > > much easier and looks like a lock ordering issue only (while we can still > > resolve it with removing on lock, but only on pagecache_folio not rest yet). > > Yes, I think that that should be enough to determine if we already mapped and cowed the > folio privately. > > > > + > > > /* > > > - * hugetlb_wp() should be called with page lock of the original hugepage held. > > > * Called with hugetlb_fault_mutex_table held and pte_page locked so we > > > * cannot race with other handlers or page migration. > > > * Keep the pte_same checks anyway to make transition from the mutex easier. > > > */ > > > -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, > > > - struct vm_fault *vmf) > > > +static vm_fault_t hugetlb_wp(struct vm_fault *vmf, enum cow_context context) > > > { > > > struct vm_area_struct *vma = vmf->vma; > > > struct mm_struct *mm = vma->vm_mm; > > > const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; > > > pte_t pte = huge_ptep_get(mm, vmf->address, vmf->pte); > > > struct hstate *h = hstate_vma(vma); > > > - struct folio *old_folio; > > > - struct folio *new_folio; > > > bool cow_from_owner = 0; > > > vm_fault_t ret = 0; > > > struct mmu_notifier_range range; > > > + struct folio *old_folio, *new_folio, *pagecache_folio; > > > + struct address_space *mapping = vma->vm_file->f_mapping; > > > > > > /* > > > * Never handle CoW for uffd-wp protected pages. It should be only > > > @@ -6198,7 +6201,11 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, > > > * we run out of free hugetlb folios: we would have to kill processes > > > * in scenarios that used to work. As a side effect, there can still > > > * be leaks between processes, for example, with FOLL_GET users. > > > + * > > > + * We need to take the lock here because the folio might be undergoing a > > > + * migration. e.g: see folio_try_share_anon_rmap_pte. > > > */ > > > > I agree we'd better keep the old_folio locked as of now to be further > > discussed, but I'm not 100% sure about the comment - doesn't migration path > > still need the pgtable lock to modify mapcounts? I think we hold pgtable > > lock here. > > Uhm, yes, looking closer I think you are right. > In hugetlb_fault, prior to call in hugetlb_wp we take the ptl spinlock, > and the migration path also takes it (via page_vma_mapped_walk), so we > should not be seeing any spurious data from hugetlb_remove_rmap, as we > are serialized. > > > > + /* > > > + * We can be called from two different contexts: hugetlb_no_page or > > > + * hugetlb_fault. > > > + * When called from the latter, we try to find the original folio in > > > + * the pagecache and compare it against the current folio we have mapped > > > + * in the pagetables. If it differs, it means that this process already > > > + * CoWed and mapped the folio privately, so we know that a reservation > > > + * was already consumed. > > > + * This will be latter used to determine whether we need to unmap the folio > > > + * from other processes in case we fail to allocate a new folio. > > > + */ > > > + if (context == HUGETLB_FAULT_CONTEXT) { > > > + pagecache_folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff); > > > + if (IS_ERR(pagecache_folio)) > > > + pagecache_folio = NULL; > > > + else > > > + folio_put(pagecache_folio); > > > > So here we released the refcount but then we're referencing the pointer > > below.. I don't know whether this is wise, looks like it's prone to > > races.. If we want, we can compare the folios before releasing the > > refcount. Said that,... > > We are holding the mutex so the folio cannot really go anywhere, but > could folio_put after the check as well. > > > > - if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && > > > - old_folio != pagecache_folio) > > > + if (context == HUGETLB_FAULT_CONTEXT && > > > + is_vma_resv_set(vma, HPAGE_RESV_OWNER) && > > > + old_folio != pagecache_folio) > > > > .. here I am actually thinking whether we can use folio_test_anon() and > > completely avoid looking up the page cache. Here, the ultimate goal is > > trying to know whether this is a CoW on top of a private page. Then IIUC > > folio_test_anon(old_folio) is enough. > > I yet have to fully check, but this sounds reasonable to me. > > > > new_folio = false; > > > - folio = filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff); > > > + folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff); > > > > So this is part of the changes that I mentioned previously, that we should > > avoid doing in one patch; actually I really think we should think twice on > > using fault mutex explicitly for more things. Actually I tend to go the > > other way round: I used to think whether we can avoid some fault mutex in > > some paths. E.g. for UFFDIO_COPY it still makes sense to take fault mutex > > because it faces similar challenge of concurrent allocations. However I'm > > not sure about truncate/hole-punch lines. IIUC most file folios use > > invalidate_lock for that, and I'd think going the other way to use the same > > lock in hugetlb code, then keep fault mutex as minimum used as possible, > > then the semantics of the lock and what it protects are much clearer. > > I hear you, but as explained above, I am not entirely sure we can get > rid of the mutex in the truncation/punch-hole path, and if that is the > case, having both the lock and the mutex is definitely an overkill. > > I can split up the changes, I have no problem with that. Thanks! I hope that'll also help whatever patch to land sooner, after it can be verified to fix the issue. -- Peter Xu