From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D0ADC5AE59 for ; Tue, 3 Jun 2025 14:57:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 19CBD6B0497; Tue, 3 Jun 2025 10:57:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 14DB56B0498; Tue, 3 Jun 2025 10:57:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0634F6B0499; Tue, 3 Jun 2025 10:57:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id DCA406B0497 for ; Tue, 3 Jun 2025 10:57:30 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 84B0AEE0EA for ; Tue, 3 Jun 2025 14:57:30 +0000 (UTC) X-FDA: 83514393060.12.08733B0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf08.hostedemail.com (Postfix) with ESMTP id 1714F160002 for ; Tue, 3 Jun 2025 14:57:27 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="CULr3sv/"; spf=pass (imf08.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748962648; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=38jxSnrZ1rT9Vzr1A3Y5RdV1Xmwgo0b8GnqMkSx7ESk=; b=Ls2MAu0F0B7q4Aqf5pT5n76jwgu9ChhYX/9WRTSu8F9kaC4WpXzGQgGS+Gi1Kb1accSCtE BAYYz9vngtpYABrfSSyyh1LjJvCohteSKPBHapWKBrzOVKuq8XkYW0Ko1v6UUYDrGxcAbX ZRgII9b7SXmdRTOhvRbI+ZZY4zOLHYA= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="CULr3sv/"; spf=pass (imf08.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748962648; a=rsa-sha256; cv=none; b=qDEkaYelJumZcoOLbugcdxqpWIymnqj9pEZQpgU2TMlDU6eQRQXIRF5czVNQ+WeiztsseN 7IkfF4JybxwizHtoM11HVmiBiBHllJTvaMJxLQkcDYLR3jile5mdd/LSyKemANHOcq/RlD shyjJaHH9vUmGvFOiEZVaD0w3HIQpXk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1748962647; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=38jxSnrZ1rT9Vzr1A3Y5RdV1Xmwgo0b8GnqMkSx7ESk=; b=CULr3sv/sn6dz0Eo7tIzxOXvQrZekDlVAWCkfDutpSU0OkUuFg1+DMzWaRRdjay0oX5ZEF +o30yIMDyDETIXj/K0qRNr1nK0sAZYnQ41qWkvx/p/ovh61JzjR81xT97LkjTzVWGmx8nf uCJ1ED+jmNSBa0BdcgZGxm7d0ycsk88= Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-593-Ld5VetxaPIe7nIF2ZHSD5Q-1; Tue, 03 Jun 2025 10:57:26 -0400 X-MC-Unique: Ld5VetxaPIe7nIF2ZHSD5Q-1 X-Mimecast-MFC-AGG-ID: Ld5VetxaPIe7nIF2ZHSD5Q_1748962646 Received: by mail-qk1-f199.google.com with SMTP id af79cd13be357-7d0962035b7so919628485a.1 for ; Tue, 03 Jun 2025 07:57:26 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748962646; x=1749567446; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=38jxSnrZ1rT9Vzr1A3Y5RdV1Xmwgo0b8GnqMkSx7ESk=; b=JMKl7V0LqMEydAfya0f7EkjEaIxFjXoBZR3EWCNH8IUEe0hQbYK6T+GGskN/pzKYeq lbm3s2km2vJgf+paLdNSGMFUU3joqaMh4FfEBbqQHd/irpiTtMcjfhMWnvw1LC7czRsw 76cckvH7qlwBJJYlm3Szag4FepCX320IbVfZir8Yo4DdHwFZjB+8fiZ+GcPa227AmrkX awQr3bhsjC7ZHL2yRJtHfo08dRH+rLTziA0iKu6ApPolAtrfHC4HQjNLehNZVV9CQG5N LSpE29qBSPjkTHN1pYncspmiFQyMfbP+k2VUJPVpf9MT57ELR+ySs4WK4+vaUu9BuQGh 9Psg== X-Forwarded-Encrypted: i=1; AJvYcCUAwOL46aXenDYY4sOHIce1HC29tTOhGzo4hvzT0jBnCqNRxn1pbVLhVxh4G64NSnFJT+M2XnVwsg==@kvack.org X-Gm-Message-State: AOJu0YydJcuJgLh6XrFodaM45fCQ+69VH8EVK1lmhWCCZLyw6LE1fjZe WOQC0vvdDNti23/DW29hoLdIBky433ZJprcRw9C9Mx3azIscoyzv5m7zRXP4EqavmlwziXU1kKZ pI8/Xd7l58mJA8/o9SLfPFdKsweUiNLZVDbpkX/Rvcz26XPv0JGvv X-Gm-Gg: ASbGncsLKPwprzK8xLJOh7JLjr/DPlCY2FViIqQNBJH/AXWdDoZmPFgdRXJJFFRmEYr bMRBnaN6VBAi481Rwz/A8HUmyEdwiB3vC+483e66OsD/SyaxgAdWEAMzENMeVtTYZ9jv6gTNpDS tjbpc4P3VT+lPIBCLyI023q5p2dHII4Hp5SS+TQnBGWdYvoMM0JSabRrG3YHit9tNFeTv03oTTj e22cKKE3xhrLHLAwkAC0+cuVlI3KecgIQxvjX3tzCZolF9PahRI7ISwag1y+O2t7KhPrqogcpOZ YzkJioDhCZYNCg== X-Received: by 2002:a05:620a:2996:b0:7ce:c3e8:22c8 with SMTP id af79cd13be357-7d0a49e7b97mr2498399385a.9.1748962645703; Tue, 03 Jun 2025 07:57:25 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHBBYl9lOAbqGH1VFzS8HqjKuQfYdtx2049mNoz9KEyS88AC3JVP/OEQm0GbBx7G42r1YdOuw== X-Received: by 2002:a05:620a:2996:b0:7ce:c3e8:22c8 with SMTP id af79cd13be357-7d0a49e7b97mr2498394185a.9.1748962645118; Tue, 03 Jun 2025 07:57:25 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-4a591cc5df8sm26866051cf.60.2025.06.03.07.57.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Jun 2025 07:57:24 -0700 (PDT) Date: Tue, 3 Jun 2025 10:57:21 -0400 From: Peter Xu To: Oscar Salvador Cc: Andrew Morton , Muchun Song , David Hildenbrand , James Houghton , Gavin Guo , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp Message-ID: References: <20250602141610.173698-1-osalvador@suse.de> <20250602141610.173698-2-osalvador@suse.de> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: pT-3cCqXlQdcDK9a6psPHZnaPFju4Kry3J2Ghm9jlFY_1748962646 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Queue-Id: 1714F160002 X-Rspamd-Server: rspam09 X-Stat-Signature: 6sd6a87yswtreipfh9ucfihutzisuh97 X-HE-Tag: 1748962647-246237 X-HE-Meta: U2FsdGVkX18w+KYXcPNgP7J5Cdx6nHoCHkIEjJOsHrPjV6LHN3hvruuoZuyY4L1EHOFCGanS6k7ES2s3rTLQzR3+APE8HWNg5YzYuEh7fphC7rf1YfUlyuHMIeyqflUvJlivyJVb3K/oXwA7NGIe+PDsNli4qWt4X3d9Vngo05zCgSRTfeizqEYe2lSLqGDT6S1ucwnxLKIoEfJY3k0Z8rIVLsZO1c7AU895mrrNpYNQ5TpW+tTufcQTv66cFKGCPlZBCbjAAVamj3St8Mmr2j+VuUerxbn517Tilynln1R1/p02bDoJJTDjkDLK76UwkORBl3GOpHMiZHR9sPpsS1OXElxMjJ55UJ5o+O/fz6TFcT8SpO+8SZ4SVKyNTmdTJEFgljLYXhlZZ8SCO32Pe4uzf63f+MtxqIy59v1LVeVsCBEHak0WqyMQFaVpFJNBAH0lS5hNZd+ZEvDj/qD2kuc/N9ZCK6vvjG09jjSv9ew2dEuS5Eq9Y0SH8Gda0E9fy22NM08mFxMQL+Xe2X9dZ9LTMuc9qUjHulYZK+gvkP4ild3GdCeVlWRUDz46hgK8JoVY7+Jf0MhQWzWrynqSteF2i5sWE4jZRXatCSthMz8lrvIz0cvNLit9JhqEjGRJFISImcTbM1dRa85XYBJZq+NHrLwstk230lRR0G+h9QqTY1MEFhjUvdj02LAtoNt1rhMqzU1yKZPUSTyxmFkDtSl9a43FcKh4jdft9LQc9tFbU9gBU7JlPCxrHGULrQaeep5NB+1IPPsdEs9vt7lnQHVMfboVBxSt+HUAZrcVp2ZQn7zFBO90/CtBqlOqNrs+ED7qScNQURWeMy1Ab42yH2OUKo922luxFwSDHIreTXH1MLhFFWJfAZJUEY5zjc0Cvpvy20iXctabfaOx8Xf5nr2Gv7z0AOglEVE9GZYLaEx268D+CHx1O0cXl8zOUfSqmO1/hasuDfgxTW+aQDB IwvPyOt7 Xh4tByghUkhUrjAqqXMaE9XG93OY1ZCLCeO34NxdUS4f/YJtQDWFxl9YF7smP8YUUctrSaTk5L53BakQCn/khSYChtLG0XMnLErKWMRftS5fnKWSPrc+JvTDjyAAJ+e5s/JNG4Ucfb6DitUQyLrinRRMv38OnuazXfLiuMr0F4RThOrJ+4Qop9CXAm5noUIlqrSGDu6RjFWxzLThyTgvSj5REAeEPrGBs1SZWMAePebOJ62Cdha1QGSfJR69wQ7A7Sqy+1pCDYFJQVpAs3BpN7yJmrOZEG3YwB5d/hvI1CXSWlJQgUMOnrgz5HHaOmCYg+dq7Z++ED23OwMFvuQgERaRlhQMwoiSpU4SXB4Jjxatr+0CQtJaj/5JXEWl+uulOqyrl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 03, 2025 at 03:50:54PM +0200, Oscar Salvador wrote: > On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote: > > Right, and thanks for the git digging as usual. I would agree hugetlb is > > more challenge than many other modules on git archaeology. :) > > > > Even if I mentioned the invalidate_lock, I don't think I thought deeper > > than that. I just wished whenever possible we still move hugetlb code > > closer to generic code, so if that's the goal we may still want to one day > > have a closer look at whether hugetlb can also use invalidate_lock. Maybe > > it isn't worthwhile at last: invalidate_lock is currently a rwsem, which > > normally at least allows concurrent fault, but that's currently what isn't > > allowed in hugetlb anyway.. > > > > If we start to remove finer grained locks that work will be even harder, > > and removing folio lock in this case in fault path also brings hugetlbfs > > even further from other file systems. That might be slightly against what > > we used to wish to do, which is to make it closer to others. Meanwhile I'm > > also not yet sure the benefit of not taking folio lock all across, e.g. I > > don't expect perf would change at all even if lock is avoided. We may want > > to think about that too when doing so. > > Ok, I have to confess I was not looking things from this perspective, > but when doing so, yes, you are right, we should strive to find > replacements wherever we can for not using hugetlb-specific code. > > I do not know about this case though, not sure what other options do we > have when trying to shut concurrent faults while doing other operation. > But it is something we should definitely look at. > > Wrt. to the lock. > There were two locks, old_folio (taken in hugetlb_fault) and > pagecache_folio one. There're actually three places this patch touched, the 3rd one is hugetlb_no_page(), in which case I also think we should lock it, not only because file folios normally does it (see do_fault(), for example), but also that's exactly what James mentioned I believe on possible race of !uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines: folio = alloc_hugetlb_folio(vma, vmf->address, false); ... folio_zero_user(folio, vmf->real_address); __folio_mark_uptodate(folio); > The thing was not about worry as how much perf we leave on the table > because of these locks, as I am pretty sure is next to 0, but my drive > was to understand what are protection and why, because as the discussion > showed, none of us really had a good idea about it and it turns out that this > goes back more than ~20 years ago. > > Another topic for the lock (old_folio, so the one we copy from), > when we compare it to generic code, we do not take the lock there. > Looking at do_wp_page(), we do __get__ a reference on the folio we copy > from, but not the lock, so AFAIU, the lock seems only to please Yes this is a good point; for CoW path alone maybe we don't need to lock old_folio. > folio_move_anon_rmap() from hugetlb_wp. > > Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also > calls folio_move_anon_rmap() in case we can re-use the folio, it only > takes the lock before the call to folio_move_anon_rmap(), and then > unlocks it. IMHO, do_wp_page() took the folio lock not for folio_move_anon_rmap(), but for checking swapcache/ksm stuff which needs to be serialized with folio lock. So I'm not 100% confident on the folio_move_anon_rmap(), but I _think_ it deserves a data_race() and IIUC it only work not because of the folio lock, but because of how anon_vma is managed as a tree as of now, so that as long as WRITE_ONCE() even a race is benign (because the rmap walker will either see a complete old anon_vma that includes the parent process's anon_vma, or the child's). What really protects the anon_vma should really be anon_vma lock.. That can definitely be a separate topic. I'm not sure whether you'd like to dig this part out, but if you do I'd also be more than happy to know whether my understanding needs correction here.. :) In general, I still agree with you that if hugetlb CoW path can look closer to do_wp_page then it's great. > > Which, I think, hugetlb should also do. > > do_wp_page > wp_can_reuse_anon_folio ? > : yes: folio_lock ; folio_move_anon_rmap ; folio_unlock > bail out > : no: get a reference on the folio and call wp_page_copy > > So, this should be the lead that hugetlb follows. > As I said, it is not about the performance, and I agree that relying on > finer granularity locks is the way to go, but we need to understand > where and why, and with the current code from upstream, that is not > clear at all. > > That is why I wanted to reduce the scope of old_folio to what is > actually needed, which is the snippet: > > if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) { > if (!PageAnonExclusive(&old_folio->page)) { > folio_move_anon_rmap(old_folio, vma); > SetPageAnonExclusive(&old_folio->page); > } > if (likely(!unshare)) > set_huge_ptep_maybe_writable(vma, vmf->address, > vmf->pte); > > delayacct_wpcopy_end(); > return 0; > } > > I think it is important to 1) reduce it to wrap what actually needs to > be within the lock and 2) document why, so no one has to put the gloves > and start digging in the history again. > > > Thanks! I hope that'll also help whatever patch to land sooner, after it > > can be verified to fix the issue. > > So, my plan is: > > 1) Fix pagecache folio issue in one patch (test for anon, still need to > check but it should work) > 2) implement the 'filemap_get_hugetlb_folio' thing to get a reference and not > lock it > 3) reduce scope of old_folio > > I want to make it clear that while I still want to add filemap_get_hugetlb_folio > and stop using the lock version, the reason is not to give more power to the mutex, > but to bring it closer to what do_wp_page does. > > What do you think about it? So in this case as long as (2) won't change the lock folio for no_page then I agree. Thanks, -- Peter Xu