From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Jonathan Corbet <corbet@lwn.net>,
Matthew Wilcox <willy@infradead.org>, Guo Ren <guoren@kernel.org>,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
Heiko Carstens <hca@linux.ibm.com>,
Vasily Gorbik <gor@linux.ibm.com>,
Alexander Gordeev <agordeev@linux.ibm.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Sven Schnelle <svens@linux.ibm.com>,
"David S . Miller" <davem@davemloft.net>,
Andreas Larsson <andreas@gaisler.com>,
Arnd Bergmann <arnd@arndb.de>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Dan Williams <dan.j.williams@intel.com>,
Vishal Verma <vishal.l.verma@intel.com>,
Dave Jiang <dave.jiang@intel.com>,
Nicolas Pitre <nico@fluxnic.net>,
Muchun Song <muchun.song@linux.dev>,
Oscar Salvador <osalvador@suse.de>,
David Hildenbrand <david@redhat.com>,
Konstantin Komarov <almaz.alexandrovich@paragon-software.com>,
Baoquan He <bhe@redhat.com>, Vivek Goyal <vgoyal@redhat.com>,
Dave Young <dyoung@redhat.com>, Tony Luck <tony.luck@intel.com>,
Reinette Chatre <reinette.chatre@intel.com>,
Dave Martin <Dave.Martin@arm.com>,
James Morse <james.morse@arm.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>, Hugh Dickins <hughd@google.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Uladzislau Rezki <urezki@gmail.com>,
Dmitry Vyukov <dvyukov@google.com>,
Andrey Konovalov <andreyknvl@gmail.com>,
Jann Horn <jannh@google.com>, Pedro Falcato <pfalcato@suse.de>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-csky@vger.kernel.org,
linux-mips@vger.kernel.org, linux-s390@vger.kernel.org,
sparclinux@vger.kernel.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-mm@kvack.org,
ntfs3@lists.linux.dev, kexec@lists.infradead.org,
kasan-dev@googlegroups.com, Jason Gunthorpe <jgg@nvidia.com>,
iommu@lists.linux.dev, Kevin Tian <kevin.tian@intel.com>,
Will Deacon <will@kernel.org>,
Robin Murphy <robin.murphy@arm.com>
Subject: Re: [PATCH v4 11/14] mm/hugetlbfs: update hugetlbfs to use mmap_prepare
Date: Mon, 20 Oct 2025 11:58:25 +0100 [thread overview]
Message-ID: <2e65cc96-5fb8-4197-b4c2-188c4378c417@lucifer.local> (raw)
In-Reply-To: <aNKJ6b7kmT_u0A4c@li-2b55cdcc-350b-11b2-a85c-a78bff51fc11.ibm.com>
On Tue, Sep 23, 2025 at 01:52:09PM +0200, Sumanth Korikkar wrote:
> Hi Lorenzo,
>
> The following tests causes the kernel to enter a blocked state,
> suggesting an issue related to locking order. I was able to reproduce
> this behavior in certain test runs.
>
> Test case:
> git clone https://github.com/libhugetlbfs/libhugetlbfs.git
> cd libhugetlbfs ; ./configure
> make -j32
> cd tests
> echo 100 > /proc/sys/vm/nr_hugepages
> mkdir -p /test-hugepages && mount -t hugetlbfs nodev /test-hugepages
> ./run_tests.py <in a loop>
> ...
> shm-fork 10 100 (1024K: 64): PASS
> set shmmax limit to 104857600
> shm-getraw 100 /dev/full (1024K: 32):
> shm-getraw 100 /dev/full (1024K: 64): PASS
> fallocate_stress.sh (1024K: 64): <blocked>
>
> Blocked task state below:
>
> task:fallocate_stres state:D stack:0 pid:5106 tgid:5106 ppid:5103
> task_flags:0x400000 flags:0x00000001
> Call Trace:
> [<00000255adc646f0>] __schedule+0x370/0x7f0
> [<00000255adc64bb0>] schedule+0x40/0xc0
> [<00000255adc64d32>] schedule_preempt_disabled+0x22/0x30
> [<00000255adc68492>] rwsem_down_write_slowpath+0x232/0x610
> [<00000255adc68922>] down_write_killable+0x52/0x80
> [<00000255ad12c980>] vm_mmap_pgoff+0xc0/0x1f0
> [<00000255ad164bbe>] ksys_mmap_pgoff+0x17e/0x220
> [<00000255ad164d3c>] __s390x_sys_old_mmap+0x7c/0xa0
> [<00000255adc60e4e>] __do_syscall+0x12e/0x350
> [<00000255adc6cfee>] system_call+0x6e/0x90
> task:fallocate_stres state:D stack:0 pid:5109 tgid:5106 ppid:5103
> task_flags:0x400040 flags:0x00000001
> Call Trace:
> [<00000255adc646f0>] __schedule+0x370/0x7f0
> [<00000255adc64bb0>] schedule+0x40/0xc0
> [<00000255adc64d32>] schedule_preempt_disabled+0x22/0x30
> [<00000255adc68492>] rwsem_down_write_slowpath+0x232/0x610
> [<00000255adc688be>] down_write+0x4e/0x60
> [<00000255ad1c11ec>] __hugetlb_zap_begin+0x3c/0x70
> [<00000255ad158b9c>] unmap_vmas+0x10c/0x1a0
> [<00000255ad180844>] vms_complete_munmap_vmas+0x134/0x2e0
> [<00000255ad1811be>] do_vmi_align_munmap+0x13e/0x170
> [<00000255ad1812ae>] do_vmi_munmap+0xbe/0x140
> [<00000255ad183f86>] __vm_munmap+0xe6/0x190
> [<00000255ad166832>] __s390x_sys_munmap+0x32/0x40
> [<00000255adc60e4e>] __do_syscall+0x12e/0x350
> [<00000255adc6cfee>] system_call+0x6e/0x90
>
>
> Thanks,
> Sumanth
(been on holiday for a couple weeks and last week was a catch-up! :)
So having looked into this, the issue is that hugetlbfs exposes a per-VMA
hugetlbfs lock which can be taken via the rmap.
So, while faults are disallowed until the VMA is fully setup, the rmap is not,
and therefore there's a race between setting up the hugetlbfs lock and the rmap
trying to take/release it.
It's a real edge case as it's kind of unusual to have this requirement during
initial custom mmap, but to account for this and for any other users which might
require it, I have resolved this by introducing the ability to hold on to the
rmap lock until the VMA is fully set up.
The window is very very small, but obviously it's one we have to account for :)
This is the most correct solution I think, as it prevents any confusion as to
the state of the lock, rmap users simply cannot access the VMA until it is
established.
I am putting the finishing touches to a respin with this fix included, will cc
you on it.
Cheers, Lorenzo
next prev parent reply other threads:[~2025-10-20 10:59 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-17 19:11 [PATCH v4 00/14] expand mmap_prepare functionality, port more users Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 01/14] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 02/14] device/dax: update devdax " Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 03/14] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 04/14] relay: update relay to use mmap_prepare Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 05/14] mm/vma: rename __mmap_prepare() function to avoid confusion Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 06/14] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
2025-09-17 21:32 ` Jason Gunthorpe
2025-09-18 6:09 ` Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 07/14] mm: abstract io_remap_pfn_range() based on PFN Lorenzo Stoakes
2025-09-17 21:19 ` Jason Gunthorpe
2025-09-18 6:26 ` Lorenzo Stoakes
2025-09-18 9:11 ` Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 08/14] mm: introduce io_remap_pfn_range_[prepare, complete]() Lorenzo Stoakes
2025-09-18 9:12 ` Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 09/14] mm: add ability to take further action in vm_area_desc Lorenzo Stoakes
2025-09-17 21:37 ` Jason Gunthorpe
2025-09-18 6:09 ` Lorenzo Stoakes
2025-09-18 9:14 ` Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 10/14] doc: update porting, vfs documentation for mmap_prepare actions Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 11/14] mm/hugetlbfs: update hugetlbfs to use mmap_prepare Lorenzo Stoakes
2025-09-23 11:52 ` Sumanth Korikkar
2025-09-23 21:17 ` Andrew Morton
2025-09-24 12:03 ` Lorenzo Stoakes
2025-10-17 12:27 ` Sumanth Korikkar
2025-10-17 12:46 ` Lorenzo Stoakes
2025-10-17 21:37 ` Andrew Morton
2025-10-20 10:58 ` Lorenzo Stoakes [this message]
2025-09-17 19:11 ` [PATCH v4 12/14] mm: add shmem_zero_setup_desc() Lorenzo Stoakes
2025-09-17 21:38 ` Jason Gunthorpe
2025-09-17 19:11 ` [PATCH v4 13/14] mm: update mem char driver to use mmap_prepare Lorenzo Stoakes
2025-09-17 19:11 ` [PATCH v4 14/14] mm: update resctl " Lorenzo Stoakes
2025-09-17 20:31 ` [PATCH v4 00/14] expand mmap_prepare functionality, port more users Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2e65cc96-5fb8-4197-b4c2-188c4378c417@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=Dave.Martin@arm.com \
--cc=Liam.Howlett@oracle.com \
--cc=agordeev@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=almaz.alexandrovich@paragon-software.com \
--cc=andreas@gaisler.com \
--cc=andreyknvl@gmail.com \
--cc=arnd@arndb.de \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=borntraeger@linux.ibm.com \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=davem@davemloft.net \
--cc=david@redhat.com \
--cc=dvyukov@google.com \
--cc=dyoung@redhat.com \
--cc=gor@linux.ibm.com \
--cc=gregkh@linuxfoundation.org \
--cc=guoren@kernel.org \
--cc=hca@linux.ibm.com \
--cc=hughd@google.com \
--cc=iommu@lists.linux.dev \
--cc=jack@suse.cz \
--cc=james.morse@arm.com \
--cc=jannh@google.com \
--cc=jgg@nvidia.com \
--cc=kasan-dev@googlegroups.com \
--cc=kevin.tian@intel.com \
--cc=kexec@lists.infradead.org \
--cc=linux-csky@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nico@fluxnic.net \
--cc=ntfs3@lists.linux.dev \
--cc=nvdimm@lists.linux.dev \
--cc=osalvador@suse.de \
--cc=pfalcato@suse.de \
--cc=reinette.chatre@intel.com \
--cc=robin.murphy@arm.com \
--cc=rppt@kernel.org \
--cc=sparclinux@vger.kernel.org \
--cc=sumanthk@linux.ibm.com \
--cc=surenb@google.com \
--cc=svens@linux.ibm.com \
--cc=tony.luck@intel.com \
--cc=tsbogend@alpha.franken.de \
--cc=urezki@gmail.com \
--cc=vbabka@suse.cz \
--cc=vgoyal@redhat.com \
--cc=viro@zeniv.linux.org.uk \
--cc=vishal.l.verma@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox