[bug report?] unintuitive behavior when mapping over hugepage-backed PROT

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
@ 2025-02-06  6:18 Uday Shankar
  2025-02-06  9:01 ` Oscar Salvador
  2025-02-07 13:12 ` Lorenzo Stoakes
  0 siblings, 2 replies; 16+ messages in thread
From: Uday Shankar @ 2025-02-06  6:18 UTC (permalink / raw)
  To: Muchun Song, Andrew Morton, Joern Engel; +Cc: linux-mm

I was debugging an issue with a malloc implementation when I noticed
some unintuitive behavior that happens when someone attempts to
overwrite part of a hugepage-backed PROT_NONE mapping with another
mapping. I've isolated the issue and reproduced it with the following
program:

[root@localhost ~]# cat test.c
#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

#define MMAP_FLAGS_COMMON (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)

int main()
{
        size_t len = 2ULL << 30;
        void *a = mmap(
                (void *)0x7c8000000000, len, PROT_NONE,
                MMAP_FLAGS_COMMON | MAP_FIXED_NOREPLACE | MAP_NORESERVE, -1, 0);
        printf("a=%p errno %d %m\n", a, errno);
        errno = 0;

        char buf[128];
        sprintf(buf, "cp /proc/%d/smaps smaps1", getpid());
        assert(system(buf) == 0);

        len = 4096;
        void *b = mmap(
                a, len, PROT_READ | PROT_WRITE,
                MMAP_FLAGS_COMMON | MAP_POPULATE | MAP_FIXED, -1, 0);
        printf("b=%p errno %d %m\n", b, errno);
        errno = 0;

        sprintf(buf, "cp /proc/%d/smaps smaps2", getpid());
        assert(system(buf) == 0);

        return 0;
}
[root@localhost ~]# gcc -o test test.c && ./test
a=0x7c8000000000 errno 0 Success
b=0xffffffffffffffff errno 12 Cannot allocate memory
[root@localhost ~]# diff smaps1 smaps2
157,158c157,158
< 7c8000000000-7c8080000000 ---p 00000000 00:10 7332                       /anon_hugepage (deleted)
< Size:            2097152 kB
---
> 7c8000200000-7c8080000000 ---p 00200000 00:10 7332                       /anon_hugepage (deleted)
> Size:            2095104 kB

First, we map a 2G PROT_NONE region using hugepages. This succeeds. Then
we try to map a 4096-length PROT_READ | PROT_WRITE region at the
beginning of the PROT_NONE region, still using hugepages. This fails, as
expected, because 4096 is much smaller than the hugepage size configured
on the system (this is x86 with a default hugepage size of 2M). The
surprising thing is the difference in /proc/pid/smaps before and after
the failed mmap. Even though the mmap failed, the value in
/proc/pid/smaps changed, with a 2M-sized bite being taken out the front
of the mapping. This feels unintuitive to me, as I'd expect a failed
mmap to have no effect on the virtual memory mappings of the calling
process whatsoever.

I initially saw this on an ancient redhat kernel, but I was able to
reproduce it on 6.13 as well. So I assume this behavior still exists and
has been around forever.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-06  6:18 [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions Uday Shankar
@ 2025-02-06  9:01 ` Oscar Salvador
  2025-02-06 18:11   ` Jörn Engel
  2025-02-06 19:44   ` Uday Shankar
  2025-02-07 13:12 ` Lorenzo Stoakes
  1 sibling, 2 replies; 16+ messages in thread
From: Oscar Salvador @ 2025-02-06  9:01 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Muchun Song, Andrew Morton, Joern Engel, linux-mm

On Wed, Feb 05, 2025 at 11:18:34PM -0700, Uday Shankar wrote:
> I was debugging an issue with a malloc implementation when I noticed
> some unintuitive behavior that happens when someone attempts to
> overwrite part of a hugepage-backed PROT_NONE mapping with another
> mapping. I've isolated the issue and reproduced it with the following
> program:
...
 
> First, we map a 2G PROT_NONE region using hugepages. This succeeds. Then
> we try to map a 4096-length PROT_READ | PROT_WRITE region at the
> beginning of the PROT_NONE region, still using hugepages. This fails, as
> expected, because 4096 is much smaller than the hugepage size configured
> on the system (this is x86 with a default hugepage size of 2M). The

Not really, see how ksys_mmap_pgoff() aligns len to huge_page_size if we
set MAP_HUGETLB.
It fails with ENOMEM because likely you did not preallocate any hugetlb
pages, so by the time we do hugetlbfs_file_mmap()->hugetlb_reserve_pages(),
it sees that we do not have enough hugetlb pages in the pool to be reserved,
so it bails out.

> surprising thing is the difference in /proc/pid/smaps before and after
> the failed mmap. Even though the mmap failed, the value in
> /proc/pid/smaps changed, with a 2M-sized bite being taken out the front
> of the mapping. This feels unintuitive to me, as I'd expect a failed
> mmap to have no effect on the virtual memory mappings of the calling
> process whatsoever.

That is because the above happens after __mmap_prepare(), which is
responsible of unmapping any overlapping areas, is executed.
I guess this is done this way because rolling back at this point would be
quite tricky.


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-06  9:01 ` Oscar Salvador
@ 2025-02-06 18:11   ` Jörn Engel
  2025-02-06 18:54     ` Oscar Salvador
                       ` (2 more replies)
  2025-02-06 19:44   ` Uday Shankar
  1 sibling, 3 replies; 16+ messages in thread
From: Jörn Engel @ 2025-02-06 18:11 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm

On Thu, Feb 06, 2025 at 10:01:05AM +0100, Oscar Salvador wrote:
>
> That is because the above happens after __mmap_prepare(), which is
> responsible of unmapping any overlapping areas, is executed.
> I guess this is done this way because rolling back at this point would be
> quite tricky.

The big question (to me at least) is whether the current behavior is
correct or not.  I cannot find any documentation to that end, so maybe
this is a new question we have to answer for the first time.  So:

  In case of failure, should munmap() change the process address space?

As a user I would like the answer to be "no".  Partially because I was
personally surprised to see a change and surprises often result in bugs.
Partially because the specific change isn't even well-defined.  The size
of the unmapped region depends on the kernel configuration, you might
unmap a 2M-aligned chunk or a 1G-aligned chunk.

Are there contrary opinions out there?  Would it ever be useful to have
a failed mmap or munmap make changes to the process address space?

Jörn

--
I don't care what anything was designed to do,
I care about what it can do.
-- Gene Kranz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-06 18:11   ` Jörn Engel
@ 2025-02-06 18:54     ` Oscar Salvador
  2025-02-07 10:29       ` Lorenzo Stoakes
  2025-02-07 10:49     ` Vlastimil Babka
  2025-02-07 12:33     ` Lorenzo Stoakes
  2 siblings, 1 reply; 16+ messages in thread
From: Oscar Salvador @ 2025-02-06 18:54 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm, lorenzo.stoakes

On Thu, Feb 06, 2025 at 10:11:30AM -0800, Jörn Engel wrote:
> On Thu, Feb 06, 2025 at 10:01:05AM +0100, Oscar Salvador wrote:
> >
> > That is because the above happens after __mmap_prepare(), which is
> > responsible of unmapping any overlapping areas, is executed.
> > I guess this is done this way because rolling back at this point would be
> > quite tricky.

Let me add Lorenzo

> The big question (to me at least) is whether the current behavior is
> correct or not.  I cannot find any documentation to that end, so maybe
> this is a new question we have to answer for the first time.  So:
> 
>   In case of failure, should munmap() change the process address space?
> 
> As a user I would like the answer to be "no".  Partially because I was
> personally surprised to see a change and surprises often result in bugs.
> Partially because the specific change isn't even well-defined.  The size
> of the unmapped region depends on the kernel configuration, you might
> unmap a 2M-aligned chunk or a 1G-aligned chunk.
> 
> Are there contrary opinions out there?  Would it ever be useful to have
> a failed mmap or munmap make changes to the process address space?

AFAIK we try to rollback as much as possible (vms_abort_munmap_vmas()),
but sometimes it is not possible.
For the problem here at hand, we could poke hugetlb to check whether it
has enough hugetlb pages, but that would be 1) racy and 2) we do not
want to special case hugetlb even more.

Hopefully Lorenzo can shed some light here.


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-06 18:54     ` Oscar Salvador
@ 2025-02-07 10:29       ` Lorenzo Stoakes
  0 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-02-07 10:29 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jörn Engel, Uday Shankar, Muchun Song, Andrew Morton, linux-mm

On Thu, Feb 06, 2025 at 07:54:29PM +0100, Oscar Salvador wrote:
> On Thu, Feb 06, 2025 at 10:11:30AM -0800, Jörn Engel wrote:
> > On Thu, Feb 06, 2025 at 10:01:05AM +0100, Oscar Salvador wrote:
> > >
> > > That is because the above happens after __mmap_prepare(), which is
> > > responsible of unmapping any overlapping areas, is executed.
> > > I guess this is done this way because rolling back at this point would be
> > > quite tricky.
>
> Let me add Lorenzo
>
> > The big question (to me at least) is whether the current behavior is
> > correct or not.  I cannot find any documentation to that end, so maybe
> > this is a new question we have to answer for the first time.  So:
> >
> >   In case of failure, should munmap() change the process address space?
> >
> > As a user I would like the answer to be "no".  Partially because I was
> > personally surprised to see a change and surprises often result in bugs.
> > Partially because the specific change isn't even well-defined.  The size
> > of the unmapped region depends on the kernel configuration, you might
> > unmap a 2M-aligned chunk or a 1G-aligned chunk.
> >
> > Are there contrary opinions out there?  Would it ever be useful to have
> > a failed mmap or munmap make changes to the process address space?
>
> AFAIK we try to rollback as much as possible (vms_abort_munmap_vmas()),
> but sometimes it is not possible.
> For the problem here at hand, we could poke hugetlb to check whether it
> has enough hugetlb pages, but that would be 1) racy and 2) we do not
> want to special case hugetlb even more.
>
> Hopefully Lorenzo can shed some light here.

Thanks, let me look at the OP...

>
>
> --
> Oscar Salvador
> SUSE Labs
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-06 18:11   ` Jörn Engel
  2025-02-06 18:54     ` Oscar Salvador
@ 2025-02-07 10:49     ` Vlastimil Babka
  2025-02-07 12:33     ` Lorenzo Stoakes
  2 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-02-07 10:49 UTC (permalink / raw)
  To: Jörn Engel, Oscar Salvador, Lorenzo Stoakes
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm

On 2/6/25 19:11, Jörn Engel wrote:
> On Thu, Feb 06, 2025 at 10:01:05AM +0100, Oscar Salvador wrote:
>>
>> That is because the above happens after __mmap_prepare(), which is
>> responsible of unmapping any overlapping areas, is executed.
>> I guess this is done this way because rolling back at this point would be
>> quite tricky.
> 
> The big question (to me at least) is whether the current behavior is
> correct or not.  I cannot find any documentation to that end, so maybe
> this is a new question we have to answer for the first time.  So:
> 
>   In case of failure, should munmap() change the process address space?
> 
> As a user I would like the answer to be "no".  Partially because I was
> personally surprised to see a change and surprises often result in bugs.
> Partially because the specific change isn't even well-defined.  The size
> of the unmapped region depends on the kernel configuration, you might
> unmap a 2M-aligned chunk or a 1G-aligned chunk.
> 
> Are there contrary opinions out there?  Would it ever be useful to have
> a failed mmap or munmap make changes to the process address space?
> 
> Jörn
> 
> --
> I don't care what anything was designed to do,
> I care about what it can do.

Incidentally, a similar thing may apply to existing userspace application
that learned to expect the unmap on failure, and if we now fix the kernel to
stop doing that, that application might break. However in this case it's
probably very unlikely such application exists, so we might try and see if
anyone complains...

> -- Gene Kranz
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-06 18:11   ` Jörn Engel
  2025-02-06 18:54     ` Oscar Salvador
  2025-02-07 10:49     ` Vlastimil Babka
@ 2025-02-07 12:33     ` Lorenzo Stoakes
  2 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-02-07 12:33 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Oscar Salvador, Uday Shankar, Muchun Song, Andrew Morton, linux-mm

On Thu, Feb 06, 2025 at 10:11:30AM -0800, Jörn Engel wrote:
> On Thu, Feb 06, 2025 at 10:01:05AM +0100, Oscar Salvador wrote:
> >
> > That is because the above happens after __mmap_prepare(), which is
> > responsible of unmapping any overlapping areas, is executed.
> > I guess this is done this way because rolling back at this point would be
> > quite tricky.
>
> The big question (to me at least) is whether the current behavior is
> correct or not.  I cannot find any documentation to that end, so maybe
> this is a new question we have to answer for the first time.  So:
>
>   In case of failure, should munmap() change the process address space?
>
> As a user I would like the answer to be "no".  Partially because I was
> personally surprised to see a change and surprises often result in bugs.
> Partially because the specific change isn't even well-defined.  The size
> of the unmapped region depends on the kernel configuration, you might
> unmap a 2M-aligned chunk or a 1G-aligned chunk.
>
> Are there contrary opinions out there?  Would it ever be useful to have
> a failed mmap or munmap make changes to the process address space?
>

Yes :)

I will reply on the top thread with everyone cc'd though.


> Jörn
>
> --
> I don't care what anything was designed to do,
> I care about what it can do.
> -- Gene Kranz
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-06  9:01 ` Oscar Salvador
  2025-02-06 18:11   ` Jörn Engel
@ 2025-02-06 19:44   ` Uday Shankar
  1 sibling, 0 replies; 16+ messages in thread
From: Uday Shankar @ 2025-02-06 19:44 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: Muchun Song, Andrew Morton, Joern Engel, linux-mm

On Thu, Feb 06, 2025 at 10:01:05AM +0100, Oscar Salvador wrote:
> On Wed, Feb 05, 2025 at 11:18:34PM -0700, Uday Shankar wrote:
> > I was debugging an issue with a malloc implementation when I noticed
> > some unintuitive behavior that happens when someone attempts to
> > overwrite part of a hugepage-backed PROT_NONE mapping with another
> > mapping. I've isolated the issue and reproduced it with the following
> > program:
> ...
>  
> > First, we map a 2G PROT_NONE region using hugepages. This succeeds. Then
> > we try to map a 4096-length PROT_READ | PROT_WRITE region at the
> > beginning of the PROT_NONE region, still using hugepages. This fails, as
> > expected, because 4096 is much smaller than the hugepage size configured
> > on the system (this is x86 with a default hugepage size of 2M). The
> 
> Not really, see how ksys_mmap_pgoff() aligns len to huge_page_size if we
> set MAP_HUGETLB.
> It fails with ENOMEM because likely you did not preallocate any hugetlb
> pages, so by the time we do hugetlbfs_file_mmap()->hugetlb_reserve_pages(),
> it sees that we do not have enough hugetlb pages in the pool to be reserved,
> so it bails out.

Yeah, you're right, this is a system without any hugepages preallocated,
and that makes the error code make a lot more sense. Sorry for any
confusion. This is why I should report issues when I discover them and
not 2 months later :)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-06  6:18 [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions Uday Shankar
  2025-02-06  9:01 ` Oscar Salvador
@ 2025-02-07 13:12 ` Lorenzo Stoakes
  2025-02-07 19:35   ` Jörn Engel
  1 sibling, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-02-07 13:12 UTC (permalink / raw)
  To: Uday Shankar
  Cc: Muchun Song, Andrew Morton, Joern Engel, linux-mm,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Oscar Salvador

Thanks very much for your report!

+cc other mmap maintainers/reviewers + Oscar (thank you for pinging me on
this Oscar!)

(you weren't to know as a bug reporter, so no problem, but the people
listed in the MEMORY MAPPING section in MAINTAINERS will always deal with
anything to do with mmap() and friends swiftly so it's a good resource
going forward if you find anything else :)

On Wed, Feb 05, 2025 at 11:18:34PM -0700, Uday Shankar wrote:
> I was debugging an issue with a malloc implementation when I noticed
> some unintuitive behavior that happens when someone attempts to
> overwrite part of a hugepage-backed PROT_NONE mapping with another
> mapping. I've isolated the issue and reproduced it with the following
> program:
>
> [root@localhost ~]# cat test.c
> #include <assert.h>
> #include <errno.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <unistd.h>
>
> #define MMAP_FLAGS_COMMON (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
>
> int main()
> {
>         size_t len = 2ULL << 30;
>         void *a = mmap(
>                 (void *)0x7c8000000000, len, PROT_NONE,
>                 MMAP_FLAGS_COMMON | MAP_FIXED_NOREPLACE | MAP_NORESERVE, -1, 0);

The MAP_NORESERVE is key here. This tells hugetlb that you do not want to
reserve pages, but rather to try at fault time, so it will make this
succeed even if you have no hugetlb pages.

If you try to then fault anywhere in the range, it'll segfault.

This check is done in hugetlb_reserve_pages() which is called by
hugetlb_file_mmap() which is an f_op->mmap() callback (the MAP_HUGETLB
'cheats in' an fd so makes it a file-backed mapping despite apperances):

	/*
	 * Only apply hugepage reservation if asked. At fault time, an
	 * attempt will be made for VM_NORESERVE to allocate a page
	 * without using reserves
	 */
	if (vm_flags & VM_NORESERVE)
		return true;

Actually though this is irrelevant - you will see the _same_ behaviour
regardless of how you map this.

>         printf("a=%p errno %d %m\n", a, errno);
>         errno = 0;
>
>         char buf[128];
>         sprintf(buf, "cp /proc/%d/smaps smaps1", getpid());
>         assert(system(buf) == 0);
>
>         len = 4096;
>         void *b = mmap(
>                 a, len, PROT_READ | PROT_WRITE,
>                 MMAP_FLAGS_COMMON | MAP_POPULATE | MAP_FIXED, -1, 0);

Here is the more interesting bit. Let me analyse what's happened here below.

>         printf("b=%p errno %d %m\n", b, errno);
>         errno = 0;
>
>         sprintf(buf, "cp /proc/%d/smaps smaps2", getpid());
>         assert(system(buf) == 0);
>
>         return 0;
> }
> [root@localhost ~]# gcc -o test test.c && ./test
> a=0x7c8000000000 errno 0 Success
> b=0xffffffffffffffff errno 12 Cannot allocate memory
> [root@localhost ~]# diff smaps1 smaps2
> 157,158c157,158
> < 7c8000000000-7c8080000000 ---p 00000000 00:10 7332                       /anon_hugepage (deleted)
> < Size:            2097152 kB
> ---
> > 7c8000200000-7c8080000000 ---p 00200000 00:10 7332                       /anon_hugepage (deleted)
> > Size:            2095104 kB
>
> First, we map a 2G PROT_NONE region using hugepages. This succeeds. Then
> we try to map a 4096-length PROT_READ | PROT_WRITE region at the
> beginning of the PROT_NONE region, still using hugepages. This fails, as
> expected, because 4096 is much smaller than the hugepage size configured
> on the system (this is x86 with a default hugepage size of 2M). The
> surprising thing is the difference in /proc/pid/smaps before and after
> the failed mmap. Even though the mmap failed, the value in
> /proc/pid/smaps changed, with a 2M-sized bite being taken out the front
> of the mapping. This feels unintuitive to me, as I'd expect a failed
> mmap to have no effect on the virtual memory mappings of the calling
> process whatsoever.
>
> I initially saw this on an ancient redhat kernel, but I was able to
> reproduce it on 6.13 as well. So I assume this behavior still exists and
> has been around forever.
>

As Oscar pointed out, this is not really hugely _surprising_ in that you
may be requesting a hugetlb mapping of 4096 bytes for case 'b' but by
setting MAP_HUGETLB you are asking for it to be aligned.

So in effect you are doing:

1. Perform 2 GiB mapping (it actually doesn't matter if you make it
   MAP_HUGETLB with MAP_NORESERVE to make that succeed or not, or even the
   size as long as it is larger than the mapping you make afterwards for
   the purposes of the discussion :)
2. Map something over this _partially_ where the mapping fails.

So the crux of the point is something that Joern alluded to in his reply -
'should a failed memory mapping operation cause visible changes to the
memory mapping'.

The key point here is that the operation you are performing is
_aggregate_. You're not asking for one thing to happen, you're actually
asking for 2 - unmap the start of range a THEN perform a new mapping in its
place.

So the question boils down to 'does linux guarantee that aggregate
operations, on partial failure, unwind those partial operations which
succeed?' - and emphatically the answer to that is no.

If one of those successful operations is 'unmap this mapping', as it is
here - you have to ask yourself - what would undoing that look like?
Because if there's anon backing, then the data underlying that mapping is
gone forever, so we can't restore the mapping as it was.

And at this point it really makes no sense to special case things like
PROT_NONE, because we simply do not guarantee that we do anything good in
this scenario.

Attempting to 'do everything up front' would be prohibitively expensive,
imagine the overhead in tracking that etc. etc.

If you get an error performing an operation, you must interpret this as
'the operation either partially or entirely failed' and act accordingly.

This may be surprising, but as you can imagine, correctly performing
unwinding of partial operations is... difficult and in some cases would
have significant performance penalties.

An important point here esp. as to Joern's question re: an unmapping
partially failing - since now we gather up and do unmappings on MAP_FIXED
in aggregate _internal to the kernel_ - if the failure had occurred there
(it didn't, it occurred in the mapping bit) - we WOULD HAVE undone the
partial operation.

The problem here is that we have SUCCESSFULLY unmapped part of a mapping,
then FAILED to map something in that gap.

So TL;DR is - aggregate operations failing means any or all of the
operation failed, you can no longer rely on the mapping state being what
you expected.

~~ Beyond here lies extraneous details for the curious... ~~

HOWEVER, interestingly, there is a little more to this story in this
particular case - Liam, recently significantly altered how MAP_FIXED
behaves (actually interestingly ultimately to allow for far more efficient
interaction with /proc/$pid/maps and friends but that's another story).

And he _tried_ very hard to implement the ability to partially unwind these
unmap operations, and this was part of 6.12... but in this instance there's
just nothing we can reasonably, as discussed above.

Now - _had the failure happened on the unmap bit_ - we WOULD be able to
unwind, by simply reattaching the existing mappings (that is - internal to
the kernel - VMAs).

HOWEVER, the failure here happens _after_ the unmap SUCCESSFULLY completes.

We track the unmapping operation with a 'vms' object, and key to this is a
field 'clear_ptes' which tracks whether we've cleaned up both page table
mappings and actually executed on the unmapping. If true, we haven't done
this yet, if false we have.

We set this false after succeeding at the unmap here:

vms_complete_munmap_vmas()
-> vms_clear_ptes()
-> [ clears down any actual page table mappings ]
-> [ sets vms->clear_ptes = false ]

And then on the new mapping we do:

__mmap_region()
-> __mmap_new_vma()
-> [ error handling ]
-> vms_abort_munmap_vmas()
-> [ sees vms->clear_ptes is false ]
-> clears the range and gives up.

Looking at the code:

static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
		struct ma_state *mas_detach)
{
	struct ma_state *mas = &vms->vmi->mas;

	if (!vms->nr_pages)
		return;

	if (vms->clear_ptes)
		return reattach_vmas(mas_detach);

^--- here is where, if the operation failed during unmap, we'd be able to
     reattach the VMAs and go about our business.

But this is not the case.

	/*
	 * Aborting cannot just call the vm_ops open() because they are often
	 * not symmetrical and state data has been lost.  Resort to the old
	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
	 */
	mas_set_range(mas, vms->start, vms->end - 1);

^--- This comment is the pertinent one - we have no choice but to do
     so. And this is what is happening now.

	mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
	/* Clean up the insertion of the unfortunate gap */
	vms_complete_munmap_vmas(vms, mas_detach);
}

Prior to Liam's changes, there was no attempt to reattach _at all_ and we
just always added this gap. Now we attempt to reattach _if it makes sense
to do so_ but the case you've encountered both previously and currently
will result in the behaviour you've observed.

Hopefully this clarifies things somewhat... :) it's a (surprisingly) tricky
area when it comes to corner case stuff like this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-07 13:12 ` Lorenzo Stoakes
@ 2025-02-07 19:35   ` Jörn Engel
  2025-02-08 16:02     ` Lorenzo Stoakes
  0 siblings, 1 reply; 16+ messages in thread
From: Jörn Engel @ 2025-02-07 19:35 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Oscar Salvador

On Fri, Feb 07, 2025 at 01:12:33PM +0000, Lorenzo Stoakes wrote:
> 
> So TL;DR is - aggregate operations failing means any or all of the
> operation failed, you can no longer rely on the mapping state being what
> you expected.

Coming back to the "what should the interface be?" question, I can see
three reasonable answers:
1. Failure should result in no change.  We have a bug and will fix it.
2. Failure should result in no change.  But fixing things is exceedingly
   hard and we may have to live with current reality for a long time.
3. Failure should result in undefined behavior.

I think you convincingly argue against the first answer.  It might still
be useful to also argue against the third answer.

For background, I wrote a somewhat weird memory allocator in 2017,
called "big_allocate".  Underlying problem is that your favorite malloc
tends to do a reasonable job for small to medium objects, but eventually
gives up and calls mmap()/munmap() for large objects.  With a heavily
threaded process, the combination of mmap_sem and TLB shootdown via IPI
is a big performance-killer.  Solution is a specialized allocator for
large objects instead of mmap()/munmap().

The original (and still current) design of big_allocate has a mapping
structure somewhat similar to "struct page" in the kernel.  It relies on
having a large chunk of virtual memory space that it directly controls,
so that it can have a simple 1:1 mapping between virtual memory and
"struct page".

To get a large chunk of virtual memory space, big_allocate does a
MAP_NONE mmap().  It then later does the MAP_RW mmap() to allocate
memory.  Often combined with MAP_HUGETLB, for obvious performance
reasons.  (Side note: I wish MAP_RW existed in the headers.)

If memory serves, big_allocate resulted in a 2-3% macrobenchmark
improvement.

Current big_allocate has a number of ugly warts I rather dislike.  One
of those warts is that you now have existing users that rely on mmap()
over existing MAP_NONE mappings working.  At least with the special set
of conditions we care about.

I have some plans to rewrite big_allocate with a different design.  But
for now we have existing code that may make your life harder than you
wished for.

Jörn

--
Without congressional action or a strong judicial precedent, I would
_strongly_ recommend against anyone trusting their private data to a
company with physical ties to the United States.
-- Ladar Levison

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-07 19:35   ` Jörn Engel
@ 2025-02-08 16:02     ` Lorenzo Stoakes
  2025-02-08 17:37       ` Jörn Engel
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-02-08 16:02 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Oscar Salvador

On Fri, Feb 07, 2025 at 11:35:40AM -0800, Jörn Engel wrote:
> On Fri, Feb 07, 2025 at 01:12:33PM +0000, Lorenzo Stoakes wrote:
> >
> > So TL;DR is - aggregate operations failing means any or all of the
> > operation failed, you can no longer rely on the mapping state being what
> > you expected.
>
> Coming back to the "what should the interface be?" question, I can see
> three reasonable answers:
> 1. Failure should result in no change.  We have a bug and will fix it.
> 2. Failure should result in no change.  But fixing things is exceedingly
>    hard and we may have to live with current reality for a long time.
> 3. Failure should result in undefined behavior.
>
> I think you convincingly argue against the first answer.  It might still
> be useful to also argue against the third answer.

To be clear, you won't get any kind of undefined behaviour (what that means
wrt the kernel is not entirely clear - but if it were to mean as equivalent
to the compiler sort 'anything might happen' - then no) or incomplete state.

You simply cannot differentiate (without, at least, further investigation)
between partial success and partial failure of an aggregate operation
vs. total failure of an aggregate operation based on an error code.

>
>
> For background, I wrote a somewhat weird memory allocator in 2017,
> called "big_allocate".  Underlying problem is that your favorite malloc
> tends to do a reasonable job for small to medium objects, but eventually
> gives up and calls mmap()/munmap() for large objects.  With a heavily
> threaded process, the combination of mmap_sem and TLB shootdown via IPI
> is a big performance-killer.  Solution is a specialized allocator for
> large objects instead of mmap()/munmap().
>
> The original (and still current) design of big_allocate has a mapping
> structure somewhat similar to "struct page" in the kernel.  It relies on
> having a large chunk of virtual memory space that it directly controls,
> so that it can have a simple 1:1 mapping between virtual memory and
> "struct page".
>
> To get a large chunk of virtual memory space, big_allocate does a
> MAP_NONE mmap().  It then later does the MAP_RW mmap() to allocate
> memory.  Often combined with MAP_HUGETLB, for obvious performance
> reasons.  (Side note: I wish MAP_RW existed in the headers.)
>
> If memory serves, big_allocate resulted in a 2-3% macrobenchmark
> improvement.
>
> Current big_allocate has a number of ugly warts I rather dislike.  One
> of those warts is that you now have existing users that rely on mmap()
> over existing MAP_NONE mappings working.  At least with the special set
> of conditions we care about.

I guess you mean PROT_NONE? :) For the case in the thread you would have to
have mapped a hugetlb area over the PROT_NONE one without MAP_NORESERVE and
with insufficiently reserved hugetlb pages, a combination which should be
expected to possibly fail.

If you perform an mprotect() to R/W the range, you will end up with a 'one
and done' operation.

I'd also suggest that hugetlb doesn't seem to fit a malloc library like to
me, as you rely on reserved pages, rather wouldn't it make more sense to
try to allocate memory that gets THP pages? You could MADV_COLLAPSE to try
to make sure...

However, if aligned correctly, we should automagically give you those.

>
> I have some plans to rewrite big_allocate with a different design.  But
> for now we have existing code that may make your life harder than you
> wished for.
>
> Jörn
>
> --
> Without congressional action or a strong judicial precedent, I would
> _strongly_ recommend against anyone trusting their private data to a
> company with physical ties to the United States.
> -- Ladar Levison
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-08 16:02     ` Lorenzo Stoakes
@ 2025-02-08 17:37       ` Jörn Engel
  2025-02-08 17:40         ` Lorenzo Stoakes
  0 siblings, 1 reply; 16+ messages in thread
From: Jörn Engel @ 2025-02-08 17:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Oscar Salvador

On Sat, Feb 08, 2025 at 04:02:47PM +0000, Lorenzo Stoakes wrote:
> 
> To be clear, you won't get any kind of undefined behaviour (what that means
> wrt the kernel is not entirely clear - but if it were to mean as equivalent
> to the compiler sort 'anything might happen' - then no) or incomplete state.

Going off on that tangent, I think compiler folks completely abuse
undefined behavior.  Historically it simply meant that you might get two
or more different (and well-defined) results, depending on which
specific hardware you ran on.  Somehow that was reinterpreted as "I have
license to do whatever I want".

> I guess you mean PROT_NONE? :) For the case in the thread you would have to
> have mapped a hugetlb area over the PROT_NONE one without MAP_NORESERVE and
> with insufficiently reserved hugetlb pages, a combination which should be
> expected to possibly fail.
> 
> If you perform an mprotect() to R/W the range, you will end up with a 'one
> and done' operation.
> 
> I'd also suggest that hugetlb doesn't seem to fit a malloc library like to
> me, as you rely on reserved pages, rather wouldn't it make more sense to
> try to allocate memory that gets THP pages? You could MADV_COLLAPSE to try
> to make sure...
> 
> However, if aligned correctly, we should automagically give you those.

We tried THP around 2012 and rejected it.  The latency tail became a lot
longer and fatter.  Various things have changed that might make THP less
bad today, but I am not aware of anyone reevaluating it.

I think the problem with THP was the mmap_sem.  Given a heavily threaded
process, the mmap_sem tends to be the one dominant lock in the kernel.

A secondary problem might have been the background thread scanning for
pages that can be merged.  Not sure about this part.  We just disabled
THP and moved on to other problems.

And yes, PROT_NONE/PROT_RW.  Sorry!

Jörn

--
Every hour workers spend doing something else is an hour that they
aren't doing their jobs.
-- unknown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-08 17:37       ` Jörn Engel
@ 2025-02-08 17:40         ` Lorenzo Stoakes
  2025-02-08 17:53           ` Jörn Engel
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-02-08 17:40 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Oscar Salvador

On Sat, Feb 08, 2025 at 09:37:34AM -0800, Jörn Engel wrote:
> On Sat, Feb 08, 2025 at 04:02:47PM +0000, Lorenzo Stoakes wrote:
> >
> > To be clear, you won't get any kind of undefined behaviour (what that means
> > wrt the kernel is not entirely clear - but if it were to mean as equivalent
> > to the compiler sort 'anything might happen' - then no) or incomplete state.
>
> Going off on that tangent, I think compiler folks completely abuse
> undefined behavior.  Historically it simply meant that you might get two
> or more different (and well-defined) results, depending on which
> specific hardware you ran on.  Somehow that was reinterpreted as "I have
> license to do whatever I want".
>
>
> > I guess you mean PROT_NONE? :) For the case in the thread you would have to
> > have mapped a hugetlb area over the PROT_NONE one without MAP_NORESERVE and
> > with insufficiently reserved hugetlb pages, a combination which should be
> > expected to possibly fail.
> >
> > If you perform an mprotect() to R/W the range, you will end up with a 'one
> > and done' operation.
> >
> > I'd also suggest that hugetlb doesn't seem to fit a malloc library like to
> > me, as you rely on reserved pages, rather wouldn't it make more sense to
> > try to allocate memory that gets THP pages? You could MADV_COLLAPSE to try
> > to make sure...
> >
> > However, if aligned correctly, we should automagically give you those.
>
> We tried THP around 2012 and rejected it.  The latency tail became a lot
> longer and fatter.  Various things have changed that might make THP less
> bad today, but I am not aware of anyone reevaluating it.

A _lot_ has changed. Try it again :)

>
> I think the problem with THP was the mmap_sem.  Given a heavily threaded
> process, the mmap_sem tends to be the one dominant lock in the kernel.

A lot of work has been done on reducing mmap_sem contention. Again, worth
another shot ;)

>
> A secondary problem might have been the background thread scanning for
> pages that can be merged.  Not sure about this part.  We just disabled
> THP and moved on to other problems.

This sounds like really ancient problems that no longer exist (as well as a lot
of FUD about THP at the time actually).

We tend to proactively go THP if we can these days.

>
>
> And yes, PROT_NONE/PROT_RW.  Sorry!

Haha no problem, just to clarify I understood you!

>
> Jörn
>
> --
> Every hour workers spend doing something else is an hour that they
> aren't doing their jobs.
> -- unknown
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-08 17:40         ` Lorenzo Stoakes
@ 2025-02-08 17:53           ` Jörn Engel
  2025-02-08 18:00             ` Lorenzo Stoakes
  0 siblings, 1 reply; 16+ messages in thread
From: Jörn Engel @ 2025-02-08 17:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Oscar Salvador

On Sat, Feb 08, 2025 at 05:40:41PM +0000, Lorenzo Stoakes wrote:
> >
> > We tried THP around 2012 and rejected it.  The latency tail became a lot
> > longer and fatter.  Various things have changed that might make THP less
> > bad today, but I am not aware of anyone reevaluating it.
> 
> A _lot_ has changed. Try it again :)
> 
> >
> > I think the problem with THP was the mmap_sem.  Given a heavily threaded
> > process, the mmap_sem tends to be the one dominant lock in the kernel.
> 
> A lot of work has been done on reducing mmap_sem contention. Again, worth
> another shot ;)

Probably worth it, agreed.  But unlikely to happen in the near term for
various reasons.

One of the biggest improvements was actually userspace.  Glibc malloc is
very eager to call mprotect for reasons that imo never made any sense.
Replacing glibc malloc with any sane allocator will significantly reduce
the number of mmap_sem taking system calls.

Another hard-to-avoid problem is all the variations of "ps" that take
the mmap_sem for every process in the system.

Jörn

--
You can't do much carpentry with your bare hands and you can't do much
thinking with your bare brain.
-- unknown


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-08 17:53           ` Jörn Engel
@ 2025-02-08 18:00             ` Lorenzo Stoakes
  2025-02-08 21:16               ` Jörn Engel
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-02-08 18:00 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Oscar Salvador

On Sat, Feb 08, 2025 at 09:53:27AM -0800, Jörn Engel wrote:
> On Sat, Feb 08, 2025 at 05:40:41PM +0000, Lorenzo Stoakes wrote:
> > >
> > > We tried THP around 2012 and rejected it.  The latency tail became a lot
> > > longer and fatter.  Various things have changed that might make THP less
> > > bad today, but I am not aware of anyone reevaluating it.
> >
> > A _lot_ has changed. Try it again :)
> >
> > >
> > > I think the problem with THP was the mmap_sem.  Given a heavily threaded
> > > process, the mmap_sem tends to be the one dominant lock in the kernel.
> >
> > A lot of work has been done on reducing mmap_sem contention. Again, worth
> > another shot ;)
>
> Probably worth it, agreed.  But unlikely to happen in the near term for
> various reasons.
>
> One of the biggest improvements was actually userspace.  Glibc malloc is
> very eager to call mprotect for reasons that imo never made any sense.
> Replacing glibc malloc with any sane allocator will significantly reduce
> the number of mmap_sem taking system calls.
>
> Another hard-to-avoid problem is all the variations of "ps" that take
> the mmap_sem for every process in the system.

Funny you should mention that ;) we will soon no longer have this problem,
and MAJOR efforts have gone into making this possible under RCU. The 6.12
release was one long battle to get the key series in for this :P

Which is in fact again Liam's aforementioned, a prerequisite for consistent
observation of VMAs in /proc/$pid/[s]maps[_rollup].

>
> Jörn
>
> --
> You can't do much carpentry with your bare hands and you can't do much
> thinking with your bare brain.
> -- unknown
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions
  2025-02-08 18:00             ` Lorenzo Stoakes
@ 2025-02-08 21:16               ` Jörn Engel
  0 siblings, 0 replies; 16+ messages in thread
From: Jörn Engel @ 2025-02-08 21:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uday Shankar, Muchun Song, Andrew Morton, linux-mm,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Oscar Salvador

On Sat, Feb 08, 2025 at 06:00:39PM +0000, Lorenzo Stoakes wrote:
> On Sat, Feb 08, 2025 at 09:53:27AM -0800, Jörn Engel wrote:
> 
> Funny you should mention that ;) we will soon no longer have this problem,
> and MAJOR efforts have gone into making this possible under RCU. The 6.12
> release was one long battle to get the key series in for this :P
> 
> Which is in fact again Liam's aforementioned, a prerequisite for consistent
> observation of VMAs in /proc/$pid/[s]maps[_rollup].

You are my hero, Liam!

I looked at the issue some time ago and decided it was too hard to sort
out the mess.

Jörn

--
It's not whether you win or lose, it's how you place the blame.
-- unknown


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-02-08 21:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-06  6:18 [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions Uday Shankar
2025-02-06  9:01 ` Oscar Salvador
2025-02-06 18:11   ` Jörn Engel
2025-02-06 18:54     ` Oscar Salvador
2025-02-07 10:29       ` Lorenzo Stoakes
2025-02-07 10:49     ` Vlastimil Babka
2025-02-07 12:33     ` Lorenzo Stoakes
2025-02-06 19:44   ` Uday Shankar
2025-02-07 13:12 ` Lorenzo Stoakes
2025-02-07 19:35   ` Jörn Engel
2025-02-08 16:02     ` Lorenzo Stoakes
2025-02-08 17:37       ` Jörn Engel
2025-02-08 17:40         ` Lorenzo Stoakes
2025-02-08 17:53           ` Jörn Engel
2025-02-08 18:00             ` Lorenzo Stoakes
2025-02-08 21:16               ` Jörn Engel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox