linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Possible regression with file madvise(MADV_COLLAPSE)
@ 2024-10-09 15:54 Avi Kivity
  2024-10-11  6:32 ` Gavin Shan
  2024-10-11 22:29 ` Yang Shi
  0 siblings, 2 replies; 10+ messages in thread
From: Avi Kivity @ 2024-10-09 15:54 UTC (permalink / raw)
  To: linux-mm

On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
madvise(MADV_COLLAPSE) on  program text fails with EINVAL.

To reproduce, compile the reproducer with

clang -g -o text-hugepage  text-hugepage.c \
	-fuse-ld=lld \
	-Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 \
        -Wl,-z,separate-loadable-segments

and run:

$ strace -e trace=madvise ./text-hugepage
madvise(0x400000, 2097152, MADV_HUGEPAGE) = 0
madvise(0x400000, 2097152, MADV_POPULATE_READ) = 0
madvise(0x400000, 2097152, MADV_COLLAPSE) = -1 EINVAL (Invalid
argument)

(the funky linker options are needed to make sure the .text vma spans a
hugepage).


I say "possible regression" since I haven't tried it with an older
kernel, but I believe it worked at some point or other seeing that
others managed to get it to work.

==== text-hugepage.c ====
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

#include <sys/mman.h>

static
void
try_remap_text_segment() {
    FILE *fp = fopen("/proc/self/maps", "r");
    if (!fp) {
        return;
    }
    char *buf = NULL;
    size_t n;
    while (getline(&buf, &n, fp) >= 0) {
        char *lstart = buf;
        char *lmid = strchr(lstart, '-');
        if (!lmid) {
            continue;
        }
        *lmid++ = '\0';
        char *lend = strchr(lmid, ' ');
        if (!lend) {
            continue;
        }
        *lend = '\0';
        
        size_t start = strtoul(lstart, NULL, 16);
        size_t end = strtoul(lmid, NULL, 16);
        uintptr_t some_text_addr = (uintptr_t)&try_remap_text_segment;
        if (some_text_addr >= start && some_text_addr < end) {
            end &= ~(uintptr_t)0x1fffff;
            madvise((void*)start, end - start, MADV_HUGEPAGE);
            madvise((void*)start, end - start, MADV_POPULATE_READ);
            madvise((void*)start, end - start, MADV_COLLAPSE);
            break;
        }
    }
    free(buf);
    fclose(fp);
}

void
huge_function() {
    // Make sure .text is has a huge page full of stuff
    asm volatile (".fill 4000000, 1, 0x90");
}

int
main() {
    try_remap_text_segment();
}
==== end text-hugepage.c ====


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-09 15:54 Possible regression with file madvise(MADV_COLLAPSE) Avi Kivity
@ 2024-10-11  6:32 ` Gavin Shan
  2024-10-11 22:29 ` Yang Shi
  1 sibling, 0 replies; 10+ messages in thread
From: Gavin Shan @ 2024-10-11  6:32 UTC (permalink / raw)
  To: Avi Kivity, linux-mm

Hi Avi,

On 10/10/24 1:54 AM, Avi Kivity wrote:
> On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
> madvise(MADV_COLLAPSE) on  program text fails with EINVAL.
> 
> To reproduce, compile the reproducer with
> 
> clang -g -o text-hugepage  text-hugepage.c \
> 	-fuse-ld=lld \
> 	-Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 \
>          -Wl,-z,separate-loadable-segments
> 
> and run:
> 
> $ strace -e trace=madvise ./text-hugepage
> madvise(0x400000, 2097152, MADV_HUGEPAGE) = 0
> madvise(0x400000, 2097152, MADV_POPULATE_READ) = 0
> madvise(0x400000, 2097152, MADV_COLLAPSE) = -1 EINVAL (Invalid
> argument)
> 
> (the funky linker options are needed to make sure the .text vma spans a
> hugepage).
> 
> 
> I say "possible regression" since I haven't tried it with an older
> kernel, but I believe it worked at some point or other seeing that
> others managed to get it to work.
> 
> ==== text-hugepage.c ====
> #include <stdlib.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <string.h>
> 
> #include <sys/mman.h>
> 
> static
> void
> try_remap_text_segment() {
>      FILE *fp = fopen("/proc/self/maps", "r");
>      if (!fp) {
>          return;
>      }
>      char *buf = NULL;
>      size_t n;
>      while (getline(&buf, &n, fp) >= 0) {
>          char *lstart = buf;
>          char *lmid = strchr(lstart, '-');
>          if (!lmid) {
>              continue;
>          }
>          *lmid++ = '\0';
>          char *lend = strchr(lmid, ' ');
>          if (!lend) {
>              continue;
>          }
>          *lend = '\0';
>          
>          size_t start = strtoul(lstart, NULL, 16);
>          size_t end = strtoul(lmid, NULL, 16);
>          uintptr_t some_text_addr = (uintptr_t)&try_remap_text_segment;
>          if (some_text_addr >= start && some_text_addr < end) {
>              end &= ~(uintptr_t)0x1fffff;
>              madvise((void*)start, end - start, MADV_HUGEPAGE);
>              madvise((void*)start, end - start, MADV_POPULATE_READ);
>              madvise((void*)start, end - start, MADV_COLLAPSE);
>              break;
>          }
>      }
>      free(buf);
>      fclose(fp);
> }
> 
> void
> huge_function() {
>      // Make sure .text is has a huge page full of stuff
>      asm volatile (".fill 4000000, 1, 0x90");
> }
> 
> int
> main() {
>      try_remap_text_segment();
> }
> ==== end text-hugepage.c ====
> 

I'm able to reproduce the issue with upstream kernel (v6.12.rc2) on ARM64 where the
base page size is 4KB. The reason why I looked into the issue is because of commit
d659b715e94a ("mm/huge_memory: avoid PMD-size page cache if needed") where -EINVAL
is enforced on madvise(MADV_COLLAPSE) on ARM64 where the base page size is 64KB.

In order to reproduce the issue, I have to drop the clean pagecache and compile
the test program every time.

[root@dhcp-10-26-1-237 issue]# cat Makefile
default:
	@echo 1 > /proc/sys/vm/drop_caches
	@gcc test.c -o test
	./test
[root@dhcp-10-26-1-237 issue]# make
./test
test: test.c:54: try_remap_text_segment: Assertion `ret == 0' failed.      <<< Error from madvise(MADV_COLLAPSE)
make: *** [Makefile:4: default] Aborted (core dumped)

Traced it a bit and found SCAN_FAIL is returned as the following call trace indicates.
However, the progream ("test") is opened as readonly, I don't understand how PG_dirty
is set.

Backtrace
=========
sys_madvise
   do_madvise
     madvise_behavior_valid
     madvise_walk_vmas
       madvise_vma_behavior
         can_modify_vma_madv
         madvise_collapse
           thp_vma_allowable_order
           hpage_collapse_scan_file
             collapse_file
               folio_test_dirty          # SCAN_FAIL returned here

Snapshot of /proc/`pidof test`/smaps before calling to madvise(MADV_COLLAPSE).

[root@dhcp-10-26-1-237 issue]# cat /proc/`pidof test`/smaps | head -n 25
00400000-00600000 r-xp 00000000 fd:05 101812754                          /home/gavin/sandbox/issue/test
Size:               2048 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2048 kB
Pss:                2048 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:      2048 kB
Private_Dirty:         0 kB
Referenced:         2048 kB
Anonymous:             0 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           1
VmFlags: rd ex mr mw me hg

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-09 15:54 Possible regression with file madvise(MADV_COLLAPSE) Avi Kivity
  2024-10-11  6:32 ` Gavin Shan
@ 2024-10-11 22:29 ` Yang Shi
  2024-10-12 15:38   ` Avi Kivity
  1 sibling, 1 reply; 10+ messages in thread
From: Yang Shi @ 2024-10-11 22:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-mm

On Wed, Oct 9, 2024 at 9:04 AM Avi Kivity <avi@scylladb.com> wrote:
>
> On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
> madvise(MADV_COLLAPSE) on  program text fails with EINVAL.
>
> To reproduce, compile the reproducer with
>
> clang -g -o text-hugepage  text-hugepage.c \
>         -fuse-ld=lld \
>         -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 \
>         -Wl,-z,separate-loadable-segments
>
> and run:

Didn't clang make the page cache dirty?

Having sync between clang and the execution made the problem go away for me.

>
> $ strace -e trace=madvise ./text-hugepage
> madvise(0x400000, 2097152, MADV_HUGEPAGE) = 0
> madvise(0x400000, 2097152, MADV_POPULATE_READ) = 0
> madvise(0x400000, 2097152, MADV_COLLAPSE) = -1 EINVAL (Invalid
> argument)
>
> (the funky linker options are needed to make sure the .text vma spans a
> hugepage).
>
>
> I say "possible regression" since I haven't tried it with an older
> kernel, but I believe it worked at some point or other seeing that
> others managed to get it to work.
>
> ==== text-hugepage.c ====
> #include <stdlib.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <string.h>
>
> #include <sys/mman.h>
>
> static
> void
> try_remap_text_segment() {
>     FILE *fp = fopen("/proc/self/maps", "r");
>     if (!fp) {
>         return;
>     }
>     char *buf = NULL;
>     size_t n;
>     while (getline(&buf, &n, fp) >= 0) {
>         char *lstart = buf;
>         char *lmid = strchr(lstart, '-');
>         if (!lmid) {
>             continue;
>         }
>         *lmid++ = '\0';
>         char *lend = strchr(lmid, ' ');
>         if (!lend) {
>             continue;
>         }
>         *lend = '\0';
>
>         size_t start = strtoul(lstart, NULL, 16);
>         size_t end = strtoul(lmid, NULL, 16);
>         uintptr_t some_text_addr = (uintptr_t)&try_remap_text_segment;
>         if (some_text_addr >= start && some_text_addr < end) {
>             end &= ~(uintptr_t)0x1fffff;
>             madvise((void*)start, end - start, MADV_HUGEPAGE);
>             madvise((void*)start, end - start, MADV_POPULATE_READ);
>             madvise((void*)start, end - start, MADV_COLLAPSE);
>             break;
>         }
>     }
>     free(buf);
>     fclose(fp);
> }
>
> void
> huge_function() {
>     // Make sure .text is has a huge page full of stuff
>     asm volatile (".fill 4000000, 1, 0x90");
> }
>
> int
> main() {
>     try_remap_text_segment();
> }
> ==== end text-hugepage.c ====
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-11 22:29 ` Yang Shi
@ 2024-10-12 15:38   ` Avi Kivity
  2024-10-12 20:05     ` Yang Shi
  0 siblings, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2024-10-12 15:38 UTC (permalink / raw)
  To: Yang Shi; +Cc: linux-mm

On Fri, 2024-10-11 at 15:29 -0700, Yang Shi wrote:
> On Wed, Oct 9, 2024 at 9:04 AM Avi Kivity <avi@scylladb.com> wrote:
> > 
> > On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
> > madvise(MADV_COLLAPSE) on  program text fails with EINVAL.
> > 
> > To reproduce, compile the reproducer with
> > 
> > clang -g -o text-hugepage  text-hugepage.c \
> >         -fuse-ld=lld \
> >         -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152
> > \
> >         -Wl,-z,separate-loadable-segments
> > 
> > and run:
> 
> Didn't clang make the page cache dirty?
> 
> Having sync between clang and the execution made the problem go away
> for me.
> 

I see it even with sync (and msync just before the madvise calls).


Tracing shows this (last lines before syscall exit):

|          hpage_collapse_scan_file() {
|            __rcu_read_lock();
|            __rcu_read_unlock();
|          }


so, it's not clear what the root cause is.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-12 15:38   ` Avi Kivity
@ 2024-10-12 20:05     ` Yang Shi
  2024-10-12 20:24       ` Avi Kivity
  0 siblings, 1 reply; 10+ messages in thread
From: Yang Shi @ 2024-10-12 20:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-mm

On Sat, Oct 12, 2024 at 8:38 AM Avi Kivity <avi@scylladb.com> wrote:
>
> On Fri, 2024-10-11 at 15:29 -0700, Yang Shi wrote:
> > On Wed, Oct 9, 2024 at 9:04 AM Avi Kivity <avi@scylladb.com> wrote:
> > >
> > > On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
> > > madvise(MADV_COLLAPSE) on  program text fails with EINVAL.
> > >
> > > To reproduce, compile the reproducer with
> > >
> > > clang -g -o text-hugepage  text-hugepage.c \
> > >         -fuse-ld=lld \
> > >         -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152
> > > \
> > >         -Wl,-z,separate-loadable-segments
> > >
> > > and run:
> >
> > Didn't clang make the page cache dirty?
> >
> > Having sync between clang and the execution made the problem go away
> > for me.
> >
>
> I see it even with sync (and msync just before the madvise calls).

Did you stop khugepaged? It may race with MADV_COLLAPSE. If it failed
due to race with khugepaged, you should see -EAGAIN instead of
-EINVAL.

I did the below commands in a loop for 1000 times, it never failed (I
modified the test program a little bit to print out failure if
MADV_COLLAPSE returns failure). I had khugepaged stopped and ran the
test on v6.12-rc1 kernel on my AmpereOne machine.

rm text-hugepage
clang -g -o text-hugepage  text-hugepage.c -fuse-ld=lld
-Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152
-Wl,-z,separate-loadable-segments
sync
./text-hugepage

>
>
> Tracing shows this (last lines before syscall exit):
>
> |          hpage_collapse_scan_file() {
> |            __rcu_read_lock();
> |            __rcu_read_unlock();
> |          }

It meant collapse_file() was not called at all.
hpage_collapse_scan_file() failed. A couple of reasons may fail it,
for example, refcount is not expected, not on lru, etc. You can trace
huge_memory:mm_khugepaged_scan_file to get more information about the
failure.

>
>
> so, it's not clear what the root cause is.
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-12 20:05     ` Yang Shi
@ 2024-10-12 20:24       ` Avi Kivity
  2024-10-12 23:50         ` Yang Shi
  0 siblings, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2024-10-12 20:24 UTC (permalink / raw)
  To: Yang Shi; +Cc: linux-mm

On Sat, 2024-10-12 at 13:05 -0700, Yang Shi wrote:
> On Sat, Oct 12, 2024 at 8:38 AM Avi Kivity <avi@scylladb.com> wrote:
> > 
> > On Fri, 2024-10-11 at 15:29 -0700, Yang Shi wrote:
> > > On Wed, Oct 9, 2024 at 9:04 AM Avi Kivity <avi@scylladb.com>
> > > wrote:
> > > > 
> > > > On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
> > > > madvise(MADV_COLLAPSE) on  program text fails with EINVAL.
> > > > 
> > > > To reproduce, compile the reproducer with
> > > > 
> > > > clang -g -o text-hugepage  text-hugepage.c \
> > > >         -fuse-ld=lld \
> > > >         -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-
> > > > size=2097152
> > > > \
> > > >         -Wl,-z,separate-loadable-segments
> > > > 
> > > > and run:
> > > 
> > > Didn't clang make the page cache dirty?
> > > 
> > > Having sync between clang and the execution made the problem go
> > > away
> > > for me.
> > > 
> > 
> > I see it even with sync (and msync just before the madvise calls).
> 
> Did you stop khugepaged? It may race with MADV_COLLAPSE. If it failed
> due to race with khugepaged, you should see -EAGAIN instead of
> -EINVAL.


I did not, but I don't imagine I hit the race in all my attempts.

> 
> I did the below commands in a loop for 1000 times, it never failed (I
> modified the test program a little bit to print out failure if
> MADV_COLLAPSE returns failure). I had khugepaged stopped and ran the
> test on v6.12-rc1 kernel on my AmpereOne machine.
> 
> rm text-hugepage
> clang -g -o text-hugepage  text-hugepage.c -fuse-ld=lld
> -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152
> -Wl,-z,separate-loadable-segments
> sync
> ./text-hugepage
> 
> > 
> > 
> > Tracing shows this (last lines before syscall exit):
> > 
> > >          hpage_collapse_scan_file() {
> > >            __rcu_read_lock();
> > >            __rcu_read_unlock();
> > >          }
> 
> It meant collapse_file() was not called at all.
> hpage_collapse_scan_file() failed. A couple of reasons may fail it,
> for example, refcount is not expected, not on lru, etc. You can trace
> huge_memory:mm_khugepaged_scan_file to get more information about the
> failure.


   text-hugepage-689146 [023] 200457.073794: mm_khugepaged_scan_file:
mm=0xffff92fc512aac00, scan_pfn=0x5a4310, filename=text-hugepage,
present=0, swap=0, result=page_compound




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-12 20:24       ` Avi Kivity
@ 2024-10-12 23:50         ` Yang Shi
  2024-10-13 11:04           ` Avi Kivity
  0 siblings, 1 reply; 10+ messages in thread
From: Yang Shi @ 2024-10-12 23:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-mm, Baolin Wang

On Sat, Oct 12, 2024 at 1:24 PM Avi Kivity <avi@scylladb.com> wrote:
>
> On Sat, 2024-10-12 at 13:05 -0700, Yang Shi wrote:
> > On Sat, Oct 12, 2024 at 8:38 AM Avi Kivity <avi@scylladb.com> wrote:
> > >
> > > On Fri, 2024-10-11 at 15:29 -0700, Yang Shi wrote:
> > > > On Wed, Oct 9, 2024 at 9:04 AM Avi Kivity <avi@scylladb.com>
> > > > wrote:
> > > > >
> > > > > On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
> > > > > madvise(MADV_COLLAPSE) on  program text fails with EINVAL.
> > > > >
> > > > > To reproduce, compile the reproducer with
> > > > >
> > > > > clang -g -o text-hugepage  text-hugepage.c \
> > > > >         -fuse-ld=lld \
> > > > >         -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-
> > > > > size=2097152
> > > > > \
> > > > >         -Wl,-z,separate-loadable-segments
> > > > >
> > > > > and run:
> > > >
> > > > Didn't clang make the page cache dirty?
> > > >
> > > > Having sync between clang and the execution made the problem go
> > > > away
> > > > for me.
> > > >
> > >
> > > I see it even with sync (and msync just before the madvise calls).
> >
> > Did you stop khugepaged? It may race with MADV_COLLAPSE. If it failed
> > due to race with khugepaged, you should see -EAGAIN instead of
> > -EINVAL.
>
>
> I did not, but I don't imagine I hit the race in all my attempts.
>
> >
> > I did the below commands in a loop for 1000 times, it never failed (I
> > modified the test program a little bit to print out failure if
> > MADV_COLLAPSE returns failure). I had khugepaged stopped and ran the
> > test on v6.12-rc1 kernel on my AmpereOne machine.
> >
> > rm text-hugepage
> > clang -g -o text-hugepage  text-hugepage.c -fuse-ld=lld
> > -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152
> > -Wl,-z,separate-loadable-segments
> > sync
> > ./text-hugepage
> >
> > >
> > >
> > > Tracing shows this (last lines before syscall exit):
> > >
> > > >          hpage_collapse_scan_file() {
> > > >            __rcu_read_lock();
> > > >            __rcu_read_unlock();
> > > >          }
> >
> > It meant collapse_file() was not called at all.
> > hpage_collapse_scan_file() failed. A couple of reasons may fail it,
> > for example, refcount is not expected, not on lru, etc. You can trace
> > huge_memory:mm_khugepaged_scan_file to get more information about the
> > failure.
>
>
>    text-hugepage-689146 [023] 200457.073794: mm_khugepaged_scan_file:
> mm=0xffff92fc512aac00, scan_pfn=0x5a4310, filename=text-hugepage,
> present=0, swap=0, result=page_compound

Aha, it is because v6.10 doesn't support collapse non-PMD order large
folios. It has been fixed in v6.12-rc1. The patch series is:
https://lore.kernel.org/all/cover.1724140601.git.baolin.wang@linux.alibaba.com/

The subject says "shmem", but it actually works for regular files too.

>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-12 23:50         ` Yang Shi
@ 2024-10-13 11:04           ` Avi Kivity
  2024-10-13 13:25             ` Avi Kivity
  0 siblings, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2024-10-13 11:04 UTC (permalink / raw)
  To: Yang Shi; +Cc: linux-mm, Baolin Wang

On Sat, 2024-10-12 at 16:50 -0700, Yang Shi wrote:
> On Sat, Oct 12, 2024 at 1:24 PM Avi Kivity <avi@scylladb.com> wrote:
> > 
> > On Sat, 2024-10-12 at 13:05 -0700, Yang Shi wrote:
> > > On Sat, Oct 12, 2024 at 8:38 AM Avi Kivity <avi@scylladb.com>
> > > wrote:
> > > > 
> > > > On Fri, 2024-10-11 at 15:29 -0700, Yang Shi wrote:
> > > > > On Wed, Oct 9, 2024 at 9:04 AM Avi Kivity <avi@scylladb.com>
> > > > > wrote:
> > > > > > 
> > > > > > On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
> > > > > > madvise(MADV_COLLAPSE) on  program text fails with EINVAL.
> > > > > > 
> > > > > > To reproduce, compile the reproducer with
> > > > > > 
> > > > > > clang -g -o text-hugepage  text-hugepage.c \
> > > > > >         -fuse-ld=lld \
> > > > > >         -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-
> > > > > > size=2097152
> > > > > > \
> > > > > >         -Wl,-z,separate-loadable-segments
> > > > > > 
> > > > > > and run:
> > > > > 
> > > > > Didn't clang make the page cache dirty?
> > > > > 
> > > > > Having sync between clang and the execution made the problem
> > > > > go
> > > > > away
> > > > > for me.
> > > > > 
> > > > 
> > > > I see it even with sync (and msync just before the madvise
> > > > calls).
> > > 
> > > Did you stop khugepaged? It may race with MADV_COLLAPSE. If it
> > > failed
> > > due to race with khugepaged, you should see -EAGAIN instead of
> > > -EINVAL.
> > 
> > 
> > I did not, but I don't imagine I hit the race in all my attempts.
> > 
> > > 
> > > I did the below commands in a loop for 1000 times, it never
> > > failed (I
> > > modified the test program a little bit to print out failure if
> > > MADV_COLLAPSE returns failure). I had khugepaged stopped and ran
> > > the
> > > test on v6.12-rc1 kernel on my AmpereOne machine.
> > > 
> > > rm text-hugepage
> > > clang -g -o text-hugepage  text-hugepage.c -fuse-ld=lld
> > > -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152
> > > -Wl,-z,separate-loadable-segments
> > > sync
> > > ./text-hugepage
> > > 
> > > > 
> > > > 
> > > > Tracing shows this (last lines before syscall exit):
> > > > 
> > > > >          hpage_collapse_scan_file() {
> > > > >            __rcu_read_lock();
> > > > >            __rcu_read_unlock();
> > > > >          }
> > > 
> > > It meant collapse_file() was not called at all.
> > > hpage_collapse_scan_file() failed. A couple of reasons may fail
> > > it,
> > > for example, refcount is not expected, not on lru, etc. You can
> > > trace
> > > huge_memory:mm_khugepaged_scan_file to get more information about
> > > the
> > > failure.
> > 
> > 
> >    text-hugepage-689146 [023] 200457.073794:
> > mm_khugepaged_scan_file:
> > mm=0xffff92fc512aac00, scan_pfn=0x5a4310, filename=text-hugepage,
> > present=0, swap=0, result=page_compound
> 
> Aha, it is because v6.10 doesn't support collapse non-PMD order large
> folios. It has been fixed in v6.12-rc1. The patch series is:
> https://lore.kernel.org/all/cover.1724140601.git.baolin.wang@linux.al
> ibaba.com/
> 
> The subject says "shmem", but it actually works for regular files
> too.



Thanks a lot. I will retest when 6.12 reaches Fedora testing.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-13 11:04           ` Avi Kivity
@ 2024-10-13 13:25             ` Avi Kivity
  2024-10-14 22:06               ` Yang Shi
  0 siblings, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2024-10-13 13:25 UTC (permalink / raw)
  To: Yang Shi; +Cc: linux-mm, Baolin Wang

On Sun, 2024-10-13 at 14:04 +0300, Avi Kivity wrote:
> > > 
> > >    text-hugepage-689146 [023] 200457.073794:
> > > mm_khugepaged_scan_file:
> > > mm=0xffff92fc512aac00, scan_pfn=0x5a4310, filename=text-hugepage,
> > > present=0, swap=0, result=page_compound
> > 
> > Aha, it is because v6.10 doesn't support collapse non-PMD order
> > large
> > folios. It has been fixed in v6.12-rc1. The patch series is:
> > https://lore.kernel.org/all/cover.1724140601.git.baolin.wang@linux.
> > al
> > ibaba.com/
> > 
> > The subject says "shmem", but it actually works for regular files
> > too.
> 
> 
> 
> Thanks a lot. I will retest when 6.12 reaches Fedora testing.
> 

It is available, so I retested it (6.12-rc2). I confirm it works (and
delivers a nice performance improvement).



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible regression with file madvise(MADV_COLLAPSE)
  2024-10-13 13:25             ` Avi Kivity
@ 2024-10-14 22:06               ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2024-10-14 22:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-mm, Baolin Wang

> >
> >
> >
> > Thanks a lot. I will retest when 6.12 reaches Fedora testing.
> >
>
> It is available, so I retested it (6.12-rc2). I confirm it works (and
> delivers a nice performance improvement).

Sounds perfect.

>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-10-14 22:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-09 15:54 Possible regression with file madvise(MADV_COLLAPSE) Avi Kivity
2024-10-11  6:32 ` Gavin Shan
2024-10-11 22:29 ` Yang Shi
2024-10-12 15:38   ` Avi Kivity
2024-10-12 20:05     ` Yang Shi
2024-10-12 20:24       ` Avi Kivity
2024-10-12 23:50         ` Yang Shi
2024-10-13 11:04           ` Avi Kivity
2024-10-13 13:25             ` Avi Kivity
2024-10-14 22:06               ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox