linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* BUG: MADV_COLLAPSE doesn't work for XFS files
@ 2023-09-28  9:55 Ryan Roberts
  2023-09-28 10:54 ` Bagas Sanjaya
  0 siblings, 1 reply; 7+ messages in thread
From: Ryan Roberts @ 2023-09-28  9:55 UTC (permalink / raw)
  To: Hugh Dickins, David Hildenbrand, Matthew Wilcox, zokeefe; +Cc: Linux-MM

Hi all,

I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4. 

I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file():

		if (PageTransCompound(page)) {
			struct page *head = compound_head(page);

			result = compound_order(head) == HPAGE_PMD_ORDER &&
					head->index == start
					/* Maybe PMD-mapped */
					? SCAN_PTE_MAPPED_HUGEPAGE
					: SCAN_PAGE_COMPOUND;
			goto out_unlock;
		}

I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it.

Thanks,
Ryan


Test case I've been using:

-->8--

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#ifndef MADV_COLLAPSE
#define MADV_COLLAPSE		25
#endif

#define handle_error(msg) 	do { perror(msg); exit(EXIT_FAILURE); } while (0)

#define SZ_1K			1024
#define SZ_1M			(SZ_1K * SZ_1K)
#define ALIGN(val, align)	(((val) + ((align) - 1)) & ~((align) - 1))

#if 1
// ext4
#define DATA_FILE		"/home/ubuntu/data.txt"
#else
// xfs
#define DATA_FILE		"/boot/data.txt"
#endif

int main(void)
{
	int fd;
	char *mem;
	int ret;

	fd = open(DATA_FILE, O_RDONLY);
	if (fd == -1)
        	handle_error("open");

	mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0);
	close(fd);
	if (mem == MAP_FAILED)
        	handle_error("mmap");

	printf("1: pid=%d, mem=%p\n", getpid(), mem);
	getchar();

	mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2);
	ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE);
	if (ret)
		handle_error("madvise");

	printf("2: pid=%d, mem=%p\n", getpid(), mem);
	getchar();

	return 0;
}

-->8--


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: MADV_COLLAPSE doesn't work for XFS files
  2023-09-28  9:55 BUG: MADV_COLLAPSE doesn't work for XFS files Ryan Roberts
@ 2023-09-28 10:54 ` Bagas Sanjaya
  2023-09-28 11:59   ` Ryan Roberts
  0 siblings, 1 reply; 7+ messages in thread
From: Bagas Sanjaya @ 2023-09-28 10:54 UTC (permalink / raw)
  To: Ryan Roberts, Hugh Dickins, David Hildenbrand, Matthew Wilcox,
	zokeefe, Chandan Babu R, Darrick J. Wong
  Cc: Linux Memory Management List, Linux XFS, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2504 bytes --]

On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote:
> Hi all,
> 
> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4. 
> 
> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file():
> 
> 		if (PageTransCompound(page)) {
> 			struct page *head = compound_head(page);
> 
> 			result = compound_order(head) == HPAGE_PMD_ORDER &&
> 					head->index == start
> 					/* Maybe PMD-mapped */
> 					? SCAN_PTE_MAPPED_HUGEPAGE
> 					: SCAN_PAGE_COMPOUND;
> 			goto out_unlock;
> 		}

I don't see any hint to -EINVAL above. Am I missing something?

> 
> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it.
> 
> Thanks,
> Ryan
> 
> 
> Test case I've been using:
> 
> -->8--
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> 
> #ifndef MADV_COLLAPSE
> #define MADV_COLLAPSE		25
> #endif
> 
> #define handle_error(msg) 	do { perror(msg); exit(EXIT_FAILURE); } while (0)
> 
> #define SZ_1K			1024
> #define SZ_1M			(SZ_1K * SZ_1K)
> #define ALIGN(val, align)	(((val) + ((align) - 1)) & ~((align) - 1))
> 
> #if 1
> // ext4
> #define DATA_FILE		"/home/ubuntu/data.txt"
> #else
> // xfs
> #define DATA_FILE		"/boot/data.txt"
> #endif
> 
> int main(void)
> {
> 	int fd;
> 	char *mem;
> 	int ret;
> 
> 	fd = open(DATA_FILE, O_RDONLY);
> 	if (fd == -1)
>         	handle_error("open");
> 
> 	mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0);
> 	close(fd);
> 	if (mem == MAP_FAILED)
>         	handle_error("mmap");
> 
> 	printf("1: pid=%d, mem=%p\n", getpid(), mem);
> 	getchar();
> 
> 	mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2);
> 	ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE);
> 	if (ret)
> 		handle_error("madvise");
> 
> 	printf("2: pid=%d, mem=%p\n", getpid(), mem);
> 	getchar();
> 
> 	return 0;
> }
> 
> -->8--
> 

Confused...

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: MADV_COLLAPSE doesn't work for XFS files
  2023-09-28 10:54 ` Bagas Sanjaya
@ 2023-09-28 11:59   ` Ryan Roberts
  2023-09-28 19:43     ` Zach O'Keefe
  0 siblings, 1 reply; 7+ messages in thread
From: Ryan Roberts @ 2023-09-28 11:59 UTC (permalink / raw)
  To: Bagas Sanjaya, Hugh Dickins, David Hildenbrand, Matthew Wilcox,
	zokeefe, Chandan Babu R, Darrick J. Wong
  Cc: Linux Memory Management List, Linux XFS, Linux Kernel Mailing List

On 28/09/2023 11:54, Bagas Sanjaya wrote:
> On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote:
>> Hi all,
>>
>> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4. 
>>
>> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file():
>>
>> 		if (PageTransCompound(page)) {
>> 			struct page *head = compound_head(page);
>>
>> 			result = compound_order(head) == HPAGE_PMD_ORDER &&
>> 					head->index == start
>> 					/* Maybe PMD-mapped */
>> 					? SCAN_PTE_MAPPED_HUGEPAGE
>> 					: SCAN_PAGE_COMPOUND;
>> 			goto out_unlock;
>> 		}
> 
> I don't see any hint to -EINVAL above. Am I missing something?

The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it
eventually gets converted to -EINVAL by madvise_collapse_errno().

> 
>>
>> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it.
>>
>> Thanks,
>> Ryan
>>
>>
>> Test case I've been using:
>>
>> -->8--
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>>
>> #ifndef MADV_COLLAPSE
>> #define MADV_COLLAPSE		25
>> #endif
>>
>> #define handle_error(msg) 	do { perror(msg); exit(EXIT_FAILURE); } while (0)
>>
>> #define SZ_1K			1024
>> #define SZ_1M			(SZ_1K * SZ_1K)
>> #define ALIGN(val, align)	(((val) + ((align) - 1)) & ~((align) - 1))
>>
>> #if 1
>> // ext4
>> #define DATA_FILE		"/home/ubuntu/data.txt"
>> #else
>> // xfs
>> #define DATA_FILE		"/boot/data.txt"
>> #endif
>>
>> int main(void)
>> {
>> 	int fd;
>> 	char *mem;
>> 	int ret;
>>
>> 	fd = open(DATA_FILE, O_RDONLY);
>> 	if (fd == -1)
>>         	handle_error("open");
>>
>> 	mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0);
>> 	close(fd);
>> 	if (mem == MAP_FAILED)
>>         	handle_error("mmap");
>>
>> 	printf("1: pid=%d, mem=%p\n", getpid(), mem);
>> 	getchar();
>>
>> 	mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2);
>> 	ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE);
>> 	if (ret)
>> 		handle_error("madvise");
>>
>> 	printf("2: pid=%d, mem=%p\n", getpid(), mem);
>> 	getchar();
>>
>> 	return 0;
>> }
>>
>> -->8--
>>
> 
> Confused...

This is a user space test case that shows the problem; data.txt needs to be at
least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to
0, you can see the different behaviours for ext4 and xfs -
handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are
leftovers from me looking at the smaps file.

> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: MADV_COLLAPSE doesn't work for XFS files
  2023-09-28 11:59   ` Ryan Roberts
@ 2023-09-28 19:43     ` Zach O'Keefe
  2023-09-28 21:04       ` BUG: MADV_COLLAPSE doesn't work for XFS files] Darrick J. Wong
  2023-09-29 12:33       ` BUG: MADV_COLLAPSE doesn't work for XFS files Ryan Roberts
  0 siblings, 2 replies; 7+ messages in thread
From: Zach O'Keefe @ 2023-09-28 19:43 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Bagas Sanjaya, Hugh Dickins, David Hildenbrand, Matthew Wilcox,
	Chandan Babu R, Darrick J. Wong, Linux Memory Management List,
	Linux XFS, Linux Kernel Mailing List, Yu Zhao

Hey Ryan,

Thanks for bringing this up.

On Thu, Sep 28, 2023 at 4:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/09/2023 11:54, Bagas Sanjaya wrote:
> > On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote:
> >> Hi all,
> >>
> >> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4.
> >>
> >> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file():
> >>
> >>              if (PageTransCompound(page)) {
> >>                      struct page *head = compound_head(page);
> >>
> >>                      result = compound_order(head) == HPAGE_PMD_ORDER &&
> >>                                      head->index == start
> >>                                      /* Maybe PMD-mapped */
> >>                                      ? SCAN_PTE_MAPPED_HUGEPAGE
> >>                                      : SCAN_PAGE_COMPOUND;
> >>                      goto out_unlock;
> >>              }
> >

Ya, non-PMD-sized THPs were just barely visible in my peripherals when
writing this, and I'm still woefully behind on your work on them now
(sorry!).

I'd like to eventually make collapse (not just MADV_COLLAPSE, but
khugepaged too) support arbitrary-sized large folios in general, but
I'm very pressed for time right now. I think M. Wilcox is also
interested in this, given he left the TODO to support it :P

Thank you for the reproducer though! I haven't run it, but I'll
probably come back here to steal it when the time comes.

> > I don't see any hint to -EINVAL above. Am I missing something?
>
> The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it
> eventually gets converted to -EINVAL by madvise_collapse_errno().
>
> >
> >>
> >> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it.

My guess is Q1 2024 is when I'd be able to look into this, at the
current level of urgency. It doesn't sound like it's blocking anything
for your work right now -- lmk if that changes though!

Thanks,
Zach



> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >> Test case I've been using:
> >>
> >> -->8--
> >>
> >> #include <stdio.h>
> >> #include <stdlib.h>
> >> #include <sys/mman.h>
> >> #include <sys/types.h>
> >> #include <sys/stat.h>
> >> #include <fcntl.h>
> >> #include <unistd.h>
> >>
> >> #ifndef MADV_COLLAPSE
> >> #define MADV_COLLAPSE                25
> >> #endif
> >>
> >> #define handle_error(msg)    do { perror(msg); exit(EXIT_FAILURE); } while (0)
> >>
> >> #define SZ_1K                        1024
> >> #define SZ_1M                        (SZ_1K * SZ_1K)
> >> #define ALIGN(val, align)    (((val) + ((align) - 1)) & ~((align) - 1))
> >>
> >> #if 1
> >> // ext4
> >> #define DATA_FILE            "/home/ubuntu/data.txt"
> >> #else
> >> // xfs
> >> #define DATA_FILE            "/boot/data.txt"
> >> #endif
> >>
> >> int main(void)
> >> {
> >>      int fd;
> >>      char *mem;
> >>      int ret;
> >>
> >>      fd = open(DATA_FILE, O_RDONLY);
> >>      if (fd == -1)
> >>              handle_error("open");
> >>
> >>      mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0);
> >>      close(fd);
> >>      if (mem == MAP_FAILED)
> >>              handle_error("mmap");
> >>
> >>      printf("1: pid=%d, mem=%p\n", getpid(), mem);
> >>      getchar();
> >>
> >>      mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2);
> >>      ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE);
> >>      if (ret)
> >>              handle_error("madvise");
> >>
> >>      printf("2: pid=%d, mem=%p\n", getpid(), mem);
> >>      getchar();
> >>
> >>      return 0;
> >> }
> >>
> >> -->8--
> >>
> >
> > Confused...
>
> This is a user space test case that shows the problem; data.txt needs to be at
> least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to
> 0, you can see the different behaviours for ext4 and xfs -
> handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are
> leftovers from me looking at the smaps file.
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: MADV_COLLAPSE doesn't work for XFS files]
  2023-09-28 19:43     ` Zach O'Keefe
@ 2023-09-28 21:04       ` Darrick J. Wong
  2023-09-28 22:48         ` Zach O'Keefe
  2023-09-29 12:33       ` BUG: MADV_COLLAPSE doesn't work for XFS files Ryan Roberts
  1 sibling, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2023-09-28 21:04 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Ryan Roberts, Bagas Sanjaya, Hugh Dickins, David Hildenbrand,
	Matthew Wilcox, Chandan Babu R, Linux Memory Management List,
	Linux XFS, Linux Kernel Mailing List, Yu Zhao

On Thu, Sep 28, 2023 at 12:43:57PM -0700, Zach O'Keefe wrote:
> Hey Ryan,
> 
> Thanks for bringing this up.
> 
> On Thu, Sep 28, 2023 at 4:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 28/09/2023 11:54, Bagas Sanjaya wrote:
> > > On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote:
> > >> Hi all,
> > >>
> > >> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4.
> > >>
> > >> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file():
> > >>
> > >>              if (PageTransCompound(page)) {
> > >>                      struct page *head = compound_head(page);
> > >>
> > >>                      result = compound_order(head) == HPAGE_PMD_ORDER &&
> > >>                                      head->index == start
> > >>                                      /* Maybe PMD-mapped */
> > >>                                      ? SCAN_PTE_MAPPED_HUGEPAGE
> > >>                                      : SCAN_PAGE_COMPOUND;
> > >>                      goto out_unlock;
> > >>              }
> > >
> 
> Ya, non-PMD-sized THPs were just barely visible in my peripherals when
> writing this, and I'm still woefully behind on your work on them now
> (sorry!).
> 
> I'd like to eventually make collapse (not just MADV_COLLAPSE, but
> khugepaged too) support arbitrary-sized large folios in general, but
> I'm very pressed for time right now. I think M. Wilcox is also
> interested in this, given he left the TODO to support it :P

Is the point of MADV_COLLAPSE to replace base pages with PMD-sized pages
in the pagecache for faster lookups?  Or merely to replace them with
something larger, even if it's not PMD sized?

As of 6.6, XFS asks for folios of size min(read/readahead/write_len,
ondisk_mapping_length), so in theory the folio size should roughly
follow the access patterns.  If the goal is merely larger folios, then
we are done here and can move on to some other part of the collapse.

OTOH if the goal is TLB savings, then I suppose you'd actually /want/ to
select a large (but not PMD) folio for collapsing into a PMD sized
folio, right?

e.g.

	if (PageTransCompound(page)) {
		struct page *head = compound_head(page);

		if (head->index != start) {
			/* not sure what _COMPOUND means here... */
			result = SCAN_PAGE_COMPOUND;
			goto out_unlock;
		}

		if (compound_order(head) == HPAGE_PMD_ORDER) {
			result = SCAN_PTE_MAPPED_HUGEPAGE;
			goto out_unlock;
		}

		/* result is still SCAN_SUCCEED, keep going */
	}

I /think/ that would work?  If the largefolio is dirty or not fully
uptodate then collapse won't touch it; and I think fs/iomap handles this
in a compatible way because it won't mark the folio uptodate until all
the blocks have been read, and it marks the folio dirty if any of the
blocks are dirty.

(says me, who doesn't really understand this part of the code.)

--D

> Thank you for the reproducer though! I haven't run it, but I'll
> probably come back here to steal it when the time comes.
> 
> > > I don't see any hint to -EINVAL above. Am I missing something?
> >
> > The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it
> > eventually gets converted to -EINVAL by madvise_collapse_errno().
> >
> > >
> > >>
> > >> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it.
> 
> My guess is Q1 2024 is when I'd be able to look into this, at the
> current level of urgency. It doesn't sound like it's blocking anything
> for your work right now -- lmk if that changes though!
> 
> Thanks,
> Zach
> 
> 
> 
> > >>
> > >> Thanks,
> > >> Ryan
> > >>
> > >>
> > >> Test case I've been using:
> > >>
> > >> -->8--
> > >>
> > >> #include <stdio.h>
> > >> #include <stdlib.h>
> > >> #include <sys/mman.h>
> > >> #include <sys/types.h>
> > >> #include <sys/stat.h>
> > >> #include <fcntl.h>
> > >> #include <unistd.h>
> > >>
> > >> #ifndef MADV_COLLAPSE
> > >> #define MADV_COLLAPSE                25
> > >> #endif
> > >>
> > >> #define handle_error(msg)    do { perror(msg); exit(EXIT_FAILURE); } while (0)
> > >>
> > >> #define SZ_1K                        1024
> > >> #define SZ_1M                        (SZ_1K * SZ_1K)
> > >> #define ALIGN(val, align)    (((val) + ((align) - 1)) & ~((align) - 1))
> > >>
> > >> #if 1
> > >> // ext4
> > >> #define DATA_FILE            "/home/ubuntu/data.txt"
> > >> #else
> > >> // xfs
> > >> #define DATA_FILE            "/boot/data.txt"
> > >> #endif
> > >>
> > >> int main(void)
> > >> {
> > >>      int fd;
> > >>      char *mem;
> > >>      int ret;
> > >>
> > >>      fd = open(DATA_FILE, O_RDONLY);
> > >>      if (fd == -1)
> > >>              handle_error("open");
> > >>
> > >>      mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0);
> > >>      close(fd);
> > >>      if (mem == MAP_FAILED)
> > >>              handle_error("mmap");
> > >>
> > >>      printf("1: pid=%d, mem=%p\n", getpid(), mem);
> > >>      getchar();
> > >>
> > >>      mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2);
> > >>      ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE);
> > >>      if (ret)
> > >>              handle_error("madvise");
> > >>
> > >>      printf("2: pid=%d, mem=%p\n", getpid(), mem);
> > >>      getchar();
> > >>
> > >>      return 0;
> > >> }
> > >>
> > >> -->8--
> > >>
> > >
> > > Confused...
> >
> > This is a user space test case that shows the problem; data.txt needs to be at
> > least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to
> > 0, you can see the different behaviours for ext4 and xfs -
> > handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are
> > leftovers from me looking at the smaps file.
> >


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: MADV_COLLAPSE doesn't work for XFS files]
  2023-09-28 21:04       ` BUG: MADV_COLLAPSE doesn't work for XFS files] Darrick J. Wong
@ 2023-09-28 22:48         ` Zach O'Keefe
  0 siblings, 0 replies; 7+ messages in thread
From: Zach O'Keefe @ 2023-09-28 22:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ryan Roberts, Bagas Sanjaya, Hugh Dickins, David Hildenbrand,
	Matthew Wilcox, Chandan Babu R, Linux Memory Management List,
	Linux XFS, Linux Kernel Mailing List, Yu Zhao

On Thu, Sep 28, 2023 at 2:04 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Sep 28, 2023 at 12:43:57PM -0700, Zach O'Keefe wrote:
> > Hey Ryan,
> >
> > Thanks for bringing this up.
> >
> > On Thu, Sep 28, 2023 at 4:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > > On 28/09/2023 11:54, Bagas Sanjaya wrote:
> > > > On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote:
> > > >> Hi all,
> > > >>
> > > >> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4.
> > > >>
> > > >> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file():
> > > >>
> > > >>              if (PageTransCompound(page)) {
> > > >>                      struct page *head = compound_head(page);
> > > >>
> > > >>                      result = compound_order(head) == HPAGE_PMD_ORDER &&
> > > >>                                      head->index == start
> > > >>                                      /* Maybe PMD-mapped */
> > > >>                                      ? SCAN_PTE_MAPPED_HUGEPAGE
> > > >>                                      : SCAN_PAGE_COMPOUND;
> > > >>                      goto out_unlock;
> > > >>              }
> > > >
> >
> > Ya, non-PMD-sized THPs were just barely visible in my peripherals when
> > writing this, and I'm still woefully behind on your work on them now
> > (sorry!).
> >
> > I'd like to eventually make collapse (not just MADV_COLLAPSE, but
> > khugepaged too) support arbitrary-sized large folios in general, but
> > I'm very pressed for time right now. I think M. Wilcox is also
> > interested in this, given he left the TODO to support it :P
>
> Is the point of MADV_COLLAPSE to replace base pages with PMD-sized pages
> in the pagecache for faster lookups?  Or merely to replace them with
> something larger, even if it's not PMD sized?

Might depend on who you ask, but IMHO, the principle purpose of
collapse is saving TLB entries, with TLB coalescing complicating
things a little in terms of PMD-sized things or not. M. Wilcox's work
with descriptor-izing folios might make a nice case for memory savings
as well, down the line.

> As of 6.6, XFS asks for folios of size min(read/readahead/write_len,
> ondisk_mapping_length), so in theory the folio size should roughly
> follow the access patterns.  If the goal is merely larger folios, then
> we are done here and can move on to some other part of the collapse.
>
> OTOH if the goal is TLB savings, then I suppose you'd actually /want/ to
> select a large (but not PMD) folio for collapsing into a PMD sized
> folio, right?

I suppose it might make some operations easier / faster during
collapse if we have less folios to process.

> e.g.
>
>         if (PageTransCompound(page)) {
>                 struct page *head = compound_head(page);
>
>                 if (head->index != start) {
>                         /* not sure what _COMPOUND means here... */
>                         result = SCAN_PAGE_COMPOUND;
>                         goto out_unlock;
>                 }
>
>                 if (compound_order(head) == HPAGE_PMD_ORDER) {
>                         result = SCAN_PTE_MAPPED_HUGEPAGE;
>                         goto out_unlock;
>                 }
>
>                 /* result is still SCAN_SUCCEED, keep going */
>         }
>
> I /think/ that would work?  If the largefolio is dirty or not fully
> uptodate then collapse won't touch it; and I think fs/iomap handles this
> in a compatible way because it won't mark the folio uptodate until all
> the blocks have been read, and it marks the folio dirty if any of the
> blocks are dirty.
>
> (says me, who doesn't really understand this part of the code.)

I think there's a couple issues with this -- for example, the
head->index != start case is going to be the common-case for
non-PMD-sized large folios. Regardless, there is some more code in
hpage_collapse_scan_file() and her in collapse_file() that would need
to be updated. I'm taking a cursory look, and naively it doesn't look
too bad -- most things "should just work" in file/shmem collapse path.
ac492b9c70cac ("mm/khugepaged: skip shmem with userfaultfd" was merged
since the last I looked carefully at this path, so I would need to
spend more time understanding some changes there. So, from correctness
POV, maybe there's not anything drastic to be done for file/shmem.
Maybe this is a good place to start.

For anon, things are different, as we are coming at the pages from a
different angle, and are operating over the pmd directly. I'm not
immediately sure if it makes things easier or harder. Probably harder.
Can we even get non-PMD-sized large anon folios right now, without
Ryan's work?

From a khugepaged policy POV, there are some questions to be
answered.. but I think these mostly boil down to: scale the
present/swap/none checks by (1 << order).

Anyways, this isn't to be taken with much weight as a thorough audit
is required to understand any subtleties lurking around.

Thanks,
Zach

> --D
>
> > Thank you for the reproducer though! I haven't run it, but I'll
> > probably come back here to steal it when the time comes.
> >
> > > > I don't see any hint to -EINVAL above. Am I missing something?
> > >
> > > The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it
> > > eventually gets converted to -EINVAL by madvise_collapse_errno().
> > >
> > > >
> > > >>
> > > >> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it.
> >
> > My guess is Q1 2024 is when I'd be able to look into this, at the
> > current level of urgency. It doesn't sound like it's blocking anything
> > for your work right now -- lmk if that changes though!
> >
> > Thanks,
> > Zach
> >
> >
> >
> > > >>
> > > >> Thanks,
> > > >> Ryan
> > > >>
> > > >>
> > > >> Test case I've been using:
> > > >>
> > > >> -->8--
> > > >>
> > > >> #include <stdio.h>
> > > >> #include <stdlib.h>
> > > >> #include <sys/mman.h>
> > > >> #include <sys/types.h>
> > > >> #include <sys/stat.h>
> > > >> #include <fcntl.h>
> > > >> #include <unistd.h>
> > > >>
> > > >> #ifndef MADV_COLLAPSE
> > > >> #define MADV_COLLAPSE                25
> > > >> #endif
> > > >>
> > > >> #define handle_error(msg)    do { perror(msg); exit(EXIT_FAILURE); } while (0)
> > > >>
> > > >> #define SZ_1K                        1024
> > > >> #define SZ_1M                        (SZ_1K * SZ_1K)
> > > >> #define ALIGN(val, align)    (((val) + ((align) - 1)) & ~((align) - 1))
> > > >>
> > > >> #if 1
> > > >> // ext4
> > > >> #define DATA_FILE            "/home/ubuntu/data.txt"
> > > >> #else
> > > >> // xfs
> > > >> #define DATA_FILE            "/boot/data.txt"
> > > >> #endif
> > > >>
> > > >> int main(void)
> > > >> {
> > > >>      int fd;
> > > >>      char *mem;
> > > >>      int ret;
> > > >>
> > > >>      fd = open(DATA_FILE, O_RDONLY);
> > > >>      if (fd == -1)
> > > >>              handle_error("open");
> > > >>
> > > >>      mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0);
> > > >>      close(fd);
> > > >>      if (mem == MAP_FAILED)
> > > >>              handle_error("mmap");
> > > >>
> > > >>      printf("1: pid=%d, mem=%p\n", getpid(), mem);
> > > >>      getchar();
> > > >>
> > > >>      mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2);
> > > >>      ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE);
> > > >>      if (ret)
> > > >>              handle_error("madvise");
> > > >>
> > > >>      printf("2: pid=%d, mem=%p\n", getpid(), mem);
> > > >>      getchar();
> > > >>
> > > >>      return 0;
> > > >> }
> > > >>
> > > >> -->8--
> > > >>
> > > >
> > > > Confused...
> > >
> > > This is a user space test case that shows the problem; data.txt needs to be at
> > > least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to
> > > 0, you can see the different behaviours for ext4 and xfs -
> > > handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are
> > > leftovers from me looking at the smaps file.
> > >


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: MADV_COLLAPSE doesn't work for XFS files
  2023-09-28 19:43     ` Zach O'Keefe
  2023-09-28 21:04       ` BUG: MADV_COLLAPSE doesn't work for XFS files] Darrick J. Wong
@ 2023-09-29 12:33       ` Ryan Roberts
  1 sibling, 0 replies; 7+ messages in thread
From: Ryan Roberts @ 2023-09-29 12:33 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Bagas Sanjaya, Hugh Dickins, David Hildenbrand, Matthew Wilcox,
	Chandan Babu R, Darrick J. Wong, Linux Memory Management List,
	Linux XFS, Linux Kernel Mailing List, Yu Zhao

On 28/09/2023 20:43, Zach O'Keefe wrote:
> Hey Ryan,
> 
> Thanks for bringing this up.
> 
> On Thu, Sep 28, 2023 at 4:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 28/09/2023 11:54, Bagas Sanjaya wrote:
>>> On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote:
>>>> Hi all,
>>>>
>>>> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4.
>>>>
>>>> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file():
>>>>
>>>>              if (PageTransCompound(page)) {
>>>>                      struct page *head = compound_head(page);
>>>>
>>>>                      result = compound_order(head) == HPAGE_PMD_ORDER &&
>>>>                                      head->index == start
>>>>                                      /* Maybe PMD-mapped */
>>>>                                      ? SCAN_PTE_MAPPED_HUGEPAGE
>>>>                                      : SCAN_PAGE_COMPOUND;
>>>>                      goto out_unlock;
>>>>              }
>>>
> 
> Ya, non-PMD-sized THPs were just barely visible in my peripherals when
> writing this, and I'm still woefully behind on your work on them now
> (sorry!).

Nothing to apologise for!

Although, this issue has no relation to the work I've been doing for anonymous large folios; It shows up for large _file_ folios. And it looks like the kernel was capable of doing large file folios for XFS before the collapse implementation went in, so I guess this behavior has always been the case:

git rev-list --no-walk=sorted --pretty=oneline \
	793917d997df2e432f3e9ac126e4482d68256d01 \
	6795801366da0cd3d99e27c37f020a8f16714886 \
	8549a26308f945bddb39391643eb102da026f0ef \
	e6687b89225ee9c817e6dcbadc873f6a4691e5c2 \
	7d8faaf155454f8798ec56404faca29a82689c77

7d8faaf155454f8798ec56404faca29a82689c77 mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
793917d997df2e432f3e9ac126e4482d68256d01 mm/readahead: Add large folio readahead
6795801366da0cd3d99e27c37f020a8f16714886 xfs: Support large folios

So first, XFS supported it, then readahead actually started allocating large folios, then MADV_COLLAPSE came along.


> 
> I'd like to eventually make collapse (not just MADV_COLLAPSE, but
> khugepaged too) support arbitrary-sized large folios in general, but
> I'm very pressed for time right now. I think M. Wilcox is also
> interested in this, given he left the TODO to support it :P

Yes, I think this could be a useful capability. I'm currently investigating use of MADV_COLLAPSE as a work-around to get executable sections into large folios for file systems that don't natively support them (ext4 mainly). On arm64, having executable memory in 64K folios means we can make better use of the iTLB and improve performance.

> 
> Thank you for the reproducer though! I haven't run it, but I'll
> probably come back here to steal it when the time comes.
> 
>>> I don't see any hint to -EINVAL above. Am I missing something?
>>
>> The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it
>> eventually gets converted to -EINVAL by madvise_collapse_errno().
>>
>>>
>>>>
>>>> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it.
> 
> My guess is Q1 2024 is when I'd be able to look into this, at the
> current level of urgency. It doesn't sound like it's blocking anything
> for your work right now -- lmk if that changes though!

No - its not a blocker for me. I just wanted to highlight the issue.

> 
> Thanks,
> Zach
> 
> 
> 
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>> Test case I've been using:
>>>>
>>>> -->8--
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <sys/mman.h>
>>>> #include <sys/types.h>
>>>> #include <sys/stat.h>
>>>> #include <fcntl.h>
>>>> #include <unistd.h>
>>>>
>>>> #ifndef MADV_COLLAPSE
>>>> #define MADV_COLLAPSE                25
>>>> #endif
>>>>
>>>> #define handle_error(msg)    do { perror(msg); exit(EXIT_FAILURE); } while (0)
>>>>
>>>> #define SZ_1K                        1024
>>>> #define SZ_1M                        (SZ_1K * SZ_1K)
>>>> #define ALIGN(val, align)    (((val) + ((align) - 1)) & ~((align) - 1))
>>>>
>>>> #if 1
>>>> // ext4
>>>> #define DATA_FILE            "/home/ubuntu/data.txt"
>>>> #else
>>>> // xfs
>>>> #define DATA_FILE            "/boot/data.txt"
>>>> #endif
>>>>
>>>> int main(void)
>>>> {
>>>>      int fd;
>>>>      char *mem;
>>>>      int ret;
>>>>
>>>>      fd = open(DATA_FILE, O_RDONLY);
>>>>      if (fd == -1)
>>>>              handle_error("open");
>>>>
>>>>      mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0);
>>>>      close(fd);
>>>>      if (mem == MAP_FAILED)
>>>>              handle_error("mmap");
>>>>
>>>>      printf("1: pid=%d, mem=%p\n", getpid(), mem);
>>>>      getchar();
>>>>
>>>>      mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2);
>>>>      ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE);
>>>>      if (ret)
>>>>              handle_error("madvise");
>>>>
>>>>      printf("2: pid=%d, mem=%p\n", getpid(), mem);
>>>>      getchar();
>>>>
>>>>      return 0;
>>>> }
>>>>
>>>> -->8--
>>>>
>>>
>>> Confused...
>>
>> This is a user space test case that shows the problem; data.txt needs to be at
>> least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to
>> 0, you can see the different behaviours for ext4 and xfs -
>> handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are
>> leftovers from me looking at the smaps file.
>>



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-09-29 12:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-28  9:55 BUG: MADV_COLLAPSE doesn't work for XFS files Ryan Roberts
2023-09-28 10:54 ` Bagas Sanjaya
2023-09-28 11:59   ` Ryan Roberts
2023-09-28 19:43     ` Zach O'Keefe
2023-09-28 21:04       ` BUG: MADV_COLLAPSE doesn't work for XFS files] Darrick J. Wong
2023-09-28 22:48         ` Zach O'Keefe
2023-09-29 12:33       ` BUG: MADV_COLLAPSE doesn't work for XFS files Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox