Re: [PATCH] mm/filemap: Align last_index to folio size

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Klara Modin <klarasmodin@gmail.com>
To: Youling Tang <youling.tang@linux.dev>
Cc: Matthew Wilcox <willy@infradead.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org,  linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, chizhiling@163.com,
	 Youling Tang <tangyouling@kylinos.cn>,
	Chi Zhiling <chizhiling@kylinos.cn>
Subject: Re: [PATCH] mm/filemap: Align last_index to folio size
Date: Tue, 15 Jul 2025 00:34:12 +0200	[thread overview]
Message-ID: <yru7qf5gvyzccq5ohhpylvxug5lr5tf54omspbjh4sm6pcdb2r@fpjgj2pxw7va> (raw)
In-Reply-To: <20250711055509.91587-1-youling.tang@linux.dev>

Hi,

On 2025-07-11 13:55:09 +0800, Youling Tang wrote:
> From: Youling Tang <tangyouling@kylinos.cn>
> 
> On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE
> enabled, We observed the following readahead behaviors:
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=64k count=1
>  # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136d4c	__RU_l_________H______t_________________F_1
>  1	136d4d	__RU_l__________T_____t_________________F_1
>  2	136d4e	__RU_l__________T_____t_________________F_1
>  3	136d4f	__RU_l__________T_____t_________________F_1
>  ...
>  c	136bb8	__RU_l_________H______t_________________F_1
>  d	136bb9	__RU_l__________T_____t_________________F_1
>  e	136bba	__RU_l__________T_____t_________________F_1
>  f	136bbb	__RU_l__________T_____t_________________F_1   <-- first read
>  10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>  11	13c2cd	___U_l__________T_____t______________I__F_1
>  12	13c2ce	___U_l__________T_____t______________I__F_1
>  13	13c2cf	___U_l__________T_____t______________I__F_1
>  ...
>  1c	1405d4	___U_l_________H______t_________________F_1
>  1d	1405d5	___U_l__________T_____t_________________F_1
>  1e	1405d6	___U_l__________T_____t_________________F_1
>  1f	1405d7	___U_l__________T_____t_________________F_1
>  [ra_size = 32, req_count = 16, async_size = 16]
> 
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=60k count=1
>  # ./page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136048	__RU_l_________H______t_________________F_1
>  ...
>  c	110a40	__RU_l_________H______t_________________F_1
>  d	110a41	__RU_l__________T_____t_________________F_1
>  e	110a42	__RU_l__________T_____t_________________F_1   <-- first read
>  f	110a43	__RU_l__________T_____t_________________F_1   <-- first readahead flag
>  10	13e7a8	___U_l_________H______t_________________F_1
>  ...
>  20	137a00	___U_l_________H______t_______P______I__F_1   <-- second readahead flag (20 - 2f)
>  21	137a01	___U_l__________T_____t_______P______I__F_1
>  ...
>  3f	10d4af	___U_l__________T_____t_______P_________F_1
>  [first readahead: ra_size = 32, req_count = 15, async_size = 17]
> 
> When reading 64k data (same for 61-63k range, where last_index is page-aligned
> in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra()
> and the PG_readahead flag is set on the next folio (the one containing 0x10 page).
> 
> When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra().
> However, in this case the readahead flag is set on the 0xf page. Although the
> requested read size (req_count) is 60k, the actual read will be aligned to
> folio size (64k), which triggers the readahead flag and initiates asynchronous
> readahead via page_cache_async_ra(). This results in two readahead operations
> totaling 256k.
> 
> The root cause is that when the requested size is smaller than the actual read
> size (due to folio alignment), it triggers asynchronous readahead. By changing
> last_index alignment from page size to folio size, we ensure the requested size
> matches the actual read size, preventing the case where a single read operation
> triggers two readahead operations.
> 
> After applying the patch:
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=60k count=1
>  # ./page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136d4c	__RU_l_________H______t_________________F_1
>  1	136d4d	__RU_l__________T_____t_________________F_1
>  2	136d4e	__RU_l__________T_____t_________________F_1
>  3	136d4f	__RU_l__________T_____t_________________F_1
>  ...
>  c	136bb8	__RU_l_________H______t_________________F_1
>  d	136bb9	__RU_l__________T_____t_________________F_1
>  e	136bba	__RU_l__________T_____t_________________F_1   <-- first read
>  f	136bbb	__RU_l__________T_____t_________________F_1
>  10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>  11	13c2cd	___U_l__________T_____t______________I__F_1
>  12	13c2ce	___U_l__________T_____t______________I__F_1
>  13	13c2cf	___U_l__________T_____t______________I__F_1
>  ...
>  1c	1405d4	___U_l_________H______t_________________F_1
>  1d	1405d5	___U_l__________T_____t_________________F_1
>  1e	1405d6	___U_l__________T_____t_________________F_1
>  1f	1405d7	___U_l__________T_____t_________________F_1
>  [ra_size = 32, req_count = 16, async_size = 16]
> 
> The same phenomenon will occur when reading from 49k to 64k. Set the readahead
> flag to the next folio.
> 
> Because the minimum order of folio in address_space equals the block size (at
> least in xfs and bcachefs that already support bs > ps), having request_count
> aligned to block size will not cause overread.
> 
> Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
> Signed-off-by: Youling Tang <tangyouling@kylinos.cn>

I bisected boot failures on two of my 32-bit systems to this patch.

> ---
>  include/linux/pagemap.h | 6 ++++++
>  mm/filemap.c            | 5 +++--
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e63fbfbd5b0f..447bb264fd94 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -480,6 +480,12 @@ mapping_min_folio_nrpages(struct address_space *mapping)
>  	return 1UL << mapping_min_folio_order(mapping);
>  }
>  
> +static inline unsigned long
> +mapping_min_folio_nrbytes(struct address_space *mapping)
> +{
> +	return mapping_min_folio_nrpages(mapping) << PAGE_SHIFT;
> +}
> +
>  /**
>   * mapping_align_index() - Align index for this mapping.
>   * @mapping: The address_space.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 765dc5ef6d5a..56a8656b6f86 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	unsigned int flags;
>  	int err = 0;
>  
> -	/* "last_index" is the index of the page beyond the end of the read */
> -	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> +	/* "last_index" is the index of the folio beyond the end of the read */

> +	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));

iocb->ki_pos is loff_t (long long) while pgoff_t is unsigned long and
this overflow seems to happen in practice, resulting in last_index being
before index.

The following diff resolves the issue for me:

diff --git a/mm/filemap.c b/mm/filemap.c
index 3c071307f40e..d2902be0b845 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2585,8 +2585,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	int err = 0;
 
 	/* "last_index" is the index of the folio beyond the end of the read */
-	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
-	last_index >>= PAGE_SHIFT;
+	last_index = round_up(iocb->ki_pos + count,
+			mapping_min_folio_nrbytes(mapping)) >> PAGE_SHIFT;
 retry:
 	if (fatal_signal_pending(current))
 		return -EINTR;

Regards,
Klara Modin

> +	last_index >>= PAGE_SHIFT;
>  retry:
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
> -- 
> 2.34.1
>

next prev parent reply	other threads:[~2025-07-14 22:34 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-11  5:55 Youling Tang
2025-07-11 16:08 ` David Hildenbrand
2025-07-14  8:16   ` Ryan Roberts
2025-07-14  9:33 ` Jan Kara
2025-08-12  9:08   ` Youling Tang
2025-09-03 22:50     ` Andrew Morton
2025-09-04  2:04       ` Youling Tang
2025-09-05 12:50       ` Jan Kara
2025-09-09  6:15         ` Youling Tang
2025-09-09  6:41           ` Andrew Morton
2025-07-14 22:34 ` Klara Modin [this message]
2025-07-14 22:43   ` Andrew Morton
2025-07-14 22:48     ` Klara Modin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=yru7qf5gvyzccq5ohhpylvxug5lr5tf54omspbjh4sm6pcdb2r@fpjgj2pxw7va \
    --to=klarasmodin@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chizhiling@163.com \
    --cc=chizhiling@kylinos.cn \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tangyouling@kylinos.cn \
    --cc=willy@infradead.org \
    --cc=youling.tang@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox