Re: [PATCH 02/11] MM: document and polish read-ahead code.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: NeilBrown <neilb@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Wu Fengguang <fengguang.wu@intel.com>,
	Jaegeuk Kim <jaegeuk@kernel.org>, Chao Yu <chao@kernel.org>,
	Jeff Layton <jlayton@kernel.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Ryusuke Konishi <konishi.ryusuke@gmail.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Philipp Reisner <philipp.reisner@linbit.com>,
	Lars Ellenberg <lars.ellenberg@linbit.com>,
	Paolo Valente <paolo.valente@linaro.org>,
	Jens Axboe <axboe@kernel.dk>,
	linux-doc@vger.kernel.org, linux-mm@kvack.org,
	linux-nilfs@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net,
	linux-ext4@vger.kernel.org, ceph-devel@vger.kernel.org,
	drbd-dev@lists.linbit.com, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org
Subject: Re: [PATCH 02/11] MM: document and polish read-ahead code.
Date: Thu, 10 Feb 2022 13:24:40 +0100	[thread overview]
Message-ID: <20220210122440.vqth5mwsqtv6vjpq@quack3.lan> (raw)
In-Reply-To: <164447147257.23354.2801426518649016278.stgit@noble.brown>

Hi Neil!

On Thu 10-02-22 16:37:52, NeilBrown wrote:
> Add some "big-picture" documentation for read-ahead and polish the code
> to make it fit this documentation.
> 
> The meaning of ->async_size is clarified to match its name.
> i.e. Any request to ->readahead() has a sync part and an async part.
> The caller will wait for the sync pages to complete, but will not wait
> for the async pages.  The first async page is still marked PG_readahead

So I don't think this is how the code was meant. My understanding of
readahead comes from a comment:

/*
 * On-demand readahead design.
 *
....

in mm/readahead.c. The ra->size is how many pages should be read.
ra->async_size is the "lookahead size" meaning that we should place a
marker (PageReadahead) at "ra->size - ra->async_size" to trigger next
readahead.

> 
> - in try_context_readahead(), the async_sync is set correctly rather
>   than being set to 1.  Prior to Commit 2cad40180197 ("readahead: make
>   context readahead more conservative") it was set to ra->size which
>   is not correct (that implies no sync component).  As this was too
>   high and caused problems it was reduced to 1, again incorrect but less
>   problematic.  The setting provided with this patch does not restore
>   those problems, and is now not arbitrary.

I agree the 1 there looks strange as it effectively discards all the logic
handling the lookahead size. I agree with the tweak there but I would do
this behavioral change as a separate commit since it can have performance
implications.

> - The calculation of ->async_size in the initial_readahead section of
>   ondemand_readahead() now makes sense - it is zero if the chosen
>   size does not exceed the requested size.  This means that we will not
>   set the PG_readahead flag in this case, but as the requested size
>   has not been satisfied we can expect a subsequent read ahead request
>   any way.

So I agree that setting of ->async_size to ->size in initial_readahead
section does not make great sence but if you look a bit below into readit
section, you will notice the ->async_size is overwritten there to something
meaninful. So I think the code actually does something sensible, maybe it
could be written in a more readable way.
 
> Note that the current function names page_cache_sync_ra() and
> page_cache_async_ra() are misleading.  All ra request are partly sync
> and partly async, so either part can be empty.

The meaning of these names IMO is:
page_cache_sync_ra() - tell readahead that we currently need a page
ractl->_index and would prefer req_count pages fetched ahead.

page_cache_async_ra() - called when we hit the lookahead marker to give
opportunity to readahead code to prefetch more pages.

> A page_cache_sync_ra() request will usually set ->async_size non-zero,
> implying it is not all synchronous.
> When a non-zero req_count is passed to page_cache_async_ra(), the
> implication is that some prefix of the request is synchronous, though
> the calculation made there is incorrect - I haven't tried to fix it.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

								Honza


> ---
>  Documentation/core-api/mm-api.rst |   19 ++++++-
>  Documentation/filesystems/vfs.rst |   16 ++++--
>  include/linux/fs.h                |    9 +++
>  mm/readahead.c                    |  103 ++++++++++++++++++++++++++++++++++++-
>  4 files changed, 135 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
> index 395835f9289f..f5b2f92822c8 100644
> --- a/Documentation/core-api/mm-api.rst
> +++ b/Documentation/core-api/mm-api.rst
> @@ -58,15 +58,30 @@ Virtually Contiguous Mappings
>  File Mapping and Page Cache
>  ===========================
>  
> -.. kernel-doc:: mm/readahead.c
> -   :export:
> +Filemap
> +-------
>  
>  .. kernel-doc:: mm/filemap.c
>     :export:
>  
> +Readahead
> +---------
> +
> +.. kernel-doc:: mm/readahead.c
> +   :doc: Readahead Overview
> +
> +.. kernel-doc:: mm/readahead.c
> +   :export:
> +
> +Writeback
> +---------
> +
>  .. kernel-doc:: mm/page-writeback.c
>     :export:
>  
> +Truncate
> +--------
> +
>  .. kernel-doc:: mm/truncate.c
>     :export:
>  
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index bf5c48066fac..b4a0baa46dcc 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -806,12 +806,16 @@ cache in your filesystem.  The following members are defined:
>  	object.  The pages are consecutive in the page cache and are
>  	locked.  The implementation should decrement the page refcount
>  	after starting I/O on each page.  Usually the page will be
> -	unlocked by the I/O completion handler.  If the filesystem decides
> -	to stop attempting I/O before reaching the end of the readahead
> -	window, it can simply return.  The caller will decrement the page
> -	refcount and unlock the remaining pages for you.  Set PageUptodate
> -	if the I/O completes successfully.  Setting PageError on any page
> -	will be ignored; simply unlock the page if an I/O error occurs.
> +	unlocked by the I/O completion handler.  The set of pages are
> +	divided into some sync pages followed by some async pages,
> +	rac->ra->async_size gives the number of async pages.  The
> +	filesystem should attempt to read all sync pages but may decide
> +	to stop once it reaches the async pages.  If it does decide to
> +	stop attempting I/O, it can simply return.  The caller will
> +	remove the remaining pages from the address space, unlock them
> +	and decrement the page refcount.  Set PageUptodate if the I/O
> +	completes successfully.  Setting PageError on any page will be
> +	ignored; simply unlock the page if an I/O error occurs.
>  
>  ``readpages``
>  	called by the VM to read pages associated with the address_space
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e2d892b201b0..8b5c486bd4a2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -930,10 +930,15 @@ struct fown_struct {
>   * struct file_ra_state - Track a file's readahead state.
>   * @start: Where the most recent readahead started.
>   * @size: Number of pages read in the most recent readahead.
> - * @async_size: Start next readahead when this many pages are left.
> - * @ra_pages: Maximum size of a readahead request.
> + * @async_size: Numer of pages that were/are not needed immediately
> + *      and so were/are genuinely "ahead".  Start next readahead when
> + *      the first of these pages is accessed.
> + * @ra_pages: Maximum size of a readahead request, copied from the bdi.
>   * @mmap_miss: How many mmap accesses missed in the page cache.
>   * @prev_pos: The last byte in the most recent read request.
> + *
> + * When this structure is passed to ->readahead(), the "most recent"
> + * readahead means the current readahead.
>   */
>  struct file_ra_state {
>  	pgoff_t start;
> diff --git a/mm/readahead.c b/mm/readahead.c
> index cf0dcf89eb69..c44b2957f59f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -8,6 +8,105 @@
>   *		Initial version.
>   */
>  
> +/**
> + * DOC: Readahead Overview
> + *
> + * Readahead is used to read content into the page cache before it is
> + * explicitly requested by the application.  Readahead only ever
> + * attempts to read pages that are not yet in the page cache.  If a
> + * page is present but not up-to-date, readahead will not try to read
> + * it. In that case a simple ->readpage() will be requested.
> + *
> + * Readahead is triggered when an application read request (whether a
> + * systemcall or a page fault) finds that the requested page is not in
> + * the page cache, or that it is in the page cache and has the
> + * %PG_readahead flag set.  This flag indicates that the page was loaded
> + * as part of a previous read-ahead request and now that it has been
> + * accessed, it is time for the next read-ahead.
> + *
> + * Each readahead request is partly synchronous read, and partly async
> + * read-ahead.  This is reflected in the struct file_ra_state which
> + * contains ->size being to total number of pages, and ->async_size
> + * which is the number of pages in the async section.  The first page in
> + * this async section will have %PG_readahead set as a trigger for a
> + * subsequent read ahead.  Once a series of sequential reads has been
> + * established, there should be no need for a synchronous component and
> + * all read ahead request will be fully asynchronous.
> + *
> + * When either of the triggers causes a readahead, three numbers need to
> + * be determined: the start of the region, the size of the region, and
> + * the size of the async tail.
> + *
> + * The start of the region is simply the first page address at or after
> + * the accessed address, which is not currently populated in the page
> + * cache.  This is found with a simple search in the page cache.
> + *
> + * The size of the async tail is determined by subtracting the size that
> + * was explicitly requested from the determined request size, unless
> + * this would be less than zero - then zero is used.  NOTE THIS
> + * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED
> + * PAGE.
> + *
> + * The size of the region is normally determined from the size of the
> + * previous readahead which loaded the preceding pages.  This may be
> + * discovered from the struct file_ra_state for simple sequential reads,
> + * or from examining the state of the page cache when multiple
> + * sequential reads are interleaved.  Specifically: where the readahead
> + * was triggered by the %PG_readahead flag, the size of the previous
> + * readahead is assumed to be the number of pages from the triggering
> + * page to the start of the new readahead.  In these cases, the size of
> + * the previous readahead is scaled, often doubled, for the new
> + * readahead, though see get_next_ra_size() for details.
> + *
> + * If the size of the previous read cannot be determined, the number of
> + * preceding pages in the page cache is used to estimate the size of
> + * a previous read.  This estimate could easily be misled by random
> + * reads being coincidentally adjacent, so it is ignored unless it is
> + * larger than the current request, and it is not scaled up, unless it
> + * is at the start of file.
> + *
> + * In general read ahead is accelerated at the start of the file, as
> + * reads from there are often sequential.  There are other minor
> + * adjustments to the read ahead size in various special cases and these
> + * are best discovered by reading the code.
> + *
> + * The above calculation determines the readahead, to which any requested
> + * read size may be added.
> + *
> + * Readahead requests are sent to the filesystem using the ->readahead()
> + * address space operation, for which mpage_readahead() is a canonical
> + * implementation.  ->readahead() should normally initiate reads on all
> + * pages, but may fail to read any or all pages without causing an IO
> + * error.  The page cache reading code will issue a ->readpage() request
> + * for any page which ->readahead() does not provided, and only an error
> + * from this will be final.
> + *
> + * ->readahead() will generally call readahead_page() repeatedly to get
> + * each page from those prepared for read ahead.  It may fail to read a
> + * page by:
> + *
> + * * not calling readahead_page() sufficiently many times, effectively
> + *   ignoring some pages, as might be appropriate if the path to
> + *   storage is congested.
> + *
> + * * failing to actually submit a read request for a given page,
> + *   possibly due to insufficient resources, or
> + *
> + * * getting an error during subsequent processing of a request.
> + *
> + * In the last two cases, the page should be unlocked to indicate that
> + * the read attempt has failed.  In the first case the page will be
> + * unlocked by the caller.
> + *
> + * Those pages not in the final ``async_size`` of the request should be
> + * considered to be important and ->readahead() should not fail them due
> + * to congestion or temporary resource unavailability, but should wait
> + * for necessary resources (e.g.  memory or indexing information) to
> + * become available.  Pages in the final ``async_size`` may be
> + * considered less urgent and failure to read them is more acceptable.
> + * They will eventually be read individually using ->readpage().
> + */
> +
>  #include <linux/kernel.h>
>  #include <linux/dax.h>
>  #include <linux/gfp.h>
> @@ -426,7 +525,7 @@ static int try_context_readahead(struct address_space *mapping,
>  
>  	ra->start = index;
>  	ra->size = min(size + req_size, max);
> -	ra->async_size = 1;
> +	ra->async_size = ra->size - req_size;
>  
>  	return 1;
>  }
> @@ -527,7 +626,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  initial_readahead:
>  	ra->start = index;
>  	ra->size = get_init_ra_size(req_size, max_pages);
> -	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
> +	ra->async_size = ra->size > req_size ? ra->size - req_size : 0;
>  
>  readit:
>  	/*
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2022-02-10 12:24 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
2022-02-10  5:37 ` [PATCH 03/11] MM: improve cleanup when ->readpages doesn't process all pages NeilBrown
2022-02-10  5:37 ` [PATCH 08/11] Remove bdi_congested() and wb_congested() and related functions NeilBrown
2022-02-24 18:10   ` Ryusuke Konishi
2022-02-10  5:37 ` [PATCH 09/11] f2fs: replace congestion_wait() calls with io_schedule_timeout() NeilBrown
2022-02-10  5:37 ` [PATCH 02/11] MM: document and polish read-ahead code NeilBrown
2022-02-10 12:24   ` Jan Kara [this message]
2022-02-10 23:35     ` NeilBrown
2022-02-24 18:26       ` Jan Kara
2022-02-28  4:28         ` NeilBrown
2022-02-28  4:47           ` Andrew Morton
2022-02-28  5:19             ` NeilBrown
2022-02-10  5:37 ` [PATCH 01/11] DOC: convert 'subsection' to 'section' in gfp.h NeilBrown
2022-02-10  5:37 ` [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC" NeilBrown
2022-02-10 10:02   ` Jan Kara
2022-02-10  5:37 ` [PATCH 11/11] Remove congestion tracking framework NeilBrown
2022-02-10  5:37 ` [PATCH 07/11] Remove inode_congested() NeilBrown
2022-02-10  5:37 ` [PATCH 05/11] nfs: remove reliance on bdi congestion NeilBrown
2022-02-10  5:37 ` [PATCH 06/11] ceph: " NeilBrown
2022-02-10  5:37 ` [PATCH 04/11] fuse: " NeilBrown
2022-02-22  3:17 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
2022-02-22  3:17 ` [PATCH 02/11] MM: document and polish read-ahead code NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220210122440.vqth5mwsqtv6vjpq@quack3.lan \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=anna.schumaker@netapp.com \
    --cc=axboe@kernel.dk \
    --cc=ceph-devel@vger.kernel.org \
    --cc=chao@kernel.org \
    --cc=djwong@kernel.org \
    --cc=drbd-dev@lists.linbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=idryomov@gmail.com \
    --cc=jaegeuk@kernel.org \
    --cc=jlayton@kernel.org \
    --cc=konishi.ryusuke@gmail.com \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-nilfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=neilb@suse.de \
    --cc=paolo.valente@linaro.org \
    --cc=philipp.reisner@linbit.com \
    --cc=trond.myklebust@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox