Re: [PATCH] mm/madvise: set ra_pages as device max request size during ADV_POPULATE_READ

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Don Dutile <ddutile@redhat.com>,
	Rafael Aquini <raquini@redhat.com>,
	Mike Snitzer <snitzer@kernel.org>
Subject: Re: [PATCH] mm/madvise: set ra_pages as device max request size during ADV_POPULATE_READ
Date: Mon, 5 Feb 2024 17:53:45 +0800	[thread overview]
Message-ID: <ZcCwKc1k/W5xSsGK@fedora> (raw)
In-Reply-To: <ZcAfF18OM2kqKsBe@dread.disaster.area>

On Mon, Feb 05, 2024 at 10:34:47AM +1100, Dave Chinner wrote:
> On Fri, Feb 02, 2024 at 10:20:29AM +0800, Ming Lei wrote:
> > madvise(MADV_POPULATE_READ) tries to populate all page tables in the
> > specific range, so it is usually sequential IO if VMA is backed by
> > file.
> > 
> > Set ra_pages as device max request size for the involved readahead in
> > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ)
> > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with
> > usual(default) 128KB of read_ahead_kb.
> > 
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Don Dutile <ddutile@redhat.com>
> > Cc: Rafael Aquini <raquini@redhat.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Mike Snitzer <snitzer@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 51 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..db5452c8abdd 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  		return -EINVAL;
> >  }
> >  
> > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages)
> > +{
> > +	if (*file) {
> > +		struct file *f = *file;
> > +
> > +		f->f_ra.ra_pages = ra_pages;
> > +		fput(f);
> > +		*file = NULL;
> > +	}
> > +}
> > +
> > +static struct file *madvise_override_ra_win(struct file *f,
> > +		unsigned long start, unsigned long end,
> > +		unsigned int *old_ra_pages)
> > +{
> > +	unsigned int io_pages;
> > +
> > +	if (!f || !f->f_mapping || !f->f_mapping->host)
> > +		return NULL;
> > +
> > +	io_pages = inode_to_bdi(f->f_mapping->host)->io_pages;
> > +	if (((end - start) >> PAGE_SHIFT) < io_pages)
> > +		return NULL;
> > +
> > +	f = get_file(f);
> > +	*old_ra_pages = f->f_ra.ra_pages;
> > +	f->f_ra.ra_pages = io_pages;
> > +
> > +	return f;
> > +}
> 
> This won't do what you think if the file has been marked
> FMODE_RANDOM before this populate call.

Yeah.

But madvise(POPULATE_READ) is actually one action,
so userspace can call fadvise(POSIX_FADV_NORMAL) or fadvise(POSIX_FADV_SEQUENTIAL)
before madvise(POPULATE_READ), and set RANDOM advise back after
madvise(POPULATE_READ) returns, so looks not big issue in reality.

> 
> IOWs, I don't think madvise should be digging in the struct file
> readahead stuff here. It should call vfs_fadvise(FADV_SEQUENTIAL) to
> do the set the readahead mode, rather that try to duplicate
> FADV_SEQUENTIAL (badly).  We already do this for WILLNEED to make it
> do the right thing, we should be doing the same thing here.

FADV_SEQUENTIAL doubles current readahead window, which is far from
enough to get top performance, such as, latency of doubling (default) ra
window is still 2X of setting ra windows as bdi->io_pages.

If application sets small 'bdi/read_ahead_kb' just like this report, the
gap can be very big.

Or can we add one API/helper in fs code to set file readahead ra_pages for
this use case?

> 
> Also, AFAICT, there is no need for get_file()/fput() here - the vma
> already has a reference to the struct file, and the vma should not
> be going away whilst the madvise() operation is in progress.

You are right, get_file() is only needed in case of dropping mm lock.


Thanks,
Ming

     prev parent reply	other threads:[~2024-02-05  9:54 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-02  2:20 Ming Lei
2024-02-02  4:15 ` Matthew Wilcox
2024-02-02  4:48   ` Ming Lei
2024-02-02  4:43 ` Mike Snitzer
2024-02-02 10:52   ` Ming Lei
2024-02-02 14:19     ` Mike Snitzer
2024-02-04 23:34 ` [PATCH] " Dave Chinner
2024-02-05  9:53   ` Ming Lei [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZcCwKc1k/W5xSsGK@fedora \
    --to=ming.lei@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=ddutile@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=raquini@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox