From: "Hillf Danton" <hillf.zj@alibaba-inc.com>
To: 'Jens Axboe' <axboe@kernel.dk>,
'Andrew Morton' <akpm@linux-foundation.org>,
'Johannes Weiner' <hannes@cmpxchg.org>,
linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
'Linus Torvalds' <torvalds@linux-foundation.org>
Subject: Re: [PATCH v4] mm: don't cap request size based on read-ahead setting
Date: Fri, 18 Nov 2016 13:58:33 +0800 [thread overview]
Message-ID: <007401d24160$cc2442c0$646cc840$@alibaba-inc.com> (raw)
In-Reply-To: <e4271a04-35cf-b082-34ea-92649f5111be@kernel.dk>
On Friday, November 18, 2016 5:23 AM Jens Axboe wrote:
>
> We ran into a funky issue, where someone doing 256K buffered reads saw
> 128K requests at the device level. Turns out it is read-ahead capping
> the request size, since we use 128K as the default setting. This doesn't
> make a lot of sense - if someone is issuing 256K reads, they should see
> 256K reads, regardless of the read-ahead setting, if the underlying
> device can support a 256K read in a single command.
>
> To make matters more confusing, there's an odd interaction with the
> fadvise hint setting. If we tell the kernel we're doing sequential IO on
> this file descriptor, we can get twice the read-ahead size. But if we
> tell the kernel that we are doing random IO, hence disabling read-ahead,
> we do get nice 256K requests at the lower level. This is because
> ondemand and forced read-ahead behave differently, with the latter doing
> the right thing.
As far as I read, forced RA is innocent but it is corrected below.
And with RA disabled, we should drop care of ondemand.
I'm scratching.
> An application developer will be, rightfully,
> scratching his head at this point, wondering wtf is going on. A good one
> will dive into the kernel source, and silently weep.
>
> This patch introduces a bdi hint, io_pages. This is the soft max IO size
> for the lower level, I've hooked it up to the bdev settings here.
> Read-ahead is modified to issue the maximum of the user request size,
> and the read-ahead max size, but capped to the max request size on the
> device side. The latter is done to avoid reading ahead too much, if the
> application asks for a huge read. With this patch, the kernel behaves
> like the application expects.
>
> Signed-off-by: Jens Axboe <axboe@fb.com>
>
> ---
>
> Changes since v3:
>
> - Went over it with Johannes, cleaned up the the logic as a result
>
> Changes since v2:
>
> - Fix up the last minute typo on io_pages (Johannes/Hillf)
> - Apply the same limit to force_page_cache_readahead().
>
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index f679ae1..65f16cf 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -249,6 +249,7 @@ void blk_queue_max_hw_sectors(struct request_queue
> *q, unsigned int max_hw_secto
> max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors);
> max_sectors = min_t(unsigned int, max_sectors, BLK_DEF_MAX_SECTORS);
> limits->max_sectors = max_sectors;
> + q->backing_dev_info.io_pages = max_sectors >> (PAGE_SHIFT - 9);
> }
> EXPORT_SYMBOL(blk_queue_max_hw_sectors);
>
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 9cc8d7c..ea374e8 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -212,6 +212,7 @@ queue_max_sectors_store(struct request_queue *q,
> const char *page, size_t count)
>
> spin_lock_irq(q->queue_lock);
> q->limits.max_sectors = max_sectors_kb << 1;
> + q->backing_dev_info.io_pages = max_sectors_kb >> (PAGE_SHIFT - 10);
> spin_unlock_irq(q->queue_lock);
>
> return ret;
> diff --git a/include/linux/backing-dev-defs.h
> b/include/linux/backing-dev-defs.h
> index c357f27..b8144b2 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -136,6 +136,7 @@ struct bdi_writeback {
> struct backing_dev_info {
> struct list_head bdi_list;
> unsigned long ra_pages; /* max readahead in PAGE_SIZE units */
> + unsigned long io_pages; /* max allowed IO size */
> unsigned int capabilities; /* Device capabilities */
> congested_fn *congested_fn; /* Function pointer if device is md/dm */
> void *congested_data; /* Pointer to aux data for congested func */
> diff --git a/mm/readahead.c b/mm/readahead.c
> index c8a955b..344c1da 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -207,12 +207,17 @@ int __do_page_cache_readahead(struct address_space
> *mapping, struct file *filp,
> * memory at once.
> */
> int force_page_cache_readahead(struct address_space *mapping, struct
> file *filp,
> - pgoff_t offset, unsigned long nr_to_read)
> + pgoff_t offset, unsigned long nr_to_read)
> {
> + struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
> + struct file_ra_state *ra = &filp->f_ra;
> + unsigned long max_pages;
> +
> if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
> return -EINVAL;
>
> - nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
> + max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
> + nr_to_read = min(nr_to_read, max_pages);
> while (nr_to_read) {
> int err;
>
> @@ -369,10 +374,18 @@ ondemand_readahead(struct address_space *mapping,
> bool hit_readahead_marker, pgoff_t offset,
> unsigned long req_size)
> {
> - unsigned long max = ra->ra_pages;
> + struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
> + unsigned long max_pages = ra->ra_pages;
> pgoff_t prev_offset;
>
> /*
> + * If the request exceeds the readahead window, allow the read to
> + * be up to the optimal hardware IO size
> + */
> + if (req_size > max_pages && bdi->io_pages > max_pages)
> + max_pages = min(req_size, bdi->io_pages);
> +
> + /*
> * start of file
> */
> if (!offset)
> @@ -385,7 +398,7 @@ ondemand_readahead(struct address_space *mapping,
> if ((offset == (ra->start + ra->size - ra->async_size) ||
> offset == (ra->start + ra->size))) {
> ra->start += ra->size;
> - ra->size = get_next_ra_size(ra, max);
> + ra->size = get_next_ra_size(ra, max_pages);
> ra->async_size = ra->size;
> goto readit;
> }
> @@ -400,16 +413,16 @@ ondemand_readahead(struct address_space *mapping,
> pgoff_t start;
>
> rcu_read_lock();
> - start = page_cache_next_hole(mapping, offset + 1, max);
> + start = page_cache_next_hole(mapping, offset + 1, max_pages);
> rcu_read_unlock();
>
> - if (!start || start - offset > max)
> + if (!start || start - offset > max_pages)
> return 0;
>
> ra->start = start;
> ra->size = start - offset; /* old async_size */
> ra->size += req_size;
> - ra->size = get_next_ra_size(ra, max);
> + ra->size = get_next_ra_size(ra, max_pages);
> ra->async_size = ra->size;
> goto readit;
> }
> @@ -417,7 +430,7 @@ ondemand_readahead(struct address_space *mapping,
> /*
> * oversize read
> */
> - if (req_size > max)
> + if (req_size > max_pages)
> goto initial_readahead;
>
> /*
> @@ -433,7 +446,7 @@ ondemand_readahead(struct address_space *mapping,
> * Query the page cache and look for the traces(cached history pages)
> * that a sequential stream would leave behind.
> */
> - if (try_context_readahead(mapping, ra, offset, req_size, max))
> + if (try_context_readahead(mapping, ra, offset, req_size, max_pages))
> goto readit;
>
> /*
> @@ -444,7 +457,7 @@ ondemand_readahead(struct address_space *mapping,
>
> initial_readahead:
> ra->start = offset;
> - ra->size = get_init_ra_size(req_size, max);
> + ra->size = get_init_ra_size(req_size, max_pages);
> ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
>
> readit:
> @@ -454,7 +467,7 @@ ondemand_readahead(struct address_space *mapping,
> * the resulted next readahead window into the current one.
> */
> if (offset == ra->start && ra->size == ra->async_size) {
> - ra->async_size = get_next_ra_size(ra, max);
> + ra->async_size = get_next_ra_size(ra, max_pages);
> ra->size += ra->async_size;
> }
>
>
>
> --
> Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-11-18 5:59 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-17 21:23 Jens Axboe
2016-11-18 5:58 ` Hillf Danton [this message]
2016-11-18 15:09 ` Jens Axboe
2016-11-18 18:02 ` Johannes Weiner
2016-11-18 19:34 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='007401d24160$cc2442c0$646cc840$@alibaba-inc.com' \
--to=hillf.zj@alibaba-inc.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=hannes@cmpxchg.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox