From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: Jens Axboe <axboe@kernel.dk>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>, Mel Gorman <mgorman@suse.de>,
Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH RFC] mm: implement write-behind policy for sequential file writes
Date: Tue, 3 Oct 2017 00:50:31 +0300 [thread overview]
Message-ID: <3f67ed30-4a2e-09d2-3663-8be423dbbdac@yandex-team.ru> (raw)
In-Reply-To: <eb9447b7-9fca-5883-8f04-1fdc7db31c20@kernel.dk>
On 02.10.2017 23:00, Jens Axboe wrote:
> On 10/02/2017 03:54 AM, Konstantin Khlebnikov wrote:
>> Traditional writeback tries to accumulate as much dirty data as possible.
>> This is worth strategy for extremely short-living files and for batching
>> writes for saving battery power. But for workloads where disk latency is
>> important this policy generates periodic disk load spikes which increases
>> latency for concurrent operations.
>>
>> Present writeback engine allows to tune only dirty data size or expiration
>> time. Such tuning cannot eliminate pikes - this just lowers and multiplies
>> them. Other option is switching into sync mode which flushes written data
>> right after each write, obviously this have significant performance impact.
>> Such tuning is system-wide and affects memory-mapped and randomly written
>> files, flusher threads handle them much better.
>>
>> This patch implements write-behind policy which tracks sequential writes
>> and starts background writeback when have enough dirty pages in a row.
>
> This is a great idea in general. My only concerns would be around cases
> where we don't expect the writes to ever make it to media. It's not an
> uncommon use case - app dirties some memory in a file, and expects
> to truncate/unlink it before it makes it to disk. We don't want to trigger
> writeback for those. Arguably that should be app hinted.
Yes, this is case where serious degradation might happens.
Threshold 256k saves small files from writing.
Big temporary files anyway have good chances to be pushed
into disk by memory pressure or flusher thread.
>
>> Write-behind tracks current writing position and looks into two windows
>> behind it: first represents unwitten pages, Second - async writeback.
>>
>> Next write starts background writeback when first window exceed threshold
>> and waits for pages falling behind async writeback window. This allows to
>> combine small writes into bigger requests and maintain optimal io-depth.
>>
>> This affects only writes via syscalls, memory mapped writes are unchanged.
>> Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM.
>>
>> If async window set to 0 then write-behind skips dirty pages for congested
>> disk and never wait for writeback. This is used for files with O_NONBLOCK.
>>
>> Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically
>> evicts completely written pages from cache. This is perfect for writing
>> verbose logs without pushing more important data out of cache.
>>
>> As a bonus write-behind makes blkio throttling much more smooth for most
>> bulk file operations like copying or downloading which writes sequentially.
>>
>> Size of minimal write-behind request is set in:
>> /sys/block/$DISK/bdi/min_write_behind_kb
>> Default is 256Kb, 0 - disable write-behind for this disk.
>>
>> Size of async window set in:
>> /sys/block/$DISK/bdi/async_write_behind_kb
>> Default is 1024Kb, 0 - disables sync write-behind.
>
> Should we expose these, or just make them a function of the IO limitations
> exposed by the device? Something like 2x max request size, or similar.
Window depend on IO latency expectations for parallel workload and
concurrency at all levels.
Also it seems that RAIDs needs special treatment.
For now I think this is minimal possible interface.
>
> Finally, do you have any test results?
>
Nothing particular yet.
For example:
$ fio --name=test --rw=write --filesize=1G --ioengine=sync --blocksize=4k --end_fsync=1
with patch ends earlier
9.0s -> 8.2s for HDD
5.4s -> 4.7s for SSD
because write starts earlier. both uses old sq/cfq.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2017-10-02 21:50 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-02 9:54 Konstantin Khlebnikov
2017-10-02 11:23 ` Florian Weimer
2017-10-02 11:55 ` Konstantin Khlebnikov
2017-10-02 19:54 ` Linus Torvalds
2017-10-02 20:58 ` Konstantin Khlebnikov
2017-10-02 22:29 ` Andreas Dilger
2017-10-02 22:45 ` Dave Chinner
2017-10-02 23:08 ` Linus Torvalds
2017-10-03 0:08 ` Dave Chinner
2017-10-02 20:00 ` Jens Axboe
2017-10-02 21:50 ` Konstantin Khlebnikov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3f67ed30-4a2e-09d2-3663-8be423dbbdac@yandex-team.ru \
--to=khlebnikov@yandex-team.ru \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=hannes@cmpxchg.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox