Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andy Lutomirski <luto@amacapital.net>
To: Andres Freund <andres@2ndquadrant.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	lsf@lists.linux-foundation.org,
	Wu Fengguang <fengguang.wu@intel.com>,
	rhaas@anarazel.de
Subject: Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
Date: Wed, 26 Mar 2014 15:26:19 -0700	[thread overview]
Message-ID: <CALCETrVEjpFpKhY6=CEG-9Prm=uBDLS936imb=+hyWN4fXPjtg@mail.gmail.com> (raw)
In-Reply-To: <20140326215518.GH9066@alap3.anarazel.de>

On Wed, Mar 26, 2014 at 2:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
>> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
>> > top#1 concern is frequent slow read()s that happen while another process
>> > calls fsync(), even though we'd be perfectly fine if that fsync() took
>> > ages.
>> > The "conclusion" of that part was that it'd be very useful to have a
>> > demonstration of the problem without needing a full blown postgres
>> > setup. I've quickly hacked something together, that seems to show the
>> > problem nicely.
>> >
>> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
>> > and the "IO Scheduling" bit in
>> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>> >
>>
>> For your amusement: running this program in KVM on a 2GB disk image
>> failed, but it caused the *host* to go out to lunch for several
>> seconds while failing.  In fact, it seems to have caused the host to
>> fall over so badly that the guest decided that the disk controller was
>> timing out.  The host is btrfs, and I think that btrfs is *really* bad
>> at this kind of workload.
>
> Also, unless you changed the parameters, it's a) using a 48GB disk file,
> and writes really rather fast ;)
>
>> Even using ext4 is no good.  I think that dm-crypt is dying under the
>> load.  So I won't test your program for real :/
>
> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
> smaller. If it still doesn't work consider increasing the two nsleep()s...
>
> I didn't have a good idea how to scale those to the current machine in a
> halfway automatic fashion.

OK, I think I'm getting reasonable bad behavior with these qemu options:

-smp 2 -cpu host -m 600 -drive file=/var/lutotmp/test.img,cache=none

and a 2GB test partition.

>
>> > Possible solutions:
>> > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>> >   sync_file_range() does.
>> > * Make IO triggered by writeback regard IO priorities and add it to
>> >   schedulers other than CFQ
>> > * Add a tunable that allows limiting the amount of dirty memory before
>> >   writeback on a per process basis.
>> > * ...?
>>
>> I thought the problem wasn't so much that priorities weren't respected
>> but that the fsync call fills up the queue, so everything starts
>> contending for the right to enqueue a new request.
>
> I think it's both actually. If I understand correctly there's not even a
> correct association to the originator anymore during a fsync triggered
> flush?
>
>> Since fsync blocks until all of its IO finishes anyway, what if it
>> could just limit itself to a much smaller number of outstanding
>> requests?
>
> Yea, that could already help. If you remove the fsync()s, the problem
> will periodically appear anyway, because writeback is triggered with
> vengeance. That'd need to be fixed in a similar way.
>
>> I'm not sure I understand the request queue stuff, but here's an idea.
>>  The block core contains this little bit of code:
>
> I haven't read enough of the code yet, to comment intelligently ;)

My little patch doesn't seem to help.  I'm either changing the wrong
piece of code entirely or I'm penalizing readers and writers too much.

Hopefully some real block layer people can comment as to whether a
refinement of this idea could work.  The behavior I want is for
writeback to be limited to using a smallish fraction of the total
request queue size -- I think that writeback should be able to enqueue
enough requests to get decent sorting performance but not enough
requests to prevent the io scheduler from doing a good job on
non-writeback I/O.

As an even more radical idea, what if there was a way to submit truly
enormous numbers of lightweight requests, such that the queue will
give the requester some kind of callback when the request is nearly
ready for submission so the requester can finish filling in the
request?  This would allow things like dm-crypt to get the benefit of
sorting without needing to encrypt hundreds of MB of data in advance
of having that data actually be to the backing device.  It might also
allow writeback to submit multiple gigabytes of writes, in arbitrarily
large pieces, but not to need to pin pages or do whatever expensive
things are needed until the IO actually happens.

For reference, here's my patch that doesn't work well:

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc..c0dedc3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
        }

        /*
-        * Only allow batching queuers to allocate up to 50% over the defined
-        * limit of requests, otherwise we could have thousands of requests
-        * allocated with any setting of ->nr_requests
+        * Only allow batching queuers to allocate up to 50% of the
+        * defined limit of requests, so that non-batching queuers can
+        * get into the queue and thus be scheduled properly.
         */
-       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+       if (rl->count[is_sync] >= (q->nr_requests + 3) / 4)
                return NULL;

        q->nr_rqs[is_sync]++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-03-26 22:26 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-26 19:11 Andres Freund
2014-03-26 21:41 ` [Lsf] " Andy Lutomirski
2014-03-26 21:55   ` Andres Freund
2014-03-26 22:26     ` Andy Lutomirski [this message]
2014-03-26 22:35       ` David Lang
2014-03-26 23:11         ` Andy Lutomirski
2014-03-26 23:28           ` Andy Lutomirski
2014-03-27 15:50     ` Jan Kara
2014-03-27 18:10       ` Fernando Luis Vazquez Cao
2014-03-27 15:52 ` Jan Kara
2014-04-09  9:20 ` Dave Chinner
2014-04-12 13:24   ` Andres Freund
2014-04-28 23:47   ` [Lsf] " Dave Chinner
2014-04-28 23:57     ` Andres Freund
2014-05-23  6:42       ` Dave Chinner
2014-06-04 20:06         ` Andres Freund

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALCETrVEjpFpKhY6=CEG-9Prm=uBDLS936imb=+hyWN4fXPjtg@mail.gmail.com' \
    --to=luto@amacapital.net \
    --cc=andres@2ndquadrant.com \
    --cc=fengguang.wu@intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf@lists.linux-foundation.org \
    --cc=rhaas@anarazel.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox