Hi,

At LSF/MM there was a slot about postgres' problems with the kernel. Our
top#1 concern is frequent slow read()s that happen while another process
calls fsync(), even though we'd be perfectly fine if that fsync() took
ages.
The "conclusion" of that part was that it'd be very useful to have a
demonstration of the problem without needing a full blown postgres
setup. I've quickly hacked something together, that seems to show the
problem nicely.

For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
and the "IO Scheduling" bit in
http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de

The tools output looks like this:
gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
...
wal[12155]: avg: 0.0 msec; max: 0.0 msec
commit[12155]: avg: 0.2 msec; max: 15.4 msec
wal[12155]: avg: 0.0 msec; max: 0.0 msec
read[12157]: avg: 0.2 msec; max: 9.4 msec
...
read[12165]: avg: 0.2 msec; max: 9.4 msec
wal[12155]: avg: 0.0 msec; max: 0.0 msec
starting fsync() of files
finished fsync() of files
read[12162]: avg: 0.6 msec; max: 2765.5 msec

So, the average read time is less than one ms (SSD, and about 50% cached
workload). But once another backend does the fsync(), read latency
skyrockets.

A concurrent iostat shows the problem pretty clearly:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20

While the fsync() is going on (or the kernel decides to start writing
out aggressively for some other reason) the amount of writes to the disk
is increased by two orders of magnitude. Unsurprisingly with disastrous
consequences for read() performance. We really want a way to pace the
writes issued to the disk more regularly.

The attached program right now can only be configured by changing some
details in the code itself, but I guess that's not a problem. It will
upfront allocate two files, and then start testing. If the files already
exists it will use them.

Possible solutions:
* Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
  sync_file_range() does.
* Make IO triggered by writeback regard IO priorities and add it to
  schedulers other than CFQ
* Add a tunable that allows limiting the amount of dirty memory before
  writeback on a per process basis.
* ...?

If somebody familiar with buffered IO writeback is around at LSF/MM, or
rather collab, Robert and I will be around for the next days.

Greetings,

Andres Freund