From: yebin <yebin10@huawei.com>
To: Jan Kara <jack@suse.cz>, Zhengyuan Liu <liuzhengyuang521@gmail.com>
Cc: viro@zeniv.linux.org.uk,
"Andrew Morton" <akpm@linux-foundation.org>,
tytso@mit.edu, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-ext4@vger.kernel.org, 刘云 <liuyun01@kylinos.cn>,
"Zhengyuan Liu" <liuzhengyuan@kylinos.cn>
Subject: Re: Problem with direct IO
Date: Thu, 21 Oct 2021 16:55:44 +0800 [thread overview]
Message-ID: <61712B10.2060408@huawei.com> (raw)
In-Reply-To: <20211021080304.GB5784@quack2.suse.cz>
[-- Attachment #1: Type: text/plain, Size: 12860 bytes --]
On 2021/10/21 16:03, Jan Kara wrote:
> On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
>> On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:
>>> On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
>>>> we are encounting following Mysql crash problem while importing tables :
>>>>
>>>> 2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
>>>> fsync() returned EIO, aborting.
>>>> 2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
>>>> Assertion failure: ut0ut.cc:555 thread 281472996733168
>>>>
>>>> At the same time , we found dmesg had following message:
>>>>
>>>> [ 4328.838972] Page cache invalidation failure on direct I/O.
>>>> Possible data corruption due to collision with buffered I/O!
>>>> [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
>>>> 625 Comm: kworker/42:1
>>>>
>>>> Firstly, we doubled Mysql has operating the file with direct IO and
>>>> buffered IO interlaced, but after some checking we found it did only
>>>> do direct IO using aio. The problem is exactly from direct-io
>>>> interface (__generic_file_write_iter) itself.
>>>>
>>>> ssize_t __generic_file_write_iter()
>>>> {
>>>> ...
>>>> if (iocb->ki_flags & IOCB_DIRECT) {
>>>> loff_t pos, endbyte;
>>>>
>>>> written = generic_file_direct_write(iocb, from);
>>>> /*
>>>> * If the write stopped short of completing, fall back to
>>>> * buffered writes. Some filesystems do this for writes to
>>>> * holes, for example. For DAX files, a buffered write will
>>>> * not succeed (even if it did, DAX does not handle dirty
>>>> * page-cache pages correctly).
>>>> */
>>>> if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
>>>> goto out;
>>>>
>>>> status = generic_perform_write(file, from, pos = iocb->ki_pos);
>>>> ...
>>>> }
>>>>
>>>> From above code snippet we can see that direct io could fall back to
>>>> buffered IO under certain conditions, so even Mysql only did direct IO
>>>> it could interleave with buffered IO when fall back occurred. I have
>>>> no idea why FS(ext3) failed the direct IO currently, but it is strange
>>>> __generic_file_write_iter make direct IO fall back to buffered IO, it
>>>> seems breaking the semantics of direct IO.
>>>>
>>>> The reproduced environment is:
>>>> Platform: Kunpeng 920 (arm64)
>>>> Kernel: V5.15-rc
>>>> PAGESIZE: 64K
>>>> Mysql: V8.0
>>>> Innodb_page_size: default(16K)
>>> Thanks for report. I agree this should not happen. How hard is this to
>>> reproduce? Any idea whether the fallback to buffered IO happens because
>>> iomap_dio_rw() returns -ENOTBLK or because it returns short write?
>> It is easy to reproduce in my test environment, as I said in the previous
>> email replied to Andrew this problem is related to kernel page size.
> Ok, can you share a reproducer?
>
>>> Can you post output of "dumpe2fs -h <device>" for the filesystem where the
>>> problem happens? Thanks!
>> Sure, the output is:
>>
>> # dumpe2fs -h /dev/sda3
>> dumpe2fs 1.45.3 (14-Jul-2019)
>> Filesystem volume name: <none>
>> Last mounted on: /data
>> Filesystem UUID: 09a51146-b325-48bb-be63-c9df539a90a1
>> Filesystem magic number: 0xEF53
>> Filesystem revision #: 1 (dynamic)
>> Filesystem features: has_journal ext_attr resize_inode dir_index
>> filetype needs_recovery sparse_super large_file
> Thanks for the data. OK, a filesystem without extents. Does your test by
> any chance try to do direct IO to a hole in a file? Because that is not
> (and never was) supported without extents. Also the fact that you don't see
> the problem with ext4 (which means extents support) would be pointing in
> that direction.
>
> Honza
I try to reprodeuce this issue as follows, maybe this is just one
scenario :
step1: Modify kernel code, add delay in dio_complete and dio_bio_end_aio
step2: mkfs.ext4 /dev/sda
step3: mount /dev/sda /home/test
step4: fio -filename=/home/test/test -direct=0 -buffered=1 -iodepth 1
-thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k
-numjobs=1 -name=test_r_w
step5: fio -filename=/home/test/test -direct=1 -buffered=0 -iodepth 1
-thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k
-numjobs=1 -name=test_r_w &
step6: fio -filename=/home/test/test -direct=0 -buffered=1 -iodepth 1
-thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k
-numjobs=1 -name=test_r_w
Kernel modification base on v4.19:
diff --git a/fs/direct-io.c b/fs/direct-io.c
+#include <linux/delay.h>
/**
@@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t
ret, unsigned int flags)
* end_io() when necessary, otherwise a racing buffer read
would cache
* zeros from unwritten extents.
*/
+ printk("%s: before invalidate page \n", __func__);
+ mdelay(10 * 1000);
if (flags & DIO_COMPLETE_INVALIDATE &&
ret > 0 && dio->op == REQ_OP_WRITE &&
dio->inode->i_mapping->nrpages) {
@@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t
ret, unsigned int flags)
if (err)
dio_warn_stale_pagecache(dio->iocb->ki_filp);
}
+ printk("%s: end invalidate page \n", __func__);
inode_dio_end(dio->inode);
@@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio)
* went in between AIO submission and completion into the
* same region.
*/
+ printk("%s:start to sleep 5s.... \n", __func__);
+ mdelay(10 * 1000);
+ printk("%s: endto sleep, defer_completion=%d op=%d
nrpages=%d\n",
+ __func__,dio->defer_completion, dio->op,
dio->inode->i_mapping->nrpages);
+
if (dio->result)
defer_completion = dio->defer_completion ||
(dio->op == REQ_OP_WRITE &&
dio->inode->i_mapping->nrpages);
+ printk("%s: dio=%px defer_completion=%d\n",
+ __func__, dio, defer_completion);
if (defer_completion) {
INIT_WORK(&dio->complete_work, dio_aio_complete_work);
queue_work(dio->inode->i_sb->s_dio_done_wq,
Reproduce result :
s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k
-numjob
test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread
test_r_w: (groupid=0, jobs=1): err= 0: pid=2083: Sun Sep 26 03:18:06 2021
write: IOPS=444, BW=1778KiB/s (1820kB/s)(16.0KiB/9msec)
slat (usec): min=1294, max=3789, avg=2007.88, stdev=1191.07
clat (nsec): min=2304, max=28645, avg=9918.75, stdev=12574.80
lat (usec): min=1298, max=3824, avg=2020.36, stdev=1206.00
clat percentiles (nsec):
| 1.00th=[ 2320], 5.00th=[ 2320], 10.00th=[ 2320], 20.00th=[ 2320],
| 30.00th=[ 2960], 40.00th=[ 2960], 50.00th=[ 2960], 60.00th=[ 5792],
| 70.00th=[ 5792], 80.00th=[28544], 90.00th=[28544], 95.00th=[28544],
| 99.00th=[28544], 99.50th=[28544], 99.90th=[28544], 99.95th=[28544],
| 99.99th=[28544]
lat (usec) : 4=50.00%, 10=25.00%, 50=25.00%
cpu : usr=0.00%, sys=87.50%, ctx=0, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1778KiB/s (1820kB/s), 1778KiB/s-1778KiB/s
(1820kB/s-1820kB/s), io=16.0KiB (16.4kB), run=9-9msec
Disk stats (read/write):
sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
=1 -name=test_r_w &write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k
-numjobs
[1] 2087
[root@localhost home]# test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B,
(W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread
[root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock inode
[ 1475.288807] ext4_file_write_iter: unlock inode
[ 1475.290402] dio_bio_end_aio:start to sleep 5s....
s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k
-numjob
test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread
Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
[ 1485.292939] dio_bio_end_aio: endto sleep, defer_completion=0 op=1
nrpages=4
[ 1485.293888] dio_bio_end_aio: dio=ffff88839aff5200 defer_completion=1
[ 1485.294821] dio_aio_complete_work: dio=ffff88839aff5200...
Jobs: 1 (f=1): [W(1)][-.-%][r=0KiB/s,w=0KiB/s][r[ 1485.296539]
dio_complete: before invalidate page
[ 1495.298472] Page cache invalidation failure on direct I/O. Possible
data corruption due to collision with buffered I/O! -->detect
buffer io and direct io conflict.
[ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2
[ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted
4.19.90-dirty #96
[ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31
04/01/2014
Jobs: 1 (f=1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_work
W(1)][-.-%][r=0K
[ 1495.316879] dio_complete: end invalidate page
test_r_w: (groupid=0, jobs=1): err= 0: pid=2092: Sun Sep 26 03:18:37 2021
write: IOPS=266, BW=1067KiB/s (1092kB/s)(16.0KiB/15msec)
slat (usec): min=2001, max=3101, avg=2691.97, stdev=480.07
clat (nsec): min=2058, max=12337, avg=5999.50, stdev=4456.54
lat (usec): min=2004, max=3116, avg=2700.11, stdev=484.22
clat percentiles (nsec):
| 1.00th=[ 2064], 5.00th=[ 2064], 10.00th=[ 2064], 20.00th=[ 2064],
| 30.00th=[ 4080], 40.00th=[ 4080], 50.00th=[ 4080], 60.00th=[ 5536],
| 70.00th=[ 5536], 80.00th=[12352], 90.00th=[12352], 95.00th=[12352],
| 99.00th=[12352], 99.50th=[12352], 99.90th=[12352], 99.95th=[12352],
| 99.99th=[12352]
lat (usec) : 4=25.00%, 10=50.00%, 20=25.00%
cpu : usr=0.00%, sys=100.00%, ctx=1, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1067KiB/s (1092kB/s), 1067KiB/s-1067KiB/s
(1092kB/s-1092kB/s), io=16.0KiB (16.4kB), run=15-15msec
Disk stats (read/write):
sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
test_r_w: (groupid=0, jobs=1): err= 0: pid=2089: Sun Sep 26 03:18:37 2021
write: IOPS=0, BW=204B/s (204B/s)(4096B/20033msec)
slat (nsec): min=3358.7k, max=3358.7k, avg=3358748.00, stdev= 0.00
clat (nsec): min=20027M, max=20027M, avg=20027265355.00, stdev= 0.00
lat (nsec): min=20031M, max=20031M, avg=20030627283.00, stdev= 0.00
clat percentiles (msec):
| 1.00th=[17113], 5.00th=[17113], 10.00th=[17113], 20.00th=[17113],
| 30.00th=[17113], 40.00th=[17113], 50.00th=[17113], 60.00th=[17113],
| 70.00th=[17113], 80.00th=[17113], 90.00th=[17113], 95.00th=[17113],
| 99.00th=[17113], 99.50th=[17113], 99.90th=[17113], 99.95th=[17113],
| 99.99th=[17113]
cpu : usr=0.00%, sys=0.02%, ctx=2, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=4096B
(4096B), run=20033-20033msec
Disk stats (read/write):
sda: ios=0/5, merge=0/1, ticks=0/10018, in_queue=10015, util=36.98%
[-- Attachment #2: Type: text/html, Size: 20755 bytes --]
next prev parent reply other threads:[~2021-10-21 8:55 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-13 1:46 Zhengyuan Liu
2021-10-18 1:09 ` Zhengyuan Liu
2021-10-18 18:43 ` Andrew Morton
2021-10-19 3:39 ` Zhengyuan Liu
2021-10-20 17:37 ` Jan Kara
2021-10-21 2:21 ` Zhengyuan Liu
2021-10-21 8:03 ` Jan Kara
2021-10-21 8:55 ` yebin [this message]
2021-10-21 12:20 ` Zhengyuan Liu
2021-10-29 8:16 ` yebin
2021-10-21 12:11 ` Zhengyuan Liu
2021-10-22 9:31 ` Jan Kara
2021-10-23 2:06 ` Zhengyuan Liu
2021-10-25 15:57 ` Jan Kara
2021-10-28 15:02 ` Zhengyuan Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=61712B10.2060408@huawei.com \
--to=yebin10@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liuyun01@kylinos.cn \
--cc=liuzhengyuan@kylinos.cn \
--cc=liuzhengyuang521@gmail.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox