On 2021/10/21 20:20, Zhengyuan Liu wrote: > On Thu, Oct 21, 2021 at 4:55 PM yebin wrote: >> >> >> On 2021/10/21 16:03, Jan Kara wrote: >> >> On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote: >> >> On Thu, Oct 21, 2021 at 1:37 AM Jan Kara wrote: >> >> On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote: >> >> we are encounting following Mysql crash problem while importing tables : >> >> 2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL] >> fsync() returned EIO, aborting. >> 2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB] >> Assertion failure: ut0ut.cc:555 thread 281472996733168 >> >> At the same time , we found dmesg had following message: >> >> [ 4328.838972] Page cache invalidation failure on direct I/O. >> Possible data corruption due to collision with buffered I/O! >> [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID: >> 625 Comm: kworker/42:1 >> >> Firstly, we doubled Mysql has operating the file with direct IO and >> buffered IO interlaced, but after some checking we found it did only >> do direct IO using aio. The problem is exactly from direct-io >> interface (__generic_file_write_iter) itself. >> >> ssize_t __generic_file_write_iter() >> { >> ... >> if (iocb->ki_flags & IOCB_DIRECT) { >> loff_t pos, endbyte; >> >> written = generic_file_direct_write(iocb, from); >> /* >> * If the write stopped short of completing, fall back to >> * buffered writes. Some filesystems do this for writes to >> * holes, for example. For DAX files, a buffered write will >> * not succeed (even if it did, DAX does not handle dirty >> * page-cache pages correctly). >> */ >> if (written < 0 || !iov_iter_count(from) || IS_DAX(inode)) >> goto out; >> >> status = generic_perform_write(file, from, pos = iocb->ki_pos); >> ... >> } >> >> From above code snippet we can see that direct io could fall back to >> buffered IO under certain conditions, so even Mysql only did direct IO >> it could interleave with buffered IO when fall back occurred. I have >> no idea why FS(ext3) failed the direct IO currently, but it is strange >> __generic_file_write_iter make direct IO fall back to buffered IO, it >> seems breaking the semantics of direct IO. >> >> The reproduced environment is: >> Platform: Kunpeng 920 (arm64) >> Kernel: V5.15-rc >> PAGESIZE: 64K >> Mysql: V8.0 >> Innodb_page_size: default(16K) >> >> Thanks for report. I agree this should not happen. How hard is this to >> reproduce? Any idea whether the fallback to buffered IO happens because >> iomap_dio_rw() returns -ENOTBLK or because it returns short write? >> >> It is easy to reproduce in my test environment, as I said in the previous >> email replied to Andrew this problem is related to kernel page size. >> >> Ok, can you share a reproducer? >> >> Can you post output of "dumpe2fs -h " for the filesystem where the >> problem happens? Thanks! >> >> Sure, the output is: >> >> # dumpe2fs -h /dev/sda3 >> dumpe2fs 1.45.3 (14-Jul-2019) >> Filesystem volume name: >> Last mounted on: /data >> Filesystem UUID: 09a51146-b325-48bb-be63-c9df539a90a1 >> Filesystem magic number: 0xEF53 >> Filesystem revision #: 1 (dynamic) >> Filesystem features: has_journal ext_attr resize_inode dir_index >> filetype needs_recovery sparse_super large_file >> >> Thanks for the data. OK, a filesystem without extents. Does your test by >> any chance try to do direct IO to a hole in a file? Because that is not >> (and never was) supported without extents. Also the fact that you don't see >> the problem with ext4 (which means extents support) would be pointing in >> that direction. >> >> Honza >> >> >> I try to reprodeuce this issue as follows, maybe this is just one scenario : >> step1: Modify kernel code, add delay in dio_complete and dio_bio_end_aio >> step2: mkfs.ext4 /dev/sda >> step3: mount /dev/sda /home/test >> step4: fio -filename=/home/test/test -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjobs=1 -name=test_r_w >> step5: fio -filename=/home/test/test -direct=1 -buffered=0 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k -numjobs=1 -name=test_r_w & >> step6: fio -filename=/home/test/test -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjobs=1 -name=test_r_w > Hi, yebin > > Thanks for your test case, but it is not the same problem. The scenario you > described here is some thing the kernel design for, since you are intending to > interleave buffered IO and direct IO, please look at this two patch: > > 332391a (“fs: Fix page cache inconsistency when mixing buffered and AIO DIO”) > 5a9d929 (“iomap: report collisions between directio and buffered > writes to userspace”) Maybe we can reproduce your issue as follows: 1. mkfs.ext4 -b 1024 /dev/sdb (PAGE_SIZE=4K) 2. mount /dev/sdb /home/test 3. fio -filename=/home/test/test -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=1k -size=4k -numjobs=1 -name=test_r_w 4. fio -filename=/home/test/test -direct=1 -buffered=0 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=1k -size=1k -numjobs=1 -name=test_r_w & 5. fio -filename=/home/test/test -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=1k -size=5k -offset=1024 -numjobs=1 -name=test_r_w As file system block size is 1K, there are 4 buffer_head per page. So direct IO and buffer IO maybe operate on the same page, although it does not operate on the same logical block. I add some debug information into kernel as follows: diff --git a/fs/direct-io.c b/fs/direct-io.c index ec7c7d1c0329..aebc3e76f8ca 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -38,6 +38,8 @@ #include #include +#include + /* * How many user pages to map in one call to get_user_pages(). This determines * the size of a structure in the slab cache @@ -224,20 +226,20 @@ static inline struct page *dio_get_page(struct dio *dio, */ void dio_warn_stale_pagecache(struct file *filp) { - static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST); +// static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST); char pathname[128]; struct inode *inode = file_inode(filp); char *path; errseq_set(&inode->i_mapping->wb_err, -EIO); - if (__ratelimit(&_rs)) { +// if (__ratelimit(&_rs)) { path = file_path(filp, pathname, sizeof(pathname)); if (IS_ERR(path)) path = "(unknown)"; pr_crit("Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!\n"); pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid, current->comm); - } +// } } /** @@ -305,15 +307,20 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) * end_io() when necessary, otherwise a racing buffer read would cache * zeros from unwritten extents. */ + printk("%s: before invalidate page \n", __func__); + mdelay(10 * 1000); if (flags & DIO_COMPLETE_INVALIDATE && ret > 0 && dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages) { + printk("%s: start to call invalidate_inode_pages2_range nrpags=%lu\n", + __func__, dio->inode->i_mapping->nrpages); err = invalidate_inode_pages2_range(dio->inode->i_mapping, offset >> PAGE_SHIFT, (offset + ret - 1) >> PAGE_SHIFT); if (err) dio_warn_stale_pagecache(dio->iocb->ki_filp); } + printk("%s: end invalidate page \n", __func__); inode_dio_end(dio->inode); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 21b0fc0d6ffe..35d1a330437b 100644 --- a/fs/ext4/file.c + printk("%s: end invalidate page \n", __func__); inode_dio_end(dio->inode); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 21b0fc0d6ffe..35d1a330437b 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -241,6 +241,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) inode_lock(inode); } + printk("%s: start lock inode offset=%llu \n", __func__, iocb->ki_pos); ret = ext4_write_checks(iocb, from); if (ret <= 0) goto out; @@ -278,7 +279,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (ret == -EIOCBQUEUED && unaligned_aio) ext4_unwritten_wait(inode); inode_unlock(inode); - + printk("%s: unlock inode\n", __func__); if (ret > 0) ret = generic_write_sync(iocb, ret); diff --git a/mm/truncate.c b/mm/truncate.c index 43c73db17a0a..9b1ffedf831b 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -677,6 +677,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, int ret2 = 0; int did_range_unmap = 0; + printk("%s: start=%lu end=%lu pages=%lu\n", __func__, start, end, mapping->nrpages); if (mapping->nrpages == 0 && mapping->nrexceptional == 0) goto out; @@ -690,6 +691,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, /* We rely upon deletion not changing page->index */ index = indices[i]; + printk("%s: index=%lu\n", __func__, index); if (index > end) break; >> Kernel modification base on v4.19: >> diff --git a/fs/direct-io.c b/fs/direct-io.c >> +#include >> >> /** >> @@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) >> * end_io() when necessary, otherwise a racing buffer read would cache >> * zeros from unwritten extents. >> */ >> + printk("%s: before invalidate page \n", __func__); >> + mdelay(10 * 1000); >> if (flags & DIO_COMPLETE_INVALIDATE && >> ret > 0 && dio->op == REQ_OP_WRITE && >> dio->inode->i_mapping->nrpages) { >> @@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) >> if (err) >> dio_warn_stale_pagecache(dio->iocb->ki_filp); >> } >> + printk("%s: end invalidate page \n", __func__); >> >> inode_dio_end(dio->inode); >> >> @@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio) >> * went in between AIO submission and completion into the >> * same region. >> */ >> + printk("%s:start to sleep 5s.... \n", __func__); >> + mdelay(10 * 1000); >> + printk("%s: endto sleep, defer_completion=%d op=%d nrpages=%d\n", >> + __func__,dio->defer_completion, dio->op, dio->inode->i_mapping->nrpages); >> + >> if (dio->result) >> defer_completion = dio->defer_completion || >> (dio->op == REQ_OP_WRITE && >> dio->inode->i_mapping->nrpages); >> + printk("%s: dio=%px defer_completion=%d\n", >> + __func__, dio, defer_completion); >> if (defer_completion) { >> INIT_WORK(&dio->complete_work, dio_aio_complete_work); >> queue_work(dio->inode->i_sb->s_dio_done_wq, >> >> >> Reproduce result : >> s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjob >> test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 >> fio-3.7 >> Starting 1 thread >> test_r_w: (groupid=0, jobs=1): err= 0: pid=2083: Sun Sep 26 03:18:06 2021 >> write: IOPS=444, BW=1778KiB/s (1820kB/s)(16.0KiB/9msec) >> slat (usec): min=1294, max=3789, avg=2007.88, stdev=1191.07 >> clat (nsec): min=2304, max=28645, avg=9918.75, stdev=12574.80 >> lat (usec): min=1298, max=3824, avg=2020.36, stdev=1206.00 >> clat percentiles (nsec): >> | 1.00th=[ 2320], 5.00th=[ 2320], 10.00th=[ 2320], 20.00th=[ 2320], >> | 30.00th=[ 2960], 40.00th=[ 2960], 50.00th=[ 2960], 60.00th=[ 5792], >> | 70.00th=[ 5792], 80.00th=[28544], 90.00th=[28544], 95.00th=[28544], >> | 99.00th=[28544], 99.50th=[28544], 99.90th=[28544], 99.95th=[28544], >> | 99.99th=[28544] >> lat (usec) : 4=50.00%, 10=25.00%, 50=25.00% >> cpu : usr=0.00%, sys=87.50%, ctx=0, majf=0, minf=1 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0 >> latency : target=0, window=0, percentile=100.00%, depth=1 >> >> Run status group 0 (all jobs): >> WRITE: bw=1778KiB/s (1820kB/s), 1778KiB/s-1778KiB/s (1820kB/s-1820kB/s), io=16.0KiB (16.4kB), run=9-9msec >> >> Disk stats (read/write): >> sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% >> =1 -name=test_r_w &write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k -numjobs >> [1] 2087 >> [root@localhost home]# test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 >> fio-3.7 >> Starting 1 thread >> >> [root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock inode >> [ 1475.288807] ext4_file_write_iter: unlock inode >> [ 1475.290402] dio_bio_end_aio:start to sleep 5s.... >> >> s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjob >> test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 >> fio-3.7 >> Starting 1 thread >> Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s] >> Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s] >> [ 1485.292939] dio_bio_end_aio: endto sleep, defer_completion=0 op=1 nrpages=4 >> [ 1485.293888] dio_bio_end_aio: dio=ffff88839aff5200 defer_completion=1 >> [ 1485.294821] dio_aio_complete_work: dio=ffff88839aff5200... >> Jobs: 1 (f=1): [W(1)][-.-%][r=0KiB/s,w=0KiB/s][r[ 1485.296539] dio_complete: before invalidate page >> [ 1495.298472] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! -->detect buffer io and direct io conflict. >> [ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2 >> [ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted 4.19.90-dirty #96 >> [ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 >> Jobs: 1 (f=1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_work >> W(1)][-.-%][r=0K >> [ 1495.316879] dio_complete: end invalidate page >> test_r_w: (groupid=0, jobs=1): err= 0: pid=2092: Sun Sep 26 03:18:37 2021 >> write: IOPS=266, BW=1067KiB/s (1092kB/s)(16.0KiB/15msec) >> slat (usec): min=2001, max=3101, avg=2691.97, stdev=480.07 >> clat (nsec): min=2058, max=12337, avg=5999.50, stdev=4456.54 >> lat (usec): min=2004, max=3116, avg=2700.11, stdev=484.22 >> clat percentiles (nsec): >> | 1.00th=[ 2064], 5.00th=[ 2064], 10.00th=[ 2064], 20.00th=[ 2064], >> | 30.00th=[ 4080], 40.00th=[ 4080], 50.00th=[ 4080], 60.00th=[ 5536], >> | 70.00th=[ 5536], 80.00th=[12352], 90.00th=[12352], 95.00th=[12352], >> | 99.00th=[12352], 99.50th=[12352], 99.90th=[12352], 99.95th=[12352], >> | 99.99th=[12352] >> lat (usec) : 4=25.00%, 10=50.00%, 20=25.00% >> cpu : usr=0.00%, sys=100.00%, ctx=1, majf=0, minf=1 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0 >> latency : target=0, window=0, percentile=100.00%, depth=1 >> >> Run status group 0 (all jobs): >> WRITE: bw=1067KiB/s (1092kB/s), 1067KiB/s-1067KiB/s (1092kB/s-1092kB/s), io=16.0KiB (16.4kB), run=15-15msec >> >> Disk stats (read/write): >> sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% >> >> test_r_w: (groupid=0, jobs=1): err= 0: pid=2089: Sun Sep 26 03:18:37 2021 >> write: IOPS=0, BW=204B/s (204B/s)(4096B/20033msec) >> slat (nsec): min=3358.7k, max=3358.7k, avg=3358748.00, stdev= 0.00 >> clat (nsec): min=20027M, max=20027M, avg=20027265355.00, stdev= 0.00 >> lat (nsec): min=20031M, max=20031M, avg=20030627283.00, stdev= 0.00 >> clat percentiles (msec): >> | 1.00th=[17113], 5.00th=[17113], 10.00th=[17113], 20.00th=[17113], >> | 30.00th=[17113], 40.00th=[17113], 50.00th=[17113], 60.00th=[17113], >> | 70.00th=[17113], 80.00th=[17113], 90.00th=[17113], 95.00th=[17113], >> | 99.00th=[17113], 99.50th=[17113], 99.90th=[17113], 99.95th=[17113], >> | 99.99th=[17113] >> cpu : usr=0.00%, sys=0.02%, ctx=2, majf=0, minf=1 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0 >> latency : target=0, window=0, percentile=100.00%, depth=1 >> >> Run status group 0 (all jobs): >> WRITE: bw=204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=4096B (4096B), run=20033-20033msec >> >> Disk stats (read/write): >> sda: ios=0/5, merge=0/1, ticks=0/10018, in_queue=10015, util=36.98% >> >> >> > . >