From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3FC8C433F5 for ; Fri, 29 Oct 2021 08:16:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0466C61157 for ; Fri, 29 Oct 2021 08:16:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0466C61157 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 93A876B0071; Fri, 29 Oct 2021 04:16:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C3BF6B0072; Fri, 29 Oct 2021 04:16:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 715FB940007; Fri, 29 Oct 2021 04:16:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0239.hostedemail.com [216.40.44.239]) by kanga.kvack.org (Postfix) with ESMTP id 2E9F46B0071 for ; Fri, 29 Oct 2021 04:16:27 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 96F6C2DD68 for ; Fri, 29 Oct 2021 08:16:26 +0000 (UTC) X-FDA: 78748767972.22.BC69972 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf22.hostedemail.com (Postfix) with ESMTP id 195881901 for ; Fri, 29 Oct 2021 08:16:24 +0000 (UTC) Received: from dggeme754-chm.china.huawei.com (unknown [172.30.72.55]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4HgZvz2LLYz1DH80; Fri, 29 Oct 2021 16:14:23 +0800 (CST) Received: from [10.174.178.185] (10.174.178.185) by dggeme754-chm.china.huawei.com (10.3.19.100) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2308.15; Fri, 29 Oct 2021 16:16:20 +0800 Subject: Re: Problem with direct IO To: Zhengyuan Liu References: <20211020173729.GF16460@quack2.suse.cz> <20211021080304.GB5784@quack2.suse.cz> <61712B10.2060408@huawei.com> CC: Jan Kara , , Andrew Morton , , , , , =?UTF-8?B?5YiY5LqR?= , Zhengyuan Liu From: yebin Message-ID: <617BADD4.7030009@huawei.com> Date: Fri, 29 Oct 2021 16:16:20 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/alternative; boundary="------------060206020608060806070900" X-Originating-IP: [10.174.178.185] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggeme754-chm.china.huawei.com (10.3.19.100) X-CFilter-Loop: Reflected X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 195881901 X-Stat-Signature: f1g3s6f68jo9n95zd94ufxsgm37sb91d Authentication-Results: imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of yebin10@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=yebin10@huawei.com; dmarc=pass (policy=none) header.from=huawei.com X-HE-Tag: 1635495384-515954 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --------------060206020608060806070900 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: quoted-printable On 2021/10/21 20:20, Zhengyuan Liu wrote: > On Thu, Oct 21, 2021 at 4:55 PM yebin wrote: >> >> >> On 2021/10/21 16:03, Jan Kara wrote: >> >> On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote: >> >> On Thu, Oct 21, 2021 at 1:37 AM Jan Kara wrote: >> >> On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote: >> >> we are encounting following Mysql crash problem while importing tables= : >> >> 2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL= ] >> fsync() returned EIO, aborting. >> 2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB] >> Assertion failure: ut0ut.cc:555 thread 281472996733168 >> >> At the same time , we found dmesg had following message: >> >> [ 4328.838972] Page cache invalidation failure on direct I/O. >> Possible data corruption due to collision with buffered I/O! >> [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID: >> 625 Comm: kworker/42:1 >> >> Firstly, we doubled Mysql has operating the file with direct IO and >> buffered IO interlaced, but after some checking we found it did only >> do direct IO using aio. The problem is exactly from direct-io >> interface (__generic_file_write_iter) itself. >> >> ssize_t __generic_file_write_iter() >> { >> ... >> if (iocb->ki_flags & IOCB_DIRECT) { >> loff_t pos, endbyte; >> >> written =3D generic_file_direct_write(iocb, from); >> /* >> * If the write stopped short of completing, fall bac= k to >> * buffered writes. Some filesystems do this for wri= tes to >> * holes, for example. For DAX files, a buffered wri= te will >> * not succeed (even if it did, DAX does not handle d= irty >> * page-cache pages correctly). >> */ >> if (written < 0 || !iov_iter_count(from) || IS_DAX(in= ode)) >> goto out; >> >> status =3D generic_perform_write(file, from, pos =3D = iocb->ki_pos); >> ... >> } >> >> From above code snippet we can see that direct io could fall back to >> buffered IO under certain conditions, so even Mysql only did direct IO >> it could interleave with buffered IO when fall back occurred. I have >> no idea why FS(ext3) failed the direct IO currently, but it is strange >> __generic_file_write_iter make direct IO fall back to buffered IO, it >> seems breaking the semantics of direct IO. >> >> The reproduced environment is: >> Platform: Kunpeng 920 (arm64) >> Kernel: V5.15-rc >> PAGESIZE: 64K >> Mysql: V8.0 >> Innodb_page_size: default(16K) >> >> Thanks for report. I agree this should not happen. How hard is this to >> reproduce? Any idea whether the fallback to buffered IO happens becaus= e >> iomap_dio_rw() returns -ENOTBLK or because it returns short write? >> >> It is easy to reproduce in my test environment, as I said in the previ= ous >> email replied to Andrew this problem is related to kernel page size. >> >> Ok, can you share a reproducer? >> >> Can you post output of "dumpe2fs -h " for the filesystem where= the >> problem happens? Thanks! >> >> Sure, the output is: >> >> # dumpe2fs -h /dev/sda3 >> dumpe2fs 1.45.3 (14-Jul-2019) >> Filesystem volume name: >> Last mounted on: /data >> Filesystem UUID: 09a51146-b325-48bb-be63-c9df539a90a1 >> Filesystem magic number: 0xEF53 >> Filesystem revision #: 1 (dynamic) >> Filesystem features: has_journal ext_attr resize_inode dir_index >> filetype needs_recovery sparse_super large_file >> >> Thanks for the data. OK, a filesystem without extents. Does your test = by >> any chance try to do direct IO to a hole in a file? Because that is no= t >> (and never was) supported without extents. Also the fact that you don'= t see >> the problem with ext4 (which means extents support) would be pointing = in >> that direction. >> >> Honza >> >> >> I try to reprodeuce this issue as follows, maybe this is just one scen= ario : >> step1: Modify kernel code, add delay in dio_complete and dio_bio_end= _aio >> step2: mkfs.ext4 /dev/sda >> step3: mount /dev/sda /home/test >> step4: fio -filename=3D/home/test/test -direct=3D0 -buffered=3D1 -io= depth 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -= size=3D16k -numjobs=3D1 -name=3Dtest_r_w >> step5: fio -filename=3D/home/test/test -direct=3D1 -buffered=3D0 -io= depth 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -= size=3D4k -numjobs=3D1 -name=3Dtest_r_w & >> step6: fio -filename=3D/home/test/test -direct=3D0 -buffered=3D1 -io= depth 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -= size=3D16k -numjobs=3D1 -name=3Dtest_r_w > Hi, yebin > > Thanks for your test case, but it is not the same problem. The scenari= o you > described here is some thing the kernel design for, since you are inten= ding to > interleave buffered IO and direct IO, please look at this two patch: > > 332391a (=E2=80=9Cfs: Fix page cache inconsistency when mixing buffered= and AIO DIO=E2=80=9D) > 5a9d929 (=E2=80=9Ciomap: report collisions between directio and buffere= d > writes to userspace=E2=80=9D) Maybe we can reproduce your issue as follows: 1. mkfs.ext4 -b 1024 /dev/sdb =EF=BC=88PAGE_SIZE=3D4K=EF=BC=89 2. mount /dev/sdb /home/test 3. fio -filename=3D/home/test/test -direct=3D0 -buffered=3D1 -iodepth 1=20 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D1k -size=3D4= k=20 -numjobs=3D1 -name=3Dtest_r_w 4. fio -filename=3D/home/test/test -direct=3D1 -buffered=3D0 -iodepth 1=20 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D1k -size=3D1= k=20 -numjobs=3D1 -name=3Dtest_r_w & 5. fio -filename=3D/home/test/test -direct=3D0 -buffered=3D1 -iodepth 1=20 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D1k -size=3D5= k=20 -offset=3D1024 -numjobs=3D1 -name=3Dtest_r_w As file system block size is 1K, there are 4 buffer_head per page. So=20 direct IO and buffer IO maybe operate on the same page, although it does not operate on the same logical block. I add some debug information into kernel as follows: diff --git a/fs/direct-io.c b/fs/direct-io.c index ec7c7d1c0329..aebc3e76f8ca 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -38,6 +38,8 @@ #include #include +#include + /* * How many user pages to map in one call to get_user_pages(). This=20 determines * the size of a structure in the slab cache @@ -224,20 +226,20 @@ static inline struct page *dio_get_page(struct dio=20 *dio, */ void dio_warn_stale_pagecache(struct file *filp) { - static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ,=20 DEFAULT_RATELIMIT_BURST); +// static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ,=20 DEFAULT_RATELIMIT_BURST); char pathname[128]; struct inode *inode =3D file_inode(filp); char *path; errseq_set(&inode->i_mapping->wb_err, -EIO); - if (__ratelimit(&_rs)) { +// if (__ratelimit(&_rs)) { path =3D file_path(filp, pathname, sizeof(pathname)); if (IS_ERR(path)) path =3D "(unknown)"; pr_crit("Page cache invalidation failure on direct=20 I/O. Possible data corruption due to collision with buffered I/O!\n"); pr_crit("File: %s PID: %d Comm: %.20s\n", path,=20 current->pid, current->comm); - } +// } } /** @@ -305,15 +307,20 @@ static ssize_t dio_complete(struct dio *dio,=20 ssize_t ret, unsigned int flags) * end_io() when necessary, otherwise a racing buffer read=20 would cache * zeros from unwritten extents. */ + printk("%s: before invalidate page \n", __func__); + mdelay(10 * 1000); if (flags & DIO_COMPLETE_INVALIDATE && ret > 0 && dio->op =3D=3D REQ_OP_WRITE && dio->inode->i_mapping->nrpages) { + printk("%s: start to call invalidate_inode_pages2_range=20 nrpags=3D%lu\n", + __func__, dio->inode->i_mapping->nrpages); err =3D invalidate_inode_pages2_range(dio->inode->i_mapp= ing, offset >> PAGE_SHIFT, (offset + ret - 1) >> PAGE_SHIFT= ); if (err) dio_warn_stale_pagecache(dio->iocb->ki_filp); } + printk("%s: end invalidate page \n", __func__); inode_dio_end(dio->inode); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 21b0fc0d6ffe..35d1a330437b 100644 --- a/fs/ext4/file.c + printk("%s: end invalidate page \n", __func__); inode_dio_end(dio->inode); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 21b0fc0d6ffe..35d1a330437b 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -241,6 +241,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct=20 iov_iter *from) inode_lock(inode); } + printk("%s: start lock inode offset=3D%llu \n", __func__,=20 iocb->ki_pos); ret =3D ext4_write_checks(iocb, from); if (ret <=3D 0) goto out; @@ -278,7 +279,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct=20 iov_iter *from) if (ret =3D=3D -EIOCBQUEUED && unaligned_aio) ext4_unwritten_wait(inode); inode_unlock(inode); - + printk("%s: unlock inode\n", __func__); if (ret > 0) ret =3D generic_write_sync(iocb, ret); diff --git a/mm/truncate.c b/mm/truncate.c index 43c73db17a0a..9b1ffedf831b 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -677,6 +677,7 @@ int invalidate_inode_pages2_range(struct=20 address_space *mapping, int ret2 =3D 0; int did_range_unmap =3D 0; + printk("%s: start=3D%lu end=3D%lu pages=3D%lu\n", __func__, start= ,=20 end, mapping->nrpages); if (mapping->nrpages =3D=3D 0 && mapping->nrexceptional =3D=3D 0= ) goto out; @@ -690,6 +691,7 @@ int invalidate_inode_pages2_range(struct=20 address_space *mapping, /* We rely upon deletion not changing=20 page->index */ index =3D indices[i]; + printk("%s: index=3D%lu\n", __func__, index); if (index > end) break; >> Kernel modification base on v4.19: >> diff --git a/fs/direct-io.c b/fs/direct-io.c >> +#include >> >> /** >> @@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize= _t ret, unsigned int flags) >> * end_io() when necessary, otherwise a racing buffer read wo= uld cache >> * zeros from unwritten extents. >> */ >> + printk("%s: before invalidate page \n", __func__); >> + mdelay(10 * 1000); >> if (flags & DIO_COMPLETE_INVALIDATE && >> ret > 0 && dio->op =3D=3D REQ_OP_WRITE && >> dio->inode->i_mapping->nrpages) { >> @@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize= _t ret, unsigned int flags) >> if (err) >> dio_warn_stale_pagecache(dio->iocb->ki_filp); >> } >> + printk("%s: end invalidate page \n", __func__); >> >> inode_dio_end(dio->inode); >> >> @@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio) >> * went in between AIO submission and completion into= the >> * same region. >> */ >> + printk("%s:start to sleep 5s.... \n", __func__); >> + mdelay(10 * 1000); >> + printk("%s: endto sleep, defer_completion=3D%d op=3D%= d nrpages=3D%d\n", >> + __func__,dio->defer_completion, dio->op, dio->= inode->i_mapping->nrpages); >> + >> if (dio->result) >> defer_completion =3D dio->defer_completion || >> (dio->op =3D=3D REQ_OP_WRI= TE && >> dio->inode->i_mapping->nr= pages); >> + printk("%s: dio=3D%px defer_completion=3D%d\n", >> + __func__, dio, defer_completion); >> if (defer_completion) { >> INIT_WORK(&dio->complete_work, dio_aio_comple= te_work); >> queue_work(dio->inode->i_sb->s_dio_done_wq, >> >> >> Reproduce result =EF=BC=9A >> s=3D1 -name=3Dtest_r_w=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D= 4k -size=3D16k -numjob >> test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096B, (W) 4096B-4096B, = (T) 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1 >> fio-3.7 >> Starting 1 thread >> test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2083: Sun Sep 26 03= :18:06 2021 >> write: IOPS=3D444, BW=3D1778KiB/s (1820kB/s)(16.0KiB/9msec) >> slat (usec): min=3D1294, max=3D3789, avg=3D2007.88, stdev=3D1191.= 07 >> clat (nsec): min=3D2304, max=3D28645, avg=3D9918.75, stdev=3D1257= 4.80 >> lat (usec): min=3D1298, max=3D3824, avg=3D2020.36, stdev=3D1206.= 00 >> clat percentiles (nsec): >> | 1.00th=3D[ 2320], 5.00th=3D[ 2320], 10.00th=3D[ 2320], 20.00= th=3D[ 2320], >> | 30.00th=3D[ 2960], 40.00th=3D[ 2960], 50.00th=3D[ 2960], 60.00= th=3D[ 5792], >> | 70.00th=3D[ 5792], 80.00th=3D[28544], 90.00th=3D[28544], 95.00= th=3D[28544], >> | 99.00th=3D[28544], 99.50th=3D[28544], 99.90th=3D[28544], 99.95= th=3D[28544], >> | 99.99th=3D[28544] >> lat (usec) : 4=3D50.00%, 10=3D25.00%, 50=3D25.00% >> cpu : usr=3D0.00%, sys=3D87.50%, ctx=3D0, majf=3D0, minf=3D= 1 >> IO depths : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%,= 32=3D0.0%, >=3D64=3D0.0% >> submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%= , 64=3D0.0%, >=3D64=3D0.0% >> complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%= , 64=3D0.0%, >=3D64=3D0.0% >> issued rwts: total=3D0,4,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0 >> latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 1 >> >> Run status group 0 (all jobs): >> WRITE: bw=3D1778KiB/s (1820kB/s), 1778KiB/s-1778KiB/s (1820kB/s-182= 0kB/s), io=3D16.0KiB (16.4kB), run=3D9-9msec >> >> Disk stats (read/write): >> sda: ios=3D0/0, merge=3D0/0, ticks=3D0/0, in_queue=3D0, util=3D0.00= % >> =3D1 -name=3Dtest_r_w &write -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D= 4k -size=3D4k -numjobs >> [1] 2087 >> [root@localhost home]# test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4= 096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1 >> fio-3.7 >> Starting 1 thread >> >> [root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock= inode >> [ 1475.288807] ext4_file_write_iter: unlock inode >> [ 1475.290402] dio_bio_end_aio:start to sleep 5s.... >> >> s=3D1 -name=3Dtest_r_w=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D= 4k -size=3D16k -numjob >> test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096B, (W) 4096B-4096B, = (T) 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1 >> fio-3.7 >> Starting 1 thread >> Jobs: 1 (f=3D0): [f(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r=3D0,w=3D0 IOPS]= [eta 00m:00s] >> Jobs: 1 (f=3D0): [f(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r=3D0,w=3D0 IOPS]= [eta 00m:00s] >> [ 1485.292939] dio_bio_end_aio: endto sleep, defer_completion=3D0 op=3D= 1 nrpages=3D4 >> [ 1485.293888] dio_bio_end_aio: dio=3Dffff88839aff5200 defer_completio= n=3D1 >> [ 1485.294821] dio_aio_complete_work: dio=3Dffff88839aff5200... >> Jobs: 1 (f=3D1): [W(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r[ 1485.296539] d= io_complete: before invalidate page >> [ 1495.298472] Page cache invalidation failure on direct I/O. Possibl= e data corruption due to collision with buffered I/O! -->detect buff= er io and direct io conflict. >> [ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2 >> [ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted 4.19.90-d= irty #96 >> [ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), = BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04= /01/2014 >> Jobs: 1 (f=3D1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_w= ork >> W(1)][-.-%][r=3D0K >> [ 1495.316879] dio_complete: end invalidate page >> test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2092: Sun Sep 26 03= :18:37 2021 >> write: IOPS=3D266, BW=3D1067KiB/s (1092kB/s)(16.0KiB/15msec) >> slat (usec): min=3D2001, max=3D3101, avg=3D2691.97, stdev=3D480.0= 7 >> clat (nsec): min=3D2058, max=3D12337, avg=3D5999.50, stdev=3D4456= .54 >> lat (usec): min=3D2004, max=3D3116, avg=3D2700.11, stdev=3D484.2= 2 >> clat percentiles (nsec): >> | 1.00th=3D[ 2064], 5.00th=3D[ 2064], 10.00th=3D[ 2064], 20.00= th=3D[ 2064], >> | 30.00th=3D[ 4080], 40.00th=3D[ 4080], 50.00th=3D[ 4080], 60.00= th=3D[ 5536], >> | 70.00th=3D[ 5536], 80.00th=3D[12352], 90.00th=3D[12352], 95.00= th=3D[12352], >> | 99.00th=3D[12352], 99.50th=3D[12352], 99.90th=3D[12352], 99.95= th=3D[12352], >> | 99.99th=3D[12352] >> lat (usec) : 4=3D25.00%, 10=3D50.00%, 20=3D25.00% >> cpu : usr=3D0.00%, sys=3D100.00%, ctx=3D1, majf=3D0, minf=3D= 1 >> IO depths : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%,= 32=3D0.0%, >=3D64=3D0.0% >> submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%= , 64=3D0.0%, >=3D64=3D0.0% >> complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%= , 64=3D0.0%, >=3D64=3D0.0% >> issued rwts: total=3D0,4,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0 >> latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 1 >> >> Run status group 0 (all jobs): >> WRITE: bw=3D1067KiB/s (1092kB/s), 1067KiB/s-1067KiB/s (1092kB/s-109= 2kB/s), io=3D16.0KiB (16.4kB), run=3D15-15msec >> >> Disk stats (read/write): >> sda: ios=3D0/0, merge=3D0/0, ticks=3D0/0, in_queue=3D0, util=3D0.00= % >> >> test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2089: Sun Sep 26 03= :18:37 2021 >> write: IOPS=3D0, BW=3D204B/s (204B/s)(4096B/20033msec) >> slat (nsec): min=3D3358.7k, max=3D3358.7k, avg=3D3358748.00, stde= v=3D 0.00 >> clat (nsec): min=3D20027M, max=3D20027M, avg=3D20027265355.00, st= dev=3D 0.00 >> lat (nsec): min=3D20031M, max=3D20031M, avg=3D20030627283.00, st= dev=3D 0.00 >> clat percentiles (msec): >> | 1.00th=3D[17113], 5.00th=3D[17113], 10.00th=3D[17113], 20.00= th=3D[17113], >> | 30.00th=3D[17113], 40.00th=3D[17113], 50.00th=3D[17113], 60.00= th=3D[17113], >> | 70.00th=3D[17113], 80.00th=3D[17113], 90.00th=3D[17113], 95.00= th=3D[17113], >> | 99.00th=3D[17113], 99.50th=3D[17113], 99.90th=3D[17113], 99.95= th=3D[17113], >> | 99.99th=3D[17113] >> cpu : usr=3D0.00%, sys=3D0.02%, ctx=3D2, majf=3D0, minf=3D= 1 >> IO depths : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%,= 32=3D0.0%, >=3D64=3D0.0% >> submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%= , 64=3D0.0%, >=3D64=3D0.0% >> complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%= , 64=3D0.0%, >=3D64=3D0.0% >> issued rwts: total=3D0,1,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0 >> latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 1 >> >> Run status group 0 (all jobs): >> WRITE: bw=3D204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=3D40= 96B (4096B), run=3D20033-20033msec >> >> Disk stats (read/write): >> sda: ios=3D0/5, merge=3D0/1, ticks=3D0/10018, in_queue=3D10015, uti= l=3D36.98% >> >> >> > . > --------------060206020608060806070900 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable

On 2021/10/21 20:20, Zhengyuan Liu wrote:
On Thu, Oct 21, 2021 at 4:55 PM yebin <yebin10@huawe=
i.com> wrote:


On 2021/10/21 16:03, Jan Kara wrote:

On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:

On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:

On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:

we are encounting following Mysql crash problem while importing tables :

    2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
    fsync() returned EIO, aborting.
    2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
    Assertion failure: ut0ut.cc:555 thread 281472996733168

At the same time , we found dmesg had following message:

    [ 4328.838972] Page cache invalidation failure on direct I/O.
    Possible data corruption due to collision with buffered I/O!
    [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
    625 Comm: kworker/42:1

Firstly, we doubled Mysql has operating the file with direct IO and
buffered IO interlaced, but after some checking we found it did only
do direct IO using aio. The problem is exactly from direct-io
interface (__generic_file_write_iter) itself.

ssize_t __generic_file_write_iter()
{
...
        if (iocb->ki_flags & IOCB_DIRECT) {
                loff_t pos, endbyte;

                written =3D generic_file_direct_write(iocb, from);
                /*
                 * If the write stopped short of completing, fall back to
                 * buffered writes.  Some filesystems do this for writes =
to
                 * holes, for example.  For DAX files, a buffered write w=
ill
                 * not succeed (even if it did, DAX does not handle dirty
                 * page-cache pages correctly).
                 */
                if (written < 0 || !iov_iter_count(from) || IS_DAX(ino=
de))
                        goto out;

                status =3D generic_perform_write(file, from, pos =3D iocb=
->ki_pos);
...
}

>From above code snippet we can see that direct io could fall back to
buffered IO under certain conditions, so even Mysql only did direct IO
it could interleave with buffered IO when fall back occurred. I have
no idea why FS(ext3) failed the direct IO currently, but it is strange
__generic_file_write_iter make direct IO fall back to buffered IO, it
seems  breaking the semantics of direct IO.

The reproduced  environment is:
Platform:  Kunpeng 920 (arm64)
Kernel: V5.15-rc
PAGESIZE: 64K
Mysql:  V8.0
Innodb_page_size: default(16K)

Thanks for report. I agree this should not happen. How hard is this to
reproduce? Any idea whether the fallback to buffered IO happens because
iomap_dio_rw() returns -ENOTBLK or because it returns short write?

It is easy to reproduce in my test environment, as I said in the previous
email replied to Andrew this problem is related to kernel page size.

Ok, can you share a reproducer?

Can you post output of "dumpe2fs -h <device>" for the filesystem wh=
ere the
problem happens? Thanks!

Sure, the output is:

# dumpe2fs -h /dev/sda3
dumpe2fs 1.45.3 (14-Jul-2019)
Filesystem volume name:   <none>
Last mounted on:          /data
Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery sparse_super large_file

Thanks for the data. OK, a filesystem without extents. Does your test by
any chance try to do direct IO to a hole in a file? Because that is not
(and never was) supported without extents. Also the fact that you don't s=
ee
the problem with ext4 (which means extents support) would be pointing in
that direction.

Honza


I try to reprodeuce this issue as follows, maybe this is just one scenari=
o :
step1:  Modify  kernel code, add delay in dio_complete and dio_bio_end_ai=
o
step2:  mkfs.ext4 /dev/sda
step3:  mount /dev/sda  /home/test
step4:  fio -filename=3D/home/test/test  -direct=3D0 -buffered=3D1 -iodep=
th 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -siz=
e=3D16k -numjobs=3D1 -name=3Dtest_r_w
step5:  fio -filename=3D/home/test/test  -direct=3D1 -buffered=3D0 -iodep=
th 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -siz=
e=3D4k -numjobs=3D1 -name=3Dtest_r_w &
step6:  fio -filename=3D/home/test/test  -direct=3D0 -buffered=3D1 -iodep=
th 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -siz=
e=3D16k -numjobs=3D1 -name=3Dtest_r_w
Hi, yebin

Thanks for your test case,  but it is not the same problem. The scenario =
you
described here is some thing the kernel design for, since you are intendi=
ng to
interleave buffered IO and direct IO, please look at this two patch:

332391a (=E2=80=9Cfs: Fix page cache inconsistency when mixing buffered a=
nd AIO DIO=E2=80=9D)
5a9d929 (=E2=80=9Ciomap: report collisions between directio and buffered
writes to userspace=E2=80=9D)
=C2=A0Maybe we can reproduce your issue as follows:
1. mkfs.ext4=C2=A0 -b 1024 /dev/sdb=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 =EF=BC=88PAGE_SIZE=3D4K=EF=BC=89
2. mount /dev/sdb /home/test
3. f= io -filename=3D/home/test/test=C2=A0 -direct=3D0 -buffered=3D1 -iodept= h 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D1k -si= ze=3D4k -numjobs=3D1 -name=3Dtest_r_w
4.=C2=A0
f= io -filename=3D/home/test/test=C2=A0 -direct=3D1 -buffered=3D0 -iodept= h 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D1k -si= ze=3D1k -numjobs=3D1 -name=3Dtest_r_w &
5.=C2=A0
f= io -filename=3D/home/test/test=C2=A0 -direct=3D0 -buffered=3D1 -iodept= h 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D1k -si= ze=3D5k -offset=3D1024 -numjobs=3D1 -name=3Dtest_r_w

As file system block size is 1K,=C2=A0 there are 4=C2=A0 buffer_hea= d per=C2=A0 page.=C2=A0 So direct IO and buffer IO maybe=C2=A0 operate on the s= ame page,
although it does not operate on the same logical block.

I add some debug information=C2=A0 into=C2=A0 kernel=C2=A0 as follo= ws:
diff --git a/fs/direct-io.c b/fs/direct-io.c
index ec7c7d1c0329..aebc3e76f8ca 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,8 @@
=C2=A0#include <linux/atomic.h>
=C2=A0#include <linux/prefetch.h>

+#include <linux/delay.h>
+
=C2=A0/*
=C2=A0 * How many user pages to map in one call to get_user_pages()= .=C2=A0 This determines
=C2=A0 * the size of a structure in the slab cache
@@ -224,20 +226,20 @@ static inline struct page *dio_get_page(struct dio *dio,
=C2=A0 */
=C2=A0void dio_warn_stale_pagecache(struct file *filp)
=C2=A0{
-=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 static DEFINE_RATELIMIT_STATE= (_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST);
+//=C2=A0=C2=A0=C2=A0=C2=A0 static DEFINE_RATELIMIT_STATE(_rs, 8640= 0 * HZ, DEFAULT_RATELIMIT_BURST);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 char pathname[128];
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct inode *inode =3D = file_inode(filp);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 char *path;

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 errseq_set(&inode-&g= t;i_mapping->wb_err, -EIO);
-=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (__ratelimit(&_rs)) {<= br> +//=C2=A0=C2=A0=C2=A0=C2=A0 if (__ratelimit(&_rs)) {
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 path =3D file_path(filp, pathname, sizeof(pathname));
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 if (IS_ERR(path))
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pat= h =3D "(unknown)";
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 pr_crit("Page cache invalidation failure on direct I/O.=C2=A0 Possible data corruption due to collision with buffered I/O!\n");
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid,
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cur= rent->comm);
-=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
+//=C2=A0=C2=A0=C2=A0=C2=A0 }
=C2=A0}

=C2=A0/**
@@ -305,15 +307,20 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * end_io() when ne= cessary, otherwise a racing buffer read would cache
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * zeros from unwri= tten extents.
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 printk("%s: before invalidate= page \n", __func__);
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mdelay(10 * 1000);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (flags & DIO_COMP= LETE_INVALIDATE &&
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = ret > 0 && dio->op =3D=3D REQ_OP_WRITE &&
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = dio->inode->i_mapping->nrpages) {
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 printk("%s: start to call invalidate_inode_pages2_range nrpags=3D%lu\n",
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __func= __, dio->inode->i_mapping->nrpages);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 err =3D invalidate_inode_pages2_range(dio->inode->i_mapping,
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 offset >> PAGE_SHIFT,
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 (offset + ret - 1) >> PAGE_SHIFT);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 if (err)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 dio_warn_stale_pagecache(dio->iocb->ki_filp);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 printk("%s: end invalidate pa= ge \n", __func__);

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 inode_dio_end(dio->in= ode);

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 21b0fc0d6ffe..35d1a330437b 100644
--- a/fs/ext4/file.c
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 printk("%s: end invalidate pa= ge \n", __func__);

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 inode_dio_end(dio->in= ode);

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 21b0fc0d6ffe..35d1a330437b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -241,6 +241,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 inode_lock(inode);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }

+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 printk("%s: start lock inode = offset=3D%llu \n", __func__, iocb->ki_pos);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ret =3D ext4_write_check= s(iocb, from);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ret <=3D 0)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 goto out;
@@ -278,7 +279,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ret =3D=3D -EIOCBQUE= UED && unaligned_aio)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 ext4_unwritten_wait(inode);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 inode_unlock(inode);
-
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 printk("%s: unlock inode\n", = __func__);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ret > 0)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 ret =3D generic_write_sync(iocb, ret);

diff --git a/mm/truncate.c b/mm/truncate.c
index 43c73db17a0a..9b1ffedf831b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -677,6 +677,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int ret2 =3D 0;
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int did_range_unmap =3D = 0;

+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 printk("%s: start=3D%lu end=3D= %lu pages=3D%lu\n", __func__, start, end, mapping->nrpages);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (mapping->nrpages = =3D=3D 0 && mapping->nrexceptional =3D=3D 0)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 goto out;

@@ -690,6 +691,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* = We rely upon deletion not changing page->index */
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ind= ex =3D indices[i];
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 printk= ("%s: index=3D%lu\n", __func__, index);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if = (index > end)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 break;


      
Kernel modification base on v4.19:
diff --git a/fs/direct-io.c b/fs/direct-io.c
+#include <linux/delay.h>

/**
@@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t =
ret, unsigned int flags)
         * end_io() when necessary, otherwise a racing buffer read would =
cache
         * zeros from unwritten extents.
         */
+       printk("%s: before invalidate page \n", __func__);
+       mdelay(10 * 1000);
        if (flags & DIO_COMPLETE_INVALIDATE &&
            ret > 0 && dio->op =3D=3D REQ_OP_WRITE &&am=
p;
            dio->inode->i_mapping->nrpages) {
@@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t =
ret, unsigned int flags)
                if (err)
                        dio_warn_stale_pagecache(dio->iocb->ki_filp=
);
        }
+       printk("%s: end invalidate page \n", __func__);

        inode_dio_end(dio->inode);

@@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio)
                 * went in between AIO submission and completion into the
                 * same region.
                 */
+               printk("%s:start to  sleep 5s.... \n", __func__);
+               mdelay(10 * 1000);
+               printk("%s: endto  sleep, defer_completion=3D%d op=3D%d n=
rpages=3D%d\n",
+                       __func__,dio->defer_completion, dio->op, di=
o->inode->i_mapping->nrpages);
+
                if (dio->result)
                        defer_completion =3D dio->defer_completion ||
                                           (dio->op =3D=3D REQ_OP_WRIT=
E &&
                                            dio->inode->i_mapping-&=
gt;nrpages);
+               printk("%s: dio=3D%px defer_completion=3D%d\n",
+                       __func__, dio, defer_completion);
                if (defer_completion) {
                        INIT_WORK(&dio->complete_work, dio_aio_com=
plete_work);
                        queue_work(dio->inode->i_sb->s_dio_done_=
wq,


Reproduce result =EF=BC=9A
s=3D1 -name=3Dtest_r_w=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4=
k -size=3D16k -numjob
test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096B, (W) 4096B-4096B, (T)=
 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1
fio-3.7
Starting 1 thread
test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2083: Sun Sep 26 03:18=
:06 2021
  write: IOPS=3D444, BW=3D1778KiB/s (1820kB/s)(16.0KiB/9msec)
    slat (usec): min=3D1294, max=3D3789, avg=3D2007.88, stdev=3D1191.07
    clat (nsec): min=3D2304, max=3D28645, avg=3D9918.75, stdev=3D12574.80
     lat (usec): min=3D1298, max=3D3824, avg=3D2020.36, stdev=3D1206.00
    clat percentiles (nsec):
     |  1.00th=3D[ 2320],  5.00th=3D[ 2320], 10.00th=3D[ 2320], 20.00th=3D=
[ 2320],
     | 30.00th=3D[ 2960], 40.00th=3D[ 2960], 50.00th=3D[ 2960], 60.00th=3D=
[ 5792],
     | 70.00th=3D[ 5792], 80.00th=3D[28544], 90.00th=3D[28544], 95.00th=3D=
[28544],
     | 99.00th=3D[28544], 99.50th=3D[28544], 99.90th=3D[28544], 99.95th=3D=
[28544],
     | 99.99th=3D[28544]
  lat (usec)   : 4=3D50.00%, 10=3D25.00%, 50=3D25.00%
  cpu          : usr=3D0.00%, sys=3D87.50%, ctx=3D0, majf=3D0, minf=3D1
  IO depths    : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D=
0.0%, >=3D64=3D0.0%
     submit    : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=
=3D0.0%, >=3D64=3D0.0%
     complete  : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=
=3D0.0%, >=3D64=3D0.0%
     issued rwts: total=3D0,4,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0
     latency   : target=3D0, window=3D0, percentile=3D100.00%, depth=3D1

Run status group 0 (all jobs):
  WRITE: bw=3D1778KiB/s (1820kB/s), 1778KiB/s-1778KiB/s (1820kB/s-1820kB/=
s), io=3D16.0KiB (16.4kB), run=3D9-9msec

Disk stats (read/write):
  sda: ios=3D0/0, merge=3D0/0, ticks=3D0/0, in_queue=3D0, util=3D0.00%
=3D1 -name=3Dtest_r_w &write -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D=
4k -size=3D4k -numjobs
[1] 2087
[root@localhost home]# test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096=
B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1
fio-3.7
Starting 1 thread

[root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock in=
ode
[ 1475.288807] ext4_file_write_iter: unlock inode
[ 1475.290402] dio_bio_end_aio:start to  sleep 5s....

s=3D1 -name=3Dtest_r_w=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4=
k -size=3D16k -numjob
test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096B, (W) 4096B-4096B, (T)=
 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1
fio-3.7
Starting 1 thread
Jobs: 1 (f=3D0): [f(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r=3D0,w=3D0 IOPS][et=
a 00m:00s]
Jobs: 1 (f=3D0): [f(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r=3D0,w=3D0 IOPS][et=
a 00m:00s]
[ 1485.292939] dio_bio_end_aio: endto  sleep, defer_completion=3D0 op=3D1=
 nrpages=3D4
[ 1485.293888] dio_bio_end_aio: dio=3Dffff88839aff5200 defer_completion=3D=
1
[ 1485.294821] dio_aio_complete_work: dio=3Dffff88839aff5200...
Jobs: 1 (f=3D1): [W(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r[ 1485.296539] dio_=
complete: before invalidate page
[ 1495.298472] Page cache invalidation failure on direct I/O.  Possible d=
ata corruption due to collision with buffered I/O!     -->detect  buff=
er io and direct io conflict.
[ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2
[ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted 4.19.90-dirt=
y #96
[ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
S ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01=
/2014
Jobs: 1 (f=3D1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_work
W(1)][-.-%][r=3D0K
[ 1495.316879] dio_complete: end invalidate page
test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2092: Sun Sep 26 03:18=
:37 2021
  write: IOPS=3D266, BW=3D1067KiB/s (1092kB/s)(16.0KiB/15msec)
    slat (usec): min=3D2001, max=3D3101, avg=3D2691.97, stdev=3D480.07
    clat (nsec): min=3D2058, max=3D12337, avg=3D5999.50, stdev=3D4456.54
     lat (usec): min=3D2004, max=3D3116, avg=3D2700.11, stdev=3D484.22
    clat percentiles (nsec):
     |  1.00th=3D[ 2064],  5.00th=3D[ 2064], 10.00th=3D[ 2064], 20.00th=3D=
[ 2064],
     | 30.00th=3D[ 4080], 40.00th=3D[ 4080], 50.00th=3D[ 4080], 60.00th=3D=
[ 5536],
     | 70.00th=3D[ 5536], 80.00th=3D[12352], 90.00th=3D[12352], 95.00th=3D=
[12352],
     | 99.00th=3D[12352], 99.50th=3D[12352], 99.90th=3D[12352], 99.95th=3D=
[12352],
     | 99.99th=3D[12352]
  lat (usec)   : 4=3D25.00%, 10=3D50.00%, 20=3D25.00%
  cpu          : usr=3D0.00%, sys=3D100.00%, ctx=3D1, majf=3D0, minf=3D1
  IO depths    : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D=
0.0%, >=3D64=3D0.0%
     submit    : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=
=3D0.0%, >=3D64=3D0.0%
     complete  : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=
=3D0.0%, >=3D64=3D0.0%
     issued rwts: total=3D0,4,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0
     latency   : target=3D0, window=3D0, percentile=3D100.00%, depth=3D1

Run status group 0 (all jobs):
  WRITE: bw=3D1067KiB/s (1092kB/s), 1067KiB/s-1067KiB/s (1092kB/s-1092kB/=
s), io=3D16.0KiB (16.4kB), run=3D15-15msec

Disk stats (read/write):
  sda: ios=3D0/0, merge=3D0/0, ticks=3D0/0, in_queue=3D0, util=3D0.00%

test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2089: Sun Sep 26 03:18=
:37 2021
  write: IOPS=3D0, BW=3D204B/s (204B/s)(4096B/20033msec)
    slat (nsec): min=3D3358.7k, max=3D3358.7k, avg=3D3358748.00, stdev=3D=
 0.00
    clat (nsec): min=3D20027M, max=3D20027M, avg=3D20027265355.00, stdev=3D=
 0.00
     lat (nsec): min=3D20031M, max=3D20031M, avg=3D20030627283.00, stdev=3D=
 0.00
    clat percentiles (msec):
     |  1.00th=3D[17113],  5.00th=3D[17113], 10.00th=3D[17113], 20.00th=3D=
[17113],
     | 30.00th=3D[17113], 40.00th=3D[17113], 50.00th=3D[17113], 60.00th=3D=
[17113],
     | 70.00th=3D[17113], 80.00th=3D[17113], 90.00th=3D[17113], 95.00th=3D=
[17113],
     | 99.00th=3D[17113], 99.50th=3D[17113], 99.90th=3D[17113], 99.95th=3D=
[17113],
     | 99.99th=3D[17113]
  cpu          : usr=3D0.00%, sys=3D0.02%, ctx=3D2, majf=3D0, minf=3D1
  IO depths    : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D=
0.0%, >=3D64=3D0.0%
     submit    : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=
=3D0.0%, >=3D64=3D0.0%
     complete  : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=
=3D0.0%, >=3D64=3D0.0%
     issued rwts: total=3D0,1,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0
     latency   : target=3D0, window=3D0, percentile=3D100.00%, depth=3D1

Run status group 0 (all jobs):
  WRITE: bw=3D204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=3D4096B =
(4096B), run=3D20033-20033msec

Disk stats (read/write):
  sda: ios=3D0/5, merge=3D0/1, ticks=3D0/10018, in_queue=3D10015, util=3D=
36.98%



.


--------------060206020608060806070900--