From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F57AC433EF for ; Thu, 21 Oct 2021 08:55:51 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id ECCEE610CB for ; Thu, 21 Oct 2021 08:55:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org ECCEE610CB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 8DD626B0072; Thu, 21 Oct 2021 04:55:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 88CEE6B0073; Thu, 21 Oct 2021 04:55:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6DF786B0074; Thu, 21 Oct 2021 04:55:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0046.hostedemail.com [216.40.44.46]) by kanga.kvack.org (Postfix) with ESMTP id 5A4116B0072 for ; Thu, 21 Oct 2021 04:55:50 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 149B08249980 for ; Thu, 21 Oct 2021 08:55:50 +0000 (UTC) X-FDA: 78719836860.06.C540360 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf16.hostedemail.com (Postfix) with ESMTP id AA716F00009C for ; Thu, 21 Oct 2021 08:55:45 +0000 (UTC) Received: from dggeme754-chm.china.huawei.com (unknown [172.30.72.54]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4HZh666MF5zbnKb; Thu, 21 Oct 2021 16:51:10 +0800 (CST) Received: from [10.174.178.185] (10.174.178.185) by dggeme754-chm.china.huawei.com (10.3.19.100) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2308.15; Thu, 21 Oct 2021 16:55:44 +0800 Subject: Re: Problem with direct IO To: Jan Kara , Zhengyuan Liu References: <20211020173729.GF16460@quack2.suse.cz> <20211021080304.GB5784@quack2.suse.cz> CC: , Andrew Morton , , , , , =?UTF-8?B?5YiY5LqR?= , Zhengyuan Liu From: yebin Message-ID: <61712B10.2060408@huawei.com> Date: Thu, 21 Oct 2021 16:55:44 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <20211021080304.GB5784@quack2.suse.cz> Content-Type: multipart/alternative; boundary="------------030006070804010605030403" X-Originating-IP: [10.174.178.185] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggeme754-chm.china.huawei.com (10.3.19.100) X-CFilter-Loop: Reflected X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: AA716F00009C X-Stat-Signature: xmrywde4a71oqkoo3u6fhcaj7fsw7b9r Authentication-Results: imf16.hostedemail.com; dkim=none; spf=pass (imf16.hostedemail.com: domain of yebin10@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=yebin10@huawei.com; dmarc=pass (policy=none) header.from=huawei.com X-HE-Tag: 1634806545-475646 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --------------030006070804010605030403 Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable On 2021/10/21 16:03, Jan Kara wrote: > On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote: >> On Thu, Oct 21, 2021 at 1:37 AM Jan Kara wrote: >>> On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote: >>>> we are encounting following Mysql crash problem while importing tabl= es : >>>> >>>> 2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FAT= AL] >>>> fsync() returned EIO, aborting. >>>> 2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB] >>>> Assertion failure: ut0ut.cc:555 thread 281472996733168 >>>> >>>> At the same time , we found dmesg had following message: >>>> >>>> [ 4328.838972] Page cache invalidation failure on direct I/O. >>>> Possible data corruption due to collision with buffered I/O! >>>> [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID= : >>>> 625 Comm: kworker/42:1 >>>> >>>> Firstly, we doubled Mysql has operating the file with direct IO and >>>> buffered IO interlaced, but after some checking we found it did only >>>> do direct IO using aio. The problem is exactly from direct-io >>>> interface (__generic_file_write_iter) itself. >>>> >>>> ssize_t __generic_file_write_iter() >>>> { >>>> ... >>>> if (iocb->ki_flags & IOCB_DIRECT) { >>>> loff_t pos, endbyte; >>>> >>>> written =3D generic_file_direct_write(iocb, from); >>>> /* >>>> * If the write stopped short of completing, fall b= ack to >>>> * buffered writes. Some filesystems do this for w= rites to >>>> * holes, for example. For DAX files, a buffered w= rite will >>>> * not succeed (even if it did, DAX does not handle= dirty >>>> * page-cache pages correctly). >>>> */ >>>> if (written < 0 || !iov_iter_count(from) || IS_DAX(= inode)) >>>> goto out; >>>> >>>> status =3D generic_perform_write(file, from, pos =3D= iocb->ki_pos); >>>> ... >>>> } >>>> >>>> From above code snippet we can see that direct io could fall back t= o >>>> buffered IO under certain conditions, so even Mysql only did direct = IO >>>> it could interleave with buffered IO when fall back occurred. I have >>>> no idea why FS(ext3) failed the direct IO currently, but it is stran= ge >>>> __generic_file_write_iter make direct IO fall back to buffered IO, i= t >>>> seems breaking the semantics of direct IO. >>>> >>>> The reproduced environment is: >>>> Platform: Kunpeng 920 (arm64) >>>> Kernel: V5.15-rc >>>> PAGESIZE: 64K >>>> Mysql: V8.0 >>>> Innodb_page_size: default(16K) >>> Thanks for report. I agree this should not happen. How hard is this t= o >>> reproduce? Any idea whether the fallback to buffered IO happens becau= se >>> iomap_dio_rw() returns -ENOTBLK or because it returns short write? >> It is easy to reproduce in my test environment, as I said in the previ= ous >> email replied to Andrew this problem is related to kernel page size. > Ok, can you share a reproducer? > >>> Can you post output of "dumpe2fs -h " for the filesystem wher= e the >>> problem happens? Thanks! >> Sure, the output is: >> >> # dumpe2fs -h /dev/sda3 >> dumpe2fs 1.45.3 (14-Jul-2019) >> Filesystem volume name: >> Last mounted on: /data >> Filesystem UUID: 09a51146-b325-48bb-be63-c9df539a90a1 >> Filesystem magic number: 0xEF53 >> Filesystem revision #: 1 (dynamic) >> Filesystem features: has_journal ext_attr resize_inode dir_index >> filetype needs_recovery sparse_super large_file > Thanks for the data. OK, a filesystem without extents. Does your test b= y > any chance try to do direct IO to a hole in a file? Because that is not > (and never was) supported without extents. Also the fact that you don't= see > the problem with ext4 (which means extents support) would be pointing i= n > that direction. > > Honza I try to reprodeuce this issue as follows, maybe this is just one=20 scenario : step1: Modify kernel code, add delay in dio_complete and dio_bio_end_ai= o step2: mkfs.ext4 /dev/sda step3: mount /dev/sda /home/test step4: fio -filename=3D/home/test/test -direct=3D0 -buffered=3D1 -iodep= th 1=20 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -size=3D1= 6k=20 -numjobs=3D1 -name=3Dtest_r_w step5: fio -filename=3D/home/test/test -direct=3D1 -buffered=3D0 -iodep= th 1=20 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -size=3D4= k=20 -numjobs=3D1 -name=3Dtest_r_w & step6: fio -filename=3D/home/test/test -direct=3D0 -buffered=3D1 -iodep= th 1=20 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -size=3D1= 6k=20 -numjobs=3D1 -name=3Dtest_r_w Kernel modification base on v4.19: diff --git a/fs/direct-io.c b/fs/direct-io.c +#include /** @@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t=20 ret, unsigned int flags) * end_io() when necessary, otherwise a racing buffer read=20 would cache * zeros from unwritten extents. */ + printk("%s: before invalidate page \n", __func__); + mdelay(10 * 1000); if (flags & DIO_COMPLETE_INVALIDATE && ret > 0 && dio->op =3D=3D REQ_OP_WRITE && dio->inode->i_mapping->nrpages) { @@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t=20 ret, unsigned int flags) if (err) dio_warn_stale_pagecache(dio->iocb->ki_filp); } + printk("%s: end invalidate page \n", __func__); inode_dio_end(dio->inode); @@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio) * went in between AIO submission and completion into th= e * same region. */ + printk("%s:start to sleep 5s.... \n", __func__); + mdelay(10 * 1000); + printk("%s: endto sleep, defer_completion=3D%d op=3D%d=20 nrpages=3D%d\n", + __func__,dio->defer_completion, dio->op,=20 dio->inode->i_mapping->nrpages); + if (dio->result) defer_completion =3D dio->defer_completion || (dio->op =3D=3D REQ_OP_WRITE = && dio->inode->i_mapping->nrpages); + printk("%s: dio=3D%px defer_completion=3D%d\n", + __func__, dio, defer_completion); if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, Reproduce result =EF=BC=9A s=3D1 -name=3Dtest_r_w=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4= k -size=3D16k=20 -numjob test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096B, (W) 4096B-4096B, (T)= =20 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1 fio-3.7 Starting 1 thread test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2083: Sun Sep 26 03:18= :06 2021 write: IOPS=3D444, BW=3D1778KiB/s (1820kB/s)(16.0KiB/9msec) slat (usec): min=3D1294, max=3D3789, avg=3D2007.88, stdev=3D1191.07 clat (nsec): min=3D2304, max=3D28645, avg=3D9918.75, stdev=3D12574.8= 0 lat (usec): min=3D1298, max=3D3824, avg=3D2020.36, stdev=3D1206.00 clat percentiles (nsec): | 1.00th=3D[ 2320], 5.00th=3D[ 2320], 10.00th=3D[ 2320], 20.00th=3D= [ 2320], | 30.00th=3D[ 2960], 40.00th=3D[ 2960], 50.00th=3D[ 2960], 60.00th=3D= [ 5792], | 70.00th=3D[ 5792], 80.00th=3D[28544], 90.00th=3D[28544], 95.00th=3D= [28544], | 99.00th=3D[28544], 99.50th=3D[28544], 99.90th=3D[28544], 99.95th=3D= [28544], | 99.99th=3D[28544] lat (usec) : 4=3D50.00%, 10=3D25.00%, 50=3D25.00% cpu : usr=3D0.00%, sys=3D87.50%, ctx=3D0, majf=3D0, minf=3D1 IO depths : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%, 32= =3D0.0%,=20 >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 6= 4=3D0.0%,=20 >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 6= 4=3D0.0%,=20 >=3D64=3D0.0% issued rwts: total=3D0,4,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D1 Run status group 0 (all jobs): WRITE: bw=3D1778KiB/s (1820kB/s), 1778KiB/s-1778KiB/s=20 (1820kB/s-1820kB/s), io=3D16.0KiB (16.4kB), run=3D9-9msec Disk stats (read/write): sda: ios=3D0/0, merge=3D0/0, ticks=3D0/0, in_queue=3D0, util=3D0.00% =3D1 -name=3Dtest_r_w &write -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k = -size=3D4k=20 -numjobs [1] 2087 [root@localhost home]# test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096= B,=20 (W) 4096B-4096B, (T) 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1 fio-3.7 Starting 1 thread [root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock in= ode [ 1475.288807] ext4_file_write_iter: unlock inode [ 1475.290402] dio_bio_end_aio:start to sleep 5s.... s=3D1 -name=3Dtest_r_w=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4= k -size=3D16k=20 -numjob test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096B, (W) 4096B-4096B, (T)= =20 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1 fio-3.7 Starting 1 thread Jobs: 1 (f=3D0): [f(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r=3D0,w=3D0 IOPS][et= a 00m:00s] Jobs: 1 (f=3D0): [f(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r=3D0,w=3D0 IOPS][et= a 00m:00s] [ 1485.292939] dio_bio_end_aio: endto sleep, defer_completion=3D0 op=3D1= =20 nrpages=3D4 [ 1485.293888] dio_bio_end_aio: dio=3Dffff88839aff5200 defer_completion=3D= 1 [ 1485.294821] dio_aio_complete_work: dio=3Dffff88839aff5200... Jobs: 1 (f=3D1): [W(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r[ 1485.296539]=20 dio_complete: before invalidate page [ 1495.298472] Page cache invalidation failure on direct I/O. Possible=20 data corruption due to collision with buffered I/O! -->detect =20 buffer io and direct io conflict. [ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2 [ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted=20 4.19.90-dirty #96 [ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),=20 BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31=20 04/01/2014 Jobs: 1 (f=3D1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_work W(1)][-.-%][r=3D0K [ 1495.316879] dio_complete: end invalidate page test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2092: Sun Sep 26 03:18= :37 2021 write: IOPS=3D266, BW=3D1067KiB/s (1092kB/s)(16.0KiB/15msec) slat (usec): min=3D2001, max=3D3101, avg=3D2691.97, stdev=3D480.07 clat (nsec): min=3D2058, max=3D12337, avg=3D5999.50, stdev=3D4456.54 lat (usec): min=3D2004, max=3D3116, avg=3D2700.11, stdev=3D484.22 clat percentiles (nsec): | 1.00th=3D[ 2064], 5.00th=3D[ 2064], 10.00th=3D[ 2064], 20.00th=3D= [ 2064], | 30.00th=3D[ 4080], 40.00th=3D[ 4080], 50.00th=3D[ 4080], 60.00th=3D= [ 5536], | 70.00th=3D[ 5536], 80.00th=3D[12352], 90.00th=3D[12352], 95.00th=3D= [12352], | 99.00th=3D[12352], 99.50th=3D[12352], 99.90th=3D[12352], 99.95th=3D= [12352], | 99.99th=3D[12352] lat (usec) : 4=3D25.00%, 10=3D50.00%, 20=3D25.00% cpu : usr=3D0.00%, sys=3D100.00%, ctx=3D1, majf=3D0, minf=3D1 IO depths : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%, 32= =3D0.0%,=20 >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 6= 4=3D0.0%,=20 >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 6= 4=3D0.0%,=20 >=3D64=3D0.0% issued rwts: total=3D0,4,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D1 Run status group 0 (all jobs): WRITE: bw=3D1067KiB/s (1092kB/s), 1067KiB/s-1067KiB/s=20 (1092kB/s-1092kB/s), io=3D16.0KiB (16.4kB), run=3D15-15msec Disk stats (read/write): sda: ios=3D0/0, merge=3D0/0, ticks=3D0/0, in_queue=3D0, util=3D0.00% test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2089: Sun Sep 26 03:18= :37 2021 write: IOPS=3D0, BW=3D204B/s (204B/s)(4096B/20033msec) slat (nsec): min=3D3358.7k, max=3D3358.7k, avg=3D3358748.00, stdev=3D= 0.00 clat (nsec): min=3D20027M, max=3D20027M, avg=3D20027265355.00, stdev= =3D 0.00 lat (nsec): min=3D20031M, max=3D20031M, avg=3D20030627283.00, stdev= =3D 0.00 clat percentiles (msec): | 1.00th=3D[17113], 5.00th=3D[17113], 10.00th=3D[17113], 20.00th=3D= [17113], | 30.00th=3D[17113], 40.00th=3D[17113], 50.00th=3D[17113], 60.00th=3D= [17113], | 70.00th=3D[17113], 80.00th=3D[17113], 90.00th=3D[17113], 95.00th=3D= [17113], | 99.00th=3D[17113], 99.50th=3D[17113], 99.90th=3D[17113], 99.95th=3D= [17113], | 99.99th=3D[17113] cpu : usr=3D0.00%, sys=3D0.02%, ctx=3D2, majf=3D0, minf=3D1 IO depths : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D0.0%, 32= =3D0.0%,=20 >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 6= 4=3D0.0%,=20 >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 6= 4=3D0.0%,=20 >=3D64=3D0.0% issued rwts: total=3D0,1,0,0 short=3D0,0,0,0 dropped=3D0,0,0,0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D1 Run status group 0 (all jobs): WRITE: bw=3D204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=3D4096B= =20 (4096B), run=3D20033-20033msec Disk stats (read/write): sda: ios=3D0/5, merge=3D0/1, ticks=3D0/10018, in_queue=3D10015, util=3D= 36.98% --------------030006070804010605030403 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

On 2021/10/21 16:03, Jan Kara wrote:
On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz>=
 wrote:
On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
we are encounting following Mysql crash proble=
m while importing tables :

    2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
    fsync() returned EIO, aborting.
    2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
    Assertion failure: ut0ut.cc:555 thread 281472996733168

At the same time , we found dmesg had following message:

    [ 4328.838972] Page cache invalidation failure on direct I/O.
    Possible data corruption due to collision with buffered I/O!
    [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
    625 Comm: kworker/42:1

Firstly, we doubled Mysql has operating the file with direct IO and
buffered IO interlaced, but after some checking we found it did only
do direct IO using aio. The problem is exactly from direct-io
interface (__generic_file_write_iter) itself.

ssize_t __generic_file_write_iter()
{
...
        if (iocb->ki_flags & IOCB_DIRECT) {
                loff_t pos, endbyte;

                written =3D generic_file_direct_write(iocb, from);
                /*
                 * If the write stopped short of completing, fall back to
                 * buffered writes.  Some filesystems do this for writes =
to
                 * holes, for example.  For DAX files, a buffered write w=
ill
                 * not succeed (even if it did, DAX does not handle dirty
                 * page-cache pages correctly).
                 */
                if (written < 0 || !iov_iter_count(from) || IS_DAX(ino=
de))
                        goto out;

                status =3D generic_perform_write(file, from, pos =3D iocb=
->ki_pos);
...
}

>From above code snippet we can see that direct io could fall back to
buffered IO under certain conditions, so even Mysql only did direct IO
it could interleave with buffered IO when fall back occurred. I have
no idea why FS(ext3) failed the direct IO currently, but it is strange
__generic_file_write_iter make direct IO fall back to buffered IO, it
seems  breaking the semantics of direct IO.

The reproduced  environment is:
Platform:  Kunpeng 920 (arm64)
Kernel: V5.15-rc
PAGESIZE: 64K
Mysql:  V8.0
Innodb_page_size: default(16K)
Thanks for report. I agree this should not happen. How hard is this to
reproduce? Any idea whether the fallback to buffered IO happens because
iomap_dio_rw() returns -ENOTBLK or because it returns short write?
It is easy to reproduce in my test environment, as I said in the previous
email replied to Andrew this problem is related to kernel page size.
Ok, can you share a reproducer?

Can you post output of "dumpe2fs -h <device&g=
t;" for the filesystem where the
problem happens? Thanks!
Sure, the output is:

# dumpe2fs -h /dev/sda3
dumpe2fs 1.45.3 (14-Jul-2019)
Filesystem volume name:   <none>
Last mounted on:          /data
Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery sparse_super large_file
Thanks for the data. OK, a filesystem without extents. Does your test by
any chance try to do direct IO to a hole in a file? Because that is not
(and never was) supported without extents. Also the fact that you don't s=
ee
the problem with ext4 (which means extents support) would be pointing in
that direction.

								Honza

I try to reprodeuce this issue as follows, maybe this is just one scenario :
step1:=C2=A0 Modify=C2=A0 kernel code, add delay in dio_complete and dio_bio_end_aio
step2:=C2=A0 mkfs.ext4 /dev/sda
step3:=C2=A0 mount /dev/sda=C2=A0 /home/test
step4:=C2=A0 fio -filename=3D/home/test/test=C2=A0 -direct=3D0= -buffered=3D1 -iodepth 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -size=3D16k -numjobs=3D1 -name=3Dtest_r_w
step5:=C2=A0 fio -filename=3D/home/test/test=C2=A0 -direct=3D1= -buffered=3D0 -iodepth 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -size=3D4k -numjobs=3D1 -name=3Dtest_r_w &
step6:=C2=A0 fio -filename=3D/home/test/test=C2=A0 -direct=3D0= -buffered=3D1 -iodepth 1 -thread -rw=3Dwrite -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -size=3D16k -numjobs=3D1 -name=3Dtest_r_w

Kernel modification base on v4.19:
diff --git a/fs/direct-io.c b/fs/direct-io.c
+#include <linux/delay.h>

/**
@@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 * end_io() when necessary,= otherwise a racing buffer read would cache
=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 * zeros from unwritten ext= ents.
=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 */
+=C2=A0 =C2=A0 =C2=A0=C2=A0 printk("%s: before invalidate pa= ge \n", __func__);
+=C2=A0 =C2=A0 =C2=A0=C2=A0 mdelay(10 * 1000);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (flags & DIO_COMPLETE_INV= ALIDATE &&
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ret > 0 &&a= mp; dio->op =3D=3D REQ_OP_WRITE &&
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 dio->inode->= i_mapping->nrpages) {
@@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (= err)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 dio_warn_stale_pagecache(dio->iocb->ki_filp);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
+=C2=A0 =C2=A0 =C2=A0=C2=A0 printk("%s: end invalidate page = \n", __func__);

=C2=A0 =C2=A0 =C2=A0 =C2=A0 inode_dio_end(dio->inode);
@@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0= * went in between AIO submission and completion into the
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0= * same region.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0= */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 prin= tk("%s:start to=C2=A0 sleep 5s.... \n", __func__);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 mdel= ay(10 * 1000);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 prin= tk("%s: endto=C2=A0 sleep, defer_completion=3D%d op=3D%d nrpages=3D%d\n",
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0=C2=A0 __func__,dio->defer_completion, dio->op, dio->inode->i_mapping->nrpages);
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (= dio->result)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 defer_completion =3D dio->defer_completion ||
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0=C2=A0 (dio->op =3D=3D REQ_OP_WRITE &&
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 dio->inode->i_mapping->nrpages);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 prin= tk("%s: dio=3D%px defer_completion=3D%d\n",
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0=C2=A0 __func__, dio, defer_completion);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (= defer_completion) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 INIT_WORK(&dio->complete_work, dio_aio_complete_work);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 queue_work(dio->inode->i_sb->s_dio_done_wq,


Reproduce result =EF=BC=9A
s=3D1 -name=3Dtest_r_w=3Dwrite -rwmixread=3D70 -ioengine=3D= libaio -bs=3D4k -size=3D16k -numjob
test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1<= /div>
fio-3.7
Starting 1 thread
test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2083: S= un Sep 26 03:18:06 2021
=C2=A0 write: IOPS=3D444, BW=3D1778KiB/s (1820kB/s)(16.0Ki= B/9msec)
=C2=A0 =C2=A0 slat (usec): min=3D1294, max=3D3789, avg=3D2= 007.88, stdev=3D1191.07
=C2=A0 =C2=A0 clat (nsec): min=3D2304, max=3D28645, avg=3D= 9918.75, stdev=3D12574.80
=C2=A0 =C2=A0=C2=A0 lat (usec): min=3D1298, max=3D3824, av= g=3D2020.36, stdev=3D1206.00
=C2=A0 =C2=A0 clat percentiles (nsec):
=C2=A0 =C2=A0=C2=A0 |=C2=A0 1.00th=3D[ 2320],=C2=A0 5.00th= =3D[ 2320], 10.00th=3D[ 2320], 20.00th=3D[ 2320],
=C2=A0 =C2=A0=C2=A0 | 30.00th=3D[ 2960], 40.00th=3D[ 2960]= , 50.00th=3D[ 2960], 60.00th=3D[ 5792],
=C2=A0 =C2=A0=C2=A0 | 70.00th=3D[ 5792], 80.00th=3D[28544]= , 90.00th=3D[28544], 95.00th=3D[28544],
=C2=A0 =C2=A0=C2=A0 | 99.00th=3D[28544], 99.50th=3D[28544]= , 99.90th=3D[28544], 99.95th=3D[28544],
=C2=A0 =C2=A0=C2=A0 | 99.99th=3D[28544]
=C2=A0 lat (usec)=C2=A0=C2=A0 : 4=3D50.00%, 10=3D25.00%, 5= 0=3D25.00%
=C2=A0 cpu=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : usr=3D0.00%= , sys=3D87.50%, ctx=3D0, majf=3D0, minf=3D1
=C2=A0 IO depths=C2=A0 =C2=A0 : 1=3D100.0%, 2=3D0.0%, 4=3D= 0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 submit=C2=A0 =C2=A0 : 0=3D0.0%, 4=3D10= 0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 complete=C2=A0 : 0=3D0.0%, 4=3D100.0%,= 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 issued rwts: total=3D0,4,0,0 short=3D0= ,0,0,0 dropped=3D0,0,0,0
=C2=A0 =C2=A0=C2=A0 latency=C2=A0=C2=A0 : target=3D0, wind= ow=3D0, percentile=3D100.00%, depth=3D1

Run status group 0 (all jobs):
=C2=A0 WRITE: bw=3D1778KiB/s (1820kB/s), 1778KiB/s-1778KiB= /s (1820kB/s-1820kB/s), io=3D16.0KiB (16.4kB), run=3D9-9msec

Disk stats (read/write):
=C2=A0 sda: ios=3D0/0, merge=3D0/0, ticks=3D0/0, in_queue=3D= 0, util=3D0.00%
=3D1 -name=3Dtest_r_w &write -rwmixread=3D70 -ioengine=3Dlibaio -bs=3D4k -size=3D4k -numjobs
[1] 2087
[root@localhost home]# test_r_w: (g=3D0): rw=3Dwrite, bs=3D= (R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1
fio-3.7
Starting 1 thread

[root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock inode
[ 1475.288807] ext4_file_write_iter: unlock inode
[ 1475.290402] dio_bio_end_aio:start to=C2=A0 sleep 5s....=

s=3D1 -name=3Dtest_r_w=3Dwrite -rwmixread=3D70 -ioengine=3D= libaio -bs=3D4k -size=3D16k -numjob
test_r_w: (g=3D0): rw=3Dwrite, bs=3D(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=3Dlibaio, iodepth=3D1<= /div>
fio-3.7
Starting 1 thread
Jobs: 1 (f=3D0): [f(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r=3D0= ,w=3D0 IOPS][eta 00m:00s]
Jobs: 1 (f=3D0): [f(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r=3D0= ,w=3D0 IOPS][eta 00m:00s]
[ 1485.292939] dio_bio_end_aio: endto=C2=A0 sleep, defer_completion=3D0 op=3D1 nrpages=3D4
[ 1485.293888] dio_bio_end_aio: dio=3Dffff88839aff5200 defer_completion=3D1
[ 1485.294821] dio_aio_complete_work: dio=3Dffff88839aff5200...
Jobs: 1 (f=3D1): [W(1)][-.-%][r=3D0KiB/s,w=3D0KiB/s][r[ 1485.296539] dio_complete: before invalidate page
[ 1495.298472] Page cache invalidation failure on direct I/O.=C2=A0 Possible data corruption due to collision with buffered I/O!=C2=A0=C2=A0=C2=A0=C2=A0 -->detect=C2=A0 buff= er io and direct io conflict.
[ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2
[ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted 4.19.90-dirty #96
[ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.= fc31 04/01/2014
Jobs: 1 (f=3D1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_work
W(1)][-.-%][r=3D0K
[ 1495.316879] dio_complete: end invalidate page
test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2092: S= un Sep 26 03:18:37 2021
=C2=A0 write: IOPS=3D266, BW=3D1067KiB/s (1092kB/s)(16.0KiB/15msec)
=C2=A0 =C2=A0 slat (usec): min=3D2001, max=3D3101, avg=3D2= 691.97, stdev=3D480.07
=C2=A0 =C2=A0 clat (nsec): min=3D2058, max=3D12337, avg=3D= 5999.50, stdev=3D4456.54
=C2=A0 =C2=A0=C2=A0 lat (usec): min=3D2004, max=3D3116, av= g=3D2700.11, stdev=3D484.22
=C2=A0 =C2=A0 clat percentiles (nsec):
=C2=A0 =C2=A0=C2=A0 |=C2=A0 1.00th=3D[ 2064],=C2=A0 5.00th= =3D[ 2064], 10.00th=3D[ 2064], 20.00th=3D[ 2064],
=C2=A0 =C2=A0=C2=A0 | 30.00th=3D[ 4080], 40.00th=3D[ 4080]= , 50.00th=3D[ 4080], 60.00th=3D[ 5536],
=C2=A0 =C2=A0=C2=A0 | 70.00th=3D[ 5536], 80.00th=3D[12352]= , 90.00th=3D[12352], 95.00th=3D[12352],
=C2=A0 =C2=A0=C2=A0 | 99.00th=3D[12352], 99.50th=3D[12352]= , 99.90th=3D[12352], 99.95th=3D[12352],
=C2=A0 =C2=A0=C2=A0 | 99.99th=3D[12352]
=C2=A0 lat (usec)=C2=A0=C2=A0 : 4=3D25.00%, 10=3D50.00%, 2= 0=3D25.00%
=C2=A0 cpu=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : usr=3D0.00%= , sys=3D100.00%, ctx=3D1, majf=3D0, minf=3D1
=C2=A0 IO depths=C2=A0 =C2=A0 : 1=3D100.0%, 2=3D0.0%, 4=3D= 0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 submit=C2=A0 =C2=A0 : 0=3D0.0%, 4=3D10= 0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 complete=C2=A0 : 0=3D0.0%, 4=3D100.0%,= 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 issued rwts: total=3D0,4,0,0 short=3D0= ,0,0,0 dropped=3D0,0,0,0
=C2=A0 =C2=A0=C2=A0 latency=C2=A0=C2=A0 : target=3D0, wind= ow=3D0, percentile=3D100.00%, depth=3D1

Run status group 0 (all jobs):
=C2=A0 WRITE: bw=3D1067KiB/s (1092kB/s), 1067KiB/s-1067KiB= /s (1092kB/s-1092kB/s), io=3D16.0KiB (16.4kB), run=3D15-15msec

Disk stats (read/write):
=C2=A0 sda: ios=3D0/0, merge=3D0/0, ticks=3D0/0, in_queue=3D= 0, util=3D0.00%

test_r_w: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D2089: S= un Sep 26 03:18:37 2021
=C2=A0 write: IOPS=3D0, BW=3D204B/s (204B/s)(4096B/20033ms= ec)
=C2=A0 =C2=A0 slat (nsec): min=3D3358.7k, max=3D3358.7k, avg=3D3358748.00, stdev=3D 0.00
=C2=A0 =C2=A0 clat (nsec): min=3D20027M, max=3D20027M, avg=3D20027265355.00, stdev=3D 0.00
=C2=A0 =C2=A0=C2=A0 lat (nsec): min=3D20031M, max=3D20031M= , avg=3D20030627283.00, stdev=3D 0.00
=C2=A0 =C2=A0 clat percentiles (msec):
=C2=A0 =C2=A0=C2=A0 |=C2=A0 1.00th=3D[17113],=C2=A0 5.00th= =3D[17113], 10.00th=3D[17113], 20.00th=3D[17113],
=C2=A0 =C2=A0=C2=A0 | 30.00th=3D[17113], 40.00th=3D[17113]= , 50.00th=3D[17113], 60.00th=3D[17113],
=C2=A0 =C2=A0=C2=A0 | 70.00th=3D[17113], 80.00th=3D[17113]= , 90.00th=3D[17113], 95.00th=3D[17113],
=C2=A0 =C2=A0=C2=A0 | 99.00th=3D[17113], 99.50th=3D[17113]= , 99.90th=3D[17113], 99.95th=3D[17113],
=C2=A0 =C2=A0=C2=A0 | 99.99th=3D[17113]
=C2=A0 cpu=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : usr=3D0.00%= , sys=3D0.02%, ctx=3D2, majf=3D0, minf=3D1
=C2=A0 IO depths=C2=A0 =C2=A0 : 1=3D100.0%, 2=3D0.0%, 4=3D= 0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 submit=C2=A0 =C2=A0 : 0=3D0.0%, 4=3D10= 0.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 complete=C2=A0 : 0=3D0.0%, 4=3D100.0%,= 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=3D0.0%, >=3D64=3D0.0%
=C2=A0 =C2=A0=C2=A0 issued rwts: total=3D0,1,0,0 short=3D0= ,0,0,0 dropped=3D0,0,0,0
=C2=A0 =C2=A0=C2=A0 latency=C2=A0=C2=A0 : target=3D0, wind= ow=3D0, percentile=3D100.00%, depth=3D1

Run status group 0 (all jobs):
=C2=A0 WRITE: bw=3D204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=3D4096B (4096B), run=3D20033-20033msec

Disk stats (read/write):
=C2=A0 sda: ios=3D0/5, merge=3D0/1, ticks=3D0/10018, in_qu= eue=3D10015, util=3D36.98%



--------------030006070804010605030403--