linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: yebin <yebin10@huawei.com>
To: Jan Kara <jack@suse.cz>, Zhengyuan Liu <liuzhengyuang521@gmail.com>
Cc: viro@zeniv.linux.org.uk,
	"Andrew Morton" <akpm@linux-foundation.org>,
	tytso@mit.edu, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-ext4@vger.kernel.org, 刘云 <liuyun01@kylinos.cn>,
	"Zhengyuan Liu" <liuzhengyuan@kylinos.cn>
Subject: Re: Problem with direct IO
Date: Thu, 21 Oct 2021 16:55:44 +0800	[thread overview]
Message-ID: <61712B10.2060408@huawei.com> (raw)
In-Reply-To: <20211021080304.GB5784@quack2.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 12860 bytes --]



On 2021/10/21 16:03, Jan Kara wrote:
> On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
>> On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:
>>> On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
>>>> we are encounting following Mysql crash problem while importing tables :
>>>>
>>>>      2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
>>>>      fsync() returned EIO, aborting.
>>>>      2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
>>>>      Assertion failure: ut0ut.cc:555 thread 281472996733168
>>>>
>>>> At the same time , we found dmesg had following message:
>>>>
>>>>      [ 4328.838972] Page cache invalidation failure on direct I/O.
>>>>      Possible data corruption due to collision with buffered I/O!
>>>>      [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
>>>>      625 Comm: kworker/42:1
>>>>
>>>> Firstly, we doubled Mysql has operating the file with direct IO and
>>>> buffered IO interlaced, but after some checking we found it did only
>>>> do direct IO using aio. The problem is exactly from direct-io
>>>> interface (__generic_file_write_iter) itself.
>>>>
>>>> ssize_t __generic_file_write_iter()
>>>> {
>>>> ...
>>>>          if (iocb->ki_flags & IOCB_DIRECT) {
>>>>                  loff_t pos, endbyte;
>>>>
>>>>                  written = generic_file_direct_write(iocb, from);
>>>>                  /*
>>>>                   * If the write stopped short of completing, fall back to
>>>>                   * buffered writes.  Some filesystems do this for writes to
>>>>                   * holes, for example.  For DAX files, a buffered write will
>>>>                   * not succeed (even if it did, DAX does not handle dirty
>>>>                   * page-cache pages correctly).
>>>>                   */
>>>>                  if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
>>>>                          goto out;
>>>>
>>>>                  status = generic_perform_write(file, from, pos = iocb->ki_pos);
>>>> ...
>>>> }
>>>>
>>>>  From above code snippet we can see that direct io could fall back to
>>>> buffered IO under certain conditions, so even Mysql only did direct IO
>>>> it could interleave with buffered IO when fall back occurred. I have
>>>> no idea why FS(ext3) failed the direct IO currently, but it is strange
>>>> __generic_file_write_iter make direct IO fall back to buffered IO, it
>>>> seems  breaking the semantics of direct IO.
>>>>
>>>> The reproduced  environment is:
>>>> Platform:  Kunpeng 920 (arm64)
>>>> Kernel: V5.15-rc
>>>> PAGESIZE: 64K
>>>> Mysql:  V8.0
>>>> Innodb_page_size: default(16K)
>>> Thanks for report. I agree this should not happen. How hard is this to
>>> reproduce? Any idea whether the fallback to buffered IO happens because
>>> iomap_dio_rw() returns -ENOTBLK or because it returns short write?
>> It is easy to reproduce in my test environment, as I said in the previous
>> email replied to Andrew this problem is related to kernel page size.
> Ok, can you share a reproducer?
>
>>> Can you post output of "dumpe2fs -h <device>" for the filesystem where the
>>> problem happens? Thanks!
>> Sure, the output is:
>>
>> # dumpe2fs -h /dev/sda3
>> dumpe2fs 1.45.3 (14-Jul-2019)
>> Filesystem volume name:   <none>
>> Last mounted on:          /data
>> Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
>> Filesystem magic number:  0xEF53
>> Filesystem revision #:    1 (dynamic)
>> Filesystem features:      has_journal ext_attr resize_inode dir_index
>> filetype needs_recovery sparse_super large_file
> Thanks for the data. OK, a filesystem without extents. Does your test by
> any chance try to do direct IO to a hole in a file? Because that is not
> (and never was) supported without extents. Also the fact that you don't see
> the problem with ext4 (which means extents support) would be pointing in
> that direction.
>
> 								Honza

I try to reprodeuce this issue as follows, maybe this is just one 
scenario :
step1:  Modify  kernel code, add delay in dio_complete and dio_bio_end_aio
step2:  mkfs.ext4 /dev/sda
step3:  mount /dev/sda  /home/test
step4:  fio -filename=/home/test/test  -direct=0 -buffered=1 -iodepth 1 
-thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k 
-numjobs=1 -name=test_r_w
step5:  fio -filename=/home/test/test  -direct=1 -buffered=0 -iodepth 1 
-thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k 
-numjobs=1 -name=test_r_w &
step6:  fio -filename=/home/test/test  -direct=0 -buffered=1 -iodepth 1 
-thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k 
-numjobs=1 -name=test_r_w

Kernel modification base on v4.19:
diff --git a/fs/direct-io.c b/fs/direct-io.c
+#include <linux/delay.h>

/**
@@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t 
ret, unsigned int flags)
          * end_io() when necessary, otherwise a racing buffer read 
would cache
          * zeros from unwritten extents.
          */
+       printk("%s: before invalidate page \n", __func__);
+       mdelay(10 * 1000);
         if (flags & DIO_COMPLETE_INVALIDATE &&
             ret > 0 && dio->op == REQ_OP_WRITE &&
             dio->inode->i_mapping->nrpages) {
@@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t 
ret, unsigned int flags)
                 if (err)
dio_warn_stale_pagecache(dio->iocb->ki_filp);
         }
+       printk("%s: end invalidate page \n", __func__);

         inode_dio_end(dio->inode);

@@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio)
                  * went in between AIO submission and completion into the
                  * same region.
                  */
+               printk("%s:start to  sleep 5s.... \n", __func__);
+               mdelay(10 * 1000);
+               printk("%s: endto  sleep, defer_completion=%d op=%d 
nrpages=%d\n",
+                       __func__,dio->defer_completion, dio->op, 
dio->inode->i_mapping->nrpages);
+
                 if (dio->result)
                         defer_completion = dio->defer_completion ||
                                            (dio->op == REQ_OP_WRITE &&
dio->inode->i_mapping->nrpages);
+               printk("%s: dio=%px defer_completion=%d\n",
+                       __func__, dio, defer_completion);
                 if (defer_completion) {
INIT_WORK(&dio->complete_work, dio_aio_complete_work);
queue_work(dio->inode->i_sb->s_dio_done_wq,


Reproduce result :
s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k 
-numjob
test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread
test_r_w: (groupid=0, jobs=1): err= 0: pid=2083: Sun Sep 26 03:18:06 2021
   write: IOPS=444, BW=1778KiB/s (1820kB/s)(16.0KiB/9msec)
     slat (usec): min=1294, max=3789, avg=2007.88, stdev=1191.07
     clat (nsec): min=2304, max=28645, avg=9918.75, stdev=12574.80
      lat (usec): min=1298, max=3824, avg=2020.36, stdev=1206.00
     clat percentiles (nsec):
      |  1.00th=[ 2320],  5.00th=[ 2320], 10.00th=[ 2320], 20.00th=[ 2320],
      | 30.00th=[ 2960], 40.00th=[ 2960], 50.00th=[ 2960], 60.00th=[ 5792],
      | 70.00th=[ 5792], 80.00th=[28544], 90.00th=[28544], 95.00th=[28544],
      | 99.00th=[28544], 99.50th=[28544], 99.90th=[28544], 99.95th=[28544],
      | 99.99th=[28544]
   lat (usec)   : 4=50.00%, 10=25.00%, 50=25.00%
   cpu          : usr=0.00%, sys=87.50%, ctx=0, majf=0, minf=1
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=1778KiB/s (1820kB/s), 1778KiB/s-1778KiB/s 
(1820kB/s-1820kB/s), io=16.0KiB (16.4kB), run=9-9msec

Disk stats (read/write):
   sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
=1 -name=test_r_w &write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k 
-numjobs
[1] 2087
[root@localhost home]# test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, 
(W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread

[root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock inode
[ 1475.288807] ext4_file_write_iter: unlock inode
[ 1475.290402] dio_bio_end_aio:start to  sleep 5s....

s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k 
-numjob
test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread
Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
[ 1485.292939] dio_bio_end_aio: endto  sleep, defer_completion=0 op=1 
nrpages=4
[ 1485.293888] dio_bio_end_aio: dio=ffff88839aff5200 defer_completion=1
[ 1485.294821] dio_aio_complete_work: dio=ffff88839aff5200...
Jobs: 1 (f=1): [W(1)][-.-%][r=0KiB/s,w=0KiB/s][r[ 1485.296539] 
dio_complete: before invalidate page
[ 1495.298472] Page cache invalidation failure on direct I/O.  Possible 
data corruption due to collision with buffered I/O!     -->detect  
buffer io and direct io conflict.
[ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2
[ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted 
4.19.90-dirty #96
[ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 
04/01/2014
Jobs: 1 (f=1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_work
W(1)][-.-%][r=0K
[ 1495.316879] dio_complete: end invalidate page
test_r_w: (groupid=0, jobs=1): err= 0: pid=2092: Sun Sep 26 03:18:37 2021
   write: IOPS=266, BW=1067KiB/s (1092kB/s)(16.0KiB/15msec)
     slat (usec): min=2001, max=3101, avg=2691.97, stdev=480.07
     clat (nsec): min=2058, max=12337, avg=5999.50, stdev=4456.54
      lat (usec): min=2004, max=3116, avg=2700.11, stdev=484.22
     clat percentiles (nsec):
      |  1.00th=[ 2064],  5.00th=[ 2064], 10.00th=[ 2064], 20.00th=[ 2064],
      | 30.00th=[ 4080], 40.00th=[ 4080], 50.00th=[ 4080], 60.00th=[ 5536],
      | 70.00th=[ 5536], 80.00th=[12352], 90.00th=[12352], 95.00th=[12352],
      | 99.00th=[12352], 99.50th=[12352], 99.90th=[12352], 99.95th=[12352],
      | 99.99th=[12352]
   lat (usec)   : 4=25.00%, 10=50.00%, 20=25.00%
   cpu          : usr=0.00%, sys=100.00%, ctx=1, majf=0, minf=1
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=1067KiB/s (1092kB/s), 1067KiB/s-1067KiB/s 
(1092kB/s-1092kB/s), io=16.0KiB (16.4kB), run=15-15msec

Disk stats (read/write):
   sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

test_r_w: (groupid=0, jobs=1): err= 0: pid=2089: Sun Sep 26 03:18:37 2021
   write: IOPS=0, BW=204B/s (204B/s)(4096B/20033msec)
     slat (nsec): min=3358.7k, max=3358.7k, avg=3358748.00, stdev= 0.00
     clat (nsec): min=20027M, max=20027M, avg=20027265355.00, stdev= 0.00
      lat (nsec): min=20031M, max=20031M, avg=20030627283.00, stdev= 0.00
     clat percentiles (msec):
      |  1.00th=[17113],  5.00th=[17113], 10.00th=[17113], 20.00th=[17113],
      | 30.00th=[17113], 40.00th=[17113], 50.00th=[17113], 60.00th=[17113],
      | 70.00th=[17113], 80.00th=[17113], 90.00th=[17113], 95.00th=[17113],
      | 99.00th=[17113], 99.50th=[17113], 99.90th=[17113], 99.95th=[17113],
      | 99.99th=[17113]
   cpu          : usr=0.00%, sys=0.02%, ctx=2, majf=0, minf=1
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=4096B 
(4096B), run=20033-20033msec

Disk stats (read/write):
   sda: ios=0/5, merge=0/1, ticks=0/10018, in_queue=10015, util=36.98%




[-- Attachment #2: Type: text/html, Size: 20755 bytes --]

  reply	other threads:[~2021-10-21  8:55 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-13  1:46 Zhengyuan Liu
2021-10-18  1:09 ` Zhengyuan Liu
2021-10-18 18:43   ` Andrew Morton
2021-10-19  3:39     ` Zhengyuan Liu
2021-10-20 17:37 ` Jan Kara
2021-10-21  2:21   ` Zhengyuan Liu
2021-10-21  8:03     ` Jan Kara
2021-10-21  8:55       ` yebin [this message]
2021-10-21 12:20         ` Zhengyuan Liu
2021-10-29  8:16           ` yebin
2021-10-21 12:11       ` Zhengyuan Liu
2021-10-22  9:31         ` Jan Kara
2021-10-23  2:06           ` Zhengyuan Liu
2021-10-25 15:57             ` Jan Kara
2021-10-28 15:02               ` Zhengyuan Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=61712B10.2060408@huawei.com \
    --to=yebin10@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuyun01@kylinos.cn \
    --cc=liuzhengyuan@kylinos.cn \
    --cc=liuzhengyuang521@gmail.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox