From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2D63C433F5 for ; Sat, 23 Oct 2021 02:06:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4252E60FC1 for ; Sat, 23 Oct 2021 02:06:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4252E60FC1 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 8CC34940009; Fri, 22 Oct 2021 22:06:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 854E4940007; Fri, 22 Oct 2021 22:06:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6813E940009; Fri, 22 Oct 2021 22:06:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0108.hostedemail.com [216.40.44.108]) by kanga.kvack.org (Postfix) with ESMTP id 57480940007 for ; Fri, 22 Oct 2021 22:06:37 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 0C9558249980 for ; Sat, 23 Oct 2021 02:06:37 +0000 (UTC) X-FDA: 78726063234.13.89AC567 Received: from mail-il1-f178.google.com (mail-il1-f178.google.com [209.85.166.178]) by imf02.hostedemail.com (Postfix) with ESMTP id D57C37001A23 for ; Sat, 23 Oct 2021 02:06:33 +0000 (UTC) Received: by mail-il1-f178.google.com with SMTP id h2so6156465ili.11 for ; Fri, 22 Oct 2021 19:06:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=bfGzjdUzGE5C8OebCGb1QKOz70KMcn/wQD4FIoBzFCQ=; b=Mwu6FGRPd/Mrq+h0W5Zh3XHCYMowEoJysBJeEcUnBE7jxmwpiae2R+55Nbywbo6Wn8 TcBeX7M8tQPNoTzfeuaN6Y9Tqhcbv1XvHiMu9DPq/Zo39eS8b0mExux3jpSXH+cXRG9z B2m8F+CrYyhsfmLo5KwgHUaZR5cLGKFxUSxaoFNbJ7/6EhoxmcUVC1e/qCwRR+8K8eF4 WDiU563Z1+rgH8D1nXsM8Rgkze4xUTmEokx7zLr+6QRuuKTUDE5R7ZnGKU0sCSnk0M2a sOL5RlH5dZbNybieFWPluJA0sDqMXwUUE9krwY+8dxac17OSZ45BJnQIRngkz2LOPKYJ M7lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=bfGzjdUzGE5C8OebCGb1QKOz70KMcn/wQD4FIoBzFCQ=; b=Ao9IPKhO9SOmHY3xaF9V/ZcvApDeDDHupILgxJKUjgmtU4WQN9jquy5JFqmENqNN4u JlozT7PMaDo4brJqXIdVAntJm9RY5/hOf90TfKe8abNqqDT476C3/0D5WtqI44K+7ddL uRFT0cS6f1HC+7gRR6YWvyC5zjWAEnnyVJCi77X3VVzd52nkwq2D/D94l5Ca6n3qPmcL Rjw7zX3WAGb2AWF6CkO5XvLub1hjkYukdmEHlYNYXFsUE3NYrZwFE/zZ/9/NrMhmkdqU fUIyL9qDjDzldjg3Pvamz7FLBtURHrIKeshAzvNvAIPC9dC1I4HIRft0DlH0afZXUpVP 3h+w== X-Gm-Message-State: AOAM533XQvwWTkmdwQzvZbSnwKylKAjOxDVpSJg8gOJGWJ8icSeXXIz9 DRiuXju7twkwyTmTTi0UktjScHfYORBNcr6G8Xw= X-Google-Smtp-Source: ABdhPJw2kXJO18rfk5vkResfVFGZ/g4NVMkq0grXq3OdPHg7nrNhLWVTERhOFcTwIhJ41aZbO2/3J7f7KT78CvR9nVs= X-Received: by 2002:a05:6e02:1023:: with SMTP id o3mr2079878ilj.27.1634954795893; Fri, 22 Oct 2021 19:06:35 -0700 (PDT) MIME-Version: 1.0 References: <20211020173729.GF16460@quack2.suse.cz> <20211021080304.GB5784@quack2.suse.cz> <20211022093120.GG1026@quack2.suse.cz> In-Reply-To: <20211022093120.GG1026@quack2.suse.cz> From: Zhengyuan Liu Date: Sat, 23 Oct 2021 10:06:24 +0800 Message-ID: Subject: Re: Problem with direct IO To: Jan Kara Cc: viro@zeniv.linux.org.uk, Andrew Morton , tytso@mit.edu, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org, =?UTF-8?B?5YiY5LqR?= , Zhengyuan Liu Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: D57C37001A23 X-Stat-Signature: 3pdyut183hpbctjrznwoudha4aoa6am9 Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=Mwu6FGRP; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of liuzhengyuang521@gmail.com designates 209.85.166.178 as permitted sender) smtp.mailfrom=liuzhengyuang521@gmail.com X-HE-Tag: 1634954793-522875 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Oct 22, 2021 at 5:31 PM Jan Kara wrote: > > On Thu 21-10-21 20:11:43, Zhengyuan Liu wrote: > > On Thu, Oct 21, 2021 at 4:03 PM Jan Kara wrote: > > > > > > On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote: > > > > On Thu, Oct 21, 2021 at 1:37 AM Jan Kara wrote: > > > > > On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote: > > > > > > we are encounting following Mysql crash problem while importing tables : > > > > > > > > > > > > 2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL] > > > > > > fsync() returned EIO, aborting. > > > > > > 2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB] > > > > > > Assertion failure: ut0ut.cc:555 thread 281472996733168 > > > > > > > > > > > > At the same time , we found dmesg had following message: > > > > > > > > > > > > [ 4328.838972] Page cache invalidation failure on direct I/O. > > > > > > Possible data corruption due to collision with buffered I/O! > > > > > > [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID: > > > > > > 625 Comm: kworker/42:1 > > > > > > > > > > > > Firstly, we doubled Mysql has operating the file with direct IO and > > > > > > buffered IO interlaced, but after some checking we found it did only > > > > > > do direct IO using aio. The problem is exactly from direct-io > > > > > > interface (__generic_file_write_iter) itself. > > > > > > > > > > > > ssize_t __generic_file_write_iter() > > > > > > { > > > > > > ... > > > > > > if (iocb->ki_flags & IOCB_DIRECT) { > > > > > > loff_t pos, endbyte; > > > > > > > > > > > > written = generic_file_direct_write(iocb, from); > > > > > > /* > > > > > > * If the write stopped short of completing, fall back to > > > > > > * buffered writes. Some filesystems do this for writes to > > > > > > * holes, for example. For DAX files, a buffered write will > > > > > > * not succeed (even if it did, DAX does not handle dirty > > > > > > * page-cache pages correctly). > > > > > > */ > > > > > > if (written < 0 || !iov_iter_count(from) || IS_DAX(inode)) > > > > > > goto out; > > > > > > > > > > > > status = generic_perform_write(file, from, pos = iocb->ki_pos); > > > > > > ... > > > > > > } > > > > > > > > > > > > From above code snippet we can see that direct io could fall back to > > > > > > buffered IO under certain conditions, so even Mysql only did direct IO > > > > > > it could interleave with buffered IO when fall back occurred. I have > > > > > > no idea why FS(ext3) failed the direct IO currently, but it is strange > > > > > > __generic_file_write_iter make direct IO fall back to buffered IO, it > > > > > > seems breaking the semantics of direct IO. > > > > > > > > > > > > The reproduced environment is: > > > > > > Platform: Kunpeng 920 (arm64) > > > > > > Kernel: V5.15-rc > > > > > > PAGESIZE: 64K > > > > > > Mysql: V8.0 > > > > > > Innodb_page_size: default(16K) > > > > > > > > > > Thanks for report. I agree this should not happen. How hard is this to > > > > > reproduce? Any idea whether the fallback to buffered IO happens because > > > > > iomap_dio_rw() returns -ENOTBLK or because it returns short write? > > > > > > > > It is easy to reproduce in my test environment, as I said in the previous > > > > email replied to Andrew this problem is related to kernel page size. > > > > > > Ok, can you share a reproducer? > > > > I don't have a simple test case to reproduce, the whole procedure shown as > > following is somewhat complex. > > > > 1. Prepare Mysql installation environment > > a. Prepare a SSD partition (at least 100G) as the Mysql data > > partition, format to Ext3 and mount to /data > > # mkfs.ext3 /dev/sdb1 > > # mount /dev/sdb1 /data > > b. Create Mysql user and user group > > # groupadd mysql > > # useradd -g mysql mysql > > c. Create Mysql directory > > # mkdir -p /data/mysql > > # cd /data/mysql > > # mkdir data tmp run log > > > > 2. Install Mysql > > a. Download mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar from > > https://downloads.mysql.com/archives/community/ > > b. Install Mysql > > # tar -xvf mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar > > # yum install openssl openssl-devel > > # rpm -ivh mysql-community-common-8.0.25-1.el8.aarch64.rpm > > mysql-community-client-plugins-8.0.25-1.el8.aarch64.rpm \ > > mysql-community-libs-8.0.25-1.el8.aarch64.rp > > mysql-community-client-8.0.25-1.el8.aarch64.rpm \ > > mysql-community-server-8.0.25-1.el8.aarch64.rpm > > mysql-community-devel-8.0.25-1.el8.aarch64.rpm > > > > 3. Configure Mysql > > a. # chown mysql:mysql /etc/my.cnf > > b. # vim /etc/my.cnf > > innodb_flush_method = O_DIRECT > > default-storage-engine=INNODB > > datadir=/data/mysql/data > > socket=/data/mysql/run/mysql.sock > > tmpdir=/data/mysql/tmp > > log-error=/data/mysql/log/mysqld.log > > pid-file=/data/mysql/run/mysqld.pid > > port=3306 > > user=mysql > > c. initialize Mysql (problem may reproduce at this stage) > > # mysqld --defaults-file=/etc/my.cnf --initialize > > d. Start Mysql > > # mysqld --defaults-file=/etc/my.cnf & > > e. Login into Mysql > > # mysql -uroot -p -S /data/mysql/run/mysql.sock > > You can see the temporary password from step 3.c > > f. Configure access > > mysql> alter user 'root'@'localhost' identified by "123456"; > > mysql> create user 'root'@'%' identified by '123456'; > > mysql> grant all privileges on *.* to 'root'@'%'; flush privileges; > > mysql> create database sysbench; > > > > 4. Use sysbench to test Mysql > > a. Install sysbench from https://github.com/akopytov/sysbench/archive/master.zip > > b. Use following script to reproduce problem (may need dozens of minutes) > > while true ; do > > sysbench /usr/local/share/sysbench/oltp_write_only.lua > > --table-size=1000000 --tables=100 \ > > --threads=32 --db-driver=mysql --mysql-db=sysbench > > --mysql-host=127.0.0.1 --mysql- port=3306 \ > > --mysql-user=root --mysql-password=123456 > > --mysql-socket=/var/lib/mysql/mysql.sock prepare > > > > sleep 5 > > sysbench /usr/local/share/sysbench/oltp_write_only.lua > > --table-size=1000000 --tables=100 \ > > --threads=32 --db-driver=mysql --mysql-db=sysbench > > --mysql-host=127.0.0.1 --mysql- port=3306 \ > > --mysql-user=root --mysql-password=123456 > > --mysql-socket=/var/lib/mysql/mysql.sock cleanup > > > > sleep 5 > > done > > > > If you can't reproduce, we could provide a remote environment for you or > > connect to your machine to build a reproduced environment. > > Ah, not that simple, also it isn't that easy to get arm64 machine for > experiments for me. Connecting to your environment would be possible but > let's try remote debugging for a bit more ;) > > > > > > Can you post output of "dumpe2fs -h " for the filesystem where the > > > > > problem happens? Thanks! > > > > > > > > Sure, the output is: > > > > > > > > # dumpe2fs -h /dev/sda3 > > > > dumpe2fs 1.45.3 (14-Jul-2019) > > > > Filesystem volume name: > > > > Last mounted on: /data > > > > Filesystem UUID: 09a51146-b325-48bb-be63-c9df539a90a1 > > > > Filesystem magic number: 0xEF53 > > > > Filesystem revision #: 1 (dynamic) > > > > Filesystem features: has_journal ext_attr resize_inode dir_index > > > > filetype needs_recovery sparse_super large_file > > > > > > Thanks for the data. OK, a filesystem without extents. Does your test by > > > any chance try to do direct IO to a hole in a file? Because that is not > > > (and never was) supported without extents. Also the fact that you don't see > > > the problem with ext4 (which means extents support) would be pointing in > > > that direction. > > > > I am not sure if it trys to do direct IO to a hole or not, is there any > > way to check? If you have a simple test to reproduce please let me know, > > we are glad to try. > > Can you enable following tracing? Sure, but let's confirm before doing that, it seems Ext4 doesn't support iomap in V4.19 which could also reproduce the problem, so if it is necessary to do the following tracing? or should we modify the tracing if under V4.19? > echo 1 >/sys/kernel/debug/tracing/events/ext4/ext4_ind_map_blocks_exit/enable > echo iomap_dio_rw >/sys/kernel/debug/tracing/set_ftrace_filter > echo "function_graph" >/sys/kernel/debug/tracing/current_tracer > > And then gather output from /sys/kernel/debug/tracing/trace_pipe. Once the > problem reproduces, you can gather the problematic file name from dmesg, find > inode number from "stat " and provide that all to me? Thanks! > > Honza > -- > Jan Kara > SUSE Labs, CR