From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC98BC4167B for ; Tue, 5 Dec 2023 12:50:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3820C6B007B; Tue, 5 Dec 2023 07:50:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 332D46B007E; Tue, 5 Dec 2023 07:50:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2210A6B0080; Tue, 5 Dec 2023 07:50:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0E6C16B007B for ; Tue, 5 Dec 2023 07:50:42 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id DE4104011B for ; Tue, 5 Dec 2023 12:50:41 +0000 (UTC) X-FDA: 81532748682.16.9FD4816 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf16.hostedemail.com (Postfix) with ESMTP id B478C18001A for ; Tue, 5 Dec 2023 12:50:38 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf16.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=libaokun1@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701780640; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vSjXr1LUSaXQN2N31yjfNYTahGIVcxcQ3TSjxRkE9Vs=; b=yxOYw/Ycm1QjXWK8fb7H2DhicBXqQr8L/d04gwl69CQDFMfuc7vBPokK9qJesbgAMZmgi3 xg7B9+VZjv4IpZPFrUvJZRVzsXx3+NnpHv96ZisbUSdx91g0yefFCIIn+9IBO1voMMsn8q zKzovdLkwciLM4UPurYGo48ReMwYs2s= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf16.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=libaokun1@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701780640; a=rsa-sha256; cv=none; b=2uYP66RXPAtTv88MkwUeWhBMROZ1cnl18cqlEGpaIgNMiMBeZExoflDboxZXdw6/TWjFZs VdDG/E9OSzUZiRGLa5wQD82E8BH8qa5SQLQ7RzWHJ4fuVBWjf0lSS8uryhqLXV19AWPCgr y6XiC+uLvhFzWr4SUxLGj97Lt1Mp88Q= Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.57]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4Sl0dY3gfQzFr8S; Tue, 5 Dec 2023 20:46:09 +0800 (CST) Received: from [10.174.177.174] (10.174.177.174) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Tue, 5 Dec 2023 20:50:30 +0800 Message-ID: <70b274c2-c19a-103b-4cf4-b106c698ddcc@huawei.com> Date: Tue, 5 Dec 2023 20:50:30 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.1.2 Subject: Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read Content-Language: en-US To: Jan Kara CC: , , , , , , , , , , , Baokun Li References: <20231202091432.8349-1-libaokun1@huawei.com> <20231204121120.mpxntey47rluhcfi@quack3> <20231204144106.fk4yxc422gppifsz@quack3> From: Baokun Li In-Reply-To: <20231204144106.fk4yxc422gppifsz@quack3> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.174] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpeml500021.china.huawei.com (7.185.36.21) X-CFilter-Loop: Reflected X-Rspam-User: X-Stat-Signature: mow6er77sp1t57wcmrffz8x8xerreu6x X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: B478C18001A X-HE-Tag: 1701780638-687762 X-HE-Meta: U2FsdGVkX19pldQE+rk5aAx+Mdk5ruecenAKkwgnd74FBHdmlMl7PA1cw+7qSlfvjzPxxi0Pi4KvxVSiAyJ9G01yCdcYJ8SttLh65401opnekCP1HGee71kx2YCUIqEmckIiGdkqWhrh9nwmeCPEIF2m/TDW6VdufNtDYH3R1v6bWlz8ZyxnzR5Oyr+l0gOX2TWu12ruySki8SQ++ZMmx60KzP3eDmFbBMeQ76bwmFr4AM8A5drBSqg/unrlRrSi/5VV580+Vsg7/UuXCYEOVyAhLoyT2sBbHHnVzdja7YlWwPK0UONiuYw+Whmyhp8n9xJ6yRRqrKlYTZgNa8uUgxgEmp238dsyKhM9TBVZsI70aG9ORecd/v9JQ7Dfuqy8azGoaAmWUq9jyyYbaPL8NUr4nXiy2LuWILusXJO+hTmtnRPXdzmcWhvK70OHKhCfr3uqtHYk5I4lrvA8U8KvWuTTpmthGxnMkOduvLkMoHe8c6vPxRfWy3syxVS30HzKIGciaZvrcztOAytQJuqB5B28716O1MPJ6/nZ1rl72JSbAPyaNfQYj6Tvfd+1qdN3lnm6YxS3AhlvSDsb1NadokpxloAdA68tb7YLaX9LkvsFx5Pu6c5kMnIyVTkdQ3J22U8Z1ZbeA/Wa3Oh+I7I6ki/tRS2DfdntxT6/lN0X+zpcPNqiggKtp2QSoWgilNGLBddDNuEmFBZGqFKVwzJfPNxojP2D2yK597/GuBKHQku0PZy3dSwrgnlsfqOug1TVQ6f4ebQlgd+aArIyMH7pmaRkasGMnMJbTUPYIofROcFp8S+1lcrKnqkiMmHQYRwQAo7XP+L1ynltVyiG0fjOPO/vOMhmfzIW2ptvAe3aO8lNNFNgLZplL3rR90V3vHuDVMpBfD2JBFa7zXJEWT980/NzxJRGWclbzwovAZl/suejb1mgfk1q+cqEojDMdqLnmwUsqa84GyzXzmLwSnM ExBu8KKj 0WJvun8gEoYDlPYRHahmtseeP7KufyJFaW2LfroKHLZkC8w+SWyVus0ZuRYM6SDq2Z6V/c05Hy2h/J6/7CzO/KFJsG+JEsPe6yIm60F5Nr60NfCS65Tpbi93IS2P/wGOieCLhS/lKKG2L5pNXSeSZ2/f2DtQ/eoWa2DFmrFacV9D+FH63cIj339CNdJyBI9OEUkJAWCMiww7hCp8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/12/4 22:41, Jan Kara wrote: > On Mon 04-12-23 21:50:18, Baokun Li wrote: >> On 2023/12/4 20:11, Jan Kara wrote: >>> On Sat 02-12-23 17:14:30, Baokun Li wrote: >>>> Recently, while running some pressure tests on MYSQL, noticed that >>>> occasionally a "corrupted data in log event" error would be reported. >>>> After analyzing the error, I found that extending DIO write and buffered >>>> read were competing, resulting in some zero-filled page end being read. >>>> Since ext4 buffered read doesn't hold an inode lock, and there is no >>>> field in the page to indicate the valid data size, it seems to me that >>>> it is impossible to solve this problem perfectly without changing these >>>> two things. >>> Yes, combining buffered reads with direct IO writes is a recipe for >>> problems and pretty much in the "don't do it" territory. So honestly I'd >>> consider this a MYSQL bug. Were you able to identify why does MYSQL use >>> buffered read in this case? It is just something specific to the test >>> you're doing? >> The problem is with a one-master-twoslave MYSQL database with three >> physical machines, and using sysbench pressure testing on each of the >> three machines, the problem occurs about once every two to three hours. >> >> The problem is with the relay log file, and when the problem occurs, the >> middle dozens of bytes of the file are read as all zeros, while the data on >> disk is not. This is a journal-like file where a write process gets the data >> from >> the master node and writes it locally, and another replay process reads the >> file and performs the replay operation accordingly (some SQL statements). >> The problem is that when replaying, it finds that the data read is >> corrupted, >> not valid SQL data, while the data on disk is normal. >> >> It's not confirmed that buffered reads vs direct IO writes is actually >> causing this issue, but this is the only scenario that we can reproduce >> with our local simplified scripts. Also, after merging in patch 1, the >> MYSQL pressure test scenario has now been tested for 5 days and has not >> been reproduced. >> >> I'll double-check the problem scenario, although buffered reads with >> buffered writes doesn't seem to have this problem. > Yeah, from what you write it seems that the replay code is using buffered > reads on the journal file. I guess you could confirm that with a bit of > kernel tracing but the symptoms look pretty convincing. Did you try talking > to MYSQL guys about why they are doing this? The operations performed on the relay log file are buffered reads and writes, which I confirmed with the following bpftrace script: ``` #include #include #include kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {     printf("read path: %s\n", str(((struct kiocb *)arg0)->ki_filp->f_path.dentry->d_name.name)); } kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {     printf("write path: %s\n", str(((struct kiocb *)arg0)->ki_filp->f_path.dentry->d_name.name)); } ``` I suspect there are DIO writes causing the problem, but I haven't caught any DIO writes to such files via bpftrace. >>>> In this series, the first patch reads the inode size twice, and takes the >>>> smaller of the two values as the copyout limit to avoid copying data that >>>> was not actually read (0-padding) into the user buffer and causing data >>>> corruption. This greatly reduces the probability of problems under 4k >>>> page. However, the problem is still easily triggered under 64k page. >>>> >>>> The second patch waits for the existing dio write to complete and >>>> invalidate the stale page cache before performing a new buffered read >>>> in ext4, avoiding data corruption by copying the stale page cache to >>>> the user buffer. This makes it much less likely that the problem will >>>> be triggered in a 64k page. >>>> >>>> Do we have a plan to add a lock to the ext4 buffered read or a field in >>>> the page that indicates the size of the valid data in the page? Or does >>>> anyone have a better idea? >>> No, there are no plans to address this AFAIK. Because such locking will >>> slow down all the well behaved applications to fix a corner case for >>> application doing unsupported things. Sure we must not crash the kernel, >>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app >>> combines buffered and direct IO but returning zeros instead of valid data >>> is in my opinion fully within the range of acceptable behavior for such >>> case. >>> >> I also feel that a scenario like buffered reads + DIO writes is strange. >> But theoretically when read doesn't return an error, the data read >> shouldn't be wrong. And I tested that xfs guarantees data consistency in >> this scenario, which is why I thought it might be buggy. > Yes, XFS has inherited stronger consistency guarantees from IRIX times than > Linux filesystems traditionally had. We generally don't even guarantee > buffered read vs buffered write atomicity (i.e., buffered read can see a > torn buffered write). > > Honza I'm a bit confused here, buffered read vs buffered write uses the same page and appears to be protected by a memory barrier, how does the inconsistency occur? Thanks! -- With Best Regards, Baokun Li .