From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6CA4C4167B for ; Thu, 7 Dec 2023 14:16:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1836D6B0089; Thu, 7 Dec 2023 09:16:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 132C56B008A; Thu, 7 Dec 2023 09:16:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F14C46B008C; Thu, 7 Dec 2023 09:16:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id DDB926B0089 for ; Thu, 7 Dec 2023 09:16:10 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B174EC0243 for ; Thu, 7 Dec 2023 14:16:10 +0000 (UTC) X-FDA: 81540221700.09.AC0F65E Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf12.hostedemail.com (Postfix) with ESMTP id 27D8740021 for ; Thu, 7 Dec 2023 14:16:05 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701958568; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SN4H2iAkEd8LZVNtp9JKJuMwJX1No8PZGMI3zUcbO7U=; b=wIw8zevoXUMbG0SweeovBLua/RyZlVEtewelFWG0ql+PkR0BzqwBvP1TptKnUKM1W4WhK4 eykttrd1DbNG+IWXM0H6QJ9/vjgLwJE+l0BzWtTez4BBbHzTONrFCD5Z5ZA9Mxd6qexQ+G dO/i1n+Cw3SD0uF5mu2eqM4o58DmAKM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701958568; a=rsa-sha256; cv=none; b=QE89n4yBOqid39qtazUAg8u7LZz5I5Z4FLmJpn+Ys8KXR3b13zPP4bOdOpHKqbfacMO4ay wle9IVUmcdLHqtYAymv4CIGFWLwppg8UymeU7O+jgZEW2NRtSaCsLSrktXZZlkMUAfsJl+ gV5dmi3Nu7Xl27kvjyKVLyQMmCudmSg= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.56]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4SmGWS2pl2zYsrV; Thu, 7 Dec 2023 22:15:16 +0800 (CST) Received: from [10.174.177.174] (10.174.177.174) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 7 Dec 2023 22:15:56 +0800 Message-ID: <63b1e234-e005-a62b-82c5-fa7acf26d53a@huawei.com> Date: Thu, 7 Dec 2023 22:15:55 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.1.2 Subject: Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read Content-Language: en-US To: Jan Kara CC: , , , , , , , , , , , Baokun Li References: <20231202091432.8349-1-libaokun1@huawei.com> <20231204121120.mpxntey47rluhcfi@quack3> <20231204144106.fk4yxc422gppifsz@quack3> <70b274c2-c19a-103b-4cf4-b106c698ddcc@huawei.com> <20231206193757.k5cppxqew6zjmbx3@quack3> From: Baokun Li In-Reply-To: <20231206193757.k5cppxqew6zjmbx3@quack3> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.174] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpeml500021.china.huawei.com (7.185.36.21) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: 27D8740021 X-Rspam-User: X-Stat-Signature: x6g4q1bfmtxb9ttxbmyg8t4wc6bxw5h4 X-Rspamd-Server: rspam03 X-HE-Tag: 1701958565-137835 X-HE-Meta: U2FsdGVkX1+vTREnfieu2GfXTdWEiQNrPn0bGhBU7TvraEBVgF5TsxJWoZEtBSjPX56os2u4pxmzsZgCCpa1mt81xu1/oWtbqTnMgUJXwYuxJ/2qCWjFRhhFBY6hS2rLdXnuS/ArkgIqfXq9xAe8UCM/4T4ONtQMnDwkfj7y7/bAs8XIyzdjtbdp5fh83DxbuFJZC/qyp+2XyvTTAFhXTZo/Sw+xeflVzkWkQgZieW2UnPGCFB/2fzG+qByka8zMe0bPUTLpUzreVq2CyhvLH9LMjJslUVlClimsriBAG+efpeXVjKDzuQknd6gRIAMAaDTGSMt2FhPCXmbq7Ja+UOKWo4RT7CpRuh3A0X8Sg9hGRfTub6Fn01lFsc1S7DHq7EQiFLasC2/XeVs2W3eWgCUZHIEt6KRIotqeDAOeAbZPcjyU9UkDLwqVGNDvmIg1iF1vaP66elvih3rHZPwF0qjidTCFrBhNUZpjW4feJ4p849/v2UhxJA0qfYPSC6BMbitgZ1UmARss96vNkGCTJqktSghqrNmB7dRxx+l6bWokCA1j1hTIWDSsDyMOHVhnxZQ52lGxA4xvVJ3ILpshnjxSTAQAS/muUlZSqJSBVZTRj6m8wvALBRzgNIXOLgIgYgDnRQXl6fmntVuGLCOFK+dShUmWOVMhTj9l4iLJLDdgojfSYnbFkjbL8uqLvmouPu+XS05l0IcVIQg2oGpTiBAFObFsFGXbvBeKc6Aoh23kO4CsSXyzPLJfeMMFsYtVfCXzq2upWpyNd6v69UWC617T9jzFmZ/mO95tdy9iaywXGzAa5lGMJOfdmBhN7BgjYmlwsOfHn4TDemEk8hQ/+mvxvP5xuwpgSIV5+WBJRzT1KZV2JmlcnYJZYnMLg+wspg4z0dr3MJR76eHns0If8uMBJcD5Bk7FhZWkbsi+7TbxR2mG1aGI4NqEZoVX2ZOOwBAPuplzTo7JTrB0yaG lYbTSBHJ 1UxenbWN54SIWYW+eMfZ59chyg6C3XIrlazYLtzuE0sRd7mht2TQe2I6Ghrsim7nL1CbhSpmFXSaHstFsQGzlnWJwSvcqWwqvvkyKfG/1g2BYfUiWw//wcdOD92rS58YRxQh7Jqe6ntDPWZV1dqOgAycQ8vnR/AHKKlIZbxkhFC+/YLqxC1kLEy/YzKU88iElf3npRxk7EAlVB3s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/12/7 3:37, Jan Kara wrote: > On Tue 05-12-23 20:50:30, Baokun Li wrote: >> On 2023/12/4 22:41, Jan Kara wrote: >>> On Mon 04-12-23 21:50:18, Baokun Li wrote: >>>> On 2023/12/4 20:11, Jan Kara wrote: >>>> The problem is with a one-master-twoslave MYSQL database with three >>>> physical machines, and using sysbench pressure testing on each of the >>>> three machines, the problem occurs about once every two to three hours. >>>> >>>> The problem is with the relay log file, and when the problem occurs, the >>>> middle dozens of bytes of the file are read as all zeros, while the data on >>>> disk is not. This is a journal-like file where a write process gets the data >>>> from >>>> the master node and writes it locally, and another replay process reads the >>>> file and performs the replay operation accordingly (some SQL statements). >>>> The problem is that when replaying, it finds that the data read is >>>> corrupted, >>>> not valid SQL data, while the data on disk is normal. >>>> >>>> It's not confirmed that buffered reads vs direct IO writes is actually >>>> causing this issue, but this is the only scenario that we can reproduce >>>> with our local simplified scripts. Also, after merging in patch 1, the >>>> MYSQL pressure test scenario has now been tested for 5 days and has not >>>> been reproduced. >>>> >>>> I'll double-check the problem scenario, although buffered reads with >>>> buffered writes doesn't seem to have this problem. >>> Yeah, from what you write it seems that the replay code is using buffered >>> reads on the journal file. I guess you could confirm that with a bit of >>> kernel tracing but the symptoms look pretty convincing. Did you try talking >>> to MYSQL guys about why they are doing this? >> The operations performed on the relay log file are buffered reads and >> writes, which I confirmed with the following bpftrace script: >> ``` >> #include >> #include >> #include >> >> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { >>     printf("read path: %s\n", str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name)); >> } >> >> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { >>     printf("write path: %s\n", str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name)); >> } >> ``` >> I suspect there are DIO writes causing the problem, but I haven't caught >> any DIO writes to such files via bpftrace. > Interesting. Not sure how your partially zeroed-out buffers could happen > with fully buffered IO. > After looking at the code again and again, the following concurrency seems to bypass the memory barrier: ext4_buffered_write_iter  generic_perform_write   copy_page_from_iter_atomic   ext4_da_write_end    ext4_da_do_write_end     block_write_end      __block_commit_write       folio_mark_uptodate        smp_wmb()        set_bit(PG_uptodate, folio_flags(folio, 0))     i_size_write(inode, pos + copied)     // write isize 2048     unlock_page(page) ext4_file_read_iter  generic_file_read_iter   filemap_read    filemap_get_pages     filemap_get_read_batch     folio_test_uptodate(folio)      ret = test_bit(PG_uptodate, folio_flags(folio, 0));      if (ret)       smp_rmb();       // The read barrier here ensures       // that data 0-2048 in the page is synchronized.                            ext4_buffered_write_iter                             generic_perform_write                              copy_page_from_iter_atomic                              ext4_da_write_end                               ext4_da_do_write_end                                block_write_end                                 __block_commit_write                                  folio_mark_uptodate                                   smp_wmb()                                   set_bit(PG_uptodate, folio_flags(folio, 0))                                i_size_write(inode, pos + copied)                                // write isize 4096                                unlock_page(page)    // read isize 4096    isize = i_size_read(inode)    // But there is no read barrier here,    // so the data in the 2048-4096 range    // may not be synchronized yet !!!    copy_page_to_iter()    // copyout 4096 In the concurrency above, we read the updated i_size, but there is no read barrier to ensure that the data in the page is the same as the i_size at this point. Therefore, we may copy the unsynchronized page out. Is it normal for us to read zero-filled data in this case? Thanks! -- With Best Regards, Baokun Li .