From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0540EC10DC1 for ; Mon, 4 Dec 2023 13:50:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A07E6B02BD; Mon, 4 Dec 2023 08:50:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D9066B02C7; Mon, 4 Dec 2023 08:50:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6534D6B02C8; Mon, 4 Dec 2023 08:50:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 4F92B6B02BD for ; Mon, 4 Dec 2023 08:50:30 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 21CFF16026E for ; Mon, 4 Dec 2023 13:50:30 +0000 (UTC) X-FDA: 81529270620.23.25F04C2 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf02.hostedemail.com (Postfix) with ESMTP id 825E78001F for ; Mon, 4 Dec 2023 13:50:26 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701697828; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ZB3h16+cgD1lZ/ca48nYKiOnxdE1Adc6XafiP2MSw58=; b=2+WB6ZbZY4Wzsr2Qt8Cz9fZIB0UUt7qwLCR6gh7YYa5XBgqHmk4sxsHX1+CZk8wCaeO3kG 3cBWiNh4lIkom+dhaiAb00iUdB22QKLl3u+x/rusPy5L4CgF3ODnaVfWMURIfH55ii/ZiK r1MKSMaIgk18KfJ1xeLys+Z0nRwZ8Hk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701697828; a=rsa-sha256; cv=none; b=apk3td4TgkYWgnunvPS5lDTQiQsWYyWAmvIhi3557NzePrZ7Q1fad7WB8FT99dQSaBptQp b99QYyiY0ulzgFN+kuxl5bvY3IeKJdtwVy08gVrO7scq5cr9ZXvje2qFk+7Fj90jGzD3j0 Lz6DZWv6iN/0+GRdvVewPer/lvXrK9g= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.53]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4SkQ5L2jDdzvRY4; Mon, 4 Dec 2023 21:49:42 +0800 (CST) Received: from [10.174.177.174] (10.174.177.174) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Mon, 4 Dec 2023 21:50:19 +0800 Message-ID: Date: Mon, 4 Dec 2023 21:50:18 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.1.2 Subject: Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read Content-Language: en-US To: Jan Kara CC: , , , , , , , , , , , Baokun Li References: <20231202091432.8349-1-libaokun1@huawei.com> <20231204121120.mpxntey47rluhcfi@quack3> From: Baokun Li In-Reply-To: <20231204121120.mpxntey47rluhcfi@quack3> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.174] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpeml500021.china.huawei.com (7.185.36.21) X-CFilter-Loop: Reflected X-Stat-Signature: 63p5wqf8e4uweseh5nzf5p3mxa659mtu X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 825E78001F X-Rspam-User: X-HE-Tag: 1701697826-489127 X-HE-Meta: U2FsdGVkX1/+zinW8uSbrBtQyBgg4Zg7kmJ6vOSyLGpJ/L6kTN0cPdPoh6nYK1ltilQ6FLWoyX0OB69XfNKzlafrefEeROBHuC2iMWIYiQwE7oHQ4U8nReJrIf8IPq1HPp/rNtW198iexcClcwdDXZdIxXAlgK45uOzfpFnWBfyL42hc5k4dx4Ch1xf8f5xLp/wvGqo9VW1mx9hnknwgQwqjiYInpM92Xf0AekedZ/YVHE63//SxvWUPfx4LkwLCYswuXCJSxkO6mNAMVJQ8gTosugZTcYgQCd2A4yJNkEvKiM7t7eQoZwivC80RoQrU7ihzzMhTvs3sEItfg1WmiGd5zjp4cVgZjgDOsAGX5jRWWSUb2qB/Mhu55bumCIMZgtgasFOWPfJve0+nDM1HNq9Pws+ptkyrOrJ8tL3RglOMGvA/RGwf+HnI1LW6Grxr2NYN0u1cPJwpMmNh4ScW1wKkvcKT4CJnqgjbKjM5OHZlQMzILI0ebWOwymGRYooHCVf6PSezjarPHjTZE4SajeByjUPajIup2nftNvvlzjcy4VQPZI3C2OuV3FJBvE7FjnbNIc4RmsPc3HaTGA2zC+dB/K+lqCm1oLwRZky5PqHpeNPcQjL2t+jS++wpyUkafck/WcuG1VntIDE2qXiHNFif4MLTKyoSU6VA9ZkB9DR/1KHQufAn77E9R8pGvyzAcuF85rJc51RSebRLEZSZ3+PWvCEqII/pqx+FIA3TurFnbbEZlxeDYanqqsReQ+NUSFlTUSWHVfOZZ5GYYQFLDXz3WSx8MTC60PymM2XujUJqxkMKyLAXUBmJUuC/TavyjVI+yU/2o74x+RdObdvROtTrfikmlaG/0EIxTEDkUA0Y+ANFIXjW904g9O1c06EgS3Zuq0jJbs2i0gPmU8fAisqgxBnco2jWmgt+i2gYJWhMYQiO23YdprFErRx17OnZwce8r9zxPZ6to9EiZA4 KbcYEomQ Axp077iT6Sa8qqrhzsvzzUnyrVcjXB7mv0ryjlvrv8BxlO+17MBlryphmbCBxht1TTupX+mktHEYuP/2HQa/xezxbb0UNEGAn9QoZkqSiTDGedSmq7oGbDcWoSCGtpB7PkJSqP+3weV9v+O6Labr/0FmW8VxB5ohBa+WkEXbiHMvP6Zs/Cojq9XPeKKRgDcbF0UPXBkpCRhwRi0I= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/12/4 20:11, Jan Kara wrote: > Hello! Thank you for your reply! > > On Sat 02-12-23 17:14:30, Baokun Li wrote: >> Recently, while running some pressure tests on MYSQL, noticed that >> occasionally a "corrupted data in log event" error would be reported. >> After analyzing the error, I found that extending DIO write and buffered >> read were competing, resulting in some zero-filled page end being read. >> Since ext4 buffered read doesn't hold an inode lock, and there is no >> field in the page to indicate the valid data size, it seems to me that >> it is impossible to solve this problem perfectly without changing these >> two things. > Yes, combining buffered reads with direct IO writes is a recipe for > problems and pretty much in the "don't do it" territory. So honestly I'd > consider this a MYSQL bug. Were you able to identify why does MYSQL use > buffered read in this case? It is just something specific to the test > you're doing? The problem is with a one-master-twoslave MYSQL database with three physical machines, and using sysbench pressure testing on each of the three machines, the problem occurs about once every two to three hours. The problem is with the relay log file, and when the problem occurs, the middle dozens of bytes of the file are read as all zeros, while the data on disk is not. This is a journal-like file where a write process gets the data from the master node and writes it locally, and another replay process reads the file and performs the replay operation accordingly (some SQL statements). The problem is that when replaying, it finds that the data read is corrupted, not valid SQL data, while the data on disk is normal. It's not confirmed that buffered reads vs direct IO writes is actually causing this issue, but this is the only scenario that we can reproduce with our local simplified scripts. Also, after merging in patch 1, the MYSQL pressure test scenario has now been tested for 5 days and has not been reproduced. I'll double-check the problem scenario, although buffered reads with buffered writes doesn't seem to have this problem. >> In this series, the first patch reads the inode size twice, and takes the >> smaller of the two values as the copyout limit to avoid copying data that >> was not actually read (0-padding) into the user buffer and causing data >> corruption. This greatly reduces the probability of problems under 4k >> page. However, the problem is still easily triggered under 64k page. >> >> The second patch waits for the existing dio write to complete and >> invalidate the stale page cache before performing a new buffered read >> in ext4, avoiding data corruption by copying the stale page cache to >> the user buffer. This makes it much less likely that the problem will >> be triggered in a 64k page. >> >> Do we have a plan to add a lock to the ext4 buffered read or a field in >> the page that indicates the size of the valid data in the page? Or does >> anyone have a better idea? > No, there are no plans to address this AFAIK. Because such locking will > slow down all the well behaved applications to fix a corner case for > application doing unsupported things. Sure we must not crash the kernel, > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > combines buffered and direct IO but returning zeros instead of valid data > is in my opinion fully within the range of acceptable behavior for such > case. > > Honza I also feel that a scenario like buffered reads + DIO writes is strange. But theoretically when read doesn't return an error, the data read shouldn't be wrong. And I tested that xfs guarantees data consistency in this scenario, which is why I thought it might be buggy. Thanks! -- With Best Regards, Baokun Li .