From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D347C4167B for ; Wed, 6 Dec 2023 08:35:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B8816B0095; Wed, 6 Dec 2023 03:35:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 968476B0096; Wed, 6 Dec 2023 03:35:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8305F6B0098; Wed, 6 Dec 2023 03:35:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 710FE6B0095 for ; Wed, 6 Dec 2023 03:35:41 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3F3A4A013F for ; Wed, 6 Dec 2023 08:35:41 +0000 (UTC) X-FDA: 81535734882.12.107EFA4 Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com [209.85.167.180]) by imf05.hostedemail.com (Postfix) with ESMTP id 68AB0100006 for ; Wed, 6 Dec 2023 08:35:39 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=D5cgT6xE; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf05.hostedemail.com: domain of david@fromorbit.com designates 209.85.167.180 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701851739; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ohn3sCDUMPNiTAAQitTv55v9D4ick0LqAf/d7PE1Bk4=; b=Aye7AARAMVAEy4jvudxl91eo+xZr6pWty9Ir/thPGnPy+hW2fwYf4ZzH6XTvgvFs/W06uA rr+R8Gnq6lvwIzgi4gJ5M06/DmaymjNiu/5UMMn8cS16TWKH04j1RiuCWkU5ncoi8LyahY p6q1Gzxv/RtFaYSfZvj10UlG8dfYLCg= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=D5cgT6xE; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf05.hostedemail.com: domain of david@fromorbit.com designates 209.85.167.180 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701851739; a=rsa-sha256; cv=none; b=AW3hQS9Cpnh4kUPhKOdY2H/AlNc68FK1G6uNF3Kt8JqXJLNlddQ0UvDbfN1XjCghVA20pr xH6hOX92BMVRjMcUr6Yc9Up+jQh7Y7TaieqUB/zRxSk//0lFlJlWMGZfriWNMwEja5UX0z UmDjcpgXZKNFjTovabWIdztOoADZ+qU= Received: by mail-oi1-f180.google.com with SMTP id 5614622812f47-3b8b8372e30so2377843b6e.3 for ; Wed, 06 Dec 2023 00:35:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1701851738; x=1702456538; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=ohn3sCDUMPNiTAAQitTv55v9D4ick0LqAf/d7PE1Bk4=; b=D5cgT6xEBeMgMkEpC2iwr0yZAow6Y6bYbpqquHFXrNdzOK6MjBXsYrntfbJA9DJWdJ yI5hlCdFfh3d9zB2+KbbaUPyfXTf1z9/JlQ0VakK7ASCuWkiMHMlamwDToiZolntiS8j J8hj+tc34GkDw74h1INIsSaPYUNMfpttd4tzMuDiSovtsDpgCjsNtZup632eVTwNqw0N 7Iqlmhohe7njDzEYIdaZz60xUfcxYC2VugTvWjnACXXObihskGbVg8DFQ6kQGfJw509I U1l/C1fbyDD6ZxAGiJA7oAvUvVxPOj4cf762n6TlBi6KDXf+etd8LuYzukDhssp1BQ7F hK2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701851738; x=1702456538; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ohn3sCDUMPNiTAAQitTv55v9D4ick0LqAf/d7PE1Bk4=; b=n98RIzesF8ZOnhEyV4cPzEVrb7m0OIoG0MHoqOaCiMuR/R+TG9K4GuIzjcYKcB8I6Z Hu95QkzQQ1G/aSPqrrV+pzyQRlcdJQGks+/b9ZyQzNYvU29WZB0RjwYK8zoCQKxlZD5F +ajx83eVlHXWO9nwzBksDvhYxsOHCyHVmFM5i1E8riLkp+sTyPr+u0lmQpxUraaWE/F9 srGjvwacc1bN0xfNkl6Tmj1R/Xa8By9reM3Y0+34vyZni6y54EzpZpVULRI/VV5UZ8Op 3GMQYi28kIBOAgVQDlwkF8ulhaMygVCHWHOfZcuRXpM/lF+E/wOVs4HoYiJPENlSxAmG uoPA== X-Gm-Message-State: AOJu0YzxOycZVIMs9j+TEgZQLQFdN8TvUBOozw6eWFnmq5N0cUbHkoN5 g22je4cCd028T/UAujFUnfzUBA== X-Google-Smtp-Source: AGHT+IEaFcVZuuTo+wyGZz9Uf3a8wTv/flKy/+FEv8W9jNsNAuHMOUQ6YrdQV1jr+Xef1OH6rLlfbg== X-Received: by 2002:a05:6808:1994:b0:3b9:d20b:ea52 with SMTP id bj20-20020a056808199400b003b9d20bea52mr72608oib.82.1701851738462; Wed, 06 Dec 2023 00:35:38 -0800 (PST) Received: from dread.disaster.area (pa49-180-125-5.pa.nsw.optusnet.com.au. [49.180.125.5]) by smtp.gmail.com with ESMTPSA id e12-20020aa7980c000000b006ce50876c37sm2359500pfl.100.2023.12.06.00.35.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 00:35:37 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1rAnNX-004YHN-1z; Wed, 06 Dec 2023 19:35:35 +1100 Date: Wed, 6 Dec 2023 19:35:35 +1100 From: Dave Chinner To: Baokun Li Cc: Jan Kara , linux-mm@kvack.org, linux-ext4@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, willy@infradead.org, akpm@linux-foundation.org, ritesh.list@gmail.com, linux-kernel@vger.kernel.org, yi.zhang@huawei.com, yangerkun@huawei.com, yukuai3@huawei.com Subject: Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read Message-ID: References: <20231202091432.8349-1-libaokun1@huawei.com> <20231204121120.mpxntey47rluhcfi@quack3> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 68AB0100006 X-Stat-Signature: nnbghweepcqozgiadcch44axgnd8eaks X-Rspam-User: X-HE-Tag: 1701851739-215219 X-HE-Meta: U2FsdGVkX18H8jbl/Dz5Ki1VX32wqEd4nXooInv+4T8UDoTpgS1kQufwG+gQNgjLLZLDUuL5ykNelPz6eu7YOA66ffLFRWrlY1Ovwx7ix9/hZBwsN6Q0gMtbAx4QBH8P+UI20DLgHd/DLIgTidcp4a++DeA4IjmCl3w1Ktl/MrPAUHpbo8cTY1TvvhbMxqw/668Nh6o9erWX18c3rWY+MqRRFJNL7x1QxBxh8YlW7Dzk7C/805dtUvFlpzn293/va4qpQV12xT7xjZ7XoW3MSYWm8kF6wSTU7dHiOgd3s0YHSaBIL1X3keAHKpfB3cLXlVMdm/5XapaMEMpvwHTydoJzlp+o8Q2c+R0auduufcChpgUHkZ0KUWBYlXsCdWsEECmtF2LJXBzh4lilvNq1YCV2TX7OAiKte7jR4uo8Ge8imtv4yBGN1mB0YXEXwqiIfdgm2iwrf9UGb4zjXQcpYOUSOi0i4ODlvJ8CmHEZa9hQcQIhMIKEkKR5NE1tPutX0oETBINO6oixBkbttktB4XGGhBLtbujhai15NmE39WiFKMLtde6gk5KkQkCsrEEx6mFFQP1ETOMRHKehftKVzjSFrGjcF0uMwTxm0i5qO91Yod6J6/VOwl5W8wQ7qZUiiwtlzQQatRhLvxB3uWIWe/1/NanZ/IPj6xoVJ6P7go8S6xa2DQv4D8enV65Fak0NVnpuWGkSuyv1NEjAGVLUEXLHrb64SL3aglBnIFGk5hsuT631bnbKDE7CIiRKpAuZvP6xslek2rhcIhbCYo17HxuwqG8rodKIpbX+eGal+26zNP+kOnnJ/U1Kn1SRMfUkhJSUctAE7cVykma2++SfF3m3DWNZw5699gc3JfFv/9k9bNf2/ZLEVXGr+5ScjLhKlIWdPZo+mxRflBBNCK4MMyv7sdTUHupK+Q5W4a4WQ1zOaGWPLDvuTdYzHXZbjxZG3rmU1H0/HAAn5vN9ZAw 2Q7cieMP UbY6Ag8beMB0lXLlR04E3eaG8vCXvT/0BsAdTNbOGSOc8fA9JiWtDoTaCIvcTQE4FpKLYNvzbFmFWmv9kwVGuOy3er/V7quRJrzXJabY/YoFZkPT6OLAtSuh7hNWO3Sb0PydYAe+qVlRsy758bxEyKVBe2drH0zIFBOm6lJeAYePhgX3pOojZJ9iPLPSj0oWe75/VvCXKIe23kscXYoCfX9o+8xJtngY+9nT9JXw5aQKa1wf1uFXB9cBRRDsIbymIdHIBM2ZGEc9ZQDVT+mhqtNOPPwyTfE0fp/uBqNbeel4MowShg5Hmxix3nZSlQjIVK0mSUmixCvjQvg2ZYRGKtcTEGgTyrubnE3I/Lt9JXFzI3+MRilRBb5ZCfEUSwz5AXwuN+P8Ha9LGyGFqvj5/+97IdIi1NJynzQiSJ7i09+UMjUoIG+pe5sfBVJbhSw/Zebv0IKuy5PeFUywA8QcaCdk0wok7frNlDg+3L3HQdy1v9Gn6LKlCjLPv0VNXQeDh+T5xDzciKWyxlbwdRmoU34IpqLgXGKpRsPQtvn2ZJpuXxO9PE2MO3d/mhDSPLmeIReE7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote: > On 2023/12/4 20:11, Jan Kara wrote: > > Hello! > Thank you for your reply! > > > > On Sat 02-12-23 17:14:30, Baokun Li wrote: > > > Recently, while running some pressure tests on MYSQL, noticed that > > > occasionally a "corrupted data in log event" error would be reported. > > > After analyzing the error, I found that extending DIO write and buffered > > > read were competing, resulting in some zero-filled page end being read. > > > Since ext4 buffered read doesn't hold an inode lock, and there is no > > > field in the page to indicate the valid data size, it seems to me that > > > it is impossible to solve this problem perfectly without changing these > > > two things. > > Yes, combining buffered reads with direct IO writes is a recipe for > > problems and pretty much in the "don't do it" territory. So honestly I'd > > consider this a MYSQL bug. Were you able to identify why does MYSQL use > > buffered read in this case? It is just something specific to the test > > you're doing? > The problem is with a one-master-twoslave MYSQL database with three > physical machines, and using sysbench pressure testing on each of the > three machines, the problem occurs about once every two to three hours. > > The problem is with the relay log file, and when the problem occurs, the > middle dozens of bytes of the file are read as all zeros, while the data on > disk is not. This is a journal-like file where a write process gets the data > from > the master node and writes it locally, and another replay process reads the > file and performs the replay operation accordingly (some SQL statements). > The problem is that when replaying, it finds that the data read is > corrupted, > not valid SQL data, while the data on disk is normal. > > It's not confirmed that buffered reads vs direct IO writes is actually > causing > this issue, but this is the only scenario that we can reproduce with our > local > simplified scripts. Also, after merging in patch 1, the MYSQL pressure test > scenario has now been tested for 5 days and has not been reproduced. Mixing overlapping buffered read with direct writes - especially partial block extending DIO writes - is a recipe for data corruption. It's not a matter of if, it's a matter of when. Fundamentally, when you have overlapping write IO involving DIO, the result of the overlapping IOs is undefined. One cannot control submission order, the order that the overlapping IO hit the media, or completion ordering that might clear flags like unwritten extents. The only guarantee that we give in this case is that we won't expose stale data from the disk to the user read. As such, it is the responsibility of the application to avoid overlapping IOs when doing direct IO. The fact that you['ve observed data corruption due to overlapping IO ranges from the one application indicates that this is, indeed, an application level problem and can only be solved by fixing the application.... > I'll double-check the problem scenario, although buffered reads with > buffered > writes doesn't seem to have this problem. > > > In this series, the first patch reads the inode size twice, and takes the > > > smaller of the two values as the copyout limit to avoid copying data that > > > was not actually read (0-padding) into the user buffer and causing data > > > corruption. This greatly reduces the probability of problems under 4k > > > page. However, the problem is still easily triggered under 64k page. > > > > > > The second patch waits for the existing dio write to complete and > > > invalidate the stale page cache before performing a new buffered read > > > in ext4, avoiding data corruption by copying the stale page cache to > > > the user buffer. This makes it much less likely that the problem will > > > be triggered in a 64k page. > > > > > > Do we have a plan to add a lock to the ext4 buffered read or a field in > > > the page that indicates the size of the valid data in the page? Or does > > > anyone have a better idea? > > No, there are no plans to address this AFAIK. Because such locking will > > slow down all the well behaved applications to fix a corner case for > > application doing unsupported things. Sure we must not crash the kernel, > > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > > combines buffered and direct IO but returning zeros instead of valid data > > is in my opinion fully within the range of acceptable behavior for such > > case. > > > > Honza > I also feel that a scenario like buffered reads + DIO writes is strange. But > theoretically when read doesn't return an error, the data read shouldn't be > wrong. The data that was read isn't wrong - it just wasn't what the application expected. > And I tested that xfs guarantees data consistency in this scenario, which is > why I > thought it might be buggy. XFS certainly does not guarantee data consistency between buffered reads and DIO writes, especially for overlapping IO ranges using DIO (see above). IOWs, the fact that the data inconsistency doesn't reproduce on XFS doesn't mean that XFS is providing some guarantee of consistency for this IO pattern. All it means is that ext4 and XFS behave differently in an situation where the operational result is indeterminate and all we guarantee is that we won't expose stale data read from disk.... Cheers, Dave. -- Dave Chinner david@fromorbit.com