From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1D901C4332F
	for <linux-mm@archiver.kernel.org>; Tue, 12 Dec 2023 02:15:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4064E6B026E; Mon, 11 Dec 2023 21:15:44 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3B58E6B026F; Mon, 11 Dec 2023 21:15:44 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 27EBD6B0270; Mon, 11 Dec 2023 21:15:44 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 19F116B026E
	for <linux-mm@kvack.org>; Mon, 11 Dec 2023 21:15:44 -0500 (EST)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id CF395808EB
	for <linux-mm@kvack.org>; Tue, 12 Dec 2023 02:15:43 +0000 (UTC)
X-FDA: 81556550166.06.6F179B1
Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255])
	by imf01.hostedemail.com (Postfix) with ESMTP id 596C240009
	for <linux-mm@kvack.org>; Tue, 12 Dec 2023 02:15:39 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=none;
	spf=pass (imf01.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=libaokun1@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1702347341;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=tGFaICDAZjVnebOCiYVm7diIYMqEbvcXdBU2zzc9kXM=;
	b=le1qOdXvyI89/O4IuOZJQoZ/hpCHQNOCOaztiLga1gclZr9v2AhmxnqlDD6hSzZv02M/WN
	mVh+VxlzRmgMLppoS2mKgPm667OwVEyzFCNwPHOLAh0RZpP+whhlPvDkjHeUr4nver517X
	SDBnFqd7f+6eW0kriWSfhLGqLOndZYk=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702347342; a=rsa-sha256;
	cv=none;
	b=S1DvQfOG1ECEFG7CX4SsV+sKPkIPHYDEJrY4Y8lwLs7QKGtqYPXRrMuWmMAar6Nm6vY/c2
	V3PF0Qy9RGoMJf5kSmkyVQjfbu9bCk97kG6gf9ByFePfTj504pmx7gK1Ld3nugpO2KqH+y
	16cgo5XvMQzDU6Bt9gGbLHv4UpCEdq0=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=none;
	spf=pass (imf01.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=libaokun1@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
Received: from mail.maildlp.com (unknown [172.19.162.254])
	by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4Sq2Jd5CNTz1Q6Zj;
	Tue, 12 Dec 2023 10:15:29 +0800 (CST)
Received: from dggpeml500021.china.huawei.com (unknown [7.185.36.21])
	by mail.maildlp.com (Postfix) with ESMTPS id A4C1318002F;
	Tue, 12 Dec 2023 10:15:35 +0800 (CST)
Received: from [10.174.177.174] (10.174.177.174) by
 dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.35; Tue, 12 Dec 2023 10:15:34 +0800
Message-ID: <245b1784-5c92-5d15-df33-c50de687faf9@huawei.com>
Date: Tue, 12 Dec 2023 10:15:34 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.1.2
Subject: Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending
 DIO write race with buffered read
Content-Language: en-US
To: Jan Kara <jack@suse.cz>
CC: <linux-mm@kvack.org>, <linux-ext4@vger.kernel.org>, <tytso@mit.edu>,
	<adilger.kernel@dilger.ca>, <willy@infradead.org>,
	<akpm@linux-foundation.org>, <ritesh.list@gmail.com>,
	<linux-kernel@vger.kernel.org>, <yi.zhang@huawei.com>,
	<yangerkun@huawei.com>, <yukuai3@huawei.com>, Baokun Li
	<libaokun1@huawei.com>
References: <20231202091432.8349-1-libaokun1@huawei.com>
 <20231204121120.mpxntey47rluhcfi@quack3>
 <b524ccf7-e5a0-4a55-db6e-b67989055a05@huawei.com>
 <20231204144106.fk4yxc422gppifsz@quack3>
 <70b274c2-c19a-103b-4cf4-b106c698ddcc@huawei.com>
 <20231206193757.k5cppxqew6zjmbx3@quack3>
 <63b1e234-e005-a62b-82c5-fa7acf26d53a@huawei.com>
 <20231211174952.mdwyh7fmu7rjzfn3@quack3>
From: Baokun Li <libaokun1@huawei.com>
In-Reply-To: <20231211174952.mdwyh7fmu7rjzfn3@quack3>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.177.174]
X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To
 dggpeml500021.china.huawei.com (7.185.36.21)
X-Rspamd-Queue-Id: 596C240009
X-Rspam-User: 
X-Stat-Signature: 1c64w96gr7kjfgka97rfid7ig5cwan7e
X-Rspamd-Server: rspam03
X-HE-Tag: 1702347339-804701
X-HE-Meta: U2FsdGVkX18Dj9yNbe9wpCdqsM1gNBSH4tPVnsatDfAktS4/6I6y1+gzBvLhTaVTz4HVkdIvqU8I4vMv03WNmDMlAxK9Lz6yFxIzqdPbA0+RQAQwbuRnXzQpUpIHg3Hp6BBsAw69Oajx5eFELWbTyBWsyeGZ3OtR1dNGuzF7pIoCt2j8m/WjjT8a1XtL/CcQUrgQCq3JDUYomNmFk1dY2jpOHC0S18+R0yUCAFfjd5RwNF8UvcQigUuQA7QsIis0eGdNjufpyp2YmwsPnpoU/nidxhJq5cdpXxT+bNpbSFHWBMq2JrNewM3SSwwfFi7ZuP4K95fpo/SP2BjV3FG/kO5fBdgSiF5R32lh28tKpUaeMYS9DPB/qFd8UbXjc4rBsmiUZdRxTSXlKcjZslH+NVUKlTA4qrBGIDqqG+9hzWAFWeudAds1ubzY4n/6Xn9ADUXpvivUmnMx602w6YG0GWlmKV2yyHYkpayRlTT653liLJGR/5aV/U0PXGX7Q9GSFIpU+i3cE1X1fmxiHWLgRTjou94FkQtDR7KDNLWzHA+Jc5uIpN49zySITcmLiv9zdZd6tkDXzjgn5buAsr7EX1g26yVqcwzMz3xJlwsuuP+eNfHlq+loSGtPvyKpq32BnAwrgEg9S0DFxfkccjYx6dGo+OJt07PxM4w7POMGxLprN9CcgZgLNoZlziyF2EnwUy+wtgBdBjsWVgv4e5POE+mSuuDCo0DvvozWvG3vEoMewOp3leaKdBYtEVd0eRSEhDcKYUCL3tN4ZFiiNOuzXx/beQ+trQwZyiXIxqAN0cGjwbjrEkXxZK/zbn/s/jVmTx1hF91UNTSNmVSAvAuUSwl/PZKkd8cc2n97Zjia1uAxfkHOVObMH7Us7cu/WBRINS7WqoGa5Mj8DzC2NCdhBMF6aaYVdG3xYEnAq2JDnThbQvoib4qKSFWShUq/Nb4i6Cnjibd4H1snjavHQqK
 mopMSAQb
 fIApkJFB0zy0q4Rxz/V4PCqfY38Sntd9k8cAnZAeoXbIQJM8l8wj7Jkjn+zjpSib/VUbeTdI6aZQc75NBRohQxC2USI5N7nUQq66QcS+UqZxh821L5qxF4P1VS8es0CWuFod4wpLHn1fglUiJpjyG4LJbDkLa+pfyAOO4JqiNC2ombxSayFFgELeWV4t/I/Mwr0a4bxvkq1Ff0SE=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 2023/12/12 1:49, Jan Kara wrote:
> On Thu 07-12-23 22:15:55, Baokun Li wrote:
>> On 2023/12/7 3:37, Jan Kara wrote:
>>> On Tue 05-12-23 20:50:30, Baokun Li wrote:
>>>> On 2023/12/4 22:41, Jan Kara wrote:
>>>>> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>>>>>> On 2023/12/4 20:11, Jan Kara wrote:
>>>>>> The problem is with a one-master-twoslave MYSQL database with three
>>>>>> physical machines, and using sysbench pressure testing on each of the
>>>>>> three machines, the problem occurs about once every two to three hours.
>>>>>>
>>>>>> The problem is with the relay log file, and when the problem occurs, the
>>>>>> middle dozens of bytes of the file are read as all zeros, while the data on
>>>>>> disk is not. This is a journal-like file where a write process gets the data
>>>>>> from
>>>>>> the master node and writes it locally, and another replay process reads the
>>>>>> file and performs the replay operation accordingly (some SQL statements).
>>>>>> The problem is that when replaying, it finds that the data read is
>>>>>> corrupted,
>>>>>> not valid SQL data, while the data on disk is normal.
>>>>>>
>>>>>> It's not confirmed that buffered reads vs direct IO writes is actually
>>>>>> causing this issue, but this is the only scenario that we can reproduce
>>>>>> with our local simplified scripts. Also, after merging in patch 1, the
>>>>>> MYSQL pressure test scenario has now been tested for 5 days and has not
>>>>>> been reproduced.
>>>>>>
>>>>>> I'll double-check the problem scenario, although buffered reads with
>>>>>> buffered writes doesn't seem to have this problem.
>>>>> Yeah, from what you write it seems that the replay code is using buffered
>>>>> reads on the journal file. I guess you could confirm that with a bit of
>>>>> kernel tracing but the symptoms look pretty convincing. Did you try talking
>>>>> to MYSQL guys about why they are doing this?
>>>> The operations performed on the relay log file are buffered reads and
>>>> writes, which I confirmed with the following bpftrace script:
>>>> ```
>>>> #include <linux/fs.h>
>>>> #include <linux/path.h>
>>>> #include <linux/dcache.h>
>>>>
>>>> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>>>       printf("read path: %s\n", str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>>>> }
>>>>
>>>> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>>>       printf("write path: %s\n", str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>>>> }
>>>> ```
>>>> I suspect there are DIO writes causing the problem, but I haven't caught
>>>> any DIO writes to such files via bpftrace.
>>> Interesting. Not sure how your partially zeroed-out buffers could happen
>>> with fully buffered IO.
>>>
>> After looking at the code again and again, the following concurrency
>> seems to bypass the memory barrier:
>>
>> ext4_buffered_write_iter
>>   generic_perform_write
>>    copy_page_from_iter_atomic
>>    ext4_da_write_end
>>     ext4_da_do_write_end
>>      block_write_end
>>       __block_commit_write
>>        folio_mark_uptodate
>>         smp_wmb()
>>         set_bit(PG_uptodate, folio_flags(folio, 0))
>>      i_size_write(inode, pos + copied)
>>      // write isize 2048
>>      unlock_page(page)
>>
>> ext4_file_read_iter
>>   generic_file_read_iter
>>    filemap_read
>>     filemap_get_pages
>>      filemap_get_read_batch
>>      folio_test_uptodate(folio)
>>       ret = test_bit(PG_uptodate, folio_flags(folio, 0));
>>       if (ret)
>>        smp_rmb();
>>        // The read barrier here ensures
>>        // that data 0-2048 in the page is synchronized.
>>                             ext4_buffered_write_iter
>>                              generic_perform_write
>>                               copy_page_from_iter_atomic
>>                               ext4_da_write_end
>>                                ext4_da_do_write_end
>>                                 block_write_end
>>                                  __block_commit_write
>>                                   folio_mark_uptodate
>>                                    smp_wmb()
>>                                    set_bit(PG_uptodate, folio_flags(folio,
>> 0))
>>                                 i_size_write(inode, pos + copied)
>>                                 // write isize 4096
>>                                 unlock_page(page)
>>     // read isize 4096
>>     isize = i_size_read(inode)
>>     // But there is no read barrier here,
>>     // so the data in the 2048-4096 range
>>     // may not be synchronized yet !!!
>>     copy_page_to_iter()
>>     // copyout 4096
>>
>> In the concurrency above, we read the updated i_size, but there is
>> no read barrier to ensure that the data in the page is the same as
>> the i_size at this point. Therefore, we may copy the unsynchronized
>> page out. Is it normal for us to read zero-filled data in this case?
> Indeed, I have checked and filemap_read() (but this dates back even to old
> do_generic_file_read() code) indeed does copy data only after checking
> uptodate flag and then sampling i_size so we may be copying state in the
> middle of the racing write and indeed there is nothing which would prevent
> prefetching page data before fetching inode size. I agree this is kind of
> nasty so I think adding a read barrier between i_size_read() and
> copy_page_to_iter() makes sense. Does it fix your issue with MYSQL?
>
> 								Honza
Thank you very much for confirming this issue!

Adding a read barrier between i_size_read() and copy_page_to_iter()
does seem to have solved the problem. The environment, which was
previously reproducing once every 2-3h, has been running for 60+h
without reproducing after adding the read barrier.

This is a Load-Load reordering issue, which only occurs on some
weak mem-ordering architectures (e.g. ARM64, ALPHA), but not
on strong mem-ordering architectures (e.g. X86).  This issue was
exactly discovered when migrating MYSQL services from X86 to
ARM64.

And theoretically the problem doesn't only happen on ext4,
filesystems that call filemap_read() but don't hold inode lock
(e.g. btrfs, f2fs, ubifs ...) will have this problem, while filesystems
with inode lock (e.g. xfs, nfs) won't have this problem.

Later I will send the patch that adds the memory barrier.

Thanks!
-- 
With Best Regards,
Baokun Li
.