From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C82A1C4167B
	for <linux-mm@archiver.kernel.org>; Tue, 12 Dec 2023 13:16:27 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0E8936B02CE; Tue, 12 Dec 2023 08:16:27 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 09A086B02CF; Tue, 12 Dec 2023 08:16:27 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EA25A6B02D0; Tue, 12 Dec 2023 08:16:26 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id D47106B02CE
	for <linux-mm@kvack.org>; Tue, 12 Dec 2023 08:16:26 -0500 (EST)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id A5D92A0999
	for <linux-mm@kvack.org>; Tue, 12 Dec 2023 13:16:26 +0000 (UTC)
X-FDA: 81558215172.11.005424E
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
	by imf18.hostedemail.com (Postfix) with ESMTP id 3169D1C0010
	for <linux-mm@kvack.org>; Tue, 12 Dec 2023 13:16:21 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf18.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1702386984;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=hhkT8JeexRNpBGO+G2bqhqjR470h9mzzjgApW1U30sY=;
	b=Yy9u8d2/gr2vaNbBnd276NQJ06uYqNVyjV4hWdzmplF/ipb5hxuC2U9AIYcrfQPQWZTlco
	hdq34gEd0nZxLphNTdR7dkZOCvVHV1N94Tqk6J5DrcEuH6KA8IloeYmtximqcn9K6ZtXbb
	vUg+DSSHV8BTPOAQDKiPJLKAc5x0W54=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf18.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702386984; a=rsa-sha256;
	cv=none;
	b=r5Jd46tvrKJjsY0IAypV/KfrLhIBEBF983BEpIQpodW3WGHG2A8Ko55A06YAUqTwkIfVWd
	aQBnuuf5GxRAPxwjqsRloBviUGai06Sy2rdrwZkyQZeljteub0GWTsuBKAx7PD7EuE6Rxk
	tr5jcDOzGEqtzXhGNv4mTfuezLkkoWQ=
Received: from mail.maildlp.com (unknown [172.19.163.174])
	by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4SqJyC2ZFMzvRtJ;
	Tue, 12 Dec 2023 21:15:31 +0800 (CST)
Received: from dggpeml500021.china.huawei.com (unknown [7.185.36.21])
	by mail.maildlp.com (Postfix) with ESMTPS id 440D414044F;
	Tue, 12 Dec 2023 21:16:17 +0800 (CST)
Received: from [10.174.177.174] (10.174.177.174) by
 dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.35; Tue, 12 Dec 2023 21:16:16 +0800
Content-Type: multipart/alternative;
	boundary="------------9tulGclcaU04ie3wDXE5gE60"
Message-ID: <9fdebd0a-ac10-e193-b245-7678fa708c82@huawei.com>
Date: Tue, 12 Dec 2023 21:16:16 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.1.2
Subject: Re: [RFC PATCH] mm/filemap: avoid buffered read/write race to read
 inconsistent data
Content-Language: en-US
To: Jan Kara <jack@suse.cz>
CC: <linux-mm@kvack.org>, <linux-ext4@vger.kernel.org>, <tytso@mit.edu>,
	<adilger.kernel@dilger.ca>, <willy@infradead.org>,
	<akpm@linux-foundation.org>, <david@fromorbit.com>, <hch@infradead.org>,
	<ritesh.list@gmail.com>, <linux-kernel@vger.kernel.org>,
	<yi.zhang@huawei.com>, <yangerkun@huawei.com>, <yukuai3@huawei.com>,
	<stable@kernel.org>, Baokun Li <libaokun1@huawei.com>
References: <20231212093634.2464108-1-libaokun1@huawei.com>
 <20231212124157.ew6q6jp2wsezvqzd@quack3>
From: Baokun Li <libaokun1@huawei.com>
In-Reply-To: <20231212124157.ew6q6jp2wsezvqzd@quack3>
X-Originating-IP: [10.174.177.174]
X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To
 dggpeml500021.china.huawei.com (7.185.36.21)
X-Rspamd-Queue-Id: 3169D1C0010
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: 9y5e9a71w37m7g3ewskdquet9cwhpwoi
X-HE-Tag: 1702386981-100582
X-HE-Meta: U2FsdGVkX18f1SxJUw74kCAYA/EiUGDd2LpfgC8p5u4R6z9MY4vfuidi6hMtkDrbI47AoL3H8EsccRIYPfh63hzngWqsJnFS5qmy6o7SVgnVV6Y6wQUnkI4o2e4LxeTKkbj/2ckfJNBfzAyMF76N2Q5DrA8/V7C6KkhtL91BLQG43xPnAA8uDOeyVmcn4hWM2vPEKUAopLzdTF5VT0f5WzxI+W8HpSyTASRsW1nbzTKZ7ukRAEN35N5Xx9Tv1dRZW4AEiTJFdVeUB+vdlV/vRwp4tr5n3x4/Y+zDoE+mprP6QqkYXUqZNFhP1jXa1xpkI/cVnEvRx8JlXRofqSONj0hBNUVpIFvr95IKfmpSeR8kqvsitVtaWbOv8t1qkj1drsEXMjGYgUVp1txbdnusOAN7unnHy7ICBSkSqLiZ9QncxZMX6M/O61Mj8vR2bFGd6J3cl3MxgiCCph9Dl+JTENPHMlkqFAfGuuWSRTIyJ8Hv2OMVrBX9GjN7LCC/xQJOG2AyqPahE+7L21EdAp/PFKwI3OIwzsHfXl5MlQlIqBelk2A5DRJ0JPDZlIVJa3OIktqPitoqp69WLkI5lrF9/4qOE0BqyO9Mj6dxawdzIGG9ZqPcOG61UHaXXJf5vZLFEBB4IhBhh0aaTH2e2HkuapQCkn6sUx0Nc3pzZ4Q51ymap+McciyQ/uCjnmaNM770fkipSGologXO+B2SLtJvI93dsSAA6lkKe9RuryOSEmtv8lsofsJP4f362R6l64haxB1BjYokz9e4WAyGgJ+eUCXhts1+ctqYPtY0HVcC1ecauVfp1nH91sA6JbkAOGNPE/EUj5EBOLitByB5XvvmfI/o64cOUIZGsFSiCYvGdSVwL4gv/nE5F/jPPPG3b+H77Io84avVuCWmQUUdGrWFCXxsJCA3o8WDqYw4cOsb9ZPsI0Delu9oVmMKm/88qQ+kfMMnyuaSwWI4Q5zZG/4
 B6Rm0zWi
 T/koBNpaQpy6i5ym0PsCt5JYDJFdcKokOKIRoMdr+I+xqvQ7jbVKuDj1Z2/tUJlRIXMC5Shp+Kj26t7YFIj8IzUx9fOlOXwK+xlQqOFDu5YuFEZNG8o8QosmpXAuWmre4oaCwxd2BbvTv0zuX5tcm6C49ohe/Vf+wf6jb6DzQXrSH+3cFyAkxDmQPhsWOb656huN2jJ/1s8n7cm0Yqo/7mbfrgfa5ji0TS5iBXi5cxY0VaaAZJB1QtD623cb01C7h7pVQQXmCsI20L0iVEpjXdORWNYfTlWVY3uJNK+31ZrLTSYL8ICI7qbw8yLFzUbe6GXwdxE3A8TlvTgKr1i0bia7sCkP0Wn873PHpxNxpyTuR6dN3QaGeBSDR51SIqnqU310SRS4ecDNW3G8QcjhTSygslDGrpMAb4hlpQa3C1D7uYKc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

--------------9tulGclcaU04ie3wDXE5gE60
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit

On 2023/12/12 20:41, Jan Kara wrote:
> On Tue 12-12-23 17:36:34, Baokun Li wrote:
>> The following concurrency may cause the data read to be inconsistent with
>> the data on disk:
>>
>>               cpu1                           cpu2
>> ------------------------------|------------------------------
>>                                 // Buffered write 2048 from 0
>>                                 ext4_buffered_write_iter
>>                                  generic_perform_write
>>                                   copy_page_from_iter_atomic
>>                                   ext4_da_write_end
>>                                    ext4_da_do_write_end
>>                                     block_write_end
>>                                      __block_commit_write
>>                                       folio_mark_uptodate
>> // Buffered read 4096 from 0          smp_wmb()
>> ext4_file_read_iter                   set_bit(PG_uptodate, folio_flags)
>>   generic_file_read_iter            i_size_write // 2048
>>    filemap_read                     unlock_page(page)
>>     filemap_get_pages
>>      filemap_get_read_batch
>>      folio_test_uptodate(folio)
>>       ret = test_bit(PG_uptodate, folio_flags)
>>       if (ret)
>>        smp_rmb();
>>        // Ensure that the data in page 0-2048 is up-to-date.
>>
>>                                 // New buffered write 2048 from 2048
>>                                 ext4_buffered_write_iter
>>                                  generic_perform_write
>>                                   copy_page_from_iter_atomic
>>                                   ext4_da_write_end
>>                                    ext4_da_do_write_end
>>                                     block_write_end
>>                                      __block_commit_write
>>                                       folio_mark_uptodate
>>                                        smp_wmb()
>>                                        set_bit(PG_uptodate, folio_flags)
>>                                     i_size_write // 4096
>>                                     unlock_page(page)
>>
>>     isize = i_size_read(inode) // 4096
>>     // Read the latest isize 4096, but without smp_rmb(), there may be
>>     // Load-Load disorder resulting in the data in the 2048-4096 range
>>     // in the page is not up-to-date.
>>     copy_page_to_iter
>>     // copyout 4096
>>
>> In the concurrency above, we read the updated i_size, but there is no read
>> barrier to ensure that the data in the page is the same as the i_size at
>> this point, so we may copy the unsynchronized page out. Hence adding the
>> missing read memory barrier to fix this.
>>
>> This is a Load-Load reordering issue, which only occurs on some weak
>> mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
>> mem-ordering architectures (e.g. X86). And theoretically the problem
> AFAIK x86 can also reorder loads vs loads so the problem can in theory
> happen on x86 as well.

According to what I read in the /perfbook /at the link below,

  Loads Reordered After Loads does not happen on x86.

pdf sheet 562 corresponds to page 550,

    Table 15.5: Summary of Memory Ordering

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c.2023.06.11a.pdf

>> doesn't only happen on ext4, filesystems that call filemap_read() but
>> don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
>> problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
>> this problem.
>>
>> Cc:stable@kernel.org
>> Signed-off-by: Baokun Li<libaokun1@huawei.com>
>> ---
>>   mm/filemap.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 71f00539ac00..6324e2ac3e74 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2607,6 +2607,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
>>   			goto put_folios;
>>   		end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
>>   
>> +		/* Ensure that the page cache within isize is updated. */
> Barries have to be in pairs to work and it is a good practice to document
> this. So here I'd have comment like:
> 		/*
> 		 * Pairs with a barrier in
> 		 * block_write_end()->mark_buffer_dirty() or other page
> 		 * dirtying routines like iomap_write_end() to ensure
> 		 * changes to page contents are visible before we see
> 		 * increased inode size.
> 		 */
>
> 								Honza

That's a very accurate description! Thanks a lot!

I will add this comment in the next version.

>> +		smp_rmb();
>> +
>>   		/*
>>   		 * Once we start copying data, we don't want to be touching any
>>   		 * cachelines that might be contended:
>> -- 
>> 2.31.1
>>

Thanks!
-- 
With Best Regards,
Baokun Li
.
--------------9tulGclcaU04ie3wDXE5gE60
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <div class="moz-cite-prefix">On 2023/12/12 20:41, Jan Kara wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:20231212124157.ew6q6jp2wsezvqzd@quack3">
      <pre class="moz-quote-pre" wrap="">On Tue 12-12-23 17:36:34, Baokun Li wrote:
</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">The following concurrency may cause the data read to be inconsistent with
the data on disk:

             cpu1                           cpu2
------------------------------|------------------------------
                               // Buffered write 2048 from 0
                               ext4_buffered_write_iter
                                generic_perform_write
                                 copy_page_from_iter_atomic
                                 ext4_da_write_end
                                  ext4_da_do_write_end
                                   block_write_end
                                    __block_commit_write
                                     folio_mark_uptodate
// Buffered read 4096 from 0          smp_wmb()
ext4_file_read_iter                   set_bit(PG_uptodate, folio_flags)
 generic_file_read_iter            i_size_write // 2048
  filemap_read                     unlock_page(page)
   filemap_get_pages
    filemap_get_read_batch
    folio_test_uptodate(folio)
     ret = test_bit(PG_uptodate, folio_flags)
     if (ret)
      smp_rmb();
      // Ensure that the data in page 0-2048 is up-to-date.

                               // New buffered write 2048 from 2048
                               ext4_buffered_write_iter
                                generic_perform_write
                                 copy_page_from_iter_atomic
                                 ext4_da_write_end
                                  ext4_da_do_write_end
                                   block_write_end
                                    __block_commit_write
                                     folio_mark_uptodate
                                      smp_wmb()
                                      set_bit(PG_uptodate, folio_flags)
                                   i_size_write // 4096
                                   unlock_page(page)

   isize = i_size_read(inode) // 4096
   // Read the latest isize 4096, but without smp_rmb(), there may be
   // Load-Load disorder resulting in the data in the 2048-4096 range
   // in the page is not up-to-date.
   copy_page_to_iter
   // copyout 4096

In the concurrency above, we read the updated i_size, but there is no read
barrier to ensure that the data in the page is the same as the i_size at
this point, so we may copy the unsynchronized page out. Hence adding the
missing read memory barrier to fix this.

This is a Load-Load reordering issue, which only occurs on some weak
mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
mem-ordering architectures (e.g. X86). And theoretically the problem
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
AFAIK x86 can also reorder loads vs loads so the problem can in theory
happen on x86 as well.
</pre>
    </blockquote>
    <p>According to what I read in the <i>perfbook </i>at the link
      below,</p>
    <p> Loads Reordered After Loads does not happen on x86.</p>
    <p>pdf sheet 562 corresponds to page 550, <br>
    </p>
    <p>   Table 15.5: Summary of Memory Ordering<br>
    </p>
    <p><a class="moz-txt-link-freetext" href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c.2023.06.11a.pdf">https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c.2023.06.11a.pdf</a><br>
    </p>
    <blockquote type="cite"
      cite="mid:20231212124157.ew6q6jp2wsezvqzd@quack3">
      <pre class="moz-quote-pre" wrap="">
</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">doesn't only happen on ext4, filesystems that call filemap_read() but
don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
this problem.

Cc: <a class="moz-txt-link-abbreviated" href="mailto:stable@kernel.org">stable@kernel.org</a>
Signed-off-by: Baokun Li <a class="moz-txt-link-rfc2396E" href="mailto:libaokun1@huawei.com">&lt;libaokun1@huawei.com&gt;</a>
---
 mm/filemap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 71f00539ac00..6324e2ac3e74 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2607,6 +2607,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 			goto put_folios;
 		end_offset = min_t(loff_t, isize, iocb-&gt;ki_pos + iter-&gt;count);
 
+		/* Ensure that the page cache within isize is updated. */
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
Barries have to be in pairs to work and it is a good practice to document
this. So here I'd have comment like:
		/*
		 * Pairs with a barrier in
		 * block_write_end()-&gt;mark_buffer_dirty() or other page
		 * dirtying routines like iomap_write_end() to ensure
		 * changes to page contents are visible before we see
		 * increased inode size.
		 */

								Honza
</pre>
    </blockquote>
    <p>That's a very accurate description! Thanks a lot!<br>
    </p>
    <p>I will add this comment in the next version.<br>
    </p>
    <blockquote type="cite"
      cite="mid:20231212124157.ew6q6jp2wsezvqzd@quack3">
      <pre class="moz-quote-pre" wrap="">
</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">+		smp_rmb();
+
 		/*
 		 * Once we start copying data, we don't want to be touching any
 		 * cachelines that might be contended:
-- 
2.31.1

</pre>
      </blockquote>
    </blockquote>
    <p><br>
    </p>
    Thanks!
    <br>
    <span class="moz-txt-tag">-- </span><br>
    <span class="moz-txt-tag"></span>With Best Regards,
    <br>
    Baokun Li
    <br>
    .
  </body>
</html>

--------------9tulGclcaU04ie3wDXE5gE60--