From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C82A1C4167B for ; Tue, 12 Dec 2023 13:16:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0E8936B02CE; Tue, 12 Dec 2023 08:16:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 09A086B02CF; Tue, 12 Dec 2023 08:16:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA25A6B02D0; Tue, 12 Dec 2023 08:16:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D47106B02CE for ; Tue, 12 Dec 2023 08:16:26 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A5D92A0999 for ; Tue, 12 Dec 2023 13:16:26 +0000 (UTC) X-FDA: 81558215172.11.005424E Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf18.hostedemail.com (Postfix) with ESMTP id 3169D1C0010 for ; Tue, 12 Dec 2023 13:16:21 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf18.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702386984; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hhkT8JeexRNpBGO+G2bqhqjR470h9mzzjgApW1U30sY=; b=Yy9u8d2/gr2vaNbBnd276NQJ06uYqNVyjV4hWdzmplF/ipb5hxuC2U9AIYcrfQPQWZTlco hdq34gEd0nZxLphNTdR7dkZOCvVHV1N94Tqk6J5DrcEuH6KA8IloeYmtximqcn9K6ZtXbb vUg+DSSHV8BTPOAQDKiPJLKAc5x0W54= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf18.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702386984; a=rsa-sha256; cv=none; b=r5Jd46tvrKJjsY0IAypV/KfrLhIBEBF983BEpIQpodW3WGHG2A8Ko55A06YAUqTwkIfVWd aQBnuuf5GxRAPxwjqsRloBviUGai06Sy2rdrwZkyQZeljteub0GWTsuBKAx7PD7EuE6Rxk tr5jcDOzGEqtzXhGNv4mTfuezLkkoWQ= Received: from mail.maildlp.com (unknown [172.19.163.174]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4SqJyC2ZFMzvRtJ; Tue, 12 Dec 2023 21:15:31 +0800 (CST) Received: from dggpeml500021.china.huawei.com (unknown [7.185.36.21]) by mail.maildlp.com (Postfix) with ESMTPS id 440D414044F; Tue, 12 Dec 2023 21:16:17 +0800 (CST) Received: from [10.174.177.174] (10.174.177.174) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Tue, 12 Dec 2023 21:16:16 +0800 Content-Type: multipart/alternative; boundary="------------9tulGclcaU04ie3wDXE5gE60" Message-ID: <9fdebd0a-ac10-e193-b245-7678fa708c82@huawei.com> Date: Tue, 12 Dec 2023 21:16:16 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.1.2 Subject: Re: [RFC PATCH] mm/filemap: avoid buffered read/write race to read inconsistent data Content-Language: en-US To: Jan Kara CC: , , , , , , , , , , , , , , Baokun Li References: <20231212093634.2464108-1-libaokun1@huawei.com> <20231212124157.ew6q6jp2wsezvqzd@quack3> From: Baokun Li In-Reply-To: <20231212124157.ew6q6jp2wsezvqzd@quack3> X-Originating-IP: [10.174.177.174] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpeml500021.china.huawei.com (7.185.36.21) X-Rspamd-Queue-Id: 3169D1C0010 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 9y5e9a71w37m7g3ewskdquet9cwhpwoi X-HE-Tag: 1702386981-100582 X-HE-Meta: U2FsdGVkX18f1SxJUw74kCAYA/EiUGDd2LpfgC8p5u4R6z9MY4vfuidi6hMtkDrbI47AoL3H8EsccRIYPfh63hzngWqsJnFS5qmy6o7SVgnVV6Y6wQUnkI4o2e4LxeTKkbj/2ckfJNBfzAyMF76N2Q5DrA8/V7C6KkhtL91BLQG43xPnAA8uDOeyVmcn4hWM2vPEKUAopLzdTF5VT0f5WzxI+W8HpSyTASRsW1nbzTKZ7ukRAEN35N5Xx9Tv1dRZW4AEiTJFdVeUB+vdlV/vRwp4tr5n3x4/Y+zDoE+mprP6QqkYXUqZNFhP1jXa1xpkI/cVnEvRx8JlXRofqSONj0hBNUVpIFvr95IKfmpSeR8kqvsitVtaWbOv8t1qkj1drsEXMjGYgUVp1txbdnusOAN7unnHy7ICBSkSqLiZ9QncxZMX6M/O61Mj8vR2bFGd6J3cl3MxgiCCph9Dl+JTENPHMlkqFAfGuuWSRTIyJ8Hv2OMVrBX9GjN7LCC/xQJOG2AyqPahE+7L21EdAp/PFKwI3OIwzsHfXl5MlQlIqBelk2A5DRJ0JPDZlIVJa3OIktqPitoqp69WLkI5lrF9/4qOE0BqyO9Mj6dxawdzIGG9ZqPcOG61UHaXXJf5vZLFEBB4IhBhh0aaTH2e2HkuapQCkn6sUx0Nc3pzZ4Q51ymap+McciyQ/uCjnmaNM770fkipSGologXO+B2SLtJvI93dsSAA6lkKe9RuryOSEmtv8lsofsJP4f362R6l64haxB1BjYokz9e4WAyGgJ+eUCXhts1+ctqYPtY0HVcC1ecauVfp1nH91sA6JbkAOGNPE/EUj5EBOLitByB5XvvmfI/o64cOUIZGsFSiCYvGdSVwL4gv/nE5F/jPPPG3b+H77Io84avVuCWmQUUdGrWFCXxsJCA3o8WDqYw4cOsb9ZPsI0Delu9oVmMKm/88qQ+kfMMnyuaSwWI4Q5zZG/4 B6Rm0zWi T/koBNpaQpy6i5ym0PsCt5JYDJFdcKokOKIRoMdr+I+xqvQ7jbVKuDj1Z2/tUJlRIXMC5Shp+Kj26t7YFIj8IzUx9fOlOXwK+xlQqOFDu5YuFEZNG8o8QosmpXAuWmre4oaCwxd2BbvTv0zuX5tcm6C49ohe/Vf+wf6jb6DzQXrSH+3cFyAkxDmQPhsWOb656huN2jJ/1s8n7cm0Yqo/7mbfrgfa5ji0TS5iBXi5cxY0VaaAZJB1QtD623cb01C7h7pVQQXmCsI20L0iVEpjXdORWNYfTlWVY3uJNK+31ZrLTSYL8ICI7qbw8yLFzUbe6GXwdxE3A8TlvTgKr1i0bia7sCkP0Wn873PHpxNxpyTuR6dN3QaGeBSDR51SIqnqU310SRS4ecDNW3G8QcjhTSygslDGrpMAb4hlpQa3C1D7uYKc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --------------9tulGclcaU04ie3wDXE5gE60 Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit On 2023/12/12 20:41, Jan Kara wrote: > On Tue 12-12-23 17:36:34, Baokun Li wrote: >> The following concurrency may cause the data read to be inconsistent with >> the data on disk: >> >> cpu1 cpu2 >> ------------------------------|------------------------------ >> // Buffered write 2048 from 0 >> ext4_buffered_write_iter >> generic_perform_write >> copy_page_from_iter_atomic >> ext4_da_write_end >> ext4_da_do_write_end >> block_write_end >> __block_commit_write >> folio_mark_uptodate >> // Buffered read 4096 from 0 smp_wmb() >> ext4_file_read_iter set_bit(PG_uptodate, folio_flags) >> generic_file_read_iter i_size_write // 2048 >> filemap_read unlock_page(page) >> filemap_get_pages >> filemap_get_read_batch >> folio_test_uptodate(folio) >> ret = test_bit(PG_uptodate, folio_flags) >> if (ret) >> smp_rmb(); >> // Ensure that the data in page 0-2048 is up-to-date. >> >> // New buffered write 2048 from 2048 >> ext4_buffered_write_iter >> generic_perform_write >> copy_page_from_iter_atomic >> ext4_da_write_end >> ext4_da_do_write_end >> block_write_end >> __block_commit_write >> folio_mark_uptodate >> smp_wmb() >> set_bit(PG_uptodate, folio_flags) >> i_size_write // 4096 >> unlock_page(page) >> >> isize = i_size_read(inode) // 4096 >> // Read the latest isize 4096, but without smp_rmb(), there may be >> // Load-Load disorder resulting in the data in the 2048-4096 range >> // in the page is not up-to-date. >> copy_page_to_iter >> // copyout 4096 >> >> In the concurrency above, we read the updated i_size, but there is no read >> barrier to ensure that the data in the page is the same as the i_size at >> this point, so we may copy the unsynchronized page out. Hence adding the >> missing read memory barrier to fix this. >> >> This is a Load-Load reordering issue, which only occurs on some weak >> mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong >> mem-ordering architectures (e.g. X86). And theoretically the problem > AFAIK x86 can also reorder loads vs loads so the problem can in theory > happen on x86 as well. According to what I read in the /perfbook /at the link below,  Loads Reordered After Loads does not happen on x86. pdf sheet 562 corresponds to page 550,    Table 15.5: Summary of Memory Ordering https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c.2023.06.11a.pdf >> doesn't only happen on ext4, filesystems that call filemap_read() but >> don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this >> problem, while filesystems with inode lock (e.g. xfs, nfs) won't have >> this problem. >> >> Cc:stable@kernel.org >> Signed-off-by: Baokun Li >> --- >> mm/filemap.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 71f00539ac00..6324e2ac3e74 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -2607,6 +2607,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, >> goto put_folios; >> end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count); >> >> + /* Ensure that the page cache within isize is updated. */ > Barries have to be in pairs to work and it is a good practice to document > this. So here I'd have comment like: > /* > * Pairs with a barrier in > * block_write_end()->mark_buffer_dirty() or other page > * dirtying routines like iomap_write_end() to ensure > * changes to page contents are visible before we see > * increased inode size. > */ > > Honza That's a very accurate description! Thanks a lot! I will add this comment in the next version. >> + smp_rmb(); >> + >> /* >> * Once we start copying data, we don't want to be touching any >> * cachelines that might be contended: >> -- >> 2.31.1 >> Thanks! -- With Best Regards, Baokun Li . --------------9tulGclcaU04ie3wDXE5gE60 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: 8bit
On 2023/12/12 20:41, Jan Kara wrote:
On Tue 12-12-23 17:36:34, Baokun Li wrote:
The following concurrency may cause the data read to be inconsistent with
the data on disk:

             cpu1                           cpu2
------------------------------|------------------------------
                               // Buffered write 2048 from 0
                               ext4_buffered_write_iter
                                generic_perform_write
                                 copy_page_from_iter_atomic
                                 ext4_da_write_end
                                  ext4_da_do_write_end
                                   block_write_end
                                    __block_commit_write
                                     folio_mark_uptodate
// Buffered read 4096 from 0          smp_wmb()
ext4_file_read_iter                   set_bit(PG_uptodate, folio_flags)
 generic_file_read_iter            i_size_write // 2048
  filemap_read                     unlock_page(page)
   filemap_get_pages
    filemap_get_read_batch
    folio_test_uptodate(folio)
     ret = test_bit(PG_uptodate, folio_flags)
     if (ret)
      smp_rmb();
      // Ensure that the data in page 0-2048 is up-to-date.

                               // New buffered write 2048 from 2048
                               ext4_buffered_write_iter
                                generic_perform_write
                                 copy_page_from_iter_atomic
                                 ext4_da_write_end
                                  ext4_da_do_write_end
                                   block_write_end
                                    __block_commit_write
                                     folio_mark_uptodate
                                      smp_wmb()
                                      set_bit(PG_uptodate, folio_flags)
                                   i_size_write // 4096
                                   unlock_page(page)

   isize = i_size_read(inode) // 4096
   // Read the latest isize 4096, but without smp_rmb(), there may be
   // Load-Load disorder resulting in the data in the 2048-4096 range
   // in the page is not up-to-date.
   copy_page_to_iter
   // copyout 4096

In the concurrency above, we read the updated i_size, but there is no read
barrier to ensure that the data in the page is the same as the i_size at
this point, so we may copy the unsynchronized page out. Hence adding the
missing read memory barrier to fix this.

This is a Load-Load reordering issue, which only occurs on some weak
mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
mem-ordering architectures (e.g. X86). And theoretically the problem
AFAIK x86 can also reorder loads vs loads so the problem can in theory
happen on x86 as well.

According to what I read in the perfbook at the link below,

 Loads Reordered After Loads does not happen on x86.

pdf sheet 562 corresponds to page 550,

   Table 15.5: Summary of Memory Ordering

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c.2023.06.11a.pdf


      
doesn't only happen on ext4, filesystems that call filemap_read() but
don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
this problem.

Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 mm/filemap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 71f00539ac00..6324e2ac3e74 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2607,6 +2607,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 			goto put_folios;
 		end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
 
+		/* Ensure that the page cache within isize is updated. */
Barries have to be in pairs to work and it is a good practice to document
this. So here I'd have comment like:
		/*
		 * Pairs with a barrier in
		 * block_write_end()->mark_buffer_dirty() or other page
		 * dirtying routines like iomap_write_end() to ensure
		 * changes to page contents are visible before we see
		 * increased inode size.
		 */

								Honza

That's a very accurate description! Thanks a lot!

I will add this comment in the next version.


      
+		smp_rmb();
+
 		/*
 		 * Once we start copying data, we don't want to be touching any
 		 * cachelines that might be contended:
-- 
2.31.1


Thanks!
-- 
With Best Regards,
Baokun Li
. --------------9tulGclcaU04ie3wDXE5gE60--