* [RFC PATCH] mm/filemap: avoid buffered read/write race to read inconsistent data
@ 2023-12-12 9:36 Baokun Li
2023-12-12 12:41 ` Jan Kara
0 siblings, 1 reply; 5+ messages in thread
From: Baokun Li @ 2023-12-12 9:36 UTC (permalink / raw)
To: linux-mm, linux-ext4
Cc: tytso, adilger.kernel, jack, willy, akpm, david, hch,
ritesh.list, linux-kernel, yi.zhang, yangerkun, yukuai3,
libaokun1, stable
The following concurrency may cause the data read to be inconsistent with
the data on disk:
cpu1 cpu2
------------------------------|------------------------------
// Buffered write 2048 from 0
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
// Buffered read 4096 from 0 smp_wmb()
ext4_file_read_iter set_bit(PG_uptodate, folio_flags)
generic_file_read_iter i_size_write // 2048
filemap_read unlock_page(page)
filemap_get_pages
filemap_get_read_batch
folio_test_uptodate(folio)
ret = test_bit(PG_uptodate, folio_flags)
if (ret)
smp_rmb();
// Ensure that the data in page 0-2048 is up-to-date.
// New buffered write 2048 from 2048
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate, folio_flags)
i_size_write // 4096
unlock_page(page)
isize = i_size_read(inode) // 4096
// Read the latest isize 4096, but without smp_rmb(), there may be
// Load-Load disorder resulting in the data in the 2048-4096 range
// in the page is not up-to-date.
copy_page_to_iter
// copyout 4096
In the concurrency above, we read the updated i_size, but there is no read
barrier to ensure that the data in the page is the same as the i_size at
this point, so we may copy the unsynchronized page out. Hence adding the
missing read memory barrier to fix this.
This is a Load-Load reordering issue, which only occurs on some weak
mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
mem-ordering architectures (e.g. X86). And theoretically the problem
doesn't only happen on ext4, filesystems that call filemap_read() but
don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
this problem.
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
mm/filemap.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 71f00539ac00..6324e2ac3e74 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2607,6 +2607,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
goto put_folios;
end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
+ /* Ensure that the page cache within isize is updated. */
+ smp_rmb();
+
/*
* Once we start copying data, we don't want to be touching any
* cachelines that might be contended:
--
2.31.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] mm/filemap: avoid buffered read/write race to read inconsistent data
2023-12-12 9:36 [RFC PATCH] mm/filemap: avoid buffered read/write race to read inconsistent data Baokun Li
@ 2023-12-12 12:41 ` Jan Kara
2023-12-12 13:16 ` Baokun Li
2023-12-12 13:21 ` Baokun Li
0 siblings, 2 replies; 5+ messages in thread
From: Jan Kara @ 2023-12-12 12:41 UTC (permalink / raw)
To: Baokun Li
Cc: linux-mm, linux-ext4, tytso, adilger.kernel, jack, willy, akpm,
david, hch, ritesh.list, linux-kernel, yi.zhang, yangerkun,
yukuai3, stable
On Tue 12-12-23 17:36:34, Baokun Li wrote:
> The following concurrency may cause the data read to be inconsistent with
> the data on disk:
>
> cpu1 cpu2
> ------------------------------|------------------------------
> // Buffered write 2048 from 0
> ext4_buffered_write_iter
> generic_perform_write
> copy_page_from_iter_atomic
> ext4_da_write_end
> ext4_da_do_write_end
> block_write_end
> __block_commit_write
> folio_mark_uptodate
> // Buffered read 4096 from 0 smp_wmb()
> ext4_file_read_iter set_bit(PG_uptodate, folio_flags)
> generic_file_read_iter i_size_write // 2048
> filemap_read unlock_page(page)
> filemap_get_pages
> filemap_get_read_batch
> folio_test_uptodate(folio)
> ret = test_bit(PG_uptodate, folio_flags)
> if (ret)
> smp_rmb();
> // Ensure that the data in page 0-2048 is up-to-date.
>
> // New buffered write 2048 from 2048
> ext4_buffered_write_iter
> generic_perform_write
> copy_page_from_iter_atomic
> ext4_da_write_end
> ext4_da_do_write_end
> block_write_end
> __block_commit_write
> folio_mark_uptodate
> smp_wmb()
> set_bit(PG_uptodate, folio_flags)
> i_size_write // 4096
> unlock_page(page)
>
> isize = i_size_read(inode) // 4096
> // Read the latest isize 4096, but without smp_rmb(), there may be
> // Load-Load disorder resulting in the data in the 2048-4096 range
> // in the page is not up-to-date.
> copy_page_to_iter
> // copyout 4096
>
> In the concurrency above, we read the updated i_size, but there is no read
> barrier to ensure that the data in the page is the same as the i_size at
> this point, so we may copy the unsynchronized page out. Hence adding the
> missing read memory barrier to fix this.
>
> This is a Load-Load reordering issue, which only occurs on some weak
> mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
> mem-ordering architectures (e.g. X86). And theoretically the problem
AFAIK x86 can also reorder loads vs loads so the problem can in theory
happen on x86 as well.
> doesn't only happen on ext4, filesystems that call filemap_read() but
> don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
> problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
> this problem.
>
> Cc: stable@kernel.org
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
> mm/filemap.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 71f00539ac00..6324e2ac3e74 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2607,6 +2607,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> goto put_folios;
> end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
>
> + /* Ensure that the page cache within isize is updated. */
Barries have to be in pairs to work and it is a good practice to document
this. So here I'd have comment like:
/*
* Pairs with a barrier in
* block_write_end()->mark_buffer_dirty() or other page
* dirtying routines like iomap_write_end() to ensure
* changes to page contents are visible before we see
* increased inode size.
*/
Honza
> + smp_rmb();
> +
> /*
> * Once we start copying data, we don't want to be touching any
> * cachelines that might be contended:
> --
> 2.31.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] mm/filemap: avoid buffered read/write race to read inconsistent data
2023-12-12 12:41 ` Jan Kara
@ 2023-12-12 13:16 ` Baokun Li
2023-12-12 13:37 ` Jan Kara
2023-12-12 13:21 ` Baokun Li
1 sibling, 1 reply; 5+ messages in thread
From: Baokun Li @ 2023-12-12 13:16 UTC (permalink / raw)
To: Jan Kara
Cc: linux-mm, linux-ext4, tytso, adilger.kernel, willy, akpm, david,
hch, ritesh.list, linux-kernel, yi.zhang, yangerkun, yukuai3,
stable, Baokun Li
[-- Attachment #1: Type: text/plain, Size: 4762 bytes --]
On 2023/12/12 20:41, Jan Kara wrote:
> On Tue 12-12-23 17:36:34, Baokun Li wrote:
>> The following concurrency may cause the data read to be inconsistent with
>> the data on disk:
>>
>> cpu1 cpu2
>> ------------------------------|------------------------------
>> // Buffered write 2048 from 0
>> ext4_buffered_write_iter
>> generic_perform_write
>> copy_page_from_iter_atomic
>> ext4_da_write_end
>> ext4_da_do_write_end
>> block_write_end
>> __block_commit_write
>> folio_mark_uptodate
>> // Buffered read 4096 from 0 smp_wmb()
>> ext4_file_read_iter set_bit(PG_uptodate, folio_flags)
>> generic_file_read_iter i_size_write // 2048
>> filemap_read unlock_page(page)
>> filemap_get_pages
>> filemap_get_read_batch
>> folio_test_uptodate(folio)
>> ret = test_bit(PG_uptodate, folio_flags)
>> if (ret)
>> smp_rmb();
>> // Ensure that the data in page 0-2048 is up-to-date.
>>
>> // New buffered write 2048 from 2048
>> ext4_buffered_write_iter
>> generic_perform_write
>> copy_page_from_iter_atomic
>> ext4_da_write_end
>> ext4_da_do_write_end
>> block_write_end
>> __block_commit_write
>> folio_mark_uptodate
>> smp_wmb()
>> set_bit(PG_uptodate, folio_flags)
>> i_size_write // 4096
>> unlock_page(page)
>>
>> isize = i_size_read(inode) // 4096
>> // Read the latest isize 4096, but without smp_rmb(), there may be
>> // Load-Load disorder resulting in the data in the 2048-4096 range
>> // in the page is not up-to-date.
>> copy_page_to_iter
>> // copyout 4096
>>
>> In the concurrency above, we read the updated i_size, but there is no read
>> barrier to ensure that the data in the page is the same as the i_size at
>> this point, so we may copy the unsynchronized page out. Hence adding the
>> missing read memory barrier to fix this.
>>
>> This is a Load-Load reordering issue, which only occurs on some weak
>> mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
>> mem-ordering architectures (e.g. X86). And theoretically the problem
> AFAIK x86 can also reorder loads vs loads so the problem can in theory
> happen on x86 as well.
According to what I read in the /perfbook /at the link below,
Loads Reordered After Loads does not happen on x86.
pdf sheet 562 corresponds to page 550,
Table 15.5: Summary of Memory Ordering
https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c.2023.06.11a.pdf
>> doesn't only happen on ext4, filesystems that call filemap_read() but
>> don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
>> problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
>> this problem.
>>
>> Cc:stable@kernel.org
>> Signed-off-by: Baokun Li<libaokun1@huawei.com>
>> ---
>> mm/filemap.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 71f00539ac00..6324e2ac3e74 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2607,6 +2607,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
>> goto put_folios;
>> end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
>>
>> + /* Ensure that the page cache within isize is updated. */
> Barries have to be in pairs to work and it is a good practice to document
> this. So here I'd have comment like:
> /*
> * Pairs with a barrier in
> * block_write_end()->mark_buffer_dirty() or other page
> * dirtying routines like iomap_write_end() to ensure
> * changes to page contents are visible before we see
> * increased inode size.
> */
>
> Honza
That's a very accurate description! Thanks a lot!
I will add this comment in the next version.
>> + smp_rmb();
>> +
>> /*
>> * Once we start copying data, we don't want to be touching any
>> * cachelines that might be contended:
>> --
>> 2.31.1
>>
Thanks!
--
With Best Regards,
Baokun Li
.
[-- Attachment #2: Type: text/html, Size: 6045 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] mm/filemap: avoid buffered read/write race to read inconsistent data
2023-12-12 12:41 ` Jan Kara
2023-12-12 13:16 ` Baokun Li
@ 2023-12-12 13:21 ` Baokun Li
1 sibling, 0 replies; 5+ messages in thread
From: Baokun Li @ 2023-12-12 13:21 UTC (permalink / raw)
To: Jan Kara
Cc: linux-mm, linux-ext4, tytso, adilger.kernel, willy, akpm, david,
hch, ritesh.list, linux-kernel, yi.zhang, yangerkun, yukuai3,
stable, Baokun Li
On 2023/12/12 20:41, Jan Kara wrote:
> On Tue 12-12-23 17:36:34, Baokun Li wrote:
>> The following concurrency may cause the data read to be inconsistent with
>> the data on disk:
>>
>> cpu1 cpu2
>> ------------------------------|------------------------------
>> // Buffered write 2048 from 0
>> ext4_buffered_write_iter
>> generic_perform_write
>> copy_page_from_iter_atomic
>> ext4_da_write_end
>> ext4_da_do_write_end
>> block_write_end
>> __block_commit_write
>> folio_mark_uptodate
>> // Buffered read 4096 from 0 smp_wmb()
>> ext4_file_read_iter set_bit(PG_uptodate, folio_flags)
>> generic_file_read_iter i_size_write // 2048
>> filemap_read unlock_page(page)
>> filemap_get_pages
>> filemap_get_read_batch
>> folio_test_uptodate(folio)
>> ret = test_bit(PG_uptodate, folio_flags)
>> if (ret)
>> smp_rmb();
>> // Ensure that the data in page 0-2048 is up-to-date.
>>
>> // New buffered write 2048 from 2048
>> ext4_buffered_write_iter
>> generic_perform_write
>> copy_page_from_iter_atomic
>> ext4_da_write_end
>> ext4_da_do_write_end
>> block_write_end
>> __block_commit_write
>> folio_mark_uptodate
>> smp_wmb()
>> set_bit(PG_uptodate, folio_flags)
>> i_size_write // 4096
>> unlock_page(page)
>>
>> isize = i_size_read(inode) // 4096
>> // Read the latest isize 4096, but without smp_rmb(), there may be
>> // Load-Load disorder resulting in the data in the 2048-4096 range
>> // in the page is not up-to-date.
>> copy_page_to_iter
>> // copyout 4096
>>
>> In the concurrency above, we read the updated i_size, but there is no read
>> barrier to ensure that the data in the page is the same as the i_size at
>> this point, so we may copy the unsynchronized page out. Hence adding the
>> missing read memory barrier to fix this.
>>
>> This is a Load-Load reordering issue, which only occurs on some weak
>> mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
>> mem-ordering architectures (e.g. X86). And theoretically the problem
> AFAIK x86 can also reorder loads vs loads so the problem can in theory
> happen on x86 as well.
According to what I read in the perfbook at the link below,
Loads Reordered After Loads does not happen on x86.
pdf sheet 562 corresponds to page 550,
Table 15.5: Summary of Memory Ordering
https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c.2023.06.11a.pdf
>> doesn't only happen on ext4, filesystems that call filemap_read() but
>> don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
>> problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
>> this problem.
>>
>> Cc: stable@kernel.org
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>> mm/filemap.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 71f00539ac00..6324e2ac3e74 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2607,6 +2607,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
>> goto put_folios;
>> end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
>>
>> + /* Ensure that the page cache within isize is updated. */
> Barries have to be in pairs to work and it is a good practice to document
> this. So here I'd have comment like:
> /*
> * Pairs with a barrier in
> * block_write_end()->mark_buffer_dirty() or other page
> * dirtying routines like iomap_write_end() to ensure
> * changes to page contents are visible before we see
> * increased inode size.
> */
>
> Honza
That's a very accurate description! Thanks a lot!
I will add this comment in the next version.
>> + smp_rmb();
>> +
>> /*
>> * Once we start copying data, we don't want to be touching any
>> * cachelines that might be contended:
>> --
>> 2.31.1
>>
Thanks!
--
With Best Regards,
Baokun Li
.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] mm/filemap: avoid buffered read/write race to read inconsistent data
2023-12-12 13:16 ` Baokun Li
@ 2023-12-12 13:37 ` Jan Kara
0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2023-12-12 13:37 UTC (permalink / raw)
To: Baokun Li
Cc: Jan Kara, linux-mm, linux-ext4, tytso, adilger.kernel, willy,
akpm, david, hch, ritesh.list, linux-kernel, yi.zhang, yangerkun,
yukuai3, stable
On Tue 12-12-23 21:16:16, Baokun Li wrote:
> On 2023/12/12 20:41, Jan Kara wrote:
> > On Tue 12-12-23 17:36:34, Baokun Li wrote:
> > > The following concurrency may cause the data read to be inconsistent with
> > > the data on disk:
> > >
> > > cpu1 cpu2
> > > ------------------------------|------------------------------
> > > // Buffered write 2048 from 0
> > > ext4_buffered_write_iter
> > > generic_perform_write
> > > copy_page_from_iter_atomic
> > > ext4_da_write_end
> > > ext4_da_do_write_end
> > > block_write_end
> > > __block_commit_write
> > > folio_mark_uptodate
> > > // Buffered read 4096 from 0 smp_wmb()
> > > ext4_file_read_iter set_bit(PG_uptodate, folio_flags)
> > > generic_file_read_iter i_size_write // 2048
> > > filemap_read unlock_page(page)
> > > filemap_get_pages
> > > filemap_get_read_batch
> > > folio_test_uptodate(folio)
> > > ret = test_bit(PG_uptodate, folio_flags)
> > > if (ret)
> > > smp_rmb();
> > > // Ensure that the data in page 0-2048 is up-to-date.
> > >
> > > // New buffered write 2048 from 2048
> > > ext4_buffered_write_iter
> > > generic_perform_write
> > > copy_page_from_iter_atomic
> > > ext4_da_write_end
> > > ext4_da_do_write_end
> > > block_write_end
> > > __block_commit_write
> > > folio_mark_uptodate
> > > smp_wmb()
> > > set_bit(PG_uptodate, folio_flags)
> > > i_size_write // 4096
> > > unlock_page(page)
> > >
> > > isize = i_size_read(inode) // 4096
> > > // Read the latest isize 4096, but without smp_rmb(), there may be
> > > // Load-Load disorder resulting in the data in the 2048-4096 range
> > > // in the page is not up-to-date.
> > > copy_page_to_iter
> > > // copyout 4096
> > >
> > > In the concurrency above, we read the updated i_size, but there is no read
> > > barrier to ensure that the data in the page is the same as the i_size at
> > > this point, so we may copy the unsynchronized page out. Hence adding the
> > > missing read memory barrier to fix this.
> > >
> > > This is a Load-Load reordering issue, which only occurs on some weak
> > > mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
> > > mem-ordering architectures (e.g. X86). And theoretically the problem
> > AFAIK x86 can also reorder loads vs loads so the problem can in theory
> > happen on x86 as well.
>
> According to what I read in the /perfbook /at the link below,
>
> Loads Reordered After Loads does not happen on x86.
>
> pdf sheet 562 corresponds to page 550,
>
> Table 15.5: Summary of Memory Ordering
>
> https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c.2023.06.11a.pdf
Indeed. I stand corrected! Thanks for the link.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-12-12 13:37 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-12 9:36 [RFC PATCH] mm/filemap: avoid buffered read/write race to read inconsistent data Baokun Li
2023-12-12 12:41 ` Jan Kara
2023-12-12 13:16 ` Baokun Li
2023-12-12 13:37 ` Jan Kara
2023-12-12 13:21 ` Baokun Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox