Hang when swapping huge=within_size tmpfs from zram

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Hang when swapping huge=within_size tmpfs from zram
       [not found] <1738717785.im3r5g2vxc.none.ref@localhost>
@ 2025-02-05  1:23 ` Alex Xu (Hello71)
  2025-02-05  1:55   ` Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Xu (Hello71) @ 2025-02-05  1:23 UTC (permalink / raw)
  To: linux-mm, Baolin Wang, Daniel Gomez
  Cc: Barry Song, David Hildenbrand, Hugh Dickins, Kefeng Wang,
	Lance Yang, Matthew Wilcox, Ryan Roberts, linux-kernel,
	Andrew Morton

Hi all,

On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting 
them reliably hangs when tmpfs is mounted with huge=within_size, and it 
is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this 
to acd7ccb284b "mm: shmem: add large folio support for tmpfs".

When the issue occurs, rm uses 100% CPU, cannot be killed, and has no 
output in /proc/pid/stack or wchan. Eventually, an RCU stall is 
detected:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
rcu: 	(detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
task:rm              state:R  running task     stack:0     pid:25160 tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
Call Trace:
 <TASK>
 ? __schedule+0x388/0x1000
 ? kmem_cache_free.part.0+0x23d/0x280
 ? sysvec_apic_timer_interrupt+0xa/0x80
 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? xas_load+0x12/0xc0
 ? xas_load+0x8/0xc0
 ? xas_find+0x144/0x190
 ? find_lock_entries+0x75/0x260
 ? shmem_undo_range+0xe6/0x5f0
 ? shmem_evict_inode+0xe4/0x230
 ? mtree_erase+0x7e/0xe0
 ? inode_set_ctime_current+0x2e/0x1f0
 ? evict+0xe9/0x260
 ? _atomic_dec_and_lock+0x31/0x50
 ? do_unlinkat+0x270/0x2b0
 ? __x64_sys_unlinkat+0x30/0x50
 ? do_syscall_64+0x37/0xe0
 ? entry_SYSCALL_64_after_hwframe+0x50/0x58
 </TASK>

Let me know what information is needed to further troubleshoot this 
issue.

Thanks,
Alex.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hang when swapping huge=within_size tmpfs from zram
  2025-02-05  1:23 ` Hang when swapping huge=within_size tmpfs from zram Alex Xu (Hello71)
@ 2025-02-05  1:55   ` Baolin Wang
  2025-02-05  6:38     ` Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2025-02-05  1:55 UTC (permalink / raw)
  To: Alex Xu (Hello71), linux-mm, Daniel Gomez
  Cc: Barry Song, David Hildenbrand, Hugh Dickins, Kefeng Wang,
	Lance Yang, Matthew Wilcox, Ryan Roberts, linux-kernel,
	Andrew Morton

Hi Alex,

On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
> Hi all,
> 
> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
> them reliably hangs when tmpfs is mounted with huge=within_size, and it
> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
> 
> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
> detected:

Thanks for your report. Let me try to reproduce the issue locally and 
see what happens.

> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
> rcu: 	(detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
> task:rm              state:R  running task     stack:0     pid:25160 tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
> Call Trace:
>   <TASK>
>   ? __schedule+0x388/0x1000
>   ? kmem_cache_free.part.0+0x23d/0x280
>   ? sysvec_apic_timer_interrupt+0xa/0x80
>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>   ? xas_load+0x12/0xc0
>   ? xas_load+0x8/0xc0
>   ? xas_find+0x144/0x190
>   ? find_lock_entries+0x75/0x260
>   ? shmem_undo_range+0xe6/0x5f0
>   ? shmem_evict_inode+0xe4/0x230
>   ? mtree_erase+0x7e/0xe0
>   ? inode_set_ctime_current+0x2e/0x1f0
>   ? evict+0xe9/0x260
>   ? _atomic_dec_and_lock+0x31/0x50
>   ? do_unlinkat+0x270/0x2b0
>   ? __x64_sys_unlinkat+0x30/0x50
>   ? do_syscall_64+0x37/0xe0
>   ? entry_SYSCALL_64_after_hwframe+0x50/0x58
>   </TASK>
> 
> Let me know what information is needed to further troubleshoot this
> issue.
> 
> Thanks,
> Alex.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hang when swapping huge=within_size tmpfs from zram
  2025-02-05  1:55   ` Baolin Wang
@ 2025-02-05  6:38     ` Baolin Wang
  2025-02-05 14:39       ` Lance Yang
  0 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2025-02-05  6:38 UTC (permalink / raw)
  To: Alex Xu (Hello71), linux-mm, Daniel Gomez
  Cc: Barry Song, David Hildenbrand, Hugh Dickins, Kefeng Wang,
	Lance Yang, Matthew Wilcox, Ryan Roberts, linux-kernel,
	Andrew Morton



On 2025/2/5 09:55, Baolin Wang wrote:
> Hi Alex,
> 
> On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
>> Hi all,
>>
>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
>> them reliably hangs when tmpfs is mounted with huge=within_size, and it
>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
>>
>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
>> detected:
> 
> Thanks for your report. Let me try to reproduce the issue locally and 
> see what happens.
> 
>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
>> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
>> task:rm              state:R  running task     stack:0     pid:25160 
>> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
>> Call Trace:
>>   <TASK>
>>   ? __schedule+0x388/0x1000
>>   ? kmem_cache_free.part.0+0x23d/0x280
>>   ? sysvec_apic_timer_interrupt+0xa/0x80
>>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>>   ? xas_load+0x12/0xc0
>>   ? xas_load+0x8/0xc0
>>   ? xas_find+0x144/0x190
>>   ? find_lock_entries+0x75/0x260
>>   ? shmem_undo_range+0xe6/0x5f0
>>   ? shmem_evict_inode+0xe4/0x230
>>   ? mtree_erase+0x7e/0xe0
>>   ? inode_set_ctime_current+0x2e/0x1f0
>>   ? evict+0xe9/0x260
>>   ? _atomic_dec_and_lock+0x31/0x50
>>   ? do_unlinkat+0x270/0x2b0
>>   ? __x64_sys_unlinkat+0x30/0x50
>>   ? do_syscall_64+0x37/0xe0
>>   ? entry_SYSCALL_64_after_hwframe+0x50/0x58
>>   </TASK>
>>
>> Let me know what information is needed to further troubleshoot this
>> issue.

Sorry, I can't reproduce this issue, and my testing process is as follows:
1. Mount tmpfs with huge=within_size
2. Create and write a tmpfs file
3. Swap out the large folios of the tmpfs file to zram
4. Execute 'rm' command to remove the tmpfs file

Do you have any reproduction scripts or reproduction steps guide? Thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hang when swapping huge=within_size tmpfs from zram
  2025-02-05  6:38     ` Baolin Wang
@ 2025-02-05 14:39       ` Lance Yang
  2025-02-07  7:23         ` Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Lance Yang @ 2025-02-05 14:39 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Alex Xu (Hello71),
	linux-mm, Daniel Gomez, Barry Song, David Hildenbrand,
	Hugh Dickins, Kefeng Wang, Matthew Wilcox, Ryan Roberts,
	linux-kernel, Andrew Morton

On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/2/5 09:55, Baolin Wang wrote:
> > Hi Alex,
> >
> > On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
> >> Hi all,
> >>
> >> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
> >> them reliably hangs when tmpfs is mounted with huge=within_size, and it
> >> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
> >> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
> >>
> >> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
> >> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
> >> detected:
> >
> > Thanks for your report. Let me try to reproduce the issue locally and
> > see what happens.
> >
> >> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
> >> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
> >> task:rm              state:R  running task     stack:0     pid:25160
> >> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
> >> Call Trace:
> >>   <TASK>
> >>   ? __schedule+0x388/0x1000
> >>   ? kmem_cache_free.part.0+0x23d/0x280
> >>   ? sysvec_apic_timer_interrupt+0xa/0x80
> >>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> >>   ? xas_load+0x12/0xc0
> >>   ? xas_load+0x8/0xc0
> >>   ? xas_find+0x144/0x190
> >>   ? find_lock_entries+0x75/0x260
> >>   ? shmem_undo_range+0xe6/0x5f0
> >>   ? shmem_evict_inode+0xe4/0x230
> >>   ? mtree_erase+0x7e/0xe0
> >>   ? inode_set_ctime_current+0x2e/0x1f0
> >>   ? evict+0xe9/0x260
> >>   ? _atomic_dec_and_lock+0x31/0x50
> >>   ? do_unlinkat+0x270/0x2b0
> >>   ? __x64_sys_unlinkat+0x30/0x50
> >>   ? do_syscall_64+0x37/0xe0
> >>   ? entry_SYSCALL_64_after_hwframe+0x50/0x58
> >>   </TASK>
> >>
> >> Let me know what information is needed to further troubleshoot this
> >> issue.
>
> Sorry, I can't reproduce this issue, and my testing process is as follows:
> 1. Mount tmpfs with huge=within_size
> 2. Create and write a tmpfs file
> 3. Swap out the large folios of the tmpfs file to zram
> 4. Execute 'rm' command to remove the tmpfs file

I’m unable to reproduce the issue as well, and am following steps similar
to Baolin's process:

1) Mount tmpfs with the huge=within_size option and enable swap (using
zstd/zsmalloc without a backing device).
2) Create and write over 10,000 files in the tmpfs.
3) Swap out the large folios of these tmpfs files to zram.
4) Use the rm command to delete all the files from the tmpfs.

Testing with both 2MiB and 64KiB large folio sizes, and with
shmem_enabled=within_size, but everything works as expected.

Thanks,
Lance

>
> Do you have any reproduction scripts or reproduction steps guide? Thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hang when swapping huge=within_size tmpfs from zram
  2025-02-05 14:39       ` Lance Yang
@ 2025-02-07  7:23         ` Baolin Wang
  2025-02-23 17:53           ` Kairui Song
  0 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2025-02-07  7:23 UTC (permalink / raw)
  To: Lance Yang
  Cc: Alex Xu (Hello71),
	linux-mm, Daniel Gomez, Barry Song, David Hildenbrand,
	Hugh Dickins, Kefeng Wang, Matthew Wilcox, Ryan Roberts,
	linux-kernel, Andrew Morton



On 2025/2/5 22:39, Lance Yang wrote:
> On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/2/5 09:55, Baolin Wang wrote:
>>> Hi Alex,
>>>
>>> On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
>>>> Hi all,
>>>>
>>>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
>>>> them reliably hangs when tmpfs is mounted with huge=within_size, and it
>>>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
>>>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
>>>>
>>>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
>>>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
>>>> detected:
>>>
>>> Thanks for your report. Let me try to reproduce the issue locally and
>>> see what happens.
>>>
>>>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
>>>> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
>>>> task:rm              state:R  running task     stack:0     pid:25160
>>>> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
>>>> Call Trace:
>>>>    <TASK>
>>>>    ? __schedule+0x388/0x1000
>>>>    ? kmem_cache_free.part.0+0x23d/0x280
>>>>    ? sysvec_apic_timer_interrupt+0xa/0x80
>>>>    ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>>>>    ? xas_load+0x12/0xc0
>>>>    ? xas_load+0x8/0xc0
>>>>    ? xas_find+0x144/0x190
>>>>    ? find_lock_entries+0x75/0x260
>>>>    ? shmem_undo_range+0xe6/0x5f0
>>>>    ? shmem_evict_inode+0xe4/0x230
>>>>    ? mtree_erase+0x7e/0xe0
>>>>    ? inode_set_ctime_current+0x2e/0x1f0
>>>>    ? evict+0xe9/0x260
>>>>    ? _atomic_dec_and_lock+0x31/0x50
>>>>    ? do_unlinkat+0x270/0x2b0
>>>>    ? __x64_sys_unlinkat+0x30/0x50
>>>>    ? do_syscall_64+0x37/0xe0
>>>>    ? entry_SYSCALL_64_after_hwframe+0x50/0x58
>>>>    </TASK>
>>>>
>>>> Let me know what information is needed to further troubleshoot this
>>>> issue.
>>
>> Sorry, I can't reproduce this issue, and my testing process is as follows:
>> 1. Mount tmpfs with huge=within_size
>> 2. Create and write a tmpfs file
>> 3. Swap out the large folios of the tmpfs file to zram
>> 4. Execute 'rm' command to remove the tmpfs file
> 
> I’m unable to reproduce the issue as well, and am following steps similar
> to Baolin's process:
> 
> 1) Mount tmpfs with the huge=within_size option and enable swap (using
> zstd/zsmalloc without a backing device).
> 2) Create and write over 10,000 files in the tmpfs.
> 3) Swap out the large folios of these tmpfs files to zram.
> 4) Use the rm command to delete all the files from the tmpfs.
> 
> Testing with both 2MiB and 64KiB large folio sizes, and with
> shmem_enabled=within_size, but everything works as expected.

Thanks Lance for confirming again.

Alex, could you give more hints on how to reproduce this issue?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hang when swapping huge=within_size tmpfs from zram
  2025-02-07  7:23         ` Baolin Wang
@ 2025-02-23 17:53           ` Kairui Song
  2025-02-23 18:22             ` Kairui Song
  0 siblings, 1 reply; 11+ messages in thread
From: Kairui Song @ 2025-02-23 17:53 UTC (permalink / raw)
  To: Baolin Wang, Alex Xu (Hello71)
  Cc: Lance Yang, linux-mm, Daniel Gomez, Barry Song,
	David Hildenbrand, Hugh Dickins, Kefeng Wang, Matthew Wilcox,
	Ryan Roberts, linux-kernel, Andrew Morton, ziy

On Fri, Feb 7, 2025 at 3:24 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> On 2025/2/5 22:39, Lance Yang wrote:
> > On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >> On 2025/2/5 09:55, Baolin Wang wrote:
> >>> Hi Alex,
> >>>
> >>> On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
> >>>> Hi all,
> >>>>
> >>>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
> >>>> them reliably hangs when tmpfs is mounted with huge=within_size, and it
> >>>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
> >>>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
> >>>>
> >>>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
> >>>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
> >>>> detected:
> >>>
> >>> Thanks for your report. Let me try to reproduce the issue locally and
> >>> see what happens.
> >>>
> >>>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> >>>> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
> >>>> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
> >>>> task:rm              state:R  running task     stack:0     pid:25160
> >>>> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
> >>>> Call Trace:
> >>>>    <TASK>
> >>>>    ? __schedule+0x388/0x1000
> >>>>    ? kmem_cache_free.part.0+0x23d/0x280
> >>>>    ? sysvec_apic_timer_interrupt+0xa/0x80
> >>>>    ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> >>>>    ? xas_load+0x12/0xc0
> >>>>    ? xas_load+0x8/0xc0
> >>>>    ? xas_find+0x144/0x190
> >>>>    ? find_lock_entries+0x75/0x260
> >>>>    ? shmem_undo_range+0xe6/0x5f0
> >>>>    ? shmem_evict_inode+0xe4/0x230
> >>>>    ? mtree_erase+0x7e/0xe0
> >>>>    ? inode_set_ctime_current+0x2e/0x1f0
> >>>>    ? evict+0xe9/0x260
> >>>>    ? _atomic_dec_and_lock+0x31/0x50
> >>>>    ? do_unlinkat+0x270/0x2b0
> >>>>    ? __x64_sys_unlinkat+0x30/0x50
> >>>>    ? do_syscall_64+0x37/0xe0
> >>>>    ? entry_SYSCALL_64_after_hwframe+0x50/0x58
> >>>>    </TASK>
> >>>>
> >>>> Let me know what information is needed to further troubleshoot this
> >>>> issue.
> >>
> >> Sorry, I can't reproduce this issue, and my testing process is as follows:
> >> 1. Mount tmpfs with huge=within_size
> >> 2. Create and write a tmpfs file
> >> 3. Swap out the large folios of the tmpfs file to zram
> >> 4. Execute 'rm' command to remove the tmpfs file
> >
> > I’m unable to reproduce the issue as well, and am following steps similar
> > to Baolin's process:
> >
> > 1) Mount tmpfs with the huge=within_size option and enable swap (using
> > zstd/zsmalloc without a backing device).
> > 2) Create and write over 10,000 files in the tmpfs.
> > 3) Swap out the large folios of these tmpfs files to zram.
> > 4) Use the rm command to delete all the files from the tmpfs.
> >
> > Testing with both 2MiB and 64KiB large folio sizes, and with
> > shmem_enabled=within_size, but everything works as expected.
>
> Thanks Lance for confirming again.
>
> Alex, could you give more hints on how to reproduce this issue?
>

Hi Baolin,

I can reproduce this issue very easily with the build linux kernel
test, and the failure rate is very high. I'm not exactly sure this is
the same bug but very likely, my test step:

1. Create a 10G ZRAM device and set up SWAP on it.
2. Create a 1G memcg, and spawn a shell in it.
3. Mount tmpfs with huge=within_size, and then untar linux kernel
source code into it.
4. Build with make -j32 (higher or lower job number may also work),
the build will always fall within 10s due to file corrupted.

After some debugging, the reason is in shmem_swapin_folio, when swap
cache is hit `folio = swap_cache_get_folio(swap, NULL, 0);` sets folio
to a 0 order folio, then the following shmem_add_to_page_cache will
insert a order 0 folio overriding a high order entry in shmem's
xarray, so data are lost. Swap cache hit could be due to many reasons,
in this case it's the readahead.

One quick fix is just always split the entry upon shmem fault of 0
order folio like this:

diff --git a/mm/shmem.c b/mm/shmem.c
index 4ea6109a8043..c8e5c419c675 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2341,6 +2341,10 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
                }
        }

+       /* Swapin of 0 order folio must always ensure the entries are split */
+       if (!folio_order(folio))
+               shmem_split_large_entry(inode, index, swap, gfp);
+
 alloced:
        /* We have to do this with folio locked to prevent races */
        folio_lock(folio);

And Hi Alex, can you help confirm if the above patch fixes your reported bug?

If we are OK with this, this should be merged into 6.14 I think, but
for the long term, it might be a good idea to just share a similar
logic of (or just reuse) __filemap_add_folio for shmem?
__filemap_add_folio will split the entry on insert, and code will be
much cleaner.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hang when swapping huge=within_size tmpfs from zram
  2025-02-23 17:53           ` Kairui Song
@ 2025-02-23 18:22             ` Kairui Song
  2025-02-24  3:21               ` Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Kairui Song @ 2025-02-23 18:22 UTC (permalink / raw)
  To: Baolin Wang, Alex Xu (Hello71)
  Cc: Lance Yang, linux-mm, Daniel Gomez, Barry Song,
	David Hildenbrand, Hugh Dickins, Kefeng Wang, Matthew Wilcox,
	Ryan Roberts, linux-kernel, Andrew Morton, ziy

On Mon, Feb 24, 2025 at 1:53 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Feb 7, 2025 at 3:24 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> > On 2025/2/5 22:39, Lance Yang wrote:
> > > On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
> > > <baolin.wang@linux.alibaba.com> wrote:
> > >> On 2025/2/5 09:55, Baolin Wang wrote:
> > >>> Hi Alex,
> > >>>
> > >>> On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
> > >>>> Hi all,
> > >>>>
> > >>>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
> > >>>> them reliably hangs when tmpfs is mounted with huge=within_size, and it
> > >>>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
> > >>>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
> > >>>>
> > >>>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
> > >>>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
> > >>>> detected:
> > >>>
> > >>> Thanks for your report. Let me try to reproduce the issue locally and
> > >>> see what happens.
> > >>>
> > >>>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > >>>> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
> > >>>> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
> > >>>> task:rm              state:R  running task     stack:0     pid:25160
> > >>>> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
> > >>>> Call Trace:
> > >>>>    <TASK>
> > >>>>    ? __schedule+0x388/0x1000
> > >>>>    ? kmem_cache_free.part.0+0x23d/0x280
> > >>>>    ? sysvec_apic_timer_interrupt+0xa/0x80
> > >>>>    ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > >>>>    ? xas_load+0x12/0xc0
> > >>>>    ? xas_load+0x8/0xc0
> > >>>>    ? xas_find+0x144/0x190
> > >>>>    ? find_lock_entries+0x75/0x260
> > >>>>    ? shmem_undo_range+0xe6/0x5f0
> > >>>>    ? shmem_evict_inode+0xe4/0x230
> > >>>>    ? mtree_erase+0x7e/0xe0
> > >>>>    ? inode_set_ctime_current+0x2e/0x1f0
> > >>>>    ? evict+0xe9/0x260
> > >>>>    ? _atomic_dec_and_lock+0x31/0x50
> > >>>>    ? do_unlinkat+0x270/0x2b0
> > >>>>    ? __x64_sys_unlinkat+0x30/0x50
> > >>>>    ? do_syscall_64+0x37/0xe0
> > >>>>    ? entry_SYSCALL_64_after_hwframe+0x50/0x58
> > >>>>    </TASK>
> > >>>>
> > >>>> Let me know what information is needed to further troubleshoot this
> > >>>> issue.
> > >>
> > >> Sorry, I can't reproduce this issue, and my testing process is as follows:
> > >> 1. Mount tmpfs with huge=within_size
> > >> 2. Create and write a tmpfs file
> > >> 3. Swap out the large folios of the tmpfs file to zram
> > >> 4. Execute 'rm' command to remove the tmpfs file
> > >
> > > I’m unable to reproduce the issue as well, and am following steps similar
> > > to Baolin's process:
> > >
> > > 1) Mount tmpfs with the huge=within_size option and enable swap (using
> > > zstd/zsmalloc without a backing device).
> > > 2) Create and write over 10,000 files in the tmpfs.
> > > 3) Swap out the large folios of these tmpfs files to zram.
> > > 4) Use the rm command to delete all the files from the tmpfs.
> > >
> > > Testing with both 2MiB and 64KiB large folio sizes, and with
> > > shmem_enabled=within_size, but everything works as expected.
> >
> > Thanks Lance for confirming again.
> >
> > Alex, could you give more hints on how to reproduce this issue?
> >
>
> Hi Baolin,
>
> I can reproduce this issue very easily with the build linux kernel
> test, and the failure rate is very high. I'm not exactly sure this is
> the same bug but very likely, my test step:
>
> 1. Create a 10G ZRAM device and set up SWAP on it.
> 2. Create a 1G memcg, and spawn a shell in it.
> 3. Mount tmpfs with huge=within_size, and then untar linux kernel
> source code into it.
> 4. Build with make -j32 (higher or lower job number may also work),
> the build will always fall within 10s due to file corrupted.
>
> After some debugging, the reason is in shmem_swapin_folio, when swap
> cache is hit `folio = swap_cache_get_folio(swap, NULL, 0);` sets folio
> to a 0 order folio, then the following shmem_add_to_page_cache will
> insert a order 0 folio overriding a high order entry in shmem's
> xarray, so data are lost. Swap cache hit could be due to many reasons,
> in this case it's the readahead.
>
> One quick fix is just always split the entry upon shmem fault of 0
> order folio like this:
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4ea6109a8043..c8e5c419c675 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2341,6 +2341,10 @@ static int shmem_swapin_folio(struct inode
> *inode, pgoff_t index,
>                 }
>         }
>
> +       /* Swapin of 0 order folio must always ensure the entries are split */
> +       if (!folio_order(folio))
> +               shmem_split_large_entry(inode, index, swap, gfp);
> +
>  alloced:
>         /* We have to do this with folio locked to prevent races */
>         folio_lock(folio);
>
> And Hi Alex, can you help confirm if the above patch fixes your reported bug?
>
> If we are OK with this, this should be merged into 6.14 I think, but
> for the long term, it might be a good idea to just share a similar
> logic of (or just reuse) __filemap_add_folio for shmem?
> __filemap_add_folio will split the entry on insert, and code will be
> much cleaner.

Some extra comments for above patch: If it raced with another split,
or the entry used for swap cache lookup is wrongly aligned due to
large entry, the shmem_add_to_page_cache below will fail with -EEXIST
and try again. So that seems to be working well in my test.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hang when swapping huge=within_size tmpfs from zram
  2025-02-23 18:22             ` Kairui Song
@ 2025-02-24  3:21               ` Baolin Wang
  2025-02-24  8:47                 ` [PATCH] mm: shmem: fix potential data corruption during shmem swapin Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2025-02-24  3:21 UTC (permalink / raw)
  To: Kairui Song, Alex Xu (Hello71)
  Cc: Lance Yang, linux-mm, Daniel Gomez, Barry Song,
	David Hildenbrand, Hugh Dickins, Kefeng Wang, Matthew Wilcox,
	Ryan Roberts, linux-kernel, Andrew Morton, ziy

Hi Kairui,

On 2025/2/24 02:22, Kairui Song wrote:
> On Mon, Feb 24, 2025 at 1:53 AM Kairui Song <ryncsn@gmail.com> wrote:
>>
>> On Fri, Feb 7, 2025 at 3:24 PM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>> On 2025/2/5 22:39, Lance Yang wrote:
>>>> On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>> On 2025/2/5 09:55, Baolin Wang wrote:
>>>>>> Hi Alex,
>>>>>>
>>>>>> On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
>>>>>>> them reliably hangs when tmpfs is mounted with huge=within_size, and it
>>>>>>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
>>>>>>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
>>>>>>>
>>>>>>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
>>>>>>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
>>>>>>> detected:
>>>>>>
>>>>>> Thanks for your report. Let me try to reproduce the issue locally and
>>>>>> see what happens.
>>>>>>
>>>>>>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>>>>> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
>>>>>>> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
>>>>>>> task:rm              state:R  running task     stack:0     pid:25160
>>>>>>> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
>>>>>>> Call Trace:
>>>>>>>     <TASK>
>>>>>>>     ? __schedule+0x388/0x1000
>>>>>>>     ? kmem_cache_free.part.0+0x23d/0x280
>>>>>>>     ? sysvec_apic_timer_interrupt+0xa/0x80
>>>>>>>     ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>>>>>>>     ? xas_load+0x12/0xc0
>>>>>>>     ? xas_load+0x8/0xc0
>>>>>>>     ? xas_find+0x144/0x190
>>>>>>>     ? find_lock_entries+0x75/0x260
>>>>>>>     ? shmem_undo_range+0xe6/0x5f0
>>>>>>>     ? shmem_evict_inode+0xe4/0x230
>>>>>>>     ? mtree_erase+0x7e/0xe0
>>>>>>>     ? inode_set_ctime_current+0x2e/0x1f0
>>>>>>>     ? evict+0xe9/0x260
>>>>>>>     ? _atomic_dec_and_lock+0x31/0x50
>>>>>>>     ? do_unlinkat+0x270/0x2b0
>>>>>>>     ? __x64_sys_unlinkat+0x30/0x50
>>>>>>>     ? do_syscall_64+0x37/0xe0
>>>>>>>     ? entry_SYSCALL_64_after_hwframe+0x50/0x58
>>>>>>>     </TASK>
>>>>>>>
>>>>>>> Let me know what information is needed to further troubleshoot this
>>>>>>> issue.
>>>>>
>>>>> Sorry, I can't reproduce this issue, and my testing process is as follows:
>>>>> 1. Mount tmpfs with huge=within_size
>>>>> 2. Create and write a tmpfs file
>>>>> 3. Swap out the large folios of the tmpfs file to zram
>>>>> 4. Execute 'rm' command to remove the tmpfs file
>>>>
>>>> I’m unable to reproduce the issue as well, and am following steps similar
>>>> to Baolin's process:
>>>>
>>>> 1) Mount tmpfs with the huge=within_size option and enable swap (using
>>>> zstd/zsmalloc without a backing device).
>>>> 2) Create and write over 10,000 files in the tmpfs.
>>>> 3) Swap out the large folios of these tmpfs files to zram.
>>>> 4) Use the rm command to delete all the files from the tmpfs.
>>>>
>>>> Testing with both 2MiB and 64KiB large folio sizes, and with
>>>> shmem_enabled=within_size, but everything works as expected.
>>>
>>> Thanks Lance for confirming again.
>>>
>>> Alex, could you give more hints on how to reproduce this issue?
>>>
>>
>> Hi Baolin,
>>
>> I can reproduce this issue very easily with the build linux kernel
>> test, and the failure rate is very high. I'm not exactly sure this is
>> the same bug but very likely, my test step:
>>
>> 1. Create a 10G ZRAM device and set up SWAP on it.
>> 2. Create a 1G memcg, and spawn a shell in it.
>> 3. Mount tmpfs with huge=within_size, and then untar linux kernel
>> source code into it.
>> 4. Build with make -j32 (higher or lower job number may also work),
>> the build will always fall within 10s due to file corrupted.

Very appreciated for your reproducer, and now I can reproduce the issue 
locally.

>> After some debugging, the reason is in shmem_swapin_folio, when swap
>> cache is hit `folio = swap_cache_get_folio(swap, NULL, 0);` sets folio
>> to a 0 order folio, then the following shmem_add_to_page_cache will
>> insert a order 0 folio overriding a high order entry in shmem's
>> xarray, so data are lost. Swap cache hit could be due to many reasons,
>> in this case it's the readahead.

Yes, thanks for your analysis. I missed that the swap readahead can swap 
in order 0 folios asynchronously.

>>
>> One quick fix is just always split the entry upon shmem fault of 0
>> order folio like this:
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 4ea6109a8043..c8e5c419c675 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2341,6 +2341,10 @@ static int shmem_swapin_folio(struct inode
>> *inode, pgoff_t index,
>>                  }
>>          }
>>
>> +       /* Swapin of 0 order folio must always ensure the entries are split */
>> +       if (!folio_order(folio))
>> +               shmem_split_large_entry(inode, index, swap, gfp);
>> +
>>   alloced:
>>          /* We have to do this with folio locked to prevent races */
>>          folio_lock(folio);

I don't think we should always split the large entry when getting folio 
from the swap cache. Instead, splitting should only be done when the 
order stored in the shmem mapping is inconsistent with the folio order, 
as well as updating the swap value.

Could you help to try below fix? I tested it and it can work well with 
your reproducer. Thanks a lot.

diff --git a/mm/shmem.c b/mm/shmem.c
index 671f63063fd4..7e70081a96d4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2253,7 +2253,7 @@ static int shmem_swapin_folio(struct inode *inode, 
pgoff_t index,
         struct folio *folio = NULL;
         bool skip_swapcache = false;
         swp_entry_t swap;
-       int error, nr_pages;
+       int error, nr_pages, order, split_order;

         VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
         swap = radix_to_swp_entry(*foliop);
@@ -2272,10 +2272,9 @@ static int shmem_swapin_folio(struct inode 
*inode, pgoff_t index,

         /* Look it up and read it in.. */
         folio = swap_cache_get_folio(swap, NULL, 0);
+       order = xa_get_order(&mapping->i_pages, index);
         if (!folio) {
-               int order = xa_get_order(&mapping->i_pages, index);
                 bool fallback_order0 = false;
-               int split_order;

                 /* Or update major stats only when swapin succeeds?? */
                 if (fault_type) {
@@ -2339,6 +2338,29 @@ static int shmem_swapin_folio(struct inode 
*inode, pgoff_t index,
                         error = -ENOMEM;
                         goto failed;
                 }
+       } else if (order != folio_order(folio)) {
+               /*
+                * Swap readahead may swap in order 0 folios into swapcache
+                * asynchronously, while the shmem mapping can still stores
+                * large swap entries. In such cases, we should split the
+                * large swap entry to prevent possible data loss.
+                */
+               split_order = shmem_split_large_entry(inode, index, 
swap, gfp);
+               if (split_order < 0) {
+                       error = split_order;
+                       goto failed;
+               }
+
+               /*
+                * If the large swap entry has already been split, it is
+                * necessary to recalculate the new swap entry based on
+                * the old order alignment.
+                */
+               if (split_order > 0) {
+                       pgoff_t offset = index - round_down(index, 1 << 
split_order);
+
+                       swap = swp_entry(swp_type(swap), 
swp_offset(swap) + offset);
+               }
         }

  alloced:
@@ -2346,7 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, 
pgoff_t index,
         folio_lock(folio);
         if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
             folio->swap.val != swap.val ||
-           !shmem_confirm_swap(mapping, index, swap)) {
+           !shmem_confirm_swap(mapping, index, swap) ||
+           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
                 error = -EEXIST;
                 goto unlock;
         }

>> And Hi Alex, can you help confirm if the above patch fixes your reported bug?
>>
>> If we are OK with this, this should be merged into 6.14 I think, but
>> for the long term, it might be a good idea to just share a similar
>> logic of (or just reuse) __filemap_add_folio for shmem?
>> __filemap_add_folio will split the entry on insert, and code will be
>> much cleaner.
> 
> Some extra comments for above patch: If it raced with another split,
> or the entry used for swap cache lookup is wrongly aligned due to
> large entry, the shmem_add_to_page_cache below will fail with -EEXIST
> and try again. So that seems to be working well in my test.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] mm: shmem: fix potential data corruption during shmem swapin
  2025-02-24  3:21               ` Baolin Wang
@ 2025-02-24  8:47                 ` Baolin Wang
  2025-02-24 17:50                   ` Kairui Song
  0 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2025-02-24  8:47 UTC (permalink / raw)
  To: baolin.wang
  Cc: akpm, alex_y_xu, baohua, da.gomez, david, hughd, ioworker0,
	linux-kernel, linux-mm, ryan.roberts, ryncsn, wangkefeng.wang,
	willy, ziy

Alex and Kairui reported some issues (system hang or data corruption) when
swapping out or swapping in large shmem folios. This is especially easy to
reproduce when the tmpfs is mount with the 'huge=within_size' parameter.
Thanks to Kairui's reproducer, the issue can be easily replicated.

The root cause of the problem is that swap readahead may asynchronously
swap in order 0 folios into the swap cache, while the shmem mapping can
still store large swap entries. Then an order 0 folio is inserted into
the shmem mapping without splitting the large swap entry, which overwrites
the original large swap entry, leading to data corruption.

When getting a folio from the swap cache, we should split the large swap
entry stored in the shmem mapping if the orders do not match, to fix this
issue.

Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Reported-by: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 4ea6109a8043..cebbac97a221 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2253,7 +2253,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
 	swp_entry_t swap;
-	int error, nr_pages;
+	int error, nr_pages, order, split_order;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
 	swap = radix_to_swp_entry(*foliop);
@@ -2272,10 +2272,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
+	order = xa_get_order(&mapping->i_pages, index);
 	if (!folio) {
-		int order = xa_get_order(&mapping->i_pages, index);
 		bool fallback_order0 = false;
-		int split_order;
 
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
@@ -2339,6 +2338,29 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			error = -ENOMEM;
 			goto failed;
 		}
+	} else if (order != folio_order(folio)) {
+		/*
+		 * Swap readahead may swap in order 0 folios into swapcache
+		 * asynchronously, while the shmem mapping can still stores
+		 * large swap entries. In such cases, we should split the
+		 * large swap entry to prevent possible data corruption.
+		 */
+		split_order = shmem_split_large_entry(inode, index, swap, gfp);
+		if (split_order < 0) {
+			error = split_order;
+			goto failed;
+		}
+
+		/*
+		 * If the large swap entry has already been split, it is
+		 * necessary to recalculate the new swap entry based on
+		 * the old order alignment.
+		 */
+		if (split_order > 0) {
+			pgoff_t offset = index - round_down(index, 1 << split_order);
+
+			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+		}
 	}
 
 alloced:
@@ -2346,7 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
 	    folio->swap.val != swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap)) {
+	    !shmem_confirm_swap(mapping, index, swap) ||
+	    xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
 		error = -EEXIST;
 		goto unlock;
 	}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: shmem: fix potential data corruption during shmem swapin
  2025-02-24  8:47                 ` [PATCH] mm: shmem: fix potential data corruption during shmem swapin Baolin Wang
@ 2025-02-24 17:50                   ` Kairui Song
  2025-02-25  1:07                     ` Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Kairui Song @ 2025-02-24 17:50 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, alex_y_xu, baohua, da.gomez, david, hughd, ioworker0,
	linux-kernel, linux-mm, ryan.roberts, wangkefeng.wang, willy,
	ziy

On Mon, Feb 24, 2025 at 4:47 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Alex and Kairui reported some issues (system hang or data corruption) when
> swapping out or swapping in large shmem folios. This is especially easy to
> reproduce when the tmpfs is mount with the 'huge=within_size' parameter.
> Thanks to Kairui's reproducer, the issue can be easily replicated.
>
> The root cause of the problem is that swap readahead may asynchronously
> swap in order 0 folios into the swap cache, while the shmem mapping can
> still store large swap entries. Then an order 0 folio is inserted into
> the shmem mapping without splitting the large swap entry, which overwrites
> the original large swap entry, leading to data corruption.
>
> When getting a folio from the swap cache, we should split the large swap
> entry stored in the shmem mapping if the orders do not match, to fix this
> issue.
>
> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
> Reported-by: Kairui Song <ryncsn@gmail.com>

Maybe you can add a Closes:?

> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/shmem.c | 31 +++++++++++++++++++++++++++----
>  1 file changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4ea6109a8043..cebbac97a221 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2253,7 +2253,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         struct folio *folio = NULL;
>         bool skip_swapcache = false;
>         swp_entry_t swap;
> -       int error, nr_pages;
> +       int error, nr_pages, order, split_order;
>
>         VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>         swap = radix_to_swp_entry(*foliop);
> @@ -2272,10 +2272,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>
>         /* Look it up and read it in.. */
>         folio = swap_cache_get_folio(swap, NULL, 0);
> +       order = xa_get_order(&mapping->i_pages, index);
>         if (!folio) {
> -               int order = xa_get_order(&mapping->i_pages, index);
>                 bool fallback_order0 = false;
> -               int split_order;
>
>                 /* Or update major stats only when swapin succeeds?? */
>                 if (fault_type) {
> @@ -2339,6 +2338,29 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                         error = -ENOMEM;
>                         goto failed;
>                 }
> +       } else if (order != folio_order(folio)) {
> +               /*
> +                * Swap readahead may swap in order 0 folios into swapcache
> +                * asynchronously, while the shmem mapping can still stores
> +                * large swap entries. In such cases, we should split the
> +                * large swap entry to prevent possible data corruption.
> +                */
> +               split_order = shmem_split_large_entry(inode, index, swap, gfp);
> +               if (split_order < 0) {
> +                       error = split_order;
> +                       goto failed;
> +               }
> +
> +               /*
> +                * If the large swap entry has already been split, it is
> +                * necessary to recalculate the new swap entry based on
> +                * the old order alignment.
> +                */
> +               if (split_order > 0) {
> +                       pgoff_t offset = index - round_down(index, 1 << split_order);
> +
> +                       swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> +               }
>         }
>
>  alloced:
> @@ -2346,7 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         folio_lock(folio);
>         if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
>             folio->swap.val != swap.val ||
> -           !shmem_confirm_swap(mapping, index, swap)) {
> +           !shmem_confirm_swap(mapping, index, swap) ||
> +           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
>                 error = -EEXIST;
>                 goto unlock;
>         }
> --
> 2.43.5
>

Thanks for the fix, it works for me.

Tested-by: Kairui Song <kasong@tencent.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: shmem: fix potential data corruption during shmem swapin
  2025-02-24 17:50                   ` Kairui Song
@ 2025-02-25  1:07                     ` Baolin Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Baolin Wang @ 2025-02-25  1:07 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, alex_y_xu, baohua, da.gomez, david, hughd, ioworker0,
	linux-kernel, linux-mm, ryan.roberts, wangkefeng.wang, willy,
	ziy



On 2025/2/25 01:50, Kairui Song wrote:
> On Mon, Feb 24, 2025 at 4:47 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Alex and Kairui reported some issues (system hang or data corruption) when
>> swapping out or swapping in large shmem folios. This is especially easy to
>> reproduce when the tmpfs is mount with the 'huge=within_size' parameter.
>> Thanks to Kairui's reproducer, the issue can be easily replicated.
>>
>> The root cause of the problem is that swap readahead may asynchronously
>> swap in order 0 folios into the swap cache, while the shmem mapping can
>> still store large swap entries. Then an order 0 folio is inserted into
>> the shmem mapping without splitting the large swap entry, which overwrites
>> the original large swap entry, leading to data corruption.
>>
>> When getting a folio from the swap cache, we should split the large swap
>> entry stored in the shmem mapping if the orders do not match, to fix this
>> issue.
>>
>> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
>> Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
>> Reported-by: Kairui Song <ryncsn@gmail.com>
> 
> Maybe you can add a Closes:?

Yes. Hope Andrew can help add this:

Closes: https://lore.kernel.org/all/1738717785.im3r5g2vxc.none@localhost/

>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/shmem.c | 31 +++++++++++++++++++++++++++----
>>   1 file changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 4ea6109a8043..cebbac97a221 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2253,7 +2253,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>          struct folio *folio = NULL;
>>          bool skip_swapcache = false;
>>          swp_entry_t swap;
>> -       int error, nr_pages;
>> +       int error, nr_pages, order, split_order;
>>
>>          VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>          swap = radix_to_swp_entry(*foliop);
>> @@ -2272,10 +2272,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>
>>          /* Look it up and read it in.. */
>>          folio = swap_cache_get_folio(swap, NULL, 0);
>> +       order = xa_get_order(&mapping->i_pages, index);
>>          if (!folio) {
>> -               int order = xa_get_order(&mapping->i_pages, index);
>>                  bool fallback_order0 = false;
>> -               int split_order;
>>
>>                  /* Or update major stats only when swapin succeeds?? */
>>                  if (fault_type) {
>> @@ -2339,6 +2338,29 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>                          error = -ENOMEM;
>>                          goto failed;
>>                  }
>> +       } else if (order != folio_order(folio)) {
>> +               /*
>> +                * Swap readahead may swap in order 0 folios into swapcache
>> +                * asynchronously, while the shmem mapping can still stores
>> +                * large swap entries. In such cases, we should split the
>> +                * large swap entry to prevent possible data corruption.
>> +                */
>> +               split_order = shmem_split_large_entry(inode, index, swap, gfp);
>> +               if (split_order < 0) {
>> +                       error = split_order;
>> +                       goto failed;
>> +               }
>> +
>> +               /*
>> +                * If the large swap entry has already been split, it is
>> +                * necessary to recalculate the new swap entry based on
>> +                * the old order alignment.
>> +                */
>> +               if (split_order > 0) {
>> +                       pgoff_t offset = index - round_down(index, 1 << split_order);
>> +
>> +                       swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>> +               }
>>          }
>>
>>   alloced:
>> @@ -2346,7 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>          folio_lock(folio);
>>          if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
>>              folio->swap.val != swap.val ||
>> -           !shmem_confirm_swap(mapping, index, swap)) {
>> +           !shmem_confirm_swap(mapping, index, swap) ||
>> +           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
>>                  error = -EEXIST;
>>                  goto unlock;
>>          }
>> --
>> 2.43.5
>>
> 
> Thanks for the fix, it works for me.
> 
> Tested-by: Kairui Song <kasong@tencent.com>

Thanks for testing :)


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-02-25  1:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1738717785.im3r5g2vxc.none.ref@localhost>
2025-02-05  1:23 ` Hang when swapping huge=within_size tmpfs from zram Alex Xu (Hello71)
2025-02-05  1:55   ` Baolin Wang
2025-02-05  6:38     ` Baolin Wang
2025-02-05 14:39       ` Lance Yang
2025-02-07  7:23         ` Baolin Wang
2025-02-23 17:53           ` Kairui Song
2025-02-23 18:22             ` Kairui Song
2025-02-24  3:21               ` Baolin Wang
2025-02-24  8:47                 ` [PATCH] mm: shmem: fix potential data corruption during shmem swapin Baolin Wang
2025-02-24 17:50                   ` Kairui Song
2025-02-25  1:07                     ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox