On Tue, Mar 30, 2021 at 12:57 AM Huang, Ying wrote: > > Yu Zhao writes: > > > On Mon, Mar 29, 2021 at 9:44 PM Huang, Ying wrote: > >> > >> Miaohe Lin writes: > >> > >> > On 2021/3/30 9:57, Huang, Ying wrote: > >> >> Hi, Miaohe, > >> >> > >> >> Miaohe Lin writes: > >> >> > >> >>> Hi all, > >> >>> I am investigating the swap code, and I found the below possible race window: > >> >>> > >> >>> CPU 1 CPU 2 > >> >>> ----- ----- > >> >>> do_swap_page > >> >>> skip swapcache case (synchronous swap_readpage) > >> >>> alloc_page_vma > >> >>> swapoff > >> >>> release swap_file, bdev, or ... > >> >>> swap_readpage > >> >>> check sis->flags is ok > >> >>> access swap_file, bdev or ...[oops!] > >> >>> si->flags = 0 > >> >>> > >> >>> The swapcache case is ok because swapoff will wait on the page_lock of swapcache page. > >> >>> Is this will really happen or Am I miss something ? > >> >>> Any reply would be really grateful. Thanks! :) > >> >> > >> >> This appears possible. Even for swapcache case, we can't guarantee the > >> > > >> > Many thanks for reply! > >> > > >> >> swap entry gotten from the page table is always valid too. The > >> > > >> > The page table may change at any time. And we may thus do some useless work. > >> > But the pte_same() check could handle these races correctly if these do not > >> > result in oops. > >> > > >> >> underlying swap device can be swapped off at the same time. So we use > >> >> get/put_swap_device() for that. Maybe we need similar stuff here. > >> > > >> > Using get/put_swap_device() to guard against swapoff for swap_readpage() sounds > >> > really bad as swap_readpage() may take really long time. Also such race may not be > >> > really hurtful because swapoff is usually done when system shutdown only. > >> > I can not figure some simple and stable stuff out to fix this. Any suggestions or > >> > could anyone help get rid of such race? > >> > >> Some reference counting on the swap device can prevent swap device from > >> swapping-off. To reduce the performance overhead on the hot-path as > >> much as possible, it appears we can use the percpu_ref. > > > > Hi, > > > > I've been seeing crashes when testing the latest kernels with > > stress-ng --class vm -a 20 -t 600s --temp-path /tmp > > > > I haven't had time to look into them yet: > > > > DEBUG_VM: > > BUG: unable to handle page fault for address: ffff905c33c9a000 > > Call Trace: > > get_swap_pages+0x278/0x590 > > get_swap_page+0x1ab/0x280 > > add_to_swap+0x7d/0x130 > > shrink_page_list+0xf84/0x25f0 > > reclaim_pages+0x313/0x430 > > madvise_cold_or_pageout_pte_range+0x95c/0xaa0 > > If my understanding were correct, two bugs are reported? One above and > one below? If so, and the above one is reported firstly. Can you share > the full bug message reported in dmesg? No, they are from two different kernel configs. I saw the first crash and didn't know what to look. So I turned on KASAN to see if it gives more clue. Unfortunately I haven't had time to spend more time on it. > Can you convert the call trace to source line? And the commit of the > kernel? Or the full kconfig? So I can build it by myself. It seems to be very reproducible if you enable these three options, on 5.12, 5.11, 5.10 which is where I gave up trying. > > CONFIG_MEMCG_SWAP=y > > CONFIG_THP_SWAP=y > > CONFIG_ZSWAP=y I'll dig into the log and see if I could at least give you the line numbers. Kernel config attached. Thanks! And the command line I used, which is nothing fancy: > > stress-ng --class vm -a 20 -t 600s --temp-path /tmp > > KASAN: > > ================================================================== > > BUG: KASAN: slab-out-of-bounds in __frontswap_store+0xc9/0x2e0 > > Read of size 8 at addr ffff88901f646f18 by task stress-ng-mrema/31329 > > CPU: 2 PID: 31329 Comm: stress-ng-mrema Tainted: G S I L > > 5.12.0-smp-DEV #2 > > Call Trace: > > dump_stack+0xff/0x165 > > print_address_description+0x81/0x390 > > __kasan_report+0x154/0x1b0 > > ? __frontswap_store+0xc9/0x2e0 > > ? __frontswap_store+0xc9/0x2e0 > > kasan_report+0x47/0x60 > > kasan_check_range+0x2f3/0x340 > > __kasan_check_read+0x11/0x20 > > __frontswap_store+0xc9/0x2e0 > > swap_writepage+0x52/0x80 > > pageout+0x489/0x7f0 > > shrink_page_list+0x1b11/0x2c90 > > reclaim_pages+0x6ca/0x930 > > madvise_cold_or_pageout_pte_range+0x1260/0x13a0 > > > > Allocated by task 16813: > > ____kasan_kmalloc+0xb0/0xe0 > > __kasan_kmalloc+0x9/0x10 > > __kmalloc_node+0x52/0x70 > > kvmalloc_node+0x50/0x90 > > __se_sys_swapon+0x353a/0x4860 > > __x64_sys_swapon+0x5b/0x70 > > > > The buggy address belongs to the object at ffff88901f640000 > > which belongs to the cache kmalloc-32k of size 32768 > > The buggy address is located 28440 bytes inside of > > 32768-byte region [ffff88901f640000, ffff88901f648000) > > The buggy address belongs to the page: > > page:0000000032d23e33 refcount:1 mapcount:0 mapping:0000000000000000 > > index:0x0 pfn:0x101f640 > > head:0000000032d23e33 order:4 compound_mapcount:0 compound_pincount:0 > > flags: 0x400000000010200(slab|head) > > raw: 0400000000010200 ffffea00062b8408 ffffea000a6e9008 ffff888100040300 > > raw: 0000000000000000 ffff88901f640000 0000000100000001 000000000000000 > > page dumped because: kasan: bad access detected > > > > Memory state around the buggy address: > > ffff88901f646e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > ffff88901f646e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > >ffff88901f646f00: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc > > ^ > > ffff88901f646f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > ffff88901f647000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > ================================================================== > > > > Relevant config options I could think of: > > > > CONFIG_MEMCG_SWAP=y > > CONFIG_THP_SWAP=y > > CONFIG_ZSWAP=y