Dear Andrew folks, dear Linux folks,


On a four socket Dell PowerEdge R815 with AMD Opterons 6276, Linux 5.4.14
hit the kernel bug below. We were *not* able to reproduce it yet.

[10834.604899] ------------[ cut here ]------------
[10834.604906] kernel BUG at mm/vmscan.c:1740!
[10834.604917] invalid opcode: 0000 [#1] SMP NOPTI
[10834.609485] CPU: 46 PID: 409 Comm: kswapd3 Kdump: loaded Not tainted 5.4.14.mx64.317 #1
[10834.617505] Hardware name: Dell Inc. PowerEdge R815/0THJFH, BIOS 3.2.2 09/15/2014
[10834.625014] RIP: 0010:isolate_lru_pages+0x367/0x370
[10834.629904] Code: e9 53 4d 89 f8 41 54 48 8b 4c 24 18 8b 54 24 28 8b 74 24 40 e8 4a 3b c4 00 49 8b 06 48 83 c4 18 48 85 c0 75 d0 e9 c4 fe ff ff <0f> 0b e8 42 c0 ea ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54
[10834.648716] RSP: 0018:ffffc9000d407ae0 EFLAGS: 00010082
[10834.653955] RAX: 00000000ffffffea RBX: ffffea00e0800008 RCX: ffff88bfdc595420
[10834.661103] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea00e0800000
[10834.668253] RBP: ffffc9000d407de8 R08: ffffc9000d407de8 R09: 0000000000000020
[10834.675401] R10: 00000000f0000000 R11: 0000000000000000 R12: ffff88bfdc595420
[10834.682549] R13: 0000000000000001 R14: 0000000000000010 R15: 0000000000000010
[10834.689698] FS:  0000000000000000(0000) GS:ffff88bfdfb80000(0000) knlGS:0000000000000000
[10834.697802] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10834.703559] CR2: 000000000060d348 CR3: 0000004fd3876000 CR4: 00000000000406e0
[10834.710709] Call Trace:
[10834.713170]  shrink_inactive_list+0x113/0x3d0
[10834.717543]  shrink_node_memcg+0x3c8/0x800
[10834.721655]  ? shrink_slab+0x295/0x2c0
[10834.725417]  ? shrink_slab+0x295/0x2c0
[10834.729179]  ? shrink_node+0xb6/0x420
[10834.732866]  shrink_node+0xb6/0x420
[10834.736367]  balance_pgdat+0x250/0x550
[10834.740130]  kswapd+0x15d/0x3f0
[10834.743286]  ? wait_woken+0x80/0x80
[10834.746785]  ? balance_pgdat+0x550/0x550
[10834.750724]  kthread+0x117/0x130
[10834.753968]  ? kthread_create_worker_on_cpu+0x70/0x70
[10834.759039]  ret_from_fork+0x22/0x40
[10834.762627] Modules linked in: nfsv4 nfs rpcsec_gss_krb5 ext4 mbcache jbd2 8021q garp stp mrp llc input_leds led_class mgag200 drm_vram_helper ttm kvm_amd drm_kms_helper kvm drm fb_sys_fops syscopyarea sysfillrect sysimgblt ixgbe irqbypass 3w_9xxx crc32c_intel acpi_cpufreq nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables unix ipv6 nf_defrag_ipv6 autofs4
[10834.796386] ---[ end trace 28611096f6473c90 ]---

More traces follow in the log. Please find the full log attached.

Here is the code in question from `mm/vmscan.c`:

1699         while (scan < nr_to_scan && !list_empty(src)) {
1700                 struct page *page;
1701 
1702                 page = lru_to_page(src);
1703                 prefetchw_prev_lru_page(page, src, flags);
1704 
1705                 VM_BUG_ON_PAGE(!PageLRU(page), page);
1706 
1707                 nr_pages = compound_nr(page);
1708                 total_scan += nr_pages;
1709 
1710                 if (page_zonenum(page) > sc->reclaim_idx) {
1711                         list_move(&page->lru, &pages_skipped);
1712                         nr_skipped[page_zonenum(page)] += nr_pages;
1713                         continue;
1714                 }
1715 
1716                 /*
1717                  * Do not count skipped pages because that makes the function
1718                  * return with no isolated pages if the LRU mostly contains
1719                  * ineligible pages.  This causes the VM to not reclaim any
1720                  * pages, triggering a premature OOM.
1721                  *
1722                  * Account all tail pages of THP.  This would not cause
1723                  * premature OOM since __isolate_lru_page() returns -EBUSY
1724                  * only when the page is being freed somewhere else.
1725                  */
1726                 scan += nr_pages;
1727                 switch (__isolate_lru_page(page, mode)) {
1728                 case 0:
1729                         nr_taken += nr_pages;
1730                         nr_zone_taken[page_zonenum(page)] += nr_pages;
1731                         list_move(&page->lru, dst);
1732                         break;
1733 
1734                 case -EBUSY:
1735                         /* else it is being freed elsewhere */
1736                         list_move(&page->lru, src);
1737                         continue;
1738 
1739                 default:
1740                         BUG();
1741                 }
1742         }

We haven’t seen this before with Linux versions up to 4.19.57, but
also only started to use this system as a cluster node now. Before
it was used interactively.

Could this be a regression from 4.19.x to 5.4?


Kind regards,

Paul