I was unable to reproduce the problem but I'll forward this on to my user and see if they can test it. I imagine the users would prefer it backported though they have worked around the problem by turning off splicing. On Wed, Sep 14, 2016 at 10:31 AM, Johannes Weiner wrote: > Hi Miklos, > > On Tue, Sep 13, 2016 at 10:42:17AM +0200, Miklos Szeredi wrote: > > Fuse allows pages to be spliced into the page cache when reading the > > file. It does this with replace_page_cache_page(), which is an atomic > > version of delete_from_page_cache()+add_to_page_cache(). > > > > Fuse is the only user of replace_page_cache_page(), so I imagine bugs > > can more easily escape notice than the more commonly used variants. > > > > Could you please take a look at this function. "git blame" shows that > > it's older than the add/remove variants, but I haven't gone into the > > details. > > Indeed, replace_page_cache_page() uses a properly accounted deletion > of the old page followed by a raw, untracked radix_tree_insert(). It > would lead to an underflow that triggers the page counter assertion. > > Thanks for the pointer, Miklos. This has been broken for a while. > > Antonio, does the following patch resolve the issue for you? It > applies to the head of Linus's tree, let me know if you need it > backported to a different base. > > --- > > From 3a2bb511f5e04019ccc487ef995b94700db172e7 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Wed, 14 Sep 2016 09:50:42 -0400 > Subject: [PATCH] mm: workingset: fix shadow node leak in > replace_page_cache_page() > > Antonio reports the following crash when using fuse under memory > pressure: > > [25192.515454] kernel BUG at /build/linux-a2WvEb/linux-4.4. > 0/mm/workingset.c:346! > [25192.517521] invalid opcode: 0000 [#1] SMP > [25192.519602] Modules linked in: netconsole ip6t_REJECT nf_reject_ipv6 > ipt_REJECT nf_reject_ipv4 configfs binfmt_misc veth bridge stp llc > nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_filter ip6_tables > xt_multiport iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 > xt_comment xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 > nf_nat nf_conntrack xt_CHECKSUM xt_tcpudp iptable_mangle ip_tables x_tables > intel_rapl x86_pkg_temp_thermal intel_powerclamp eeepc_wmi asus_wmi > coretemp sparse_keymap kvm_intel ppdev kvm irqbypass mei_me 8250_fintek > input_leds serio_raw parport_pc tpm_infineon mei shpchp mac_hid parport > lpc_ich autofs4 drbg ansi_cprng dm_crypt algif_skcipher af_alg btrfs > raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor > raid6_pq libcrc32c raid0 multipath linear raid10 raid1 i915 > crct10dif_pclmul crc32_pclmul aesni_intel i2c_algo_bit aes_x86_64 > drm_kms_helper lrw gf128mul glue_helper ablk_helper syscopyarea cryptd > sysfillrect sysimgblt fb_sys_fops drm ahci r8169 libahci mii wmi fjes video > [last unloaded: netconsole] > [25192.540910] CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic > #55-Ubuntu > [25192.543411] Hardware name: System manufacturer System Product > Name/P8H67-M PRO, BIOS 3904 04/27/2013 > [25192.545840] task: ffff88040cae6040 ti: ffff880407488000 task.ti: > ffff880407488000 > [25192.548277] RIP: 0010:[] [] > shadow_lru_isolate+0x181/0x190 > [25192.550706] RSP: 0018:ffff88040748bbe0 EFLAGS: 00010002 > [25192.553127] RAX: 0000000000001c81 RBX: ffff8802f91ee928 RCX: > ffff8802f91eeb38 > [25192.555544] RDX: ffff8802f91ee938 RSI: ffff8802f91ee928 RDI: > ffff8804099ba2c0 > [25192.557914] RBP: ffff88040748bc08 R08: 000000000001a7b6 R09: > 000000000000003f > [25192.560237] R10: 000000000001a750 R11: 0000000000000000 R12: > ffff8804099ba2c0 > [25192.562512] R13: ffff8803157e9680 R14: ffff8803157e9668 R15: > ffff8804099ba2c8 > [25192.564724] FS: 0000000000000000(0000) GS:ffff88041f280000(0000) > knlGS:0000000000000000 > [25192.566990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [25192.569201] CR2: 00007ffabb690000 CR3: 0000000001e0a000 CR4: > 00000000000406e0 > [25192.571419] Stack: > [25192.573550] ffff8804099ba2c0 ffff88039e4f86f0 ffff8802f91ee928 > ffff8804099ba2c8 > [25192.575695] ffff88040748bd08 ffff88040748bc58 ffffffff811b99bf > 0000000000000052 > [25192.577814] 0000000000000000 ffffffff811ba380 000000000000008a > 0000000000000080 > [25192.579947] Call Trace: > [25192.582022] [] __list_lru_walk_one.isra.3+0x8f/0x130 > [25192.584137] [] ? memcg_drain_all_list_lrus+ > 0x190/0x190 > [25192.586165] [] list_lru_walk_one+0x23/0x30 > [25192.588145] [] scan_shadow_nodes+0x34/0x50 > [25192.590074] [] shrink_slab.part.40+0x1ed/0x3d0 > [25192.591985] [] shrink_zone+0x2ca/0x2e0 > [25192.593863] [] kswapd+0x51e/0x990 > [25192.595737] [] ? mem_cgroup_shrink_node_zone+ > 0x1c0/0x1c0 > [25192.597613] [] kthread+0xd8/0xf0 > [25192.599495] [] ? kthread_create_on_node+0x1e0/0x1e0 > [25192.601335] [] ret_from_fork+0x3f/0x70 > [25192.603193] [] ? kthread_create_on_node+0x1e0/0x1e0 > [25192.605083] Code: 8d 7e 08 4c 89 fe e8 4f cc 23 00 84 c0 74 20 4c 89 ef > c6 07 00 66 66 66 90 bb 01 00 00 00 e9 c5 fe ff ff 0f 0b 0f 0b 0f 0b 0f 0b > <0f> 0b 0f 0b 0f 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 > [25192.609252] RIP [] shadow_lru_isolate+0x181/0x190 > [25192.611304] RSP > > which corresponds to the following sanity check in the shadow node > tracking: > > BUG_ON(node->count & RADIX_TREE_COUNT_MASK); > > The workingset code tracks radix tree nodes that exclusively contain > shadow entries of evicted pages in them, and this (somewhat obscure) > checks if there are real pages left that would interfere with reclaim > of the radix tree node under memory pressure. > > Discussing ways of how fuse might sneak pages into the radix tree past > the workingset code, Miklos pointed to replace_page_cache_page(), and > indeed there is a problem there: it properly accounts for the old page > being removed (__delete_from_page_cache() does that), but then does a > raw raw radix_tree_insert(), not accounting for the replacement page; > the page counter bits in node->count eventually underflow. > > To address this, make sure replace_page_cache_page() uses the tracked > page insertion code, page_cache_tree_insert(). > > Also, make the sanity checks a bit less obscure by using the helpers > for checking the number of pages and shadows in a radix tree node. > > Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check") > Cc: stable@vger.kernel.org # 3.15+ > Reported-by: Antonio SJ Musumeci > Debugged-by: Miklos Szeredi > Signed-off-by: Johannes Weiner > --- > include/linux/swap.h | 2 + > mm/filemap.c | 114 +++++++++++++++++++++++++----- > --------------------- > mm/workingset.c | 10 ++--- > 3 files changed, 63 insertions(+), 63 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index b17cc4830fa6..4a529c984a3f 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -257,6 +257,7 @@ static inline void workingset_node_pages_inc(struct > radix_tree_node *node) > > static inline void workingset_node_pages_dec(struct radix_tree_node > *node) > { > + VM_BUG_ON(!workingset_node_pages(node)); > node->count--; > } > > @@ -272,6 +273,7 @@ static inline void workingset_node_shadows_inc(struct > radix_tree_node *node) > > static inline void workingset_node_shadows_dec(struct radix_tree_node > *node) > { > + VM_BUG_ON(!workingset_node_shadows(node)); > node->count -= 1U << RADIX_TREE_COUNT_SHIFT; > } > > diff --git a/mm/filemap.c b/mm/filemap.c > index 8a287dfc5372..2d0986a64f1f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -110,6 +110,62 @@ > * ->tasklist_lock (memory_failure, collect_procs_ao) > */ > > +static int page_cache_tree_insert(struct address_space *mapping, > + struct page *page, void **shadowp) > +{ > + struct radix_tree_node *node; > + void **slot; > + int error; > + > + error = __radix_tree_create(&mapping->page_tree, page->index, 0, > + &node, &slot); > + if (error) > + return error; > + if (*slot) { > + void *p; > + > + p = radix_tree_deref_slot_protected(slot, > &mapping->tree_lock); > + if (!radix_tree_exceptional_entry(p)) > + return -EEXIST; > + > + mapping->nrexceptional--; > + if (!dax_mapping(mapping)) { > + if (shadowp) > + *shadowp = p; > + if (node) > + workingset_node_shadows_dec(node); > + } else { > + /* DAX can replace empty locked entry with a hole > */ > + WARN_ON_ONCE(p != > + (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | > + RADIX_DAX_ENTRY_LOCK)); > + /* DAX accounts exceptional entries as normal > pages */ > + if (node) > + workingset_node_pages_dec(node); > + /* Wakeup waiters for exceptional entry lock */ > + dax_wake_mapping_entry_waiter(mapping, > page->index, > + false); > + } > + } > + radix_tree_replace_slot(slot, page); > + mapping->nrpages++; > + if (node) { > + workingset_node_pages_inc(node); > + /* > + * Don't track node that contains actual pages. > + * > + * Avoid acquiring the list_lru lock if already > + * untracked. The list_empty() test is safe as > + * node->private_list is protected by > + * mapping->tree_lock. > + */ > + if (!list_empty(&node->private_list)) > + list_lru_del(&workingset_shadow_nodes, > + &node->private_list); > + } > + return 0; > +} > + > static void page_cache_tree_delete(struct address_space *mapping, > struct page *page, void *shadow) > { > @@ -561,7 +617,7 @@ int replace_page_cache_page(struct page *old, struct > page *new, gfp_t gfp_mask) > > spin_lock_irqsave(&mapping->tree_lock, flags); > __delete_from_page_cache(old, NULL); > - error = radix_tree_insert(&mapping->page_tree, offset, > new); > + error = page_cache_tree_insert(mapping, new, NULL); > BUG_ON(error); > mapping->nrpages++; > > @@ -584,62 +640,6 @@ int replace_page_cache_page(struct page *old, struct > page *new, gfp_t gfp_mask) > } > EXPORT_SYMBOL_GPL(replace_page_cache_page); > > -static int page_cache_tree_insert(struct address_space *mapping, > - struct page *page, void **shadowp) > -{ > - struct radix_tree_node *node; > - void **slot; > - int error; > - > - error = __radix_tree_create(&mapping->page_tree, page->index, 0, > - &node, &slot); > - if (error) > - return error; > - if (*slot) { > - void *p; > - > - p = radix_tree_deref_slot_protected(slot, > &mapping->tree_lock); > - if (!radix_tree_exceptional_entry(p)) > - return -EEXIST; > - > - mapping->nrexceptional--; > - if (!dax_mapping(mapping)) { > - if (shadowp) > - *shadowp = p; > - if (node) > - workingset_node_shadows_dec(node); > - } else { > - /* DAX can replace empty locked entry with a hole > */ > - WARN_ON_ONCE(p != > - (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | > - RADIX_DAX_ENTRY_LOCK)); > - /* DAX accounts exceptional entries as normal > pages */ > - if (node) > - workingset_node_pages_dec(node); > - /* Wakeup waiters for exceptional entry lock */ > - dax_wake_mapping_entry_waiter(mapping, > page->index, > - false); > - } > - } > - radix_tree_replace_slot(slot, page); > - mapping->nrpages++; > - if (node) { > - workingset_node_pages_inc(node); > - /* > - * Don't track node that contains actual pages. > - * > - * Avoid acquiring the list_lru lock if already > - * untracked. The list_empty() test is safe as > - * node->private_list is protected by > - * mapping->tree_lock. > - */ > - if (!list_empty(&node->private_list)) > - list_lru_del(&workingset_shadow_nodes, > - &node->private_list); > - } > - return 0; > -} > - > static int __add_to_page_cache_locked(struct page *page, > struct address_space *mapping, > pgoff_t offset, gfp_t gfp_mask, > diff --git a/mm/workingset.c b/mm/workingset.c > index 69551cfae97b..617475f529f4 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -418,21 +418,19 @@ static enum lru_status shadow_lru_isolate(struct > list_head *item, > * no pages, so we expect to be able to remove them all and > * delete and free the empty node afterwards. > */ > - > - BUG_ON(!node->count); > - BUG_ON(node->count & RADIX_TREE_COUNT_MASK); > + BUG_ON(!workingset_node_shadows(node)); > + BUG_ON(workingset_node_pages(node)); > > for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) { > if (node->slots[i]) { > BUG_ON(!radix_tree_exceptional_entry(node->slots[ > i])); > node->slots[i] = NULL; > - BUG_ON(node->count < (1U << > RADIX_TREE_COUNT_SHIFT)); > - node->count -= 1U << RADIX_TREE_COUNT_SHIFT; > + workingset_node_shadows_dec(node); > BUG_ON(!mapping->nrexceptional); > mapping->nrexceptional--; > } > } > - BUG_ON(node->count); > + BUG_ON(workingset_node_shadows(node)); > inc_node_state(page_pgdat(virt_to_page(node)), > WORKINGSET_NODERECLAIM); > if (!__radix_tree_delete_node(&mapping->page_tree, node)) > BUG(); > -- > 2.9.3 >