On 18-Jul-24 5:41 PM, Mateusz Guzik wrote: > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao wrote: >> >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote: >>> As for clear_shadow_entry mentioned in the opening mail, the content is: >>> spin_lock(&mapping->host->i_lock); >>> xa_lock_irq(&mapping->i_pages); >>> __clear_shadow_entry(mapping, index, entry); >>> xa_unlock_irq(&mapping->i_pages); >>> if (mapping_shrinkable(mapping)) >>> inode_add_lru(mapping->host); >>> spin_unlock(&mapping->host->i_lock); >>> >>> so for all I know it's all about the xarray thing, not the i_lock per se. >> >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq >> and hence concluded it to be i_lock. > > I'm not disputing it was i_lock. I am claiming that the i_pages is > taken immediately after and it may be that in your workload this is > the thing with the actual contention problem, making i_lock a red > herring. > > I tried to match up offsets to my own kernel binary, but things went haywire. > > Can you please resolve a bunch of symbols, like this: > ./scripts/faddr2line vmlinux clear_shadow_entry+92 > > and then paste the source code from reported lines? (I presume you are > running with some local patches, so opening relevant files in my repo > may still give bogus resutls) > > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332 clear_shadow_entry+92 $ ./scripts/faddr2line vmlinux clear_shadow_entry+92 clear_shadow_entry+92/0x180: spin_lock_irq at include/linux/spinlock.h:376 (inlined by) clear_shadow_entry at mm/truncate.c:51 42 static void clear_shadow_entry(struct address_space *mapping, 43 struct folio_batch *fbatch, pgoff_t *indices) 44 { 45 int i; 46 47 if (shmem_mapping(mapping) || dax_mapping(mapping)) 48 return; 49 50 spin_lock(&mapping->host->i_lock); 51 xa_lock_irq(&mapping->i_pages); __remove_mapping+98 $ ./scripts/faddr2line vmlinux __remove_mapping+98 __remove_mapping+98/0x230: spin_lock_irq at include/linux/spinlock.h:376 (inlined by) __remove_mapping at mm/vmscan.c:695 684 static int __remove_mapping(struct address_space *mapping, struct folio *folio, 685 bool reclaimed, struct mem_cgroup *target_memcg) 686 { 687 int refcount; 688 void *shadow = NULL; 689 690 BUG_ON(!folio_test_locked(folio)); 691 BUG_ON(mapping != folio_mapping(folio)); 692 693 if (!folio_test_swapcache(folio)) 694 spin_lock(&mapping->host->i_lock); 695 xa_lock_irq(&mapping->i_pages); __filemap_add_folio+332 $ ./scripts/faddr2line vmlinux __filemap_add_folio+332 __filemap_add_folio+332/0x480: spin_lock_irq at include/linux/spinlock.h:377 (inlined by) __filemap_add_folio at mm/filemap.c:878 851 noinline int __filemap_add_folio(struct address_space *mapping, 852 struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp) 853 { 854 XA_STATE(xas, &mapping->i_pages, index); ... 874 for (;;) { 875 int order = -1, split_order = 0; 876 void *entry, *old = NULL; 877 878 xas_lock_irq(&xas); 879 xas_for_each_conflict(&xas, entry) { > > Most notably in __remove_mapping i_lock is conditional: > if (!folio_test_swapcache(folio)) > spin_lock(&mapping->host->i_lock); > xa_lock_irq(&mapping->i_pages); > > and the disasm of the offset in my case does not match either acquire. > For all I know i_lock in this routine is *not* taken and all the > queued up __remove_mapping callers increase i_lock -> i_pages wait > times in clear_shadow_entry. So the first two are on i_pages lock and the last one is xa_lock. > > To my cursory reading i_lock in clear_shadow_entry can be hacked away > with some effort, but should this happen the contention is going to > shift to i_pages presumably with more soft lockups (except on that > lock). I am not convinced messing with it is justified. From looking > at other places the i_lock is not a problem in other spots fwiw. > > All that said even if it is i_lock in both cases *and* someone whacks > it, the mm folk should look into what happens when (maybe i_lock ->) > i_pages lock is held. To that end perhaps you could provide a > flamegraph or output of perf record -a -g, I don't know what's > preferred. I have attached the flamegraph but this is for the kernel that has been running with all the accumulated fixes so far. The original one (w/o fixes) did show considerable time spent on native_queued_spin_lock_slowpath but unfortunately unable to locate it now. Regards, Bharata.