On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
> On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
>>> As for clear_shadow_entry mentioned in the opening mail, the content is:
>>>           spin_lock(&mapping->host->i_lock);
>>>           xa_lock_irq(&mapping->i_pages);
>>>           __clear_shadow_entry(mapping, index, entry);
>>>           xa_unlock_irq(&mapping->i_pages);
>>>           if (mapping_shrinkable(mapping))
>>>                   inode_add_lru(mapping->host);
>>>           spin_unlock(&mapping->host->i_lock);
>>>
>>> so for all I know it's all about the xarray thing, not the i_lock per se.
>>
>> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
>> and hence concluded it to be i_lock.
> 
> I'm not disputing it was i_lock. I am claiming that the i_pages is
> taken immediately after and it may be that in your workload this is
> the thing with the actual contention problem, making i_lock a red
> herring.
> 
> I tried to match up offsets to my own kernel binary, but things went haywire.
> 
> Can you please resolve a bunch of symbols, like this:
> ./scripts/faddr2line vmlinux clear_shadow_entry+92
> 
> and then paste the source code from reported lines? (I presume you are
> running with some local patches, so opening relevant files in my repo
> may still give bogus resutls)
> 
> Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332

clear_shadow_entry+92

$ ./scripts/faddr2line vmlinux clear_shadow_entry+92
clear_shadow_entry+92/0x180:
spin_lock_irq at include/linux/spinlock.h:376
(inlined by) clear_shadow_entry at mm/truncate.c:51

42 static void clear_shadow_entry(struct address_space *mapping,
43                                struct folio_batch *fbatch, pgoff_t 
*indices)
44 {
45         int i;
46
47         if (shmem_mapping(mapping) || dax_mapping(mapping))
48                 return;
49
50         spin_lock(&mapping->host->i_lock);
51         xa_lock_irq(&mapping->i_pages);


__remove_mapping+98

$ ./scripts/faddr2line vmlinux __remove_mapping+98
__remove_mapping+98/0x230:
spin_lock_irq at include/linux/spinlock.h:376
(inlined by) __remove_mapping at mm/vmscan.c:695

684 static int __remove_mapping(struct address_space *mapping, struct 
folio *folio,
685                             bool reclaimed, struct mem_cgroup 
*target_memcg)
686 {
687         int refcount;
688         void *shadow = NULL;
689
690         BUG_ON(!folio_test_locked(folio));
691         BUG_ON(mapping != folio_mapping(folio));
692
693         if (!folio_test_swapcache(folio))
694                 spin_lock(&mapping->host->i_lock);
695         xa_lock_irq(&mapping->i_pages);


__filemap_add_folio+332

$ ./scripts/faddr2line vmlinux __filemap_add_folio+332
__filemap_add_folio+332/0x480:
spin_lock_irq at include/linux/spinlock.h:377
(inlined by) __filemap_add_folio at mm/filemap.c:878

851 noinline int __filemap_add_folio(struct address_space *mapping,
852                 struct folio *folio, pgoff_t index, gfp_t gfp, void 
**shadowp)
853 {
854         XA_STATE(xas, &mapping->i_pages, index);
             ...
874         for (;;) {
875                 int order = -1, split_order = 0;
876                 void *entry, *old = NULL;
877
878                 xas_lock_irq(&xas);
879                 xas_for_each_conflict(&xas, entry) {

> 
> Most notably in __remove_mapping i_lock is conditional:
>          if (!folio_test_swapcache(folio))
>                  spin_lock(&mapping->host->i_lock);
>          xa_lock_irq(&mapping->i_pages);
> 
> and the disasm of the offset in my case does not match either acquire.
> For all I know i_lock in this routine is *not* taken and all the
> queued up __remove_mapping callers increase i_lock -> i_pages wait
> times in clear_shadow_entry.

So the first two are on i_pages lock and the last one is xa_lock.

> 
> To my cursory reading i_lock in clear_shadow_entry can be hacked away
> with some effort, but should this happen the contention is going to
> shift to i_pages presumably with more soft lockups (except on that
> lock). I am not convinced messing with it is justified. From looking
> at other places the i_lock is not a problem in other spots fwiw.
> 
> All that said even if it is i_lock in both cases *and* someone whacks
> it, the mm folk should look into what happens when (maybe i_lock ->)
> i_pages lock is held. To that end perhaps you could provide a
> flamegraph or output of perf record -a -g, I don't know what's
> preferred.

I have attached the flamegraph but this is for the kernel that has been 
running with all the accumulated fixes so far. The original one (w/o 
fixes) did show considerable time spent on 
native_queued_spin_lock_slowpath but unfortunately unable to locate it now.

Regards,
Bharata.