From: Nico Pache <npache@redhat.com>
To: Dev Jain <dev.jain@arm.com>
Cc: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org,
kirill.shutemov@linux.intel.com, ryan.roberts@arm.com,
anshuman.khandual@arm.com, catalin.marinas@arm.com,
cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
apopple@nvidia.com, dave.hansen@linux.intel.com,
will@kernel.org, baohua@kernel.org, jack@suse.cz,
srivatsa@csail.mit.edu, haowenchao22@gmail.com,
hughd@google.com, aneesh.kumar@kernel.org,
yang@os.amperecomputing.com, peterx@redhat.com,
ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com,
jglisse@google.com, surenb@google.com, vishal.moola@gmail.com,
zokeefe@google.com, zhengqi.arch@bytedance.com,
jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse
Date: Fri, 14 Feb 2025 18:47:13 -0700 [thread overview]
Message-ID: <CAA1CXcACBMTPN-y4-J6wgiTf90ia5576xOJKFULnvGPCt56U6w@mail.gmail.com> (raw)
In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com>
Hi Dev,
I tried to run your kernel to get some performance numbers out of it,
but ran into the following issue while running my defer-mthp-test.sh
workload.
[ 297.393032] =====================================
[ 297.393618] WARNING: bad unlock balance detected!
[ 297.394201] 6.14.0-rc2mthpDEV #2 Not tainted
[ 297.394732] -------------------------------------
[ 297.395421] khugepaged/111 is trying to release lock (&mm->mmap_lock) at:
[ 297.396509] [<ffffffff947cb76a>] khugepaged+0x23a/0xb40
[ 297.397205] but there are no more locks to release!
[ 297.397865]
[ 297.397865] other info that might help us debug this:
[ 297.398684] no locks held by khugepaged/111.
[ 297.399155]
[ 297.399155] stack backtrace:
[ 297.399591] CPU: 10 UID: 0 PID: 111 Comm: khugepaged Not tainted
6.14.0-rc2mthpDEV #2
[ 297.399593] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-2.fc40 04/01/2014
[ 297.399595] Call Trace:
[ 297.399599] <TASK>
[ 297.399602] dump_stack_lvl+0x6e/0xa0
[ 297.399607] ? khugepaged+0x23a/0xb40
[ 297.399610] print_unlock_imbalance_bug.part.0+0xfb/0x110
[ 297.399612] ? khugepaged+0x23a/0xb40
[ 297.399614] lock_release+0x283/0x3f0
[ 297.399620] up_read+0x1b/0x30
[ 297.399622] khugepaged+0x23a/0xb40
[ 297.399631] ? __pfx_khugepaged+0x10/0x10
[ 297.399633] kthread+0xf2/0x240
[ 297.399636] ? __pfx_kthread+0x10/0x10
[ 297.399638] ret_from_fork+0x34/0x50
[ 297.399640] ? __pfx_kthread+0x10/0x10
[ 297.399642] ret_from_fork_asm+0x1a/0x30
[ 297.399649] </TASK>
[ 297.505555] ------------[ cut here ]------------
[ 297.506044] DEBUG_RWSEMS_WARN_ON(tmp < 0): count =
0xffffffffffffff00, magic = 0xffff8c6e03bc1f88, owner = 0x1, curr
0xffff8c6e0eccb700, list empty
[ 297.507362] WARNING: CPU: 8 PID: 1946 at
kernel/locking/rwsem.c:1346 __up_read+0x1ba/0x220
[ 297.508220] Modules linked in: nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 rfkill nf_tables intel_rapl_msr intel_rapl_common
kvm_amd iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm i2c_i801
i2c_smbus lpc_ich virtio_net net_failover failover virtio_balloon
joydev fuse loop nfnetlink zram xfs polyval_clmulni polyval_generic
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 virtio_console
virtio_blk sha1_ssse3 serio_raw qemu_fw_cfg
[ 297.513474] CPU: 8 UID: 0 PID: 1946 Comm: thp_test Not tainted
6.14.0-rc2mthpDEV #2
[ 297.514314] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-2.fc40 04/01/2014
[ 297.515265] RIP: 0010:__up_read+0x1ba/0x220
[ 297.515756] Code: c6 78 8b e1 95 48 c7 c7 88 0e d3 95 48 39 c2 48
c7 c2 be 39 e4 95 48 c7 c0 29 8b e1 95 48 0f 44 c2 48 8b 13 50 e8 e6
44 f5 ff <0f> 0b 58 e9 20 ff ff ff 48 8b 57 60 48 8d 47 60 4c 8b 47 08
c6 05
[ 297.517659] RSP: 0018:ffffa8a943533ac8 EFLAGS: 00010282
[ 297.518209] RAX: 0000000000000000 RBX: ffff8c6e03bc1f88 RCX: 0000000000000000
[ 297.518884] RDX: ffff8c7366ff0980 RSI: ffff8c7366fe1a80 RDI: ffff8c7366fe1a80
[ 297.519577] RBP: ffffa8a943533b58 R08: 0000000000000000 R09: 0000000000000001
[ 297.520272] R10: 0000000000000000 R11: 0770076d07650720 R12: ffffa8a943533b10
[ 297.520949] R13: ffff8c6e03bc1f88 R14: ffffa8a943533b58 R15: ffffa8a943533b10
[ 297.521651] FS: 00007f24de01b740(0000) GS:ffff8c7366e00000(0000)
knlGS:0000000000000000
[ 297.522425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 297.522990] CR2: 0000000a7ffef000 CR3: 000000010d9d6000 CR4: 0000000000750ef0
[ 297.523799] PKRU: 55555554
[ 297.524100] Call Trace:
[ 297.524367] <TASK>
[ 297.524597] ? __warn.cold+0xb7/0x151
[ 297.525072] ? __up_read+0x1ba/0x220
[ 297.525442] ? report_bug+0xff/0x140
[ 297.525804] ? console_unlock+0x9d/0x150
[ 297.526233] ? handle_bug+0x58/0x90
[ 297.526590] ? exc_invalid_op+0x17/0x70
[ 297.526993] ? asm_exc_invalid_op+0x1a/0x20
[ 297.527420] ? __up_read+0x1ba/0x220
[ 297.527783] ? __up_read+0x1ba/0x220
[ 297.528160] vms_complete_munmap_vmas+0x19c/0x1f0
[ 297.528628] do_vmi_align_munmap+0x20a/0x280
[ 297.529069] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.529552] do_vmi_munmap+0xd0/0x190
[ 297.529920] __vm_munmap+0xb1/0x1b0
[ 297.530293] __x64_sys_munmap+0x1b/0x30
[ 297.530677] do_syscall_64+0x95/0x180
[ 297.531058] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.531534] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 297.532167] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.532640] ? syscall_exit_to_user_mode+0x97/0x290
[ 297.533226] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.533701] ? do_syscall_64+0xa1/0x180
[ 297.534097] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.534587] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 297.535129] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.535603] ? syscall_exit_to_user_mode+0x97/0x290
[ 297.536092] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.536568] ? do_syscall_64+0xa1/0x180
[ 297.536954] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.537444] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 297.537936] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.538524] ? syscall_exit_to_user_mode+0x97/0x290
[ 297.539044] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.539526] ? do_syscall_64+0xa1/0x180
[ 297.539931] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.540597] ? do_user_addr_fault+0x5a9/0x8a0
[ 297.541102] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.541580] ? trace_hardirqs_off+0x4b/0xc0
[ 297.542011] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.542488] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 297.542991] ? srso_alias_return_thunk+0x5/0xfbef5
[ 297.543466] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 297.543960] RIP: 0033:0x7f24de1367eb
[ 297.544344] Code: 73 01 c3 48 8b 0d 2d f6 0c 00 f7 d8 64 89 01 48
83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd f5 0c 00 f7 d8 64 89
01 48
[ 297.546074] RSP: 002b:00007ffc7bb2e2b8 EFLAGS: 00000206 ORIG_RAX:
000000000000000b
[ 297.546796] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f24de1367eb
[ 297.547488] RDX: 0000000080000000 RSI: 0000000080000000 RDI: 0000000480000000
[ 297.548182] RBP: 00007ffc7bb2e390 R08: 0000000000000064 R09: 00000000fffffffe
[ 297.548884] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000006
[ 297.549594] R13: 0000000000000000 R14: 00007f24de258000 R15: 0000000000403e00
[ 297.550292] </TASK>
[ 297.550530] irq event stamp: 64417291
[ 297.550903] hardirqs last enabled at (64417291):
[<ffffffff94749232>] seqcount_lockdep_reader_access+0x82/0x90
[ 297.551859] hardirqs last disabled at (64417290):
[<ffffffff947491fe>] seqcount_lockdep_reader_access+0x4e/0x90
[ 297.552810] softirqs last enabled at (64413640):
[<ffffffff943bf3c2>] __irq_exit_rcu+0xe2/0x100
[ 297.553654] softirqs last disabled at (64413627):
[<ffffffff943bf3c2>] __irq_exit_rcu+0xe2/0x100
[ 297.554504] ---[ end trace 0000000000000000 ]---
On Tue, Feb 11, 2025 at 4:13 AM Dev Jain <dev.jain@arm.com> wrote:
>
> This patchset extends khugepaged from collapsing only PMD-sized THPs to
> collapsing anonymous mTHPs.
>
> mTHPs were introduced in the kernel to improve memory management by allocating
> chunks of larger memory, so as to reduce number of page faults, TLB misses (due
> to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
> is often lost due to CoW, swap-in/out, and when the kernel just cannot find
> enough physically contiguous memory to allocate on fault. Henceforth, there is a
> need to regain mTHPs in the system asynchronously. This work is an attempt in
> this direction, starting with anonymous folios.
>
> In the fault handler, we select the THP order in a greedy manner; the same has
> been used here, along with the same sysfs interface to control the order of
> collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().
>
> ---------------------------------------------------------
> Testing
> ---------------------------------------------------------
>
> The set has been build tested on x86_64.
> For Aarch64,
> 1. mm-selftests: No regressions.
> 2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
> aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
> and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.
>
> This patchset is rebased on mm-unstable (4637fa5d47a49c977116321cc575ea22215df22d).
>
> v1->v2:
> - Handle VMAs less than PMD size (patches 12-15)
> - Do not add mTHP into deferred split queue
> - Drop lock optimization and collapse mTHP under mmap_write_lock()
> - Define policy on what to do when we encounter a folio order larger than
> the order we are scanning for
> - Prevent the creep problem by enforcing tunable simplification
> - Update Documentation
> - Drop patch 12 from v1 updating selftest w.r.t the creep problem
> - Drop patch 1 from v1
>
> v1:
> https://lore.kernel.org/all/20241216165105.56185-1-dev.jain@arm.com/
>
> Dev Jain (17):
> khugepaged: Generalize alloc_charge_folio()
> khugepaged: Generalize hugepage_vma_revalidate()
> khugepaged: Generalize __collapse_huge_page_swapin()
> khugepaged: Generalize __collapse_huge_page_isolate()
> khugepaged: Generalize __collapse_huge_page_copy()
> khugepaged: Abstract PMD-THP collapse
> khugepaged: Scan PTEs order-wise
> khugepaged: Introduce vma_collapse_anon_folio()
> khugepaged: Define collapse policy if a larger folio is already mapped
> khugepaged: Exit early on fully-mapped aligned mTHP
> khugepaged: Enable sysfs to control order of collapse
> khugepaged: Enable variable-sized VMA collapse
> khugepaged: Lock all VMAs mapping the PTE table
> khugepaged: Reset scan address to correct alignment
> khugepaged: Delay cond_resched()
> khugepaged: Implement strict policy for mTHP collapse
> Documentation: transhuge: Define khugepaged mTHP collapse policy
>
> Documentation/admin-guide/mm/transhuge.rst | 49 +-
> include/linux/huge_mm.h | 2 +
> mm/huge_memory.c | 4 +
> mm/khugepaged.c | 603 ++++++++++++++++-----
> 4 files changed, 511 insertions(+), 147 deletions(-)
>
> --
> 2.30.2
>
next prev parent reply other threads:[~2025-02-15 1:47 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-11 11:13 Dev Jain
2025-02-11 11:13 ` [PATCH v2 01/17] khugepaged: Generalize alloc_charge_folio() Dev Jain
2025-02-11 11:13 ` [PATCH v2 02/17] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
2025-02-11 11:13 ` [PATCH v2 03/17] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
2025-02-11 11:13 ` [PATCH v2 04/17] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
2025-02-11 11:13 ` [PATCH v2 05/17] khugepaged: Generalize __collapse_huge_page_copy() Dev Jain
2025-02-11 11:13 ` [PATCH v2 06/17] khugepaged: Abstract PMD-THP collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 07/17] khugepaged: Scan PTEs order-wise Dev Jain
2025-02-11 11:13 ` [PATCH v2 08/17] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
2025-02-11 11:13 ` [PATCH v2 09/17] khugepaged: Define collapse policy if a larger folio is already mapped Dev Jain
2025-02-11 11:13 ` [PATCH v2 10/17] khugepaged: Exit early on fully-mapped aligned mTHP Dev Jain
2025-02-11 11:13 ` [PATCH v2 11/17] khugepaged: Enable sysfs to control order of collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 12/17] khugepaged: Enable variable-sized VMA collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 13/17] khugepaged: Lock all VMAs mapping the PTE table Dev Jain
2025-02-11 11:13 ` [PATCH v2 14/17] khugepaged: Reset scan address to correct alignment Dev Jain
2025-02-11 11:13 ` [PATCH v2 15/17] khugepaged: Delay cond_resched() Dev Jain
2025-02-11 11:13 ` [PATCH v2 16/17] khugepaged: Implement strict policy for mTHP collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 17/17] Documentation: transhuge: Define khugepaged mTHP collapse policy Dev Jain
2025-02-11 23:23 ` [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Andrew Morton
2025-02-12 4:18 ` Dev Jain
2025-02-15 1:47 ` Nico Pache [this message]
2025-02-15 7:36 ` Dev Jain
2025-04-24 18:10 Mitchell Augustin
2025-04-24 18:56 ` Nico Pache
2025-04-24 19:45 ` Mitchell Augustin
2025-05-02 20:32 ` Mitchell Augustin
2025-05-02 20:34 ` Mitchell Augustin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAA1CXcACBMTPN-y4-J6wgiTf90ia5576xOJKFULnvGPCt56U6w@mail.gmail.com \
--to=npache@redhat.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=anshuman.khandual@arm.com \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=haowenchao22@gmail.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=jack@suse.cz \
--cc=jglisse@google.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=peterx@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=srivatsa@csail.mit.edu \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=vishal.moola@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox