Re: [BUG] potential hugetlb css refcounting issues

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [BUG] potential hugetlb css refcounting issues
       [not found] <20210827151146.GA25472@bender.morinfr.org>
@ 2021-08-27 22:22 ` Mike Kravetz
  0 siblings, 0 replies; 3+ messages in thread
From: Mike Kravetz @ 2021-08-27 22:22 UTC (permalink / raw)
  To: Mina Almasry, cgroups; +Cc: linux Memory Management List

CC'ing linux-mm 

On 8/27/21 8:11 AM, Guillaume Morin wrote:
> Hi,
> 
> After upgrading to 5.10 from 5.4 (though I believe that these issues are
> still present in 5.14), we noticed some refcount count warning
> pertaining a corrupted reference count of the hugetlb css, e.g
> 
> [  704.259734] percpu ref (css_release) <= 0 (-1) after switching to atomic
> [  704.259755] WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130
> [  704.259911] CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G           O      5.10.60 #1
> [  704.259916] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130
> [  704.259920] Code: eb b1 80 3d 37 4f 0a 01 00 0f 85 5d ff ff ff 49 8b 55 e0 48 c7 c7 38 57 0d 94 c6 05 1f 4f 0a 01 01 49 8b 75 e8 e8 a9 e5 c1 ff <0f> 0b e9 3b ff ff ff 66 90 55 48 89 e5 41 56 49 89 f6 41 55 49 89
> [  704.259922] RSP: 0000:ffffb19b4684bdd0 EFLAGS: 00010282
> [  704.259924] RAX: 0000000000000000 RBX: 7ffffffffffffffe RCX: 0000000000000027
> [  704.259926] RDX: 0000000000000000 RSI: ffff9a81ffb58b40 RDI: ffff9a81ffb58b48
> [  704.259927] RBP: ffffb19b4684bde8 R08: 0000000000000003 R09: 0000000000000001
> [  704.259929] R10: 0000000000000003 R11: ffffb19b4684bb70 R12: 0000370946a03b50
> [  704.259931] R13: ffff9a72c9ceb860 R14: 0000000000000000 R15: ffff9a72c42f4000
> [  704.259933] FS:  0000000000000000(0000) GS:ffff9a81ffb40000(0000) knlGS:0000000000000000
> [  704.259935] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  704.259936] CR2: 0000000001416318 CR3: 000000011e1ac003 CR4: 00000000003706e0
> [  704.259938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  704.259939] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  704.259941] Call Trace:
> [  704.259950]  rcu_core+0x30f/0x530
> [  704.259955]  rcu_core_si+0xe/0x10
> [  704.259959]  __do_softirq+0x103/0x2a2
> [  704.259964]  ? sort_range+0x30/0x30
> [  704.259968]  run_ksoftirqd+0x2b/0x40
> [  704.259971]  smpboot_thread_fn+0x11a/0x170
> [  704.259975]  kthread+0x10a/0x140
> [  704.259978]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  704.259983]  ret_from_fork+0x22/0x30
> 
> The box would soon crash due to some GPF or NULL pointer deference
> either in cgroups_destroy or in the kill_css path. We confirmed the
> issue was specific to the hugetlb css by manually disabling its use and
> verifying that the box then stayed up and happy.

Thanks for reporting this Guillaume!

There have been other hugetlb cgroup fixes since 5.10.  I do not believe
they are related to the underflow issue you have seen.  Just FYI.

I do hope Mina is taking a look as well.  His recent/ongoing mremap work
is also touching these areas of the code.

> I believe there might be 2 distinct bugs leading to this. I am not
> familiar with the cgroup code so I might be off base here. I did my best
> to track this and understand the logic. Any feedback will be welcome.
> 
> I have not provided patches because if I am correct, they're fairly
> trivial and since I am unfamiliar with this code, I am afraid they could
> not be that helpful.  But I could provide them if anybody is interested.
> 
> 1. Since e9fe92ae0cd2, hugetlb_vm_op_close() decreases the refcount of
> the css (if present) through the hugetlb_cgroup_uncharge_counter() call
> if a resv map is set on the vma and the owner flag is present (i.e
> private mapping).  In the most basic case, the corresponding refcount
> increase is done in hugetlb_reserve_pages().
> However when sibling vmas are opened, hugetlb_vm_op_open() is called,
> the resv map reference count is increased (if vma_resv_map(vma) is not
> NULL for private mappings), but not for a potential resv->css (i.e if
> resv->css != NULL).
> When these siblings vmas are closed, the refcount will still be
> decreased once per such vma, leading to an underflow and premature
> release (potentially use after free) of the hugetlb css.  The fix would
> be to call css_get() if resv->css != NULL in hugetlb_vm_op_open()

That does appear to be a real issue.  In the 'common' fork case, a vma
is duplicated but reset_vma_resv_huge_pages() is called to clear the
pointer to the reserve map for private mappings before calling
hugetlb_vm_op_open.  However, when a vma is split both resulting vmas
would be 'owners' of private mapping reserves without incrementing the
refcount which would lead to the underflow you describe.

> 2. After 08cf9faf75580, __free_huge_page() decrements the css refcount
> for _each_ page unconditionally by calling
> hugetlb_cgroup_uncharge_page_rsvd().
> But a per-page reference count is only taken *per page* outside the
> reserve case in alloc_huge_page() (i.e
> hugetlb_cgroup_charge_cgroup_rsvd() is called only if deferred_reserve
> is true).  In the reserve case, there is only one css reference linked
> to the resv map (taken in hugetlb_reserve_pages()).
> This also leads to an underflow of the counter.  A similar scheme to
> HPageRestoreReserve can be used to track which pages were allocated in
> the deferred_reserve case and call hugetlb_cgroup_uncharge_page_rsvd()
> only for these during freeing.

I am not sure about the above analysis.  It is true that
hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
only decrement the css refcount if there is a non-NULL hugetlb_cgroup
pointer in the page.  And, the pointer in the page would only be set in
the 'deferred_reserve' path of alloc_huge_page.  Unless I am missing
something, they seem to balance.

In any case I will look closer at the code with an eye on underflows.
Thanks for the report and help!
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] potential hugetlb css refcounting issues
  2021-08-27 22:58 ` Guillaume Morin
@ 2021-08-28 19:37   ` Guillaume Morin
  0 siblings, 0 replies; 3+ messages in thread
From: Guillaume Morin @ 2021-08-28 19:37 UTC (permalink / raw)
  To: almasrymina, mike.kravetz, cgroups, guillaume, linux-mm

On 28 Aug  0:58, Guillaume Morin wrote:
> > I am not sure about the above analysis.  It is true that
> > hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
> > free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
> > only decrement the css refcount if there is a non-NULL hugetlb_cgroup
> > pointer in the page.  And, the pointer in the page would only be set
> > in the 'deferred_reserve' path of alloc_huge_page.  Unless I am
> > missing something, they seem to balance.
> 
> Now that you explain, I am pretty sure that you're right and I was
> wrong.
> 
> I'll confirm that I can't reproduce without my change for 2.

Confirmed. With the patch for the first issue, the issue is indeed
fixed. I must have messed up something during my testing...

Anyway, this is the change for 1:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ea35ba6699f..00ad4af0399b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4033,8 +4033,11 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+		if (resv->css)
+			css_get(resv->css);
 		kref_get(&resv->refs);
+	}
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)

-- 
Guillaume Morin <guillaume@morinfr.org>


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] potential hugetlb css refcounting issues
       [not found] <8a4f2fbc-76e8-b67b-f110-30beff2228f5@oracle-com>
@ 2021-08-27 22:58 ` Guillaume Morin
  2021-08-28 19:37   ` Guillaume Morin
  0 siblings, 1 reply; 3+ messages in thread
From: Guillaume Morin @ 2021-08-27 22:58 UTC (permalink / raw)
  To: almasrymina, mike.kravetz, cgroups, guillaume, linux-mm

Hello Mike,

I really appreciate the quick reply

Mike Kravets wrote: 
> There have been other hugetlb cgroup fixes since 5.10.  I do not believe
> they are related to the underflow issue you have seen.  Just FYI.

Yes, I am aware. Actually I did my best to look at all recent changes
not backported to 5.10 and couldn't find anything related. I tried to
cherry-pick a couple of fixes in case but the bug did not go away.

> However, when a vma is split both resulting vmas would be 'owners' of
> private mapping reserves without incrementing the refcount which would
> lead to the underflow you describe.

Indeed and I do know that programs running on my reproducer machines do
split vmas.

>> 2. After 08cf9faf75580, __free_huge_page() decrements the css
>> refcount for _each_ page unconditionally by calling
>> hugetlb_cgroup_uncharge_page_rsvd().  But a per-page reference count
>> is only taken *per page* outside the reserve case in
>> alloc_huge_page() (i.e hugetlb_cgroup_charge_cgroup_rsvd() is called
>> only if deferred_reserve is true).  In the reserve case, there is
>> only one css reference linked to the resv map (taken in
>> hugetlb_reserve_pages()).  This also leads to an underflow of the
>> counter.  A similar scheme to HPageRestoreReserve can be used to
>> track which pages were allocated in the deferred_reserve case and
>> call hugetlb_cgroup_uncharge_page_rsvd() only for these during
>> freeing.

> I am not sure about the above analysis.  It is true that
> hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
> free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
> only decrement the css refcount if there is a non-NULL hugetlb_cgroup
> pointer in the page.  And, the pointer in the page would only be set
> in the 'deferred_reserve' path of alloc_huge_page.  Unless I am
> missing something, they seem to balance.

Now that you explain, I am pretty sure that you're right and I was
wrong.

I'll confirm that I can't reproduce without my change for 2.

Thank you,

Guillaume.

-- 
Guillaume Morin <guillaume@morinfr.org>


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-28 19:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210827151146.GA25472@bender.morinfr.org>
2021-08-27 22:22 ` [BUG] potential hugetlb css refcounting issues Mike Kravetz
     [not found] <8a4f2fbc-76e8-b67b-f110-30beff2228f5@oracle-com>
2021-08-27 22:58 ` Guillaume Morin
2021-08-28 19:37   ` Guillaume Morin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox