Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	zhongjinji <zhongjinji@honor.com>,
	mhocko@suse.com, rientjes@google.com, akpm@linux-foundation.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tglx@linutronix.de, liulu.liu@honor.com, feng.han@honor.com
Subject: Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
Date: Wed, 27 Aug 2025 00:12:34 -0400	[thread overview]
Message-ID: <6cqsibtttohay7ucs4ncmq6nf7xndtyvc4635i2e5ygjsppua6@7iwkgpp245jf> (raw)
In-Reply-To: <oogiwonpxtqurkad7rt2zxc3ffgeujtilivno3ibcybzucsliw@ym7jm6r5kdil>

+ Cc Suren since he has worked on the exit_mmap() path a lot.

* Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > or at least mitigate it.
> > > 
> > > Yes for sure, as explored at length in previous discussions this feels like
> > > we're papering over cracks here.
> > > 
> > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > 
> > > So this is my take on the below and why I'm open to it!
> > > 
> > > >
> > > > They are probably both under the read lock, but considering how rare it
> > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > get right.  Either would reduce the probability.
> > > 
> > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > seeing such a tight and unusual race. Presumably some truly massive number
> > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > 
> > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > much I'm ok with it (of course, these are famous last words in the kernel
> > > often...!)
> > > 
> > > Liam - are you open to a solution on the basis above, or do you feel we
> > > ought simply to fix the underlying issue here?
> > 
> > At least this is a benign race. 
> 
> Is this really a race or rather a contention? IIUC exit_mmap and the oom
> reaper are trying to unmap the address space of the oom-killed process
> and can compete on page table locks. If both are running concurrently on
> two cpus then the contention can continue for whole address space and
> can slow down the actual memory freeing. Making oom reaper traverse in
> opposite direction can drastically reduce the contention and faster
> memory freeing.

It is two readers of the vma tree racing to lock the page tables for
each vma, so I guess you can see it as contention as well.. but since
the pte is a split lock, I see it as racing through vmas to see who hits
which lock first.  The smart money is on the oom killer as it skips some
vmas :)

If it were just contention, then the loop direction wouldn't matter..
but I do see your point.

> > I'd think using MMF_ to reduce the race
> > would achieve the same goal with less risk - which is why I bring it up.
> > 
> 
> With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> the oom-killed process?

Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
path to reduce the possibility of the race/contention.

> 
> > Really, both methods should be low risk, so I'm fine with either way.
> > 
> > But I am interested in hearing how this race is happening enough to
> > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > more of a scalable fix, if it achieves the same results.
> 
> On the question of if this is a rare situaion and worth the patch. I
> would say this scenario is not that rare particularly on low memory
> devices and on highly utilized overcommitted systems. Memory pressure
> and oom-kills are norm on such systems. The point of oom reaper is to
> bring the system out of the oom situation quickly and having two cpus
> unmapping the oom-killed process can potentially bring the system out of
> oom situation faster.

The exit_mmap() path used to run the oom reaper if it was an oom victim,
until recently [1].  The part that makes me nervous is the exit_mmap()
call to mmu_notifier_release(mm), while the oom reaper uses an
mmu_notifier.  I am not sure if there is an issue in ordering on any of
the platforms of such things.  Or the associated cost of the calls.

I mean, it's already pretty crazy that we have this in the exit:
mmap_read_lock()
   tlb_gather_mmu_fullmm()
     unmap vmas..
mmap_read_unlock()
mmap_write_lock()
   tlb_finish_mmu()..

So not only do we now have two tasks iterating over the vmas, but we
also have mmu notifiers and tlb calls happening across the ranges..  At
least doing all the work on a single cpu means that the hardware view is
consistent.  But I don't see this being worse than a forward race?

As it is written here, we'll have one CPU working in one direction while
the other works in the other, until both hit the end of the VMAs.  Only
when both tasks stop iterating the vmas can the exit continue since it
requires the write lock.

So the tlb_finish_mmu() in exit_mmap() will always be called after
tlb_finish_mmu() on each individual vma has run in the
__oom_reap_task_mm() context (when the race happens).

There is also a window here, between the exit_mmap() dropping the read
lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
will iterate through a list of vmas with zero memory to free and delay
the task exiting.  That is, wasting cpu and stopping the memory
associated with the mm_struct (vmas and such) from being freed.

I'm also not sure on the cpu cache effects of what we are doing and how
much that would play into the speedup.  My guess is that it's
insignificant compared to the time we spend under the pte, but we have
no numbers to go on.

So I'd like to know how likely the simultaneous runs are and if there is
a measurable gain?

I agree, that at face value, two cpus should be able to split the work..
but I don't know about the notifier or the holding up the mm_struct
associated memory.  And it could slow things down by holding up an
exiting task.

> 
> I think the patch (with your suggestions) is simple enough and I don't
> see any risk in including it.
> 

Actually, the more I look at this, the worse I feel about it..  Am I
overreacting?

Thanks,
Liam

[1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085

next prev parent reply	other threads:[~2025-08-27  4:13 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-25 13:38 [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen zhongjinji
2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji
2025-08-25 19:41   ` Shakeel Butt
2025-08-26 13:01     ` zhongjinji
2025-08-26 12:43   ` Lorenzo Stoakes
2025-08-27 12:08   ` Michal Hocko
2025-08-27 12:14     ` zhongjinji
2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji
2025-08-26 12:53   ` Lorenzo Stoakes
2025-08-26 13:37     ` Liam R. Howlett
2025-08-26 13:50       ` Lorenzo Stoakes
2025-08-26 15:21         ` Liam R. Howlett
2025-08-26 22:26           ` Shakeel Butt
2025-08-27  4:12             ` Liam R. Howlett [this message]
2025-08-27  4:25               ` Liam R. Howlett
2025-08-27  9:55               ` zhongjinji
2025-08-27 15:57                 ` Suren Baghdasaryan
2025-08-28  0:38                   ` Liam R. Howlett
2025-08-29  7:11               ` Michal Hocko
2025-08-29  7:14     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6cqsibtttohay7ucs4ncmq6nf7xndtyvc4635i2e5ygjsppua6@7iwkgpp245jf \
    --to=liam.howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=feng.han@honor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liulu.liu@honor.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rientjes@google.com \
    --cc=shakeel.butt@linux.dev \
    --cc=tglx@linutronix.de \
    --cc=zhongjinji@honor.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox