Re: [RFC]: userspace memory reaping

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Suren Baghdasaryan <surenb@google.com>
To: linux-api@vger.kernel.org, linux-mm <linux-mm@kvack.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	 David Rientjes <rientjes@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <guro@fb.com>, Rik van Riel <riel@surriel.com>,
	 Minchan Kim <minchan@kernel.org>,
	Christian Brauner <christian@brauner.io>,
	 Oleg Nesterov <oleg@redhat.com>,
	Tim Murray <timmurray@google.com>,
	 kernel-team <kernel-team@android.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC]: userspace memory reaping
Date: Mon, 14 Sep 2020 17:45:44 -0700	[thread overview]
Message-ID: <CAJuCfpGjuUz5FPpR5iQ7oURJAhnP1ffBAnERuTUp9uPxQCRhDg@mail.gmail.com> (raw)
In-Reply-To: <CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com>

+ linux-kernel@vger.kernel.org

On Mon, Sep 14, 2020 at 5:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Last year I sent an RFC about using oom-reaper while killing a
> process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019
> discussion https://lwn.net/Articles/787217 a couple of alternative
> options were discussed with the most promising one (outlined in the
> last paragraph of https://lwn.net/Articles/787217) suggesting to use a
> remote version of madvise(MADV_DONTNEED) operation to force memory
> reclaim of a killed process. With process_madvise() making its way
> through reviews (https://patchwork.kernel.org/patch/11747133/), I
> would like to revive this discussion and get feedback on several
> possible options, their pros and cons.
>
> The need is similar to why oom-reaper was introduced - when a process
> is being killed to free memory we want to make sure memory is freed
> even if the victim is in uninterruptible sleep or is busy and reaction
> to SIGKILL is delayed by an unpredictable amount of time. I
> experimented with enabling process_madvise(MADV_DONTNEED) operation
> and using it to force memory reclaim of the target process after
> sending SIGKILL. Unfortunately this approach requires the caller to
> read proc/pid/maps to extract the list of VMAs to pass as an input to
> process_madvise(). This is a time consuming operation. I measured
> times similar to what Minchan indicated in
> https://lore.kernel.org/linux-mm/20190528032632.GF6879@google.com/ and
> the reason reading proc/pid/maps consumes that much time is the number
> of read syscalls required to read this file. proc/pid/maps file, being
> a seq_file, can be read in chunks of up to 4096 bytes (1 page). Even
> if userspace provides bigger buffer, only up to 4096 bytes will be
> read with one syscall. Measured on Qualcomm® Snapdragon 855™ using its
> Big core of 2.84GHz a single read syscall takes between 50 and 200us
> (in case there was no contention on mmap_sem or some other lock during
> the syscall). Taking one typical example from my tests, a 219232 bytes
> long proc/pid/maps file describing 1623 VMAs required 55 read
> syscalls. With mmap_sem contention proc/pid/maps read can take even
> longer. In my tests I measured typical delays of 3-7ms with occasional
> delays of up to 20ms when a read syscall was blocked and the process
> got into uninterruptible sleep.
>
> While the objective is to guarantee forward progress even when the
> victim cannot terminate, we still want this mechanism to be efficient
> because we perform these operations to relieve memory pressure before
> it affects user experience.
>
> Alternative options I would like your feedback are:
> 1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM)
> specifically for this case to indicate that the whole mm can be freed.
> 2. A new syscall to efficiently obtain a vector of VMAs (start,
> length, flags) of the process instead of reading /proc/pid/maps. The
> size of the vector is still limited by UIO_MAXIOV (1024), so several
> calls might be needed to query larger number of VMAs, however it will
> still be an order of magnitude more efficient than reading
> /proc/pid/maps file in 4K or smaller chunks.
> 3. Use process_madvise() flags parameter to indicate a bulk operation
> which ignores input vectors. Sample usage: process_madvise(pidfd,
> MADV_DONTNEED, vector=NULL, vlen=0, flags=PMADV_FLAG_FILE |
> PMADV_FLAG_ANON);
> 4. madvise()/process_madvise() handle gaps between VMAs, so we could
> provide one vector element spanning the entire address space. There
> are technical issues with this approach (process_madvise return value
> can't handle such a large number of bytes and there is MAX_RW_COUNT
> limit on max number of bytes one process_madvise call can handle) but
> I would still like to hear opinions about it. If this option is
> preferable maybe we can deal with these limitations.
>
> We can also go back to reclaiming victim's memory asynchronously but
> synchronous method has the following advantages:
> - reaping will be performed in the caller's context and therefore with
> caller's priority, CPU affinity, CPU bandwidth, reaping workload will
> be charged to the caller and accounted for.
> - reaping is a blocking/synchronous operation for the caller, so when
> it's finished, the caller can be sure mm is freed (or almost freed
> considering lazy freeing and batching mechanisms) and it can reassess
> the memory conditions right away.
> - for very large MMs (not really my case) caller could split the VMA
> vector and perform reaping from multiple threads to make it faster.
> This would not be possible with options (1) and (3).
>
> Would really appreciate your feedback on these options for future development.
> Thanks,
> Suren.

next prev parent reply	other threads:[~2020-09-15  0:45 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-15  0:43 Suren Baghdasaryan
2020-09-15  0:45 ` Suren Baghdasaryan [this message]
2020-10-14 12:09   ` Michal Hocko
2020-10-14 16:57     ` Suren Baghdasaryan
2020-10-14 18:39       ` minchan
2020-10-15  9:20       ` Michal Hocko
2020-10-15 18:43         ` Minchan Kim
2020-10-15 19:32           ` Suren Baghdasaryan
2020-10-15 19:25         ` Suren Baghdasaryan
2020-11-02 20:29           ` Suren Baghdasaryan
2020-11-03  9:35             ` Michal Hocko
2020-11-03 21:28               ` Suren Baghdasaryan
2020-11-03 21:32               ` Minchan Kim
2020-11-03 21:40                 ` Suren Baghdasaryan
2020-11-03 21:46                   ` Minchan Kim
2020-11-04  6:58                 ` Michal Hocko
     [not found]                   ` <20201104204051.GA3544305@google.com>
2020-11-05 12:20                     ` Michal Hocko
2020-11-05 16:50                       ` Suren Baghdasaryan
2020-11-05 17:07                         ` Minchan Kim
2020-11-05 17:16                         ` Michal Hocko
2020-11-05 17:21                           ` Suren Baghdasaryan
2020-11-05 17:41                             ` Minchan Kim
2020-11-05 17:43                             ` Michal Hocko
2020-11-05 18:02                               ` Suren Baghdasaryan
2020-11-13 17:37                                 ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJuCfpGjuUz5FPpR5iQ7oURJAhnP1ffBAnERuTUp9uPxQCRhDg@mail.gmail.com \
    --to=surenb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=christian@brauner.io \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@android.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=oleg@redhat.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=timmurray@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox