From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=3Ur0=JS=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.2 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2C89DC433ED
	for <linux-mm@archiver.kernel.org>; Wed, 21 Apr 2021 01:18:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5B3EB6140C
	for <linux-mm@archiver.kernel.org>; Wed, 21 Apr 2021 01:18:44 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5B3EB6140C
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id C4A326B006C; Tue, 20 Apr 2021 21:18:43 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BFB5D6B006E; Tue, 20 Apr 2021 21:18:43 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A74906B0070; Tue, 20 Apr 2021 21:18:43 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0236.hostedemail.com [216.40.44.236])
	by kanga.kvack.org (Postfix) with ESMTP id 863E86B006C
	for <linux-mm@kvack.org>; Tue, 20 Apr 2021 21:18:43 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 4A6C5249F
	for <linux-mm@kvack.org>; Wed, 21 Apr 2021 01:18:43 +0000 (UTC)
X-FDA: 78054614526.21.720A972
Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53])
	by imf02.hostedemail.com (Postfix) with ESMTP id 1315340002C4
	for <linux-mm@kvack.org>; Wed, 21 Apr 2021 01:18:19 +0000 (UTC)
Received: by mail-lf1-f53.google.com with SMTP id x20so34054199lfu.6
        for <linux-mm@kvack.org>; Tue, 20 Apr 2021 18:18:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=iuYkrU1zcW7Z65oSaWVLBz+0IZ221pOBrSimYNtby7s=;
        b=ndkdt3mjYlW+H/QWZc671xlRMDK4yybhmgXbvu53LBPocwBLlpjTZ783Rd2QVNzrGj
         IkSdWhxhBQo0UbqFvs6XPv6hCdPKkg3b7FMjuPdJLz2w54a+Wyfrp6CFYDZyPmDZ3IkE
         W0vUtwV0T3yekrghgmLcCS164X9lDUDwgvhq916QgiJMH6wfzNW5C5/XrLnNN1ph/GiG
         K9J+ZM292B32Iy7S2ga4b5yxthPArwSwjdGo+ryLUmzGMwC6X6Vm5f3VEnJDlzd6Fanr
         5Tu901R+Odid1zCgNaLUEr0kwh4Mm7azowTlPf+njENOsLhgr4i1ixuDJo8Jn+I4+guT
         /+4w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=iuYkrU1zcW7Z65oSaWVLBz+0IZ221pOBrSimYNtby7s=;
        b=ElvIB+cVe1FGiiT5OAAkhGaEEIaFweZcEHoD9IM+MYgwChAUzAu+QvW4SJ7bhGqtx7
         G44M+Ydv+OSKCg0P8peEIDilOrWgYS8TwD2I7k83M97Xr/TubvDuubdMP86cD9aRqhc7
         GNfDcSIu3cTaBuVXDlFS5MtFOCGsmzRXqkOy5pHQvZ9qRlCfxzBiBYfy0M8n4eR1HYF6
         UZfg/X23UyfFhJ2dYfA/w0dqKwp3/NSVGbmehkGnbmsvPDo5tCHZXR0xsQmvYdgQg+Mw
         rD6ERmnryu18i95P3DyklaQXBBPys/3ADo3aWS7R6Mx3RqeJqFzYQoBTpBxdID0bId5x
         iFrA==
X-Gm-Message-State: AOAM5317ZSw/YUT//jHRncshCxcy37Rl8vPs5n2zkv42Q4uSNljTAZr3
	nOAXHKoP5vRbRbtwnJcashnJWSyhAe3Hd2/2HVi29g==
X-Google-Smtp-Source: ABdhPJxpxXiVbMCgfjPsAKK51JzwZPy6+zbSmmw770y0uEFu9+qDNgQV48FCOjkzQcNHaQ2ZAMShaBbuBCdhR7VK8ec=
X-Received: by 2002:a05:6512:2037:: with SMTP id s23mr17393868lfs.358.1618967921088;
 Tue, 20 Apr 2021 18:18:41 -0700 (PDT)
MIME-Version: 1.0
References: <CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com>
 <YH8o5iIau85FaeLw@carbon.DHCP.thefacebook.com>
In-Reply-To: <YH8o5iIau85FaeLw@carbon.DHCP.thefacebook.com>
From: Shakeel Butt <shakeelb@google.com>
Date: Tue, 20 Apr 2021 18:18:29 -0700
Message-ID: <CALvZod7dXuFPeMv5NGu96uCosFpWY_Gy07iDsfSORCA0dT_zsA@mail.gmail.com>
Subject: Re: [RFC] memory reserve for userspace oom-killer
To: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Linux MM <linux-mm@kvack.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Cgroups <cgroups@vger.kernel.org>, 
	David Rientjes <rientjes@google.com>, LKML <linux-kernel@vger.kernel.org>, 
	Suren Baghdasaryan <surenb@google.com>, Greg Thelen <gthelen@google.com>, 
	Dragos Sbirlea <dragoss@google.com>, Priya Duraisamy <padmapriyad@google.com>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 1315340002C4
X-Stat-Signature: mmj8bkn89ic5khswkiuu1nhjmpce6mwi
Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf02; identity=mailfrom; envelope-from="<shakeelb@google.com>"; helo=mail-lf1-f53.google.com; client-ip=209.85.167.53
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1618967899-28895
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
[...]
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
> >
> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
>
> Hello Shakeel!
>
> If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
> the system is already in a relatively bad shape. Arguably the userspace
> OOM killer should kick in earlier, it's already a bit too late.

Please note that these are not allocation failures but rather reclaim
on allocations (which is very normal). Our observation is that this
reclaim is very unpredictable and depends on the type of memory
present on the system which depends on the workload. If there is a
good amount of easily reclaimable memory (e.g. clean file pages), the
reclaim would be really fast. However for other types of reclaimable
memory the reclaim time varies a lot. The unreclaimable memory, pinned
memory, too many direct reclaimers, too many isolated memory and many
other things/heuristics/assumptions make the reclaim further
non-deterministic.

In our observation the global reclaim is very non-deterministic at the
tail and dramatically impacts the reliability of the system. We are
looking for a solution which is independent of the global reclaim.

> Allowing to use reserves just pushes this even further, so we're risking
> the kernel stability for no good reason.

Michal has suggested ALLOC_OOM which is less risky.

>
> But I agree that throttling the oom daemon in direct reclaim makes no sense.
> I wonder if we can introduce a per-task flag which will exclude the task from
> throttling, but instead all (large) allocations will just fail under a
> significant memory pressure more easily. In this case if there is a significant
> memory shortage the oom daemon will not be fully functional (will get -ENOMEM
> for an attempt to read some stats, for example), but still will be able to kill
> some processes and make the forward progress.

So, the suggestion is to have a per-task flag to (1) indicate to not
throttle and (2) fail allocations easily on significant memory
pressure.

For (1), the challenge I see is that there are a lot of places in the
reclaim code paths where a task can get throttled. There are
filesystems that block/throttle in slab shrinking. Any process can get
blocked on an unrelated page or inode writeback within reclaim.

For (2), I am not sure how to deterministically define "significant
memory pressure". One idea is to follow the __GFP_NORETRY semantics
and along with (1) the userspace oom-killer will see ENOMEM more
reliably than stucking in the reclaim.

So, the oom-killer maintains a list of processes to kill in extreme
conditions, have their pidfds open and keep that list fresh. Whenever
any syscalls returns ENOMEM, it starts doing
pidfd_send_signal(SIGKILL) to that list of processes, right?

The idea has merit but I don't see how this is any simpler. The (1) is
challenging on its own and my main concern is that it will be very
hard to maintain as reclaim code (particularly shrinkers) callbacks
into many diverse subsystems.

> But maybe it can be done in userspace too: by splitting the daemon into
> a core- and extended part and avoid doing anything behind bare minimum
> in the core part.
>
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-killer can preallocate mempool for its specific threads. The core
> > page allocator can check before going to the reclaim path if the task
> > has private access to the mempool and return page from it if yes.
> >
> > This option would be more complicated than the previous option as the
> > lifecycle of the page from the mempool would be more sophisticated.
> > Additionally the current mempool does not handle higher order pages
> > and we might need to extend it to allow such allocations. Though this
> > feature might have more use-cases and it would be less risky than the
> > previous option.
>
> It looks like an over-kill for the oom daemon protection, but if there
> are other good use cases, maybe it's a good feature to have.
>

IMHO it is not an over-kill and easier to do then to remove all
instances of potential blocking/throttling sites in memory reclaim.

> >
> > Another idea I had was to use kthread based oom-killer and provide the
> > policies through eBPF program. Though I am not sure how to make it
> > monitor arbitrary metrics and if that can be done without any
> > allocations.
>
> To start this effort it would be nice to understand what metrics various
> oom daemons use and how easy is to gather them from the bpf side. I like
> this idea long-term, but not sure if it has been settled down enough.
> I imagine it will require a fair amount of work on the bpf side, so we
> need a good understanding of features we need.
>

Are there any examples of gathering existing metrics from bpf? Suren
has given a list of metrics useful for Android. Is it possible to
gather those metrics?

BTW thanks a lot for taking a look and I really appreciate your time.

thanks,
Shakeel