From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=XY0j=DW=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6476FC433E7
	for <linux-mm@archiver.kernel.org>; Thu, 15 Oct 2020 19:26:04 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id AAC7A206DD
	for <linux-mm@archiver.kernel.org>; Thu, 15 Oct 2020 19:26:03 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="RQ38oRb0"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AAC7A206DD
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8F84F6B0062; Thu, 15 Oct 2020 15:26:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8830E6B0068; Thu, 15 Oct 2020 15:26:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7490D6B006E; Thu, 15 Oct 2020 15:26:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0142.hostedemail.com [216.40.44.142])
	by kanga.kvack.org (Postfix) with ESMTP id 381FF6B0062
	for <linux-mm@kvack.org>; Thu, 15 Oct 2020 15:26:02 -0400 (EDT)
Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id A9EF28249980
	for <linux-mm@kvack.org>; Thu, 15 Oct 2020 19:26:01 +0000 (UTC)
X-FDA: 77375140122.12.fire28_4f10b4527216
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin12.hostedemail.com (Postfix) with ESMTP id 74C6418006CC7
	for <linux-mm@kvack.org>; Thu, 15 Oct 2020 19:26:01 +0000 (UTC)
X-HE-Tag: fire28_4f10b4527216
X-Filterd-Recvd-Size: 8303
Received: from mail-wr1-f67.google.com (mail-wr1-f67.google.com [209.85.221.67])
	by imf08.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 15 Oct 2020 19:26:00 +0000 (UTC)
Received: by mail-wr1-f67.google.com with SMTP id e17so5008759wru.12
        for <linux-mm@kvack.org>; Thu, 15 Oct 2020 12:26:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=BM5j8lFQHmLcRInx4ruZjAKix1kTSPiqlMHVrc6IXZk=;
        b=RQ38oRb0HDdpyJFT4QueSLjg4Jba/1+K+imPMYvEzhEj8x0iJd4uroXsyHUp+xyEXE
         p4Z/AgcUID1tuzqUpWPbP9GnN0pDXG/Z6h95HcmBgj5164uijucopL9B5+2ffsjsdZPI
         zkcwLujawBscI0HizUBM9XECG8twNArBnGmjCZOxWygRrxa3vduez/Z5vtjetIrAcCfm
         1W68d+tsSrfhFi4vioRPido2vmr1sWz1xpDarY3mBO1cwYnjHnWKfQU3Dhp2VPZ5nKz1
         tdaDkT3x2fyS82L+C2pfrinntVs2HXorzxuNmLWtIvbR7fM2rbJgPxeo002IM9E73cua
         9Zjw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=BM5j8lFQHmLcRInx4ruZjAKix1kTSPiqlMHVrc6IXZk=;
        b=AN/dZofXmc3VVQsyLeeNqKdqcDi5uJP3JoBs9KpJkvXmMl0CTwjdw/GQQjZRFoJ6xz
         GfONr0yX++JDRYwCTbghixwXIX4teedpXhdMPVJFC1EOKDD5cAsU1Qp1xqwwCOyGKRVt
         63JT+lPQbjeOQCc5JHGa1LEO7F+D1+Trib0U8rtDwCKEYcDeZiWaytb8IUySngDlqJQA
         P91S2KWqfcLZWsZuQJ3AUN8/BUnOWUBqUnaaY4NhHgzxxmgM2v1RCbEyPiBToY5XPfMO
         nEeNAU9gVTAhycW117VWh91PvspHu/IeXnG8alZ7v5cwUck45gYzj9cdyo8InB7A/rhr
         fUjg==
X-Gm-Message-State: AOAM532bxaLlIsbFSEbsqr8Dw4aSvxvB1Loj1i0RGUAjExPvd58XB85s
	VH2FhCcj/O3zKw5MZTtAGYSVqog1+775JEmakH1abw==
X-Google-Smtp-Source: ABdhPJxkkb+qmNoeoJdvu/rhJW682Savil74aegbnLYZMmBT7v7xPl/T9j5ZrOhc3OwiCZGEIMJOL5dRdwyqGVaBxVc=
X-Received: by 2002:a5d:498a:: with SMTP id r10mr6072067wrq.106.1602789959057;
 Thu, 15 Oct 2020 12:25:59 -0700 (PDT)
MIME-Version: 1.0
References: <CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com>
 <CAJuCfpGjuUz5FPpR5iQ7oURJAhnP1ffBAnERuTUp9uPxQCRhDg@mail.gmail.com>
 <20201014120937.GC4440@dhcp22.suse.cz> <CAJuCfpEQ_ADYsMrF_zjfAeQ3d-FALSP+CeYsvgH2H1-FSoGGqg@mail.gmail.com>
 <20201015092030.GB22589@dhcp22.suse.cz>
In-Reply-To: <20201015092030.GB22589@dhcp22.suse.cz>
From: Suren Baghdasaryan <surenb@google.com>
Date: Thu, 15 Oct 2020 12:25:43 -0700
Message-ID: <CAJuCfpHwXcq1PfzHgqyYBR3N53TtV2WMt_Oubz0JZkvJHbFKGw@mail.gmail.com>
Subject: Re: [RFC]: userspace memory reaping
To: Michal Hocko <mhocko@suse.com>
Cc: linux-api@vger.kernel.org, linux-mm <linux-mm@kvack.org>, 
	Andrew Morton <akpm@linux-foundation.org>, David Rientjes <rientjes@google.com>, 
	Matthew Wilcox <willy@infradead.org>, Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, 
	Rik van Riel <riel@surriel.com>, Minchan Kim <minchan@kernel.org>, 
	Christian Brauner <christian@brauner.io>, Oleg Nesterov <oleg@redhat.com>, 
	Tim Murray <timmurray@google.com>, kernel-team <kernel-team@android.com>, 
	LKML <linux-kernel@vger.kernel.org>, Mel Gorman <mgorman@techsingularity.net>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Oct 15, 2020 at 2:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 14-10-20 09:57:20, Suren Baghdasaryan wrote:
> > On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > > The need is similar to why oom-reaper was introduced - when a process
> > > > > is being killed to free memory we want to make sure memory is freed
> > > > > even if the victim is in uninterruptible sleep or is busy and reaction
> > > > > to SIGKILL is delayed by an unpredictable amount of time. I
> > > > > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > > > > and using it to force memory reclaim of the target process after
> > > > > sending SIGKILL. Unfortunately this approach requires the caller to
> > > > > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > > > > process_madvise().
> > >
> > > Well I would argue that this is not really necessary. You can simply
> > > call process_madvise with the full address range and let the kernel
> > > operated only on ranges which are safe to tear down asynchronously.
> > > Sure that would require some changes to the existing code to not fail
> > > on those ranges if they contain incompatible vmas but that should be
> > > possible. If we are worried about backward compatibility then a
> > > dedicated flag could override.
> > >
> >
> > IIUC this is very similar to the last option I proposed. I think this
> > is doable if we treat it as a special case. process_madvise() return
> > value not being able to handle a large range would still be a problem.
> > Maybe we can return MAX_INT in those cases?
>
> madvise is documented to return
>        On success, madvise() returns zero.  On error, it returns -1 and
>        errno is set appropriately.
> [...]
> NOTES
>    Linux notes
>        The Linux implementation requires that the address addr be
>        page-aligned, and allows length to be zero.  If there are some
>        parts of the specified address range that are not mapped, the
>        Linux version of madvise() ignores them and applies the call to
>        the rest (but returns ENOMEM from the system call, as it should).
>
> I have learned about ENOMEM case only now. And it seems this is indeed
> what we are implementing. So if we want to add a new mode to
> opportunistically attempt madvise on the whole given range without a
> failure then we need a specific flag for that. Advice is a number rather
> than a bitmask but (ab)using the top bit or use negative number space
> (e.g. -MADV_DONTNEED) for that sounds possible albeit bit hackish.

process_madvise() has an additional flag parameter. Why not have a
separate flag to denote that we want to just skip VMA gaps and proceed
without error? Something like MADVF_SKIP_GAPS?

>
> [...]
> > > I do have a vague recollection that we have discussed a kill(2) based
> > > approach as well in the past. Essentially SIG_KILL_SYNC which would
> > > not only send the signal but it would start a teardown of resources
> > > owned by the task - at least those we can remove safely. The interface
> > > would be much more simple and less tricky to use. You just make your
> > > userspace oom killer or potentially other users call SIG_KILL_SYNC which
> > > will be more expensive but you would at least know that as many
> > > resources have been freed as the kernel can afford at the moment.
> >
> > Correct, my early RFC here
> > https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@google.com
> > was using a new flag for pidfd_send_signal() to request mm reaping by
> > oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
> > signal instead of a new pidfd_send_signal() flag and otherwise a very
> > similar solution. Is my understanding correct?
>
> Well, I think you shouldn't focus too much on the oom-reaper aspect
> of it. Sure it can be used for that but I believe that a new signal
> should provide a sync behavior. People more familiar with the process
> management would be better off defining what is possible for a new sync
> signal.  Ideally not only pro-active process destruction but also sync
> waiting until the target process is released so that you know that once
> kill syscall returns the process is gone.

If your suggestion is for SIG_KILL_SYNC to perform victim's resource
cleanup in the context of the caller while the victim is in
uninterruptible sleep that would definitely be useful. I assume there
are some resources which can't be reclaimed until the process itself
wakes up and handles the SIGKILL. If so, I hope kill(SIG_KILL_SYNC)
would not have to wait for the victim to wake up and handle the
signal. This would really complicate the userspace in cases when we
just want to reclaim whatever we can without victim's involvement and
continue. For cases when waiting is required waitid() with P_PIDFD can
be used.
Would this semantic work?

>
> --
> Michal Hocko
> SUSE Labs