From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=2pn3=CY=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D959FC43461
	for <linux-mm@archiver.kernel.org>; Tue, 15 Sep 2020 00:44:12 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 48123212CC
	for <linux-mm@archiver.kernel.org>; Tue, 15 Sep 2020 00:44:11 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="QwuTsUtX"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 48123212CC
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 6C1F3900002; Mon, 14 Sep 2020 20:44:11 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 673838E0001; Mon, 14 Sep 2020 20:44:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 564D8900002; Mon, 14 Sep 2020 20:44:11 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0060.hostedemail.com [216.40.44.60])
	by kanga.kvack.org (Postfix) with ESMTP id 4112E8E0001
	for <linux-mm@kvack.org>; Mon, 14 Sep 2020 20:44:11 -0400 (EDT)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id E1AB2181AEF00
	for <linux-mm@kvack.org>; Tue, 15 Sep 2020 00:44:10 +0000 (UTC)
X-FDA: 77263449060.19.pigs14_15134122710d
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin19.hostedemail.com (Postfix) with ESMTP id B7F171AD1B5
	for <linux-mm@kvack.org>; Tue, 15 Sep 2020 00:44:10 +0000 (UTC)
X-HE-Tag: pigs14_15134122710d
X-Filterd-Recvd-Size: 7494
Received: from mail-ua1-f67.google.com (mail-ua1-f67.google.com [209.85.222.67])
	by imf30.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 15 Sep 2020 00:44:10 +0000 (UTC)
Received: by mail-ua1-f67.google.com with SMTP id i22so466122uat.8
        for <linux-mm@kvack.org>; Mon, 14 Sep 2020 17:44:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:from:date:message-id:subject:to
         :content-transfer-encoding;
        bh=BoSlPGs10Ah7Ip1us1heQdTTSl7SgyITEYcV+Lqb3hY=;
        b=QwuTsUtXEVEQEtd5CP6AH1XNi03owaessppVcj6SJH0eQrzxs+Xrkrs/qHcRUrUIzN
         MbCJv0zJIPb+qZ9P7WBPB1x19A4EmP5nqLw40DclOPEckOKMyM0POkHD7PN+F9oOcSEW
         JDeMvdZZoPpicsCp7OlfuAbXYXqzXrzZToXVDXjC3FtPmjjFG7FgTBfkcxLuacyEaibw
         ZmpdUdIPBszGbU49OLOFRpExw5nP9gVOR+J7udmnGKo/a+vEkJdrMxqQLrwYEeEEDCD6
         wi+VKGhfe7+uZodGRRmlNqNyjijd6v2hXofW5F3qGSaEJDrk883KnUbMD62IjM1xHyq7
         jGgQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:from:date:message-id:subject:to
         :content-transfer-encoding;
        bh=BoSlPGs10Ah7Ip1us1heQdTTSl7SgyITEYcV+Lqb3hY=;
        b=B4sARlBLxQ3zA5nbwU5AiraizRg5rizInSGJebQ166DeWSeQ40XFj28fT030SGIXpD
         7JTLLPe5fWymbtC5sh87KZcVWOGp2ft3RqCa4WIKz7xrHS4VyJlFIUl7ToNQdF3GTI3d
         yUbn16mZHUWvAj0785IsQMIHNdBAJn7xv7S8xntq0rnMACyPeesY0FAwvrzyezXvK3Ec
         LbyTiokd2i4QpBe/z4aH/LkAZcrrDzBP0GUeleqHPnTKYEGPuoq7ciiXHuIh8JzVFtvF
         nnQeBclUJ24sskF2DTn9m21x9gwPWtyjk5pMJHmQLn06HTidVN77AiM6Q9ev9O8bdA1f
         m8Sw==
X-Gm-Message-State: AOAM532sPaHwX37fOo9yo2WdAigNWMGNQ2rtOycTNAhcJdPUW2zqBbXh
	iAHF6PEs45mLybNyL8Bifv+0p1fjA8zBIg/02HyOIg==
X-Google-Smtp-Source: ABdhPJzLd5XT4Z6UjFNpFE2MstW6cSFyWufD3+ryPMHs8eeaF61LZuLpoS+zQE8ITWJN2v3pyWv3CyoHocnfA6Wj8h4=
X-Received: by 2002:ab0:20a:: with SMTP id 10mr8179288uas.86.1600130649381;
 Mon, 14 Sep 2020 17:44:09 -0700 (PDT)
MIME-Version: 1.0
From: Suren Baghdasaryan <surenb@google.com>
Date: Mon, 14 Sep 2020 17:43:58 -0700
Message-ID: <CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com>
Subject: [RFC]: userspace memory reaping
To: linux-api@vger.kernel.org, linux-mm <linux-mm@kvack.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@kernel.org>, 
	David Rientjes <rientjes@google.com>, Matthew Wilcox <willy@infradead.org>, 
	Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, Rik van Riel <riel@surriel.com>, 
	Minchan Kim <minchan@kernel.org>, Christian Brauner <christian@brauner.io>, 
	Oleg Nesterov <oleg@redhat.com>, Tim Murray <timmurray@google.com>, 
	kernel-team <kernel-team@android.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: B7F171AD1B5
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam02
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Last year I sent an RFC about using oom-reaper while killing a
process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019
discussion https://lwn.net/Articles/787217 a couple of alternative
options were discussed with the most promising one (outlined in the
last paragraph of https://lwn.net/Articles/787217) suggesting to use a
remote version of madvise(MADV_DONTNEED) operation to force memory
reclaim of a killed process. With process_madvise() making its way
through reviews (https://patchwork.kernel.org/patch/11747133/), I
would like to revive this discussion and get feedback on several
possible options, their pros and cons.

The need is similar to why oom-reaper was introduced - when a process
is being killed to free memory we want to make sure memory is freed
even if the victim is in uninterruptible sleep or is busy and reaction
to SIGKILL is delayed by an unpredictable amount of time. I
experimented with enabling process_madvise(MADV_DONTNEED) operation
and using it to force memory reclaim of the target process after
sending SIGKILL. Unfortunately this approach requires the caller to
read proc/pid/maps to extract the list of VMAs to pass as an input to
process_madvise(). This is a time consuming operation. I measured
times similar to what Minchan indicated in
https://lore.kernel.org/linux-mm/20190528032632.GF6879@google.com/ and
the reason reading proc/pid/maps consumes that much time is the number
of read syscalls required to read this file. proc/pid/maps file, being
a seq_file, can be read in chunks of up to 4096 bytes (1 page). Even
if userspace provides bigger buffer, only up to 4096 bytes will be
read with one syscall. Measured on Qualcomm=C2=AE Snapdragon 855=E2=84=A2 u=
sing its
Big core of 2.84GHz a single read syscall takes between 50 and 200us
(in case there was no contention on mmap_sem or some other lock during
the syscall). Taking one typical example from my tests, a 219232 bytes
long proc/pid/maps file describing 1623 VMAs required 55 read
syscalls. With mmap_sem contention proc/pid/maps read can take even
longer. In my tests I measured typical delays of 3-7ms with occasional
delays of up to 20ms when a read syscall was blocked and the process
got into uninterruptible sleep.

While the objective is to guarantee forward progress even when the
victim cannot terminate, we still want this mechanism to be efficient
because we perform these operations to relieve memory pressure before
it affects user experience.

Alternative options I would like your feedback are:
1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM)
specifically for this case to indicate that the whole mm can be freed.
2. A new syscall to efficiently obtain a vector of VMAs (start,
length, flags) of the process instead of reading /proc/pid/maps. The
size of the vector is still limited by UIO_MAXIOV (1024), so several
calls might be needed to query larger number of VMAs, however it will
still be an order of magnitude more efficient than reading
/proc/pid/maps file in 4K or smaller chunks.
3. Use process_madvise() flags parameter to indicate a bulk operation
which ignores input vectors. Sample usage: process_madvise(pidfd,
MADV_DONTNEED, vector=3DNULL, vlen=3D0, flags=3DPMADV_FLAG_FILE |
PMADV_FLAG_ANON);
4. madvise()/process_madvise() handle gaps between VMAs, so we could
provide one vector element spanning the entire address space. There
are technical issues with this approach (process_madvise return value
can't handle such a large number of bytes and there is MAX_RW_COUNT
limit on max number of bytes one process_madvise call can handle) but
I would still like to hear opinions about it. If this option is
preferable maybe we can deal with these limitations.

We can also go back to reclaiming victim's memory asynchronously but
synchronous method has the following advantages:
- reaping will be performed in the caller's context and therefore with
caller's priority, CPU affinity, CPU bandwidth, reaping workload will
be charged to the caller and accounted for.
- reaping is a blocking/synchronous operation for the caller, so when
it's finished, the caller can be sure mm is freed (or almost freed
considering lazy freeing and batching mechanisms) and it can reassess
the memory conditions right away.
- for very large MMs (not really my case) caller could split the VMA
vector and perform reaping from multiple threads to make it faster.
This would not be possible with options (1) and (3).

Would really appreciate your feedback on these options for future developme=
nt.
Thanks,
Suren.