From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CB29C43461 for ; Tue, 15 Sep 2020 00:45:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8104E20735 for ; Tue, 15 Sep 2020 00:45:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="dAMvoWOt" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8104E20735 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A9024900003; Mon, 14 Sep 2020 20:45:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A3EDD8E0001; Mon, 14 Sep 2020 20:45:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 92E17900003; Mon, 14 Sep 2020 20:45:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0127.hostedemail.com [216.40.44.127]) by kanga.kvack.org (Postfix) with ESMTP id 7B5A58E0001 for ; Mon, 14 Sep 2020 20:45:57 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 33C5E181AEF00 for ; Tue, 15 Sep 2020 00:45:57 +0000 (UTC) X-FDA: 77263453554.22.space75_12143392710d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id 0159718038E60 for ; Tue, 15 Sep 2020 00:45:56 +0000 (UTC) X-HE-Tag: space75_12143392710d X-Filterd-Recvd-Size: 8022 Received: from mail-vk1-f195.google.com (mail-vk1-f195.google.com [209.85.221.195]) by imf17.hostedemail.com (Postfix) with ESMTP for ; Tue, 15 Sep 2020 00:45:56 +0000 (UTC) Received: by mail-vk1-f195.google.com with SMTP id d2so394260vkd.13 for ; Mon, 14 Sep 2020 17:45:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=dV85Eacc1roIo+vHD48suvcriSw+9fgCYtp8+g0r3Mw=; b=dAMvoWOtgEPXKSBKyqewg9cq8up9VQcD/Ew8s0A+uDIcs9bgSOS8jq1teYtFyzRuRI EqRuo3R7BCmfpMHxrwcUtcfEtQajPdTniDGXcpGfw+MJxpAwv4TkaPCX9JXzQEjgeC6E e2V48o+hvAWIOLA6zTcVApW8poXgFLexOE4mUNFyoZDsZEkc9xo/PdBdH2WbWRF8r7fd jQAB+us7BTrOtmpFjOVNlNs1FsHl7pOFMn6qZaMIbPuglLLB05Vo58xpSkkLlIT9cxDP 8vrgy7uo7P16TaqBPAGPWMOQckYOyKsWLM/GBYIPBa0AVNIg+D0T1gVwh5xdyHNaLSxl yyXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=dV85Eacc1roIo+vHD48suvcriSw+9fgCYtp8+g0r3Mw=; b=NvguZ4scZ19VU74da7vKAiozDkVsnqUXF2ruLs3CruIvEC7P9hsv2HOjQFNUO+MYFc r9OE4lysSbv/1PllURBtBJcblRTF/LtLzsEK2k/c4uq4PkL7jkDZzWA716seaICuszjS C2Ksp6OXD1jfwlLoKnGUGSd6CJbOGcQUhgMJTd2tO2n3YUfbVMz3xUSyg7fUh5FExQgX uKGB+1EgnPCsD0hovBc6M5xxNxOUsQdtzdHE5D5sT3sbgDlX5DQRjASMUMYN55s+unf+ 6wn20Xxa+ZnXsqgKNJqM/bA/oNPM64Tpbbgr9VzHxOPTfnMgACITL8syWLhZNsvzR8nS fexQ== X-Gm-Message-State: AOAM5309j+hd9PfdbOZZulRpP9p03CETuSvWXauu5rQicW9IpXz7h1h0 /lQUIgMPN7p76eZoHKaIf9S4cWhppWqt/sV/X2jcfg== X-Google-Smtp-Source: ABdhPJwur+ZPMEXzGEBYaFZfUgKfnExHnoeiaZe2G4sGEagBISoIY642qWB3YQBsgez2XNI2Qc7OCEGgdk1dolWAUIk= X-Received: by 2002:a1f:26cd:: with SMTP id m196mr8849886vkm.7.1600130755541; Mon, 14 Sep 2020 17:45:55 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Suren Baghdasaryan Date: Mon, 14 Sep 2020 17:45:44 -0700 Message-ID: Subject: Re: [RFC]: userspace memory reaping To: linux-api@vger.kernel.org, linux-mm , Andrew Morton , Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Oleg Nesterov , Tim Murray , kernel-team , LKML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 0159718038E60 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: + linux-kernel@vger.kernel.org On Mon, Sep 14, 2020 at 5:43 PM Suren Baghdasaryan wrot= e: > > Last year I sent an RFC about using oom-reaper while killing a > process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019 > discussion https://lwn.net/Articles/787217 a couple of alternative > options were discussed with the most promising one (outlined in the > last paragraph of https://lwn.net/Articles/787217) suggesting to use a > remote version of madvise(MADV_DONTNEED) operation to force memory > reclaim of a killed process. With process_madvise() making its way > through reviews (https://patchwork.kernel.org/patch/11747133/), I > would like to revive this discussion and get feedback on several > possible options, their pros and cons. > > The need is similar to why oom-reaper was introduced - when a process > is being killed to free memory we want to make sure memory is freed > even if the victim is in uninterruptible sleep or is busy and reaction > to SIGKILL is delayed by an unpredictable amount of time. I > experimented with enabling process_madvise(MADV_DONTNEED) operation > and using it to force memory reclaim of the target process after > sending SIGKILL. Unfortunately this approach requires the caller to > read proc/pid/maps to extract the list of VMAs to pass as an input to > process_madvise(). This is a time consuming operation. I measured > times similar to what Minchan indicated in > https://lore.kernel.org/linux-mm/20190528032632.GF6879@google.com/ and > the reason reading proc/pid/maps consumes that much time is the number > of read syscalls required to read this file. proc/pid/maps file, being > a seq_file, can be read in chunks of up to 4096 bytes (1 page). Even > if userspace provides bigger buffer, only up to 4096 bytes will be > read with one syscall. Measured on Qualcomm=C2=AE Snapdragon 855=E2=84=A2= using its > Big core of 2.84GHz a single read syscall takes between 50 and 200us > (in case there was no contention on mmap_sem or some other lock during > the syscall). Taking one typical example from my tests, a 219232 bytes > long proc/pid/maps file describing 1623 VMAs required 55 read > syscalls. With mmap_sem contention proc/pid/maps read can take even > longer. In my tests I measured typical delays of 3-7ms with occasional > delays of up to 20ms when a read syscall was blocked and the process > got into uninterruptible sleep. > > While the objective is to guarantee forward progress even when the > victim cannot terminate, we still want this mechanism to be efficient > because we perform these operations to relieve memory pressure before > it affects user experience. > > Alternative options I would like your feedback are: > 1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM) > specifically for this case to indicate that the whole mm can be freed. > 2. A new syscall to efficiently obtain a vector of VMAs (start, > length, flags) of the process instead of reading /proc/pid/maps. The > size of the vector is still limited by UIO_MAXIOV (1024), so several > calls might be needed to query larger number of VMAs, however it will > still be an order of magnitude more efficient than reading > /proc/pid/maps file in 4K or smaller chunks. > 3. Use process_madvise() flags parameter to indicate a bulk operation > which ignores input vectors. Sample usage: process_madvise(pidfd, > MADV_DONTNEED, vector=3DNULL, vlen=3D0, flags=3DPMADV_FLAG_FILE | > PMADV_FLAG_ANON); > 4. madvise()/process_madvise() handle gaps between VMAs, so we could > provide one vector element spanning the entire address space. There > are technical issues with this approach (process_madvise return value > can't handle such a large number of bytes and there is MAX_RW_COUNT > limit on max number of bytes one process_madvise call can handle) but > I would still like to hear opinions about it. If this option is > preferable maybe we can deal with these limitations. > > We can also go back to reclaiming victim's memory asynchronously but > synchronous method has the following advantages: > - reaping will be performed in the caller's context and therefore with > caller's priority, CPU affinity, CPU bandwidth, reaping workload will > be charged to the caller and accounted for. > - reaping is a blocking/synchronous operation for the caller, so when > it's finished, the caller can be sure mm is freed (or almost freed > considering lazy freeing and batching mechanisms) and it can reassess > the memory conditions right away. > - for very large MMs (not really my case) caller could split the VMA > vector and perform reaping from multiple threads to make it faster. > This would not be possible with options (1) and (3). > > Would really appreciate your feedback on these options for future develop= ment. > Thanks, > Suren.