From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6476FC433E7 for ; Thu, 15 Oct 2020 19:26:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AAC7A206DD for ; Thu, 15 Oct 2020 19:26:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="RQ38oRb0" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AAC7A206DD Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8F84F6B0062; Thu, 15 Oct 2020 15:26:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8830E6B0068; Thu, 15 Oct 2020 15:26:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7490D6B006E; Thu, 15 Oct 2020 15:26:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0142.hostedemail.com [216.40.44.142]) by kanga.kvack.org (Postfix) with ESMTP id 381FF6B0062 for ; Thu, 15 Oct 2020 15:26:02 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A9EF28249980 for ; Thu, 15 Oct 2020 19:26:01 +0000 (UTC) X-FDA: 77375140122.12.fire28_4f10b4527216 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id 74C6418006CC7 for ; Thu, 15 Oct 2020 19:26:01 +0000 (UTC) X-HE-Tag: fire28_4f10b4527216 X-Filterd-Recvd-Size: 8303 Received: from mail-wr1-f67.google.com (mail-wr1-f67.google.com [209.85.221.67]) by imf08.hostedemail.com (Postfix) with ESMTP for ; Thu, 15 Oct 2020 19:26:00 +0000 (UTC) Received: by mail-wr1-f67.google.com with SMTP id e17so5008759wru.12 for ; Thu, 15 Oct 2020 12:26:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BM5j8lFQHmLcRInx4ruZjAKix1kTSPiqlMHVrc6IXZk=; b=RQ38oRb0HDdpyJFT4QueSLjg4Jba/1+K+imPMYvEzhEj8x0iJd4uroXsyHUp+xyEXE p4Z/AgcUID1tuzqUpWPbP9GnN0pDXG/Z6h95HcmBgj5164uijucopL9B5+2ffsjsdZPI zkcwLujawBscI0HizUBM9XECG8twNArBnGmjCZOxWygRrxa3vduez/Z5vtjetIrAcCfm 1W68d+tsSrfhFi4vioRPido2vmr1sWz1xpDarY3mBO1cwYnjHnWKfQU3Dhp2VPZ5nKz1 tdaDkT3x2fyS82L+C2pfrinntVs2HXorzxuNmLWtIvbR7fM2rbJgPxeo002IM9E73cua 9Zjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BM5j8lFQHmLcRInx4ruZjAKix1kTSPiqlMHVrc6IXZk=; b=AN/dZofXmc3VVQsyLeeNqKdqcDi5uJP3JoBs9KpJkvXmMl0CTwjdw/GQQjZRFoJ6xz GfONr0yX++JDRYwCTbghixwXIX4teedpXhdMPVJFC1EOKDD5cAsU1Qp1xqwwCOyGKRVt 63JT+lPQbjeOQCc5JHGa1LEO7F+D1+Trib0U8rtDwCKEYcDeZiWaytb8IUySngDlqJQA P91S2KWqfcLZWsZuQJ3AUN8/BUnOWUBqUnaaY4NhHgzxxmgM2v1RCbEyPiBToY5XPfMO nEeNAU9gVTAhycW117VWh91PvspHu/IeXnG8alZ7v5cwUck45gYzj9cdyo8InB7A/rhr fUjg== X-Gm-Message-State: AOAM532bxaLlIsbFSEbsqr8Dw4aSvxvB1Loj1i0RGUAjExPvd58XB85s VH2FhCcj/O3zKw5MZTtAGYSVqog1+775JEmakH1abw== X-Google-Smtp-Source: ABdhPJxkkb+qmNoeoJdvu/rhJW682Savil74aegbnLYZMmBT7v7xPl/T9j5ZrOhc3OwiCZGEIMJOL5dRdwyqGVaBxVc= X-Received: by 2002:a5d:498a:: with SMTP id r10mr6072067wrq.106.1602789959057; Thu, 15 Oct 2020 12:25:59 -0700 (PDT) MIME-Version: 1.0 References: <20201014120937.GC4440@dhcp22.suse.cz> <20201015092030.GB22589@dhcp22.suse.cz> In-Reply-To: <20201015092030.GB22589@dhcp22.suse.cz> From: Suren Baghdasaryan Date: Thu, 15 Oct 2020 12:25:43 -0700 Message-ID: Subject: Re: [RFC]: userspace memory reaping To: Michal Hocko Cc: linux-api@vger.kernel.org, linux-mm , Andrew Morton , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Oleg Nesterov , Tim Murray , kernel-team , LKML , Mel Gorman Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 15, 2020 at 2:20 AM Michal Hocko wrote: > > On Wed 14-10-20 09:57:20, Suren Baghdasaryan wrote: > > On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko wrote: > [...] > > > > > The need is similar to why oom-reaper was introduced - when a process > > > > > is being killed to free memory we want to make sure memory is freed > > > > > even if the victim is in uninterruptible sleep or is busy and reaction > > > > > to SIGKILL is delayed by an unpredictable amount of time. I > > > > > experimented with enabling process_madvise(MADV_DONTNEED) operation > > > > > and using it to force memory reclaim of the target process after > > > > > sending SIGKILL. Unfortunately this approach requires the caller to > > > > > read proc/pid/maps to extract the list of VMAs to pass as an input to > > > > > process_madvise(). > > > > > > Well I would argue that this is not really necessary. You can simply > > > call process_madvise with the full address range and let the kernel > > > operated only on ranges which are safe to tear down asynchronously. > > > Sure that would require some changes to the existing code to not fail > > > on those ranges if they contain incompatible vmas but that should be > > > possible. If we are worried about backward compatibility then a > > > dedicated flag could override. > > > > > > > IIUC this is very similar to the last option I proposed. I think this > > is doable if we treat it as a special case. process_madvise() return > > value not being able to handle a large range would still be a problem. > > Maybe we can return MAX_INT in those cases? > > madvise is documented to return > On success, madvise() returns zero. On error, it returns -1 and > errno is set appropriately. > [...] > NOTES > Linux notes > The Linux implementation requires that the address addr be > page-aligned, and allows length to be zero. If there are some > parts of the specified address range that are not mapped, the > Linux version of madvise() ignores them and applies the call to > the rest (but returns ENOMEM from the system call, as it should). > > I have learned about ENOMEM case only now. And it seems this is indeed > what we are implementing. So if we want to add a new mode to > opportunistically attempt madvise on the whole given range without a > failure then we need a specific flag for that. Advice is a number rather > than a bitmask but (ab)using the top bit or use negative number space > (e.g. -MADV_DONTNEED) for that sounds possible albeit bit hackish. process_madvise() has an additional flag parameter. Why not have a separate flag to denote that we want to just skip VMA gaps and proceed without error? Something like MADVF_SKIP_GAPS? > > [...] > > > I do have a vague recollection that we have discussed a kill(2) based > > > approach as well in the past. Essentially SIG_KILL_SYNC which would > > > not only send the signal but it would start a teardown of resources > > > owned by the task - at least those we can remove safely. The interface > > > would be much more simple and less tricky to use. You just make your > > > userspace oom killer or potentially other users call SIG_KILL_SYNC which > > > will be more expensive but you would at least know that as many > > > resources have been freed as the kernel can afford at the moment. > > > > Correct, my early RFC here > > https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@google.com > > was using a new flag for pidfd_send_signal() to request mm reaping by > > oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC > > signal instead of a new pidfd_send_signal() flag and otherwise a very > > similar solution. Is my understanding correct? > > Well, I think you shouldn't focus too much on the oom-reaper aspect > of it. Sure it can be used for that but I believe that a new signal > should provide a sync behavior. People more familiar with the process > management would be better off defining what is possible for a new sync > signal. Ideally not only pro-active process destruction but also sync > waiting until the target process is released so that you know that once > kill syscall returns the process is gone. If your suggestion is for SIG_KILL_SYNC to perform victim's resource cleanup in the context of the caller while the victim is in uninterruptible sleep that would definitely be useful. I assume there are some resources which can't be reclaimed until the process itself wakes up and handles the SIGKILL. If so, I hope kill(SIG_KILL_SYNC) would not have to wait for the victim to wake up and handle the signal. This would really complicate the userspace in cases when we just want to reclaim whatever we can without victim's involvement and continue. For cases when waiting is required waitid() with P_PIDFD can be used. Would this semantic work? > > -- > Michal Hocko > SUSE Labs