From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 347C2C4320A for ; Thu, 5 Aug 2021 17:55:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C15116115C for ; Thu, 5 Aug 2021 17:55:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C15116115C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 158DA6B006C; Thu, 5 Aug 2021 13:55:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E1468D0001; Thu, 5 Aug 2021 13:55:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEB356B0072; Thu, 5 Aug 2021 13:55:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0132.hostedemail.com [216.40.44.132]) by kanga.kvack.org (Postfix) with ESMTP id D19516B006C for ; Thu, 5 Aug 2021 13:55:32 -0400 (EDT) Received: from smtpin36.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 7F502252CE for ; Thu, 5 Aug 2021 17:55:32 +0000 (UTC) X-FDA: 78441779304.36.307B110 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf04.hostedemail.com (Postfix) with ESMTP id 1013650056CF for ; Thu, 5 Aug 2021 17:55:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1628186131; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TY2vPAYrdCbgz7dO6xO7134YAuWg1cL9bDHqg/h8rfY=; b=I5jph/acZWIFQBGnTZ7AcMO2avtxnb1+qYVkeMgKkPkM97uojnIi+feLGHVpL2nPcDbDyx URNn+aQWYgffFyVv+yOK1W7GN49BQYcHjKG3PXd3UW+PD9gg3W+yRA8d4ve6KfPsiY7JE3 Dy7xl/HMeKQu3iwonANw4GGPNu3AO4w= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-86-9oT526VzP_SNg53Fki4M8w-1; Thu, 05 Aug 2021 13:55:30 -0400 X-MC-Unique: 9oT526VzP_SNg53Fki4M8w-1 Received: by mail-wm1-f69.google.com with SMTP id q188-20020a1ca7c50000b0290241f054d92aso1403882wme.5 for ; Thu, 05 Aug 2021 10:55:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=TY2vPAYrdCbgz7dO6xO7134YAuWg1cL9bDHqg/h8rfY=; b=TNoB5G0l7VTJ/SiWJPp4dhIcAr4l9jdW3Po6NoH09lATUXBZITOKX2rBbtJm5N2n3r klRQ1zxaALFBzEfHqa8du/3IAHVG/3x8t9EcRD5M7niX8i3wE9t8ITU72/6vDclU/3Au af5Qi0rBYACaBMClVDX4Z6pvNJKHNODJ+qPB4J/Wjzuxl+sOlBUt99ng1F5hTZUevpF6 YZBSRUrArJPmOn/PRm3vRY1Yw/97MbRqY3WbJ67iKR4MCQkNtzMwoes0g+zJLD1PsIWI ZIPIlbQDMLaND4cC08Ua9EJmLGPfqcft63D6w1fg8RfKsBxw4Vp3hHlm8JmHa+hYgLAQ aZ9w== X-Gm-Message-State: AOAM532j1eqLGhqxUOcgk8w/E649djn83ix7Qf2sxpUuVXSuHYNhBd5Q nQQhFTWkCPtBz7hBTOlP5PZxBuakGRMc7qaeUMifPqdYkQhnkYvw4M1ZQ6PSQIT/NCvA0zteEAT 4xVgEhmq+Q0w= X-Received: by 2002:a1c:a543:: with SMTP id o64mr6117715wme.103.1628186128894; Thu, 05 Aug 2021 10:55:28 -0700 (PDT) X-Google-Smtp-Source: ABdhPJym/WQV4l/ckfV4N8yXvNxdGaRbZVHNTFQGyYKvQ1ClNTG4b7G4ZUej4Xqj3dWztJ3MVSXyTA== X-Received: by 2002:a1c:a543:: with SMTP id o64mr6117708wme.103.1628186128694; Thu, 05 Aug 2021 10:55:28 -0700 (PDT) Received: from [192.168.3.132] (p5b0c630b.dip0.t-ipconnect.de. [91.12.99.11]) by smtp.gmail.com with ESMTPSA id 104sm6800116wrc.4.2021.08.05.10.55.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 05 Aug 2021 10:55:28 -0700 (PDT) Subject: Re: [PATCH v7 1/2] mm: introduce process_mrelease system call To: Suren Baghdasaryan Cc: Andrew Morton , Michal Hocko , Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Christoph Hellwig , Oleg Nesterov , Jann Horn , Shakeel Butt , Andy Lutomirski , Christian Brauner , Florian Weimer , Jan Engelhardt , Tim Murray , Linux API , linux-mm , LKML , kernel-team References: <20210805170859.2389276-1-surenb@google.com> <46998d10-d0ca-aeeb-8dcd-41b8130fb756@redhat.com> From: David Hildenbrand Organization: Red Hat Message-ID: <96203d51-5948-d063-4a9c-2eb33e631502@redhat.com> Date: Thu, 5 Aug 2021 19:55:26 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="I5jph/ac"; spf=none (imf04.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 1013650056CF X-Stat-Signature: yxzfo9c51n4pnqdcqxxqrx361w7jze3q X-HE-Tag: 1628186131-856382 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 05.08.21 19:49, Suren Baghdasaryan wrote: > On Thu, Aug 5, 2021 at 10:29 AM David Hildenbrand wrote: >> >> On 05.08.21 19:08, Suren Baghdasaryan wrote: >>> In modern systems it's not unusual to have a system component monitoring >>> memory conditions of the system and tasked with keeping system memory >>> pressure under control. One way to accomplish that is to kill >>> non-essential processes to free up memory for more important ones. >>> Examples of this are Facebook's OOM killer daemon called oomd and >>> Android's low memory killer daemon called lmkd. >>> For such system component it's important to be able to free memory >>> quickly and efficiently. Unfortunately the time process takes to free >>> up its memory after receiving a SIGKILL might vary based on the state >>> of the process (uninterruptible sleep), size and OPP level of the core >>> the process is running. A mechanism to free resources of the target >>> process in a more predictable way would improve system's ability to >>> control its memory pressure. >>> Introduce process_mrelease system call that releases memory of a dying >>> process from the context of the caller. This way the memory is freed in >>> a more controllable way with CPU affinity and priority of the caller. >>> The workload of freeing the memory will also be charged to the caller. >>> The operation is allowed only on a dying process. >>> >>> After previous discussions [1, 2, 3] the decision was made [4] to introduce >>> a dedicated system call to cover this use case. >>> >>> The API is as follows, >>> >>> int process_mrelease(int pidfd, unsigned int flags); >>> >>> DESCRIPTION >>> The process_mrelease() system call is used to free the memory of >>> an exiting process. >>> >>> The pidfd selects the process referred to by the PID file >>> descriptor. >>> (See pidfd_open(2) for further information) >>> >>> The flags argument is reserved for future use; currently, this >>> argument must be specified as 0. >>> >>> RETURN VALUE >>> On success, process_mrelease() returns 0. On error, -1 is >>> returned and errno is set to indicate the error. >>> >>> ERRORS >>> EBADF pidfd is not a valid PID file descriptor. >>> >>> EAGAIN Failed to release part of the address space. >>> >>> EINTR The call was interrupted by a signal; see signal(7). >>> >>> EINVAL flags is not 0. >>> >>> EINVAL The memory of the task cannot be released because the >>> process is not exiting, the address space is shared >>> with another live process or there is a core dump in >>> progress. >>> >>> ENOSYS This system call is not supported, for example, without >>> MMU support built into Linux. >>> >>> ESRCH The target process does not exist (i.e., it has terminated >>> and been waited on). >>> >>> [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ >>> [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ >>> [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ >>> [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ >>> >>> Signed-off-by: Suren Baghdasaryan >>> --- >>> changes in v7: >>> - Fixed pidfd_open misspelling, per Andrew Morton >>> - Fixed wrong task pinning after find_lock_task_mm() issue, per Michal Hocko >>> - Moved MMF_OOM_SKIP check before task_will_free_mem(), per Michal Hocko >>> >>> mm/oom_kill.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 73 insertions(+) >>> >>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>> index c729a4c4a1ac..a4d917b43c73 100644 >>> --- a/mm/oom_kill.c >>> +++ b/mm/oom_kill.c >>> @@ -28,6 +28,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> #include >>> #include >>> #include >>> @@ -1141,3 +1142,75 @@ void pagefault_out_of_memory(void) >>> out_of_memory(&oc); >>> mutex_unlock(&oom_lock); >>> } >>> + >>> +SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) >>> +{ >>> +#ifdef CONFIG_MMU >>> + struct mm_struct *mm = NULL; >>> + struct task_struct *task; >>> + struct task_struct *p; >>> + unsigned int f_flags; >>> + struct pid *pid; >>> + long ret = 0; >>> + >>> + if (flags) >>> + return -EINVAL; >>> + >>> + pid = pidfd_get_pid(pidfd, &f_flags); >>> + if (IS_ERR(pid)) >>> + return PTR_ERR(pid); >>> + >>> + task = get_pid_task(pid, PIDTYPE_PID); >>> + if (!task) { >>> + ret = -ESRCH; >>> + goto put_pid; >>> + } >>> + >>> + /* >>> + * If the task is dying and in the process of releasing its memory >>> + * then get its mm. >>> + */ >>> + p = find_lock_task_mm(task); >>> + if (!p) { >>> + ret = -ESRCH; >>> + goto put_pid; >>> + } >>> + if (task != p) { >>> + get_task_struct(p); >> >> >> Wouldn't we want to obtain the mm from p ? I thought that was the whole >> exercise of going via find_lock_task_mm(). > > Yes, that's what we do after checking task_will_free_mem(). > task_will_free_mem() requires us to hold task_lock and > find_lock_task_mm() achieves that ensuring that mm is still valid, but > it might return a task other than the original one. That's why we do > this dance with pinning the new task and unpinning the original one. > The same dance is performed in __oom_kill_process(). I was > contemplating adding a parameter to find_lock_task_mm() to request > this unpin/pin be done within that function but then decided to keep > it simple for now. > Did I address your question or did I misunderstand it? Excuse my tired eyes, I missed the "task = p;" Feel free to carry my ack along, even if there are minor changes. -- Thanks, David / dhildenb