From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F78BC433EF for ; Fri, 22 Oct 2021 17:38:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0D0C56103E for ; Fri, 22 Oct 2021 17:38:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0D0C56103E Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id A713C94000A; Fri, 22 Oct 2021 13:38:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9D1F1940007; Fri, 22 Oct 2021 13:38:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 874D394000A; Fri, 22 Oct 2021 13:38:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id 71142940007 for ; Fri, 22 Oct 2021 13:38:58 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 3954F8249980 for ; Fri, 22 Oct 2021 17:38:58 +0000 (UTC) X-FDA: 78724783956.09.EEEF183 Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf01.hostedemail.com (Postfix) with ESMTP id 8B6B85083D4F for ; Fri, 22 Oct 2021 17:38:52 +0000 (UTC) Received: by mail-yb1-f169.google.com with SMTP id v7so8826672ybq.0 for ; Fri, 22 Oct 2021 10:38:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=r/8J8TR4uHJ1X4W6TznT1rAj7h/OTN7aNlkEVgGiT3k=; b=LP7en/nydEvtAFRctjt7frpaybeS3xAVyS4NXj3p3ir55MZBGyzoBHMEyWvi+VW6PT BwXDhkwDH07ax89kj2XT2dNMx4ckyVZan4DxLHH2lIjIHbWMqvq/EbTqniXG7EwtSvS4 +CfXvJsgsY20ATw2fvtp/OyXy4SPk0ONvFKzV5hf2GBwbGKks6vtz1+DTPq9IzbICuYI kqS/klK5DiAvNo6refJCuZJZ9kMLNEf2EmQnAMmyrHFplSCk/4Q1IdGi7GrKSN/kyvPW rR3yW1l0MXUB3MAV4mE8gq4QHsCO+8bqIAkwVWFRD5Jgpo4wmosHJ7JGffPBxY3ScHba nlPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=r/8J8TR4uHJ1X4W6TznT1rAj7h/OTN7aNlkEVgGiT3k=; b=5atyI7uNT/o/jlldiRzCsQQRFuAMD3ajwPXzypH+HpcnXXEuopsyFgu3KfgUleZymq XHxkqqTPP0wTUgMojwCnoMCfpzJstXjDGrzn4x7qJw88mJElDL4TQBI5x3FYUsKmff6L JgbJkwCd3/3tIPDX3uz0JFdwH1ruMxJ7uSVkrj/oTZRp4cyyGYpKPtPJYdNL0El9Zx4i Qf2P4hwYD3/DDJZsttsV0aEOETKaN48Xbz5gk+okvYKSkfarfMdtNysstk1ZCzlnbLcW h4cvDfnvYlBcJGndl4GcQrhU58bDr6ZmyFSmsUOVYWtpuznif1jo0VPg6bRJVuoWUhZ7 ehKA== X-Gm-Message-State: AOAM5306quu/596HOgzKdVuWQcjJJmlbmD5sMURgMj+MQVdJZ7AhpeHk R3ykieMvrLZbF+oHHOxmlgwxYs+gANeVrAMhIGCV1w== X-Google-Smtp-Source: ABdhPJzqy77u6YQ7w9HgETig0ZT17flo890EWUfxm8QGQC6SHPCyKeC6hHbuYdXWUpWZZMOnfP5nxkD02ITzgvHJkBo= X-Received: by 2002:a25:81c6:: with SMTP id n6mr1180149ybm.412.1634924336836; Fri, 22 Oct 2021 10:38:56 -0700 (PDT) MIME-Version: 1.0 References: <20211022014658.263508-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Fri, 22 Oct 2021 10:38:45 -0700 Message-ID: Subject: Re: [PATCH 1/1] mm: prevent a race between process_mrelease and exit_mmap To: Michal Hocko Cc: Andrew Morton , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Christoph Hellwig , Oleg Nesterov , David Hildenbrand , Jann Horn , Shakeel Butt , Andy Lutomirski , Christian Brauner , Florian Weimer , Jan Engelhardt , Linux API , linux-mm , LKML , kernel-team Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: fi4wmci8ds61x7uaiqms7y5k9jyzqm1t X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 8B6B85083D4F Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="LP7en/ny"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of surenb@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=surenb@google.com X-HE-Tag: 1634924332-901355 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Oct 22, 2021 at 1:03 AM Michal Hocko wrote: > > On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote: > > Race between process_mrelease and exit_mmap, where free_pgtables is > > called while __oom_reap_task_mm is in progress, leads to kernel crash > > during pte_offset_map_lock call. oom-reaper avoids this race by setting > > MMF_OOM_VICTIM flag and causing exit_mmap to take and release > > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. > > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to > > fix this race, however that would be considered a hack. Fix this race > > by elevating mm->mm_users and preventing exit_mmap from executing until > > process_mrelease is finished. Patch slightly refactors the code to adapt > > for a possible mmget_not_zero failure. > > This fix has considerable negative impact on process_mrelease performance > > and will likely need later optimization. > > I am not sure there is any promise that process_mrelease will run in > parallel with the exiting process. In fact the primary purpose of this > syscall is to provide a reliable way to oom kill from user space. If you > want to optimize process exit resp. its exit_mmap part then you should > be using other means. So I would be careful calling this a regression. > > I do agree that taking the reference count is the right approach here. I > was wrong previously [1] when saying that pinning the mm struct is > sufficient. I have completely forgot about the subtle sync in exit_mmap. > One way we can approach that would be to take exclusive mmap_sem > throughout the exit_mmap unconditionally. I agree, that would probably be the cleanest way. > There was a push back against > that though so arguments would have to be re-evaluated. I'll review that discussion to better understand the reasons for the push back. Thanks for the link. > > [1] http://lkml.kernel.org/r/YQzZqFwDP7eUxwcn@dhcp22.suse.cz > > That being said > Acked-by: Michal Hocko Thanks! > > Thanks! > > > Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call") > > Signed-off-by: Suren Baghdasaryan > > --- > > mm/oom_kill.c | 23 ++++++++++++----------- > > 1 file changed, 12 insertions(+), 11 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 831340e7ad8b..989f35a2bbb1 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -1150,7 +1150,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > struct task_struct *task; > > struct task_struct *p; > > unsigned int f_flags; > > - bool reap = true; > > + bool reap = false; > > struct pid *pid; > > long ret = 0; > > > > @@ -1177,15 +1177,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > goto put_task; > > } > > > > - mm = p->mm; > > - mmgrab(mm); > > - > > - /* If the work has been done already, just exit with success */ > > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) > > - reap = false; > > - else if (!task_will_free_mem(p)) { > > - reap = false; > > - ret = -EINVAL; > > + if (mmget_not_zero(p->mm)) { > > + mm = p->mm; > > + if (task_will_free_mem(p)) > > + reap = true; > > + else { > > + /* Error only if the work has not been done already */ > > + if (!test_bit(MMF_OOM_SKIP, &mm->flags)) > > + ret = -EINVAL; > > + } > > } > > task_unlock(p); > > > > @@ -1201,7 +1201,8 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > mmap_read_unlock(mm); > > > > drop_mm: > > - mmdrop(mm); > > + if (mm) > > + mmput(mm); > > put_task: > > put_task_struct(task); > > put_pid: > > -- > > 2.33.0.1079.g6e70778dc9-goog > > -- > Michal Hocko > SUSE Labs