From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67B7FC433EF for ; Fri, 12 Nov 2021 14:57:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E607461107 for ; Fri, 12 Nov 2021 14:57:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org E607461107 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=xmission.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 313886B0074; Fri, 12 Nov 2021 09:57:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2C2686B0078; Fri, 12 Nov 2021 09:57:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 18A646B007B; Fri, 12 Nov 2021 09:57:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0193.hostedemail.com [216.40.44.193]) by kanga.kvack.org (Postfix) with ESMTP id 09DF26B0074 for ; Fri, 12 Nov 2021 09:57:45 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A706018525545 for ; Fri, 12 Nov 2021 14:57:44 +0000 (UTC) X-FDA: 78800582448.22.CECFD29 Received: from out02.mta.xmission.com (out02.mta.xmission.com [166.70.13.232]) by imf29.hostedemail.com (Postfix) with ESMTP id 07971900026F for ; Fri, 12 Nov 2021 14:57:43 +0000 (UTC) Received: from in01.mta.xmission.com ([166.70.13.51]:50636) by out02.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1mlXzn-00HWsa-Qj; Fri, 12 Nov 2021 07:57:39 -0700 Received: from ip68-227-160-95.om.om.cox.net ([68.227.160.95]:59588 helo=email.froward.int.ebiederm.org.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1mlXzm-004yOw-NQ; Fri, 12 Nov 2021 07:57:39 -0700 From: ebiederm@xmission.com (Eric W. Biederman) To: Claudio Imbrenda Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, thuth@redhat.com, frankja@linux.ibm.com, borntraeger@de.ibm.com, Ulrich.Weigand@de.ibm.com, david@redhat.com, ultrachin@163.com, akpm@linux-foundation.org, vbabka@suse.cz, brookxu.cn@gmail.com, xiaoggchen@tencent.com, linuszeng@tencent.com, yihuilu@tencent.com, mhocko@suse.com, daniel.m.jordan@oracle.com, axboe@kernel.dk, legion@kernel.org, peterz@infradead.org, aarcange@redhat.com, christian@brauner.io, tglx@linutronix.de References: <20211111095008.264412-1-imbrenda@linux.ibm.com> <20211111095008.264412-4-imbrenda@linux.ibm.com> <874k8ixzx0.fsf@email.froward.int.ebiederm.org> <20211112103439.441b4c12@p-imbrenda> Date: Fri, 12 Nov 2021 08:57:13 -0600 In-Reply-To: <20211112103439.441b4c12@p-imbrenda> (Claudio Imbrenda's message of "Fri, 12 Nov 2021 10:34:39 +0100") Message-ID: <87v90xv2uu.fsf@email.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1mlXzm-004yOw-NQ;;;mid=<87v90xv2uu.fsf@email.froward.int.ebiederm.org>;;;hst=in01.mta.xmission.com;;;ip=68.227.160.95;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+wBl5g38NSHPTN02ZFFZ0GiUCvSiN8pBE= X-SA-Exim-Connect-IP: 68.227.160.95 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 07971900026F X-Stat-Signature: k6d4icjfa1ew4ynb4rhyjpomw3zomrii Authentication-Results: imf29.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=xmission.com; spf=pass (imf29.hostedemail.com: domain of ebiederm@xmission.com designates 166.70.13.232 as permitted sender) smtp.mailfrom=ebiederm@xmission.com X-HE-Tag: 1636729063-807962 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Claudio Imbrenda writes: > On Thu, 11 Nov 2021 13:20:11 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > >> Claudio Imbrenda writes: >> >> > The goal of this new syscall is to be able to asynchronously free the >> > mm of a dying process. This is especially useful for processes that use >> > huge amounts of memory (e.g. databases or KVM guests). The process is >> > allowed to terminate immediately, while its mm is cleaned/reclaimed >> > asynchronously. >> > >> > A separate process needs use the process_mmput_async syscall to attach >> > itself to the mm of a running target process. The process will then >> > sleep until the last user of the target mm has gone. >> > >> > When the last user of the mm has gone, instead of synchronously free >> > the mm, the attached process is awoken. The syscall will then continue >> > and clean up the target mm. >> > >> > This solution has the advantage that the cleanup of the target mm can >> > happen both be asynchronous and properly accounted for (e.g. cgroups). >> > >> > Tested on s390x. >> > >> > A separate patch will actually wire up the syscall. >> >> I am a bit confused. >> >> You want the process report that it has finished immediately, >> and you want the cleanup work to continue on in the background. >> >> Why do you need a separate process? >> >> Why not just modify the process cleanup code to keep the task_struct >> running while allowing waitpid to reap the process (aka allowing >> release_task to run)? All tasks can be already be reaped after >> exit_notify in do_exit. >> >> I can see some reasons for wanting an opt-in. It is nice to know all of >> a processes resources have been freed when waitpid succeeds. >> >> Still I don't see why this whole thing isn't exit_mm returning >> the mm_sturct when a flag is set, and then having an exit_mm_late >> being called and passed the returned mm after exit_notify. > > nevermind, exit_notify is done after cgroup_exit, the teardown would > then not be accounted properly So you want this new mechanism so you can separate the cleanup from the exit notification, and so that things are accounted properly. It would have helped if you had included a link to the previous conversation. I think Michal Hoko has a point. This looks like a job for "clone(CLONE_VM)" and "prctl(PR_SET_PDEATH_SIG)". Maybe using a pidfd instead of the prctl. AKA just create a child that shares the parents memory, and waits for the parent to exit and then cleans things up. That should not need any new kernel mechanisms. There is the other question: why this is disastrously slow on s390? Is this a s390 specific issue? Can the issue be fixed by optimizing what is happening on s390? Eric