From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 35E20C433DB for ; Fri, 29 Jan 2021 09:13:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 88C2264E27 for ; Fri, 29 Jan 2021 09:13:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 88C2264E27 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C01556B0006; Fri, 29 Jan 2021 04:13:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BB0CD6B006C; Fri, 29 Jan 2021 04:13:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A78DD6B0070; Fri, 29 Jan 2021 04:13:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0156.hostedemail.com [216.40.44.156]) by kanga.kvack.org (Postfix) with ESMTP id 8C2C56B0006 for ; Fri, 29 Jan 2021 04:13:43 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 575A6FB55 for ; Fri, 29 Jan 2021 09:13:43 +0000 (UTC) X-FDA: 77758249926.08.rod04_540092a275a7 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin08.hostedemail.com (Postfix) with ESMTP id 3D28E1819E766 for ; Fri, 29 Jan 2021 09:13:43 +0000 (UTC) X-HE-Tag: rod04_540092a275a7 X-Filterd-Recvd-Size: 14679 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf17.hostedemail.com (Postfix) with ESMTP for ; Fri, 29 Jan 2021 09:13:42 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1611911621; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FLM9EVJeFFxaAokYCafsMTcuyRL+81+iCa3vbL0/Sj0=; b=e3nVfcF5w4WVWbnlQ6nx5wa3maWgHMpvtvOtdV9RaCvJYzH6Xh8TOcMyDWxMxA223phqDM xtGTGZcGfn5cYTcO1dwj1HYq/Wz/vMAukTubawZaa2J/TpUkUAdP/Hp/pOPfIfCoxF0xV5 FPvtHKWAnJOox0WtJV9HlIeg/M+5X7U= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 55BF9AE56; Fri, 29 Jan 2021 09:13:41 +0000 (UTC) Date: Fri, 29 Jan 2021 10:13:39 +0100 From: Michal Hocko To: Suren Baghdasaryan Cc: linux-man@vger.kernel.org, mtk.manpages@gmail.com, akpm@linux-foundation.org, jannh@google.com, keescook@chromium.org, jeffv@google.com, minchan@kernel.org, shakeelb@google.com, rientjes@google.com, edgararriaga@google.com, timmurray@google.com, linux-mm@kvack.org, selinux@vger.kernel.org, linux-security-module@vger.kernel.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Subject: Re: [PATCH v2 1/1] process_madvise.2: Add process_madvise man page Message-ID: References: <20210129070340.566340-1-surenb@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210129070340.566340-1-surenb@google.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 28-01-21 23:03:40, Suren Baghdasaryan wrote: > Initial version of process_madvise(2) manual page. Initial text was > extracted from [1], amended after fix [2] and more details added using > man pages of madvise(2) and process_vm_read(2) as examples. It also > includes the changes to required permission proposed in [3]. > > [1] https://lore.kernel.org/patchwork/patch/1297933/ > [2] https://lkml.org/lkml/2020/12/8/1282 > [3] https://patchwork.kernel.org/project/selinux/patch/20210111170622.2613577-1-surenb@google.com/#23888311 > > Signed-off-by: Suren Baghdasaryan Reviewed-by: Michal Hocko Thanks! > --- > changes in v2: > - Changed description of MADV_COLD per Michal Hocko's suggestion > - Appled fixes suggested by Michael Kerrisk > > NAME > process_madvise - give advice about use of memory to a process > > SYNOPSIS > #include > > ssize_t process_madvise(int pidfd, > const struct iovec *iovec, > unsigned long vlen, > int advice, > unsigned int flags); > > DESCRIPTION > The process_madvise() system call is used to give advice or directions > to the kernel about the address ranges of other process as well as of > the calling process. It provides the advice to address ranges of process > described by iovec and vlen. The goal of such advice is to improve system > or application performance. > > The pidfd argument is a PID file descriptor (see pidofd_open(2)) that > specifies the process to which the advice is to be applied. > > The pointer iovec points to an array of iovec structures, defined in > as: > > struct iovec { > void *iov_base; /* Starting address */ > size_t iov_len; /* Number of bytes to transfer */ > }; > > The iovec structure describes address ranges beginning at iov_base address > and with the size of iov_len bytes. > > The vlen represents the number of elements in the iovec structure. > > The advice argument is one of the values listed below. > > Linux-specific advice values > The following Linux-specific advice values have no counterparts in the > POSIX-specified posix_madvise(3), and may or may not have counterparts > in the madvise(2) interface available on other implementations. > > MADV_COLD (since Linux 5.4.1) > Deactive a given range of pages which will make them a more probable > reclaim target should there be a memory pressure. This is a non- > destructive operation. The advice might be ignored for some pages in > the range when it is not applicable. > > MADV_PAGEOUT (since Linux 5.4.1) > Reclaim a given range of pages. This is done to free up memory occupied > by these pages. If a page is anonymous it will be swapped out. If a > page is file-backed and dirty it will be written back to the backing > storage. The advice might be ignored for some pages in the range when > it is not applicable. > > The flags argument is reserved for future use; currently, this argument > must be specified as 0. > > The value specified in the vlen argument must be less than or equal to > IOV_MAX (defined in or accessible via the call > sysconf(_SC_IOV_MAX)). > > The vlen and iovec arguments are checked before applying any hints. If > the vlen is too big, or iovec is invalid, an error will be returned > immediately. > > The hint might be applied to a part of iovec if one of its elements points > to an invalid memory region in the remote process. No further elements will > be processed beyond that point. > > Permission to provide a hint to another process is governed by a ptrace > access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in addition, > the caller must have the CAP_SYS_ADMIN capability due to performance > implications of applying the hint. > > RETURN VALUE > On success, process_madvise() returns the number of bytes advised. This > return value may be less than the total number of requested bytes, if an > error occurred after some iovec elements were already processed. The caller > should check the return value to determine whether a partial advice > occurred. > > On error, -1 is returned and errno is set to indicate the error. > > ERRORS > EFAULT The memory described by iovec is outside the accessible address > space of the process referred to by pidfd. > EINVAL flags is not 0. > EINVAL The sum of the iov_len values of iovec overflows a ssize_t value. > EINVAL vlen is too large. > ENOMEM Could not allocate memory for internal copies of the iovec > structures. > EPERM The caller does not have permission to access the address space of > the process pidfd. > ESRCH The target process does not exist (i.e., it has terminated and been > waited on). > EBADF pidfd is not a valid PID file descriptor. > > VERSIONS > This system call first appeared in Linux 5.10, Support for this system > call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS > configuration option. > > SEE ALSO > madvise(2), pidofd_open(2), process_vm_readv(2), process_vm_write(2) > > man2/process_madvise.2 | 222 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 222 insertions(+) > create mode 100644 man2/process_madvise.2 > > diff --git a/man2/process_madvise.2 b/man2/process_madvise.2 > new file mode 100644 > index 000000000..07553289f > --- /dev/null > +++ b/man2/process_madvise.2 > @@ -0,0 +1,222 @@ > +.\" Copyright (C) 2021 Suren Baghdasaryan > +.\" and Copyright (C) 2021 Minchan Kim > +.\" > +.\" %%%LICENSE_START(VERBATIM) > +.\" Permission is granted to make and distribute verbatim copies of this > +.\" manual provided the copyright notice and this permission notice are > +.\" preserved on all copies. > +.\" > +.\" Permission is granted to copy and distribute modified versions of this > +.\" manual under the conditions for verbatim copying, provided that the > +.\" entire resulting derived work is distributed under the terms of a > +.\" permission notice identical to this one. > +.\" > +.\" Since the Linux kernel and libraries are constantly changing, this > +.\" manual page may be incorrect or out-of-date. The author(s) assume no > +.\" responsibility for errors or omissions, or for damages resulting from > +.\" the use of the information contained herein. The author(s) may not > +.\" have taken the same level of care in the production of this manual, > +.\" which is licensed free of charge, as they might when working > +.\" professionally. > +.\" > +.\" Formatted or processed versions of this manual, if unaccompanied by > +.\" the source, must acknowledge the copyright and authors of this work. > +.\" %%%LICENSE_END > +.\" > +.\" Commit ecb8ac8b1f146915aa6b96449b66dd48984caacc > +.\" > +.TH PROCESS_MADVISE 2 2021-01-12 "Linux" "Linux Programmer's Manual" > +.SH NAME > +process_madvise \- give advice about use of memory to a process > +.SH SYNOPSIS > +.nf > +.B #include > +.PP > +.BI "ssize_t process_madvise(int " pidfd , > +.BI " const struct iovec *" iovec , > +.BI " unsigned long " vlen , > +.BI " int " advice , > +.BI " unsigned int " flags ");" > +.fi > +.SH DESCRIPTION > +The > +.BR process_madvise() > +system call is used to give advice or directions to the kernel about the > +address ranges of other process as well as of the calling process. > +It provides the advice to address ranges of process described by > +.I iovec > +and > +.IR vlen . > +The goal of such advice is to improve system or application performance. > +.PP > +The > +.I pidfd > +argument is a PID file descriptor (see > +.BR pidofd_open (2)) > +that specifies the process to which the advice is to be applied. > +.PP > +The pointer > +.I iovec > +points to an array of > +.I iovec > +structures, defined in > +.IR > +as: > +.PP > +.in +4n > +.EX > +struct iovec { > + void *iov_base; /* Starting address */ > + size_t iov_len; /* Number of bytes to transfer */ > +}; > +.EE > +.in > +.PP > +The > +.I iovec > +structure describes address ranges beginning at > +.I iov_base > +address and with the size of > +.I iov_len > +bytes. > +.PP > +The > +.I vlen > +represents the number of elements in the > +.I iovec > +structure. > +.PP > +The > +.I advice > +argument is one of the values listed below. > +.\" > +.\" ====================================================================== > +.\" > +.SS Linux-specific advice values > +The following Linux-specific > +.I advice > +values have no counterparts in the POSIX-specified > +.BR posix_madvise (3), > +and may or may not have counterparts in the > +.BR madvise (2) > +interface available on other implementations. > +.TP > +.BR MADV_COLD " (since Linux 5.4.1)" > +.\" commit 9c276cc65a58faf98be8e56962745ec99ab87636 > +Deactive a given range of pages which will make them a more probable > +reclaim target should there be a memory pressure. > +This is a non-destructive operation. > +The advice might be ignored for some pages in the range when it is not > +applicable. > +.TP > +.BR MADV_PAGEOUT " (since Linux 5.4.1)" > +.\" commit 1a4e58cce84ee88129d5d49c064bd2852b481357 > +Reclaim a given range of pages. > +This is done to free up memory occupied by these pages. > +If a page is anonymous it will be swapped out. > +If a page is file-backed and dirty it will be written back to the backing > +storage. > +The advice might be ignored for some pages in the range when it is not > +applicable. > +.PP > +The > +.I flags > +argument is reserved for future use; currently, this argument must be > +specified as 0. > +.PP > +The value specified in the > +.I vlen > +argument must be less than or equal to > +.BR IOV_MAX > +(defined in > +.I > +or accessible via the call > +.IR sysconf(_SC_IOV_MAX) ). > +.PP > +The > +.I vlen > +and > +.I iovec > +arguments are checked before applying any hints. > +If the > +.I vlen > +is too big, or > +.I iovec > +is invalid, an error will be returned immediately. > +.PP > +The hint might be applied to a part of > +.I iovec > +if one of its elements points to an invalid memory region in the > +remote process. > +No further elements will be processed beyond that point. > +.PP > +Permission to provide a hint to another process is governed by a > +ptrace access mode > +.B PTRACE_MODE_READ_REALCREDS > +check (see > +.BR ptrace (2)); > +in addition, the caller must have the > +.B CAP_SYS_ADMIN > +capability due to performance implications of applying the hint. > +.SH RETURN VALUE > +On success, process_madvise() returns the number of bytes advised. > +This return value may be less than the total number of requested bytes, > +if an error occurred after some iovec elements were already processed. > +The caller should check the return value to determine whether a partial > +advice occurred. > +.PP > +On error, \-1 is returned and > +.I errno > +is set to indicate the error. > +.SH ERRORS > +.TP > +.B EFAULT > +The memory described by > +.I iovec > +is outside the accessible address space of the process referred to by > +.IR pidfd . > +.TP > +.B EINVAL > +.I flags > +is not 0. > +.TP > +.B EINVAL > +The sum of the > +.I iov_len > +values of > +.I iovec > +overflows a > +.I ssize_t > +value. > +.TP > +.B EINVAL > +.I vlen > +is too large. > +.TP > +.B ENOMEM > +Could not allocate memory for internal copies of the > +.I iovec > +structures. > +.TP > +.B EPERM > +The caller does not have permission to access the address space of the process > +.IR pidfd . > +.TP > +.B ESRCH > +The target process does not exist (i.e., it has terminated and been waited on). > +.TP > +.B EBADF > +.I pidfd > +is not a valid PID file descriptor. > +.SH VERSIONS > +This system call first appeared in Linux 5.10, > +.\" commit ecb8ac8b1f146915aa6b96449b66dd48984caacc > +Support for this system call is optional, > +depending on the setting of the > +.B CONFIG_ADVISE_SYSCALLS > +configuration option. > +.SH SEE ALSO > +.BR madvise (2), > +.BR pidofd_open(2), > +.BR process_vm_readv (2), > +.BR process_vm_write (2) > -- > 2.30.0.365.g02bc693789-goog > -- Michal Hocko SUSE Labs