From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B23EC433F5 for ; Tue, 28 Sep 2021 16:34:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1207461058 for ; Tue, 28 Sep 2021 16:34:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1207461058 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 967E86B006C; Tue, 28 Sep 2021 12:34:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9175B6B0071; Tue, 28 Sep 2021 12:34:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B8106B0072; Tue, 28 Sep 2021 12:34:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0096.hostedemail.com [216.40.44.96]) by kanga.kvack.org (Postfix) with ESMTP id 6D97E6B006C for ; Tue, 28 Sep 2021 12:34:37 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 0CABB1828D923 for ; Tue, 28 Sep 2021 16:34:37 +0000 (UTC) X-FDA: 78637530594.02.4BF5E76 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf10.hostedemail.com (Postfix) with ESMTP id A1BCC6001996 for ; Tue, 28 Sep 2021 16:34:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1632846876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=90MnpVDj+JolZRQzNZV0139yLGyO6wZA0RZidpLY/d8=; b=avAEbCV/jcaB48n844D9j4qSOxuEIhq12fzA1MKcHZVSJs01NF3RmD34/7luyfwOGPnOoT JJKm9UfegSwRs04sfvueqgh0MgIfcS+ufjewDdAhTFqXDE4q4pNk+X6sg7+hREroEq2ycx vciEsbkd6FBqw2FPggCpk0nzG2OyFFg= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-91-y12ieeGgMoeyFQ65q0nO_w-1; Tue, 28 Sep 2021 12:34:34 -0400 X-MC-Unique: y12ieeGgMoeyFQ65q0nO_w-1 Received: by mail-wr1-f72.google.com with SMTP id z15-20020adfec8f000000b001606a799300so1503313wrn.19 for ; Tue, 28 Sep 2021 09:34:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:subject:to:cc:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=90MnpVDj+JolZRQzNZV0139yLGyO6wZA0RZidpLY/d8=; b=NQBlLLXYopA3+XlQyx+DK9a7Zpkb9RGCHG1q1VY+jWMKCYFMuWlGwUuGz7EBCrV1jp xuMMKkh/uUZmzDvWyCG6bRxctE10MMJD6j4F1rp/E43wJsIHU3TCYISyHnTO9OlBwzv5 Ec5tiO/xEnCgbyxySxJGQuEyuN+PapdaGeUquj3YfnYNc2SQp2tb9BMD53V5nHfqz+Ap SpKDqa0ST7fRAY3FoQDStG84/qoFav6o7dhjKd2YgSCrKbfruKqFJJLGHnhCH0l3Q7J0 bozL2zIz/Z6BCLLYLvMsVN9BiZjpAasje+VmtBNf9sWl4nl3FfDQxqleRo7Wr+6anWum aEOQ== X-Gm-Message-State: AOAM533wf3yEUHgqeDXSLn4uxIH5JMwNPk53dzAsgKekGqpLKOOrckio h5kJnqpiIPEy40n+MlDgyDV7Kfz6u4MSD8uqiGHiQENOGgtjj0ta+vSmCP9SnLS2ZYOnU505GNG Ap6uGPweOdsVsMxuFhF6XE7AIADkgRE6ZQLfv8RSuI4l3E575WzyhJ75uwws= X-Received: by 2002:adf:f011:: with SMTP id j17mr1209220wro.320.1632846873642; Tue, 28 Sep 2021 09:34:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxC9lznFV3Ctkcfw7xljoa+4peYF5PkipscjyHbVyunvo7v7A6WC0Wx/PgoIe9YRO5C7xaV8Q== X-Received: by 2002:adf:f011:: with SMTP id j17mr1209179wro.320.1632846873331; Tue, 28 Sep 2021 09:34:33 -0700 (PDT) Received: from [192.168.3.132] (p4ff23aaf.dip0.t-ipconnect.de. [79.242.58.175]) by smtp.gmail.com with ESMTPSA id l18sm3700515wrp.56.2021.09.28.09.34.32 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 28 Sep 2021 09:34:32 -0700 (PDT) Subject: Re: [PATCH v3] madvise.2: Document MADV_POPULATE_READ and MADV_POPULATE_WRITE To: linux-man@vger.kernel.org Cc: Pankaj Gupta , Alejandro Colomar , Michael Kerrisk , Andrew Morton , Michal Hocko , Oscar Salvador , Jann Horn , Mike Rapoport , Linux API , linux-mm@kvack.org References: <20210823120645.8223-1-david@redhat.com> From: David Hildenbrand Organization: Red Hat Message-ID: <50357269-d227-5fda-a450-c47b035b9586@redhat.com> Date: Tue, 28 Sep 2021 18:34:31 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210823120645.8223-1-david@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: A1BCC6001996 X-Stat-Signature: 11pah7fjmxxt5kh3ux9hda5ztfi6ppmp Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="avAEbCV/"; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf10.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1632846876-542379 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 23.08.21 14:06, David Hildenbrand wrote: > MADV_POPULATE_READ and MADV_POPULATE_WRITE have been merged into > upstream Linux via commit 4ca9b3859dac ("mm/madvise: introduce > MADV_POPULATE_(READ|WRITE) to prefault page tables"), part of v5.14-rc1. > > Further, commit eb2faa513c24 ("mm/madvise: report SIGBUS as -EFAULT for > MADV_POPULATE_(READ|WRITE)"), part of v5.14-rc6, made sure that SIGBUS is > converted to -EFAULT instead of -EINVAL. > > Let's document the behavior and error conditions of these new madvise() > options. > > Acked-by: Pankaj Gupta > Cc: Alejandro Colomar > Cc: Michael Kerrisk > Cc: Andrew Morton > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Jann Horn > Cc: Mike Rapoport > Cc: Linux API > Cc: linux-mm@kvack.org > Signed-off-by: David Hildenbrand > --- > > v2 -> v3: > - Refine what "populating readable/writable" means > - Compare each version with MAP_POPULATE and give an example use case > - Reword SIGBUS handling > - Reword comment regarding special mappings and also add memfd_secret(2) > - Reference MADV_HWPOISON when talking about HW poisoned pages > - Minor cosmetic fixes > > v1 -> v2: > - Use semantic newlines in all cases > - Add two missing " > - Document -EFAULT handling > - Rephrase some parts to make it more generic: VM_PFNMAP and VM_IO are only > examples for special mappings > > --- > man2/madvise.2 | 156 +++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 156 insertions(+) > > diff --git a/man2/madvise.2 b/man2/madvise.2 > index f1f384c0c..37f6dd6fa 100644 > --- a/man2/madvise.2 > +++ b/man2/madvise.2 > @@ -469,6 +469,106 @@ If a page is file-backed and dirty, it will be written back to the backing > storage. > The advice might be ignored for some pages in the range when it is not > applicable. > +.TP > +.BR MADV_POPULATE_READ " (since Linux 5.14)" > +"Populate (prefault) page tables readable, > +faulting in all pages in the range just as if manually reading from each page; > +however, > +avoid the actual memory access that would have been performed after handling > +the fault. > +.IP > +In contrast to > +.BR MAP_POPULATE , > +.B MADV_POPULATE_READ > +does not hide errors, > +can be applied to (parts of) existing mappings and will always populate > +(prefault) page tables readable. > +One example use case is prefaulting a file mapping, > +reading all file content from disk; > +however, > +pages won't be dirtied and consequently won't have to be written back to disk > +when evicting the pages from memory. > +.IP > +Depending on the underlying mapping, > +map the shared zeropage, > +preallocate memory or read the underlying file; > +files with holes might or might not preallocate blocks. > +If populating fails, > +a > +.B SIGBUS > +signal is not generated; instead, an error is returned. > +.IP > +If > +.B MADV_POPULATE_READ > +succeeds, > +all page tables have been populated (prefaulted) readable once. > +If > +.B MADV_POPULATE_READ > +fails, > +some page tables might have been populated. > +.IP > +.B MADV_POPULATE_READ > +cannot be applied to mappings without read permissions > +and special mappings, > +for example, > +mappings marked with kernel-internal flags such as > +.B VM_PFNMAP > +or > +.BR VM_IO , > +or secret memory regions created using > +.BR memfd_secret(2) . > +.IP > +Note that with > +.BR MADV_POPULATE_READ , > +the process can be killed at any moment when the system runs out of memory. > +.TP > +.BR MADV_POPULATE_WRITE " (since Linux 5.14)" > +Populate (prefault) page tables writable, > +faulting in all pages in the range just as if manually writing to each > +each page; > +however, > +avoid the actual memory access that would have been performed after handling > +the fault. > +.IP > +In contrast to > +.BR MAP_POPULATE , > +MADV_POPULATE_WRITE does not hide errors, > +can be applied to (parts of) existing mappings and will always populate > +(prefault) page tables writable. > +One example use case is preallocating memory, > +breaking any CoW (Copy on Write). > +.IP > +Depending on the underlying mapping, > +preallocate memory or read the underlying file; > +files with holes will preallocate blocks. > +If populating fails, > +a > +.B SIGBUS > +signal is not generated; instead, an error is returned. > +.IP > +If > +.B MADV_POPULATE_WRITE > +succeeds, > +all page tables have been populated (prefaulted) writable once. > +If > +.B MADV_POPULATE_WRITE > +fails, > +some page tables might have been populated. > +.IP > +.B MADV_POPULATE_WRITE > +cannot be applied to mappings without write permissions > +and special mappings, > +for example, > +mappings marked with kernel-internal flags such as > +.B VM_PFNMAP > +or > +.BR VM_IO , > +or secret memory regions created using > +.BR memfd_secret(2) . > +.IP > +Note that with > +.BR MADV_POPULATE_WRITE , > +the process can be killed at any moment when the system runs out of memory. > .SH RETURN VALUE > On success, > .BR madvise () > @@ -490,6 +590,22 @@ A kernel resource was temporarily unavailable. > .B EBADF > The map exists, but the area maps something that isn't a file. > .TP > +.B EFAULT > +.I advice > +is > +.B MADV_POPULATE_READ > +or > +.BR MADV_POPULATE_WRITE , > +and populating (prefaulting) page tables failed because a > +.B SIGBUS > +would have been generated on actual memory access and the reason is not a > +HW poisoned page > +(HW poisoned pages can, > +for example, > +be created using the > +.B MADV_HWPOISON > +flag described elsewhere in this page). > +.TP > .B EINVAL > .I addr > is not page-aligned or > @@ -533,6 +649,22 @@ or > .BR VM_PFNMAP > ranges. > .TP > +.B EINVAL > +.I advice > +is > +.B MADV_POPULATE_READ > +or > +.BR MADV_POPULATE_WRITE , > +but the specified address range includes ranges with insufficient permissions > +or special mappings, > +for example, > +mappings marked with kernel-internal flags such a > +.B VM_IO > +or > +.BR VM_PFNMAP , > +or secret memory regions created using > +.BR memfd_secret(2) . > +.TP > .B EIO > (for > .BR MADV_WILLNEED ) > @@ -548,6 +680,15 @@ Not enough memory: paging in failed. > Addresses in the specified range are not currently > mapped, or are outside the address space of the process. > .TP > +.B ENOMEM > +.I advice > +is > +.B MADV_POPULATE_READ > +or > +.BR MADV_POPULATE_WRITE , > +and populating (prefaulting) page tables failed because there was not enough > +memory. > +.TP > .B EPERM > .I advice > is > @@ -555,6 +696,20 @@ is > but the caller does not have the > .B CAP_SYS_ADMIN > capability. > +.TP > +.B EHWPOISON > +.I advice > +is > +.B MADV_POPULATE_READ > +or > +.BR MADV_POPULATE_WRITE , > +and populating (prefaulting) page tables failed because a HW poisoned page > +(HW poisoned pages can, > +for example, > +be created using the > +.B MADV_HWPOISON > +flag described elsewhere in this page) > +was encountered. > .SH VERSIONS > Since Linux 3.18, > .\" commit d3ac21cacc24790eb45d735769f35753f5b56ceb > @@ -602,6 +757,7 @@ from the system call, as it should). > .\" function first appeared in 4.4BSD. > .SH SEE ALSO > .BR getrlimit (2), > +.BR memfd_secret(2), > .BR mincore (2), > .BR mmap (2), > .BR mprotect (2), > Gentle ping. -- Thanks, David / dhildenb