From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB9AAC7EE2E for ; Tue, 28 Feb 2023 00:36:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 48FD96B007E; Mon, 27 Feb 2023 19:36:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 43F856B0080; Mon, 27 Feb 2023 19:36:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 307F16B0081; Mon, 27 Feb 2023 19:36:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 224D96B007E for ; Mon, 27 Feb 2023 19:36:32 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id F1CC540BB6 for ; Tue, 28 Feb 2023 00:36:31 +0000 (UTC) X-FDA: 80514834582.28.34695FB Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf19.hostedemail.com (Postfix) with ESMTP id DD61D1A0002 for ; Tue, 28 Feb 2023 00:36:28 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KBmqd2TV; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677544589; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SxjtWOmonEjv54vByAy+hjLTxqJfnsX9F69vw39DJB8=; b=vuHdky/WU8OQN8668xWC+b3Z+oXed77AA/6oIQGnv8N269dUNa6lQI1HmJ2yOLrKGBSL0O Prn1FEsk8sKgGHZGGCABGErDVsjYd8iIe/AAw0qQa8ZRI1232Q5hYD6XAGovHYsAAHE0uN AZ0/3r9uMVU1iFEUb9/NYbkAUvRm8Iw= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KBmqd2TV; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677544589; a=rsa-sha256; cv=none; b=nIRFCuHdWc6FCdOrCsG4UZ6Q7d4QOy4SgvrQqmEoaOOdya903GIMbzSfXiO/Da3cJ3SxEh r05PH/TEwO3QnbXw8NpkEvEqgMWgJPUtECVgTqa/sNxUY3w18bAqiFNfyY1az2/RzGCUrl fbjd3uHJa4u4ftgS4tsx+7cKWOtnGfI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1677544588; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=SxjtWOmonEjv54vByAy+hjLTxqJfnsX9F69vw39DJB8=; b=KBmqd2TVlcoO2n7dyspIpJO6QgXWM1QEEPLk1ZKMgJ2UwK8F82EXc8RJJivYzdDlGTq01L 0jTYGbP1gZiujcsNFmoNzmhVVaaTIDDc+berVyoC80tjWsNzqD193wpxzT7vTQS7D2fYXE CXxd1sZd9JaeWRN7lWkbxZ7XXL69bgQ= Received: from mail-il1-f197.google.com (mail-il1-f197.google.com [209.85.166.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-434-F6zB_s10Om2lfyCF0XeUSQ-1; Mon, 27 Feb 2023 19:36:27 -0500 X-MC-Unique: F6zB_s10Om2lfyCF0XeUSQ-1 Received: by mail-il1-f197.google.com with SMTP id v3-20020a92c6c3000000b003159a0109ceso5069429ilm.12 for ; Mon, 27 Feb 2023 16:36:26 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=SxjtWOmonEjv54vByAy+hjLTxqJfnsX9F69vw39DJB8=; b=7D6U4rCcSWV16U3K36+gHb1j/jWpNyw+mvzhozDRgne6xDe7TK6cQAA36uJtzp65aX 0irA6UaAqP/WCrbZ3aEZ9wpt9TKOgvZ/LZsZwQq7hB9rfEkImDVmoh5EhEMV18NkJbt3 1Vvxc8DjshAlZ2KbgldQa5zsMwHB3VPz7qzucbck2lU7NfrXqp47PYJp4yNJtFv+E7q7 aO2JBB26bkpYqHaUeWfhi4NfS5A08X8GxGR1qBHdK4sDhXhMkqhOlyZxSYsYjs8Nj87i a6HiKgRjiGagkOJzyJ5xOfD324DNfkJv5JABs1b5VZmCN3CET0WW8TyEcPNQXUco/JlD o22w== X-Gm-Message-State: AO0yUKXtSSTR8fmmdyc1uZeOtz0jz5Dbzx+Io5KTuQrHF+kW2VPIP/ja YJrFBdsJ/riko4RvCIBogTdvBMzMB0G9tBR8FyWCSrjR9eWIg90PkA4Z8tTbDxy+iCegq7lPjY+ eEN4/9no5cUA= X-Received: by 2002:a05:6e02:1a04:b0:317:33fe:7d59 with SMTP id s4-20020a056e021a0400b0031733fe7d59mr970190ild.2.1677544586259; Mon, 27 Feb 2023 16:36:26 -0800 (PST) X-Google-Smtp-Source: AK7set/rBergQtUmEPY17azJ3MFpoNCPbvLUwDW+UH1+21oq2MKdRl0alO3tDWPV8Culg6JS1mDIxg== X-Received: by 2002:a05:6e02:1a04:b0:317:33fe:7d59 with SMTP id s4-20020a056e021a0400b0031733fe7d59mr970177ild.2.1677544585958; Mon, 27 Feb 2023 16:36:25 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id c20-20020a02c9d4000000b003a9632cb099sm2420334jap.51.2023.02.27.16.36.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Feb 2023 16:36:25 -0800 (PST) Date: Mon, 27 Feb 2023 19:36:23 -0500 From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Andrea Arcangeli , Andrew Morton , Muhammad Usama Anjum , Mike Rapoport , Axel Rasmussen , Nadav Amit , David Hildenbrand Subject: Re: [PATCH v2] mm/uffd: UFFD_FEATURE_WP_UNPOPULATED Message-ID: References: <20230227230044.1596744-1-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: <20230227230044.1596744-1-peterx@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: DD61D1A0002 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: jzrgqnob1f9sjesf7murzs5j7jg6rrc1 X-HE-Tag: 1677544588-329940 X-HE-Meta: U2FsdGVkX18elNWis7mtC0NMCSvXLBQx3LN8Gg0IVgIwPv2GJ6jzPo+zFVAyYuREI9NfOnSmHOEvu62Z3aVcfkl1N3ZeL4H0g2AG3PMJGnnfH9jKHMRxZr2VkhwHrF7JSu9QW3gK68DZU1piyeXzDcoJtQa1k/FGNpu15PjCEQ+5DA1lx1Zb1LlB9DvxxJixmw9UTi+0HPijZRAOCan3I1+di9nsp+zwdFY17Mz11WAaYy3ENd9+HJ862MRtLhS3g4mnJduWEnUasvk4KU1kevR7l9b0RqgYDh8B4bJah1Eadrr9NPoCf2WafO/Ol6LK46XoPYPiPNyzxVmTMauYBTDrZ1C4dF+nS/65yeMpd6VGuzL8nZCxvn4gdarkv2MXt0x/OlltGDiItyNncdue6SEJrq0OUx/cJIebhn5g/CaLJz4dz1ulRjJ90Hstg6bHbZlUlJLv63c90lstca2IZ9h9lblH8GHeWZscN8M3MquIhfNyZyac5XMOopvfainffgEnttzHILlR5bNEe/PMWLLmP5wWacMgs/jt9yn9eFOkA6B6lnalodyzAkqIhr+/3zFCpgnwK4L+hArnYWQAA6yi936pTmpxrLldN0c0vPMyqmsU5oYg7Qyt6qCOT+l9B+2txn4i2VKrhQ3kfxXcgUCgjRpIS5PE5Bf1Ak2ITlOGqvO2JmCaaWxUOy6s7CCkYoyA0QNquU/IdvIQAYqoIP2zsBkvCooIsZBkmxm0RxflHYnyEzbikx3GeTUJ4GpCZpeaTlaaRTPITx0Q5RqfwXXC3a9MPCi/XOC4MaGqm2dzcCzvS+s57YCeIf3T2UD57RBFCEiwvNlQ2QYrKa42CPGi4mjm3u/xVljwqFLUXmHmDxN4m2YQouMyqHWiXzR+tVkXSAEAMHk2DxScQROrMZ1cRJCC27zMuDvZWvNVeslDNwXWgh5URwV/SeWbdln+9zl+ppb/ZARZ85A+74Q YWyqd1AR U2rU9upgk7e5/kx9x9e7hGo+JufZB+AiVUbxZ/Kl3vZoHsJhgQ8HIy0X+8tMKRNUWJohz4mX6D4WQtw8Gw7oPRqAOlT8cgSaHYUqH56TZMrfnYNe/P15QF/oiynhPUVtWGJQQWwHinMoc5aHfHtWcMhK5EFEBT5+GThBv3MHx/zXeh6RzFFRTCdOBPJE796dmjqkv+ONuBJ4GYoCl0ZGsCiydTBNOHz2o8wm/CjXauvyvXZf1/ahtBWE7qa269e6MWKrxbaUFNRMkmfRhEL7bSB+uq2k/ccjSo/SNsrke6G7fIJSR1/LhMI230sAG9gLuEVbwP09BP2yadZUfVNgPlWTzDaXgX6AFv457SbXVJm+fn0OpUCoD+b+QGYQiPrPyHU2Jwk4Hu4SrTCRZ+Pe6OqpMBo2n+7eYg//qgqJdY75bV3qDMH1YX3ipA4CipL9jaULS5bGqDkyoQL9t6fZAtrAKDCvyEyVjzxfig5sAHU8evqFLW8A7V25KHcy5lC7BQCkl5OKRgOX7y7W5IOgoYZ6MMyf/wvj+S27MGcB4mbAjZMl7etzPjOGebVKGv28f2zYImauP/8KAbX6O/AtVij3KY7NXpt7XMm1cT2xORBXQVM8canmB/uC/4YmhscUaU9bNSQxyFIT+/ZFVqDDGfPaV3w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote: > This is a new feature that controls how uffd-wp handles none ptes. When > it's set, the kernel will handle anonymous memory the same way as file > memory, by allowing the user to wr-protect unpopulated ptes. > > File memories handles none ptes consistently by allowing wr-protecting of > none ptes because of the unawareness of page cache being exist or not. For > anonymous it was not as persistent because we used to assume that we don't > need protections on none ptes or known zero pages. > > One use case of such a feature bit was VM live snapshot, where if without > wr-protecting empty ptes the snapshot can contain random rubbish in the > holes of the anonymous memory, which can cause misbehave of the guest when > the guest OS assumes the pages should be all zeros. > > QEMU worked it around by pre-populate the section with reads to fill in > zero page entries before starting the whole snapshot process [1]. > > Recently there's another need raised on using userfaultfd wr-protect for > detecting dirty pages (to replace soft-dirty in some cases) [2]. In that > case if without being able to wr-protect none ptes by default, the dirty > info can get lost, since we cannot treat every none pte to be dirty (the > current design is identify a page dirty based on uffd-wp bit being cleared). > > In general, we want to be able to wr-protect empty ptes too even for > anonymous. > > This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make > uffd-wp handling on none ptes being consistent no matter what the memory > type is underneath. It doesn't have any impact on file memories so far > because we already have pte markers taking care of that. So it only > affects anonymous. > > The feature bit is by default off, so the old behavior will be maintained. > Sometimes it may be wanted because the wr-protect of none ptes will contain > overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to > anonymous), but also on creating the pgtables to store the pte markers. So > there's potentially less chance of using thp on the first fault for a none > pmd or larger than a pmd. > > The major implementation part is teaching the whole kernel to understand > pte markers even for anonymously mapped ranges, meanwhile allowing the > UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the > new feature bit is set. > > Note that even if the patch subject starts with mm/uffd, there're a few > small refactors to major mm path of handling anonymous page faults. But > they should be straightforward. > > So far, add a very light smoke test within the userfaultfd kselftest > pagemap unit test to make sure anon pte markers work. > > [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/ > [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/ > > Signed-off-by: Peter Xu > --- > v1->v2: > - Use pte markers rather than populate zero pages when protect [David] > - Rename WP_ZEROPAGE to WP_UNPOPULATED [David] Some very initial performance numbers (I only ran in a VM but it should be similar, unit is "us") below as requested. The measurement is about time spent when wr-protecting 10G range of empty but mapped memory. It's done in a VM, assuming we'll get similar results on bare metal. Four test cases: - default UFFDIO_WP - pre-read the memory, then UFFDIO_WP (what QEMU does right now) - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP - UFFDIO_WP with WP_UNPOPULATED Results: Test DEFAULT: 2 Test PRE-READ: 3277099 (pre-fault 3253826) Test MADVISE: 2250361 (pre-fault 2226310) Test WP-UNPOPULATE: 20850 I'll add these information into the commit message when there's a new version. [1] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c -- Peter Xu