From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 938EEC3DA6E
	for <linux-mm@archiver.kernel.org>; Wed, 20 Dec 2023 10:56:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AC3818D0002; Wed, 20 Dec 2023 05:56:35 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A73A08D0001; Wed, 20 Dec 2023 05:56:35 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 89F1B8D0002; Wed, 20 Dec 2023 05:56:35 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 6DA728D0001
	for <linux-mm@kvack.org>; Wed, 20 Dec 2023 05:56:35 -0500 (EST)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 3E0591601FC
	for <linux-mm@kvack.org>; Wed, 20 Dec 2023 10:56:35 +0000 (UTC)
X-FDA: 81586893150.15.E424D66
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf24.hostedemail.com (Postfix) with ESMTP id CC5A6180003
	for <linux-mm@kvack.org>; Wed, 20 Dec 2023 10:56:32 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WWl1VPKr;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf24.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1703069793;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=3LDXbStXynX0T4Mxxob6LJQXBFex+Ms+cavTOWNyphA=;
	b=fjm5oxgJnmBzqLF6V+dwTBd8YRYCNTAvxuCqo1h85vtZ1Qiyt+9zKWYtPW/x1xEE58jO08
	lNu/CXq5xnmpz6tI/nys2Vv9TRmAp26VC9l/3ZTcNTBSvMqyEdkfkDZyMkk2lW8qC9xltb
	tyIy7+qrH2USMmG7TnubINiRkD2u/5o=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WWl1VPKr;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf24.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703069793; a=rsa-sha256;
	cv=none;
	b=7WrG4UOvs4m/Y64BTx52QTf9Z78L8d6McBAq56piC7QS7VlGF+EwYn1dCrDqLR6ziRBDhW
	EnCfv86LQ5UsABXcoYhRQj/dKc3qX+luKp6yur+zmFjKRP6pcYh1srKi3Q94tWfII/n3/z
	XJk6bRLdNj6WPzdxK9nEUWVsMpL8ckk=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1703069792;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:autocrypt:autocrypt;
	bh=3LDXbStXynX0T4Mxxob6LJQXBFex+Ms+cavTOWNyphA=;
	b=WWl1VPKrJdkprHnbXrWILrFlAW1q84Gaf0tkw2InTOK0iMwwRRxttqDedY9vFVgQQp7mH+
	Q38MKgx2ioUsgj3D6kPudH3gWqYnAHw89h1kMz4Q/kCr9FWclYyBD4tuH8/WoDRLO4V8He
	lrksRAcXaaq8mnpQTnwcFpiHk4PNc+o=
Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com
 [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-615-JqdX5QdePAOnAbPcpDty2Q-1; Wed, 20 Dec 2023 05:56:30 -0500
X-MC-Unique: JqdX5QdePAOnAbPcpDty2Q-1
Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-40c25973861so41645555e9.2
        for <linux-mm@kvack.org>; Wed, 20 Dec 2023 02:56:30 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1703069789; x=1703674589;
        h=content-transfer-encoding:in-reply-to:organization:autocrypt:from
         :references:cc:to:content-language:subject:user-agent:mime-version
         :date:message-id:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=3LDXbStXynX0T4Mxxob6LJQXBFex+Ms+cavTOWNyphA=;
        b=BjLefBn23vFqrwyYrIBLun1C/7lz305aHmKYVihVtDnc9LhMqKDdf1PNaGS2lYT9f6
         DVGXIelrpgzohvIPTmHcGCTuZvmEMlcJmVwXGsFjAAum2RrDFbK0w9WMOaaFgvUkmXyH
         zJi7n86SmPGX51PV8NLSulYjkf4nf31AFeWW2ct30Lq2boj/pjre/ZIWVUQuyFn814cU
         x7l4dlVB7krMAtWijN/SL4bIfNcSkA8JSwOrfwH240WWcaBBvm3ApnlUBU4rUb7llEZi
         lcPFa99zPl9h4ID1olM5QipTKaHV3m/HXFPGiwM6QXGjaL0/yOKIahZzwy10b5xo86A7
         Mxeg==
X-Gm-Message-State: AOJu0YylK/2+NoQ95TjjYyVGbZ2G1FdGay/cYAwTaCJNpgEEUopjfj5j
	vBiDmBg2lDUBRkKdlzhAYt3kb9IXz28bJyUCo2V2OUYtuDLNfrEpOKB02tJ1EQQYF5ZoqfxOOvW
	Y3uJAjpUbvyI=
X-Received: by 2002:a1c:7c11:0:b0:40c:55a7:772f with SMTP id x17-20020a1c7c11000000b0040c55a7772fmr7289066wmc.146.1703069789333;
        Wed, 20 Dec 2023 02:56:29 -0800 (PST)
X-Google-Smtp-Source: AGHT+IH3ab73NcCHmA2HFIUnD86YS7awwuMGPGKp7xdcA2cfKbsQgb25u6HQPW7k40LvQYe14aRuyQ==
X-Received: by 2002:a1c:7c11:0:b0:40c:55a7:772f with SMTP id x17-20020a1c7c11000000b0040c55a7772fmr7289047wmc.146.1703069788835;
        Wed, 20 Dec 2023 02:56:28 -0800 (PST)
Received: from ?IPV6:2003:cb:c73b:eb00:8e25:6953:927:1802? (p200300cbc73beb008e25695309271802.dip0.t-ipconnect.de. [2003:cb:c73b:eb00:8e25:6953:927:1802])
        by smtp.gmail.com with ESMTPSA id n11-20020a05600c3b8b00b0040b3d8907fesm6114763wms.29.2023.12.20.02.56.27
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 20 Dec 2023 02:56:28 -0800 (PST)
Message-ID: <699cb1db-51eb-460e-9ceb-1ce08ca03050@redhat.com>
Date: Wed, 20 Dec 2023 11:56:26 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()
To: Ryan Roberts <ryan.roberts@arm.com>,
 Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>,
 Ard Biesheuvel <ardb@kernel.org>, Marc Zyngier <maz@kernel.org>,
 Oliver Upton <oliver.upton@linux.dev>, James Morse <james.morse@arm.com>,
 Suzuki K Poulose <suzuki.poulose@arm.com>, Zenghui Yu
 <yuzenghui@huawei.com>, Andrey Ryabinin <ryabinin.a.a@gmail.com>,
 Alexander Potapenko <glider@google.com>,
 Andrey Konovalov <andreyknvl@gmail.com>, Dmitry Vyukov <dvyukov@google.com>,
 Vincenzo Frascino <vincenzo.frascino@arm.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Anshuman Khandual <anshuman.khandual@arm.com>,
 Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>,
 Mark Rutland <mark.rutland@arm.com>, Kefeng Wang
 <wangkefeng.wang@huawei.com>, John Hubbard <jhubbard@nvidia.com>,
 Zi Yan <ziy@nvidia.com>, Barry Song <21cnbao@gmail.com>,
 Alistair Popple <apopple@nvidia.com>, Yang Shi <shy828301@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
References: <20231218105100.172635-1-ryan.roberts@arm.com>
 <20231218105100.172635-3-ryan.roberts@arm.com>
 <0bef5423-6eea-446b-8854-980e9c23a948@redhat.com>
 <db1be625-33e4-4d07-8500-3f7d3c8f9937@arm.com>
 <be8b5181-be2c-4800-ba53-c65a6c3ed803@redhat.com>
 <dd227e51-c4b2-420b-a92a-65da85ab4018@arm.com>
 <7c0236ad-01f3-437f-8b04-125d69e90dc0@redhat.com>
 <9a58b1a2-2c13-4fa0-8ffa-2b3d9655f1b6@arm.com>
 <28968568-f920-47ac-b6fd-87528ffd8f77@redhat.com>
 <10b0b562-c1c0-4a66-9aeb-a6bff5c218f6@arm.com>
 <8f8023cb-3c31-4ead-a9e6-03a10e9490c6@redhat.com>
 <da16a7e5-76dd-4150-9ade-54b0d227a1e1@arm.com>
From: David Hildenbrand <david@redhat.com>
Autocrypt: addr=david@redhat.com; keydata=
 xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ
 dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL
 QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp
 XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK
 Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9
 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt
 WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc
 UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv
 jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb
 B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk
 ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW
 AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q
 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp
 rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf
 wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4
 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l
 pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd
 KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE
 BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs
 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF
 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9
 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz
 Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb
 T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A
 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk
 CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G
 NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75
 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx
 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS
 lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv
 AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa
 N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3
 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB
 boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq
 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f
 XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ
 a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq
 Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6
 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8
 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E
 th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr
 jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt
 WNyWQQ==
Organization: Red Hat
In-Reply-To: <da16a7e5-76dd-4150-9ade-54b0d227a1e1@arm.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: CC5A6180003
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: 6c1jijayzp5j9ohoa3ontgsyjczrccoa
X-HE-Tag: 1703069792-445148
X-HE-Meta: U2FsdGVkX1/ouOUA1KLFfl8jmaxQrMGXZnC/Ct/2ajMIi8oC5GJB7bfCw/X+R7xK3DRnzJinQouL5N8EpowYt8WMgr2M684M5Gok450nXga/VN4zZjXzpX5VcWSNTAyzIeHWISlg9WjIVOqye8PgHtCY7a7hYTOV6GpWGAlG3nbNRC7hCu2nkMwQFc/eSdlGUfEF3PxCbH5r3hKJiitJndj0ZZEQzjefGH+lPdwkDe59MLfc03OPycPNtCL7YucZt3kh80j+RIvE00EpHxepUQlzbL1GzMMJoSurtyZPUqYtYC93LtsJRXWR0UlVTmY+zEkoJXQFBvPqH4mtLCzeVv/z4Nh1HZSvAHeCkn7dKyvbVLSn3N7Jr38B0pHEZucNwC/qSj13zFRLE8r1COWIbicrOuweE+aksKjAyFNwZMpodPd5HrGMF4hK9ZKdp9V38Ru6heCGI2UhCMiU2A7c+DDO12kX2yVgkRHeCEoqXXPtLDrbHLJcVMaj7X3n9vByWTv/9PTyC5vvQMMyolrhmVpZUEMxwWdGd8KbBjElZwTtmbbo6a2lu46ySwx26aUTdNG9och6yJyh+uBAFzVV96Gepwc0KtKRZxok7Jpqp1S8fWDR3U/gRo45BUsMRra7T1IcngOOiFbSNdIX8u9UOK84P9hhxxP/o2BQxdneDg1ktI/AVoCf6RAQVCj23BUfsOpmkDn6951EzGqdAZ/bK6oe5UuBj67Abtfj6WVW26aMkkXVik5u7zmIrvgSS0wnb3H9L6KXEOHBmF50kvv8b2d5hmXshCWkCpYaYdJvnIHd2T/i5zbhaHbnQLrzVJDPOcTRMHIrsxv0v6w/GHjYBt+PTUutwY7bCTxxZ0+YioQ78b3lXsx1+HdZRh3I8J7DhRROEO8oGzSl408Gg1kqceJut0amlT6g1t4wCw3INwMDSXP9Dcl+x5MzYNpZu7CT94XMjYLa3HCs9tyDs/r
 L6UXxwh5
 8Jfm8CC/EeS0+7CLPWl8tdJ+oKR1a4g/zxbJm2HnCq6bBfLFPh6y5/T8aG2WHEe8lOe8hjzEuN9A7DalwCL10lrk591MM3UvIOll5FXSf+8AZtnvjLaF9MTwziQvFyoHxnkQ/AUwRbs5Q4X6AG2xy6tuOvrAe54D9gQNgFRB8/pksrLST3FBAzzKOlgjh+HRqivvnk4Zk/P45BB4tRFhFUtmmW/eUvKr1vPN9ThSREqi2eUYtI9oYkCKzWl+5QRclJUp4IL40cH3k+KQVzHG5x4nT/hl64U3EKmXVNnQIqaDh05jX/MGpCyAKeI2s7Ku+tVPxlTt8TPRhN34OLrbqKCtYs2GhoAB+phPrBMmdy3B5w6w8otbEBbTwhXQ416vEITCS56JnsEKLkxTXNvg4FTpEFoOOA62mRwE4SMD/JYCEPejeqjDhh4SE/f7I4KjmJKMqHkpM5BaqoBX6cmc4tvm+tpGbVge4p9SBaEAlvwoeYWu0mJ7VeOQy0jxHnGARrPm6yQE+8+cuWxSPYToNHYhlDD3sddqSZvSFgipbgQ1xzaIXLaeJ3ZRik1cCrsZ0xCCR
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 20.12.23 11:41, Ryan Roberts wrote:
> On 20/12/2023 10:16, David Hildenbrand wrote:
>> On 20.12.23 11:11, Ryan Roberts wrote:
>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>
>>>>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>> range in the first place.
>>>>>>>>>>>
>>>>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>
>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>
>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>
>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>>>>>>
>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>
>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>
>>>>>>>>>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>> ---
>>>>>>>>>>>        include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>        mm/memory.c             | 92
>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>        2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>        #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>        #endif
>>>>>>>>>>>        +#ifndef pte_batch_remaining
>>>>>>>>>>> +/**
>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>> boundary.
>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>> + *
>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>> batch of
>>>>>>>>>>> ptes.
>>>>>>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>>>>>>> the end
>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>>>>> iterating
>>>>>>>>>>> + * over ptes.
>>>>>>>>>>> + *
>>>>>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>>>>>> + */
>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>>>>>>> addr,
>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>> +{
>>>>>>>>>>> +    return 1;
>>>>>>>>>>> +}
>>>>>>>>>>> +#endif
>>>>>>>>>>
>>>>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>>>>
>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>> require
>>>>>>>>>> arch
>>>>>>>>>> specifics?
>>>>>>>>>
>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the only
>>>>>>>>> way
>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>
>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>>>>>>> arch-specific version that didn't resolve to a compile-time constant 1
>>>>>>>>> still
>>>>>>>>> cost an extra 3%.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>
>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>
>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>>>>>>> regression under 4% with this. Further along the series I spent a lot of
>>>>>>>>> time
>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>>>>>>> memory read (even when in cache) was a problem. There is just so little in
>>>>>>>>> the
>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra and
>>>>>>>>> Apple
>>>>>>>>> M2).
>>>>>>>>>
>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>> benefit to
>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to
>>>>>>>>> play
>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>>>>>>> suggested.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>> series. I
>>>>>>>> implemented very generic and simple batching for large folios (all PTE bits
>>>>>>>> except the PFN have to match).
>>>>>>>>
>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>> Silver
>>>>>>>> 4210R CPU.
>>>>>>>>
>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>
>>>>>>>> -> Around 1.7 % faster
>>>>>>>>
>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>
>>>>>>>> -> Around 36.3 % faster
>>>>>>>
>>>>>>> Well I guess that shows me :)
>>>>>>>
>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>
>>>>>>
>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>
>>>>> I've just been trying to compile and noticed this. Will take a look at your
>>>>> update.
>>>>>
>>>>> But upon review, I've noticed the part that I think makes this difficult for
>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>> pte in
>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>> changes, its ptep_get() has to read every pte in the contpte block in order to
>>>>> gather the access and dirty bits. So if your batching function ends up wealking
>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>> function; this allows the core-mm to skip to the end of the contpte block and
>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>> instead
>>>>> of 256.
>>>>>
>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid the
>>>>> bit
>>>>> gathering. But we have a similar problem in zap_pte_range() and that function
>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see patch 3
>>>>> in my series).
>>>>>
>>>>> I guess you are going to say that we should combine both approaches, so that
>>>>> your batching loop can skip forward an arch-provided number of ptes? That would
>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>
>>>> You can overwrite the function or add special-casing internally, yes.
>>>>
>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()" and it
>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>
>>>
>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>
>>> How do you want to handle your patches? Do you want to clean them up and I'll
>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>
>> Whatever you prefer, it was mostly a quick prototype to see if we can achieve
>> decent performance.
> 
> I'm about to run it on Altra and M2. But I assume it will show similar results.
> 
>>
>> I can fixup the arch thingies (most should be easy, some might require a custom
>> pte_next_pfn())
> 
> Well if you're happy to do that, great! I'm keen to get the contpte stuff into
> v6.9 if at all possible, and I'm concious that I'm introducing more dependencies
> on you. And its about to be holiday season...

There is still plenty of time for 6.9. I'll try to get the rmap cleanup 
finished asap.

> 
>> and you can focus on getting cont-pte sorted out on top [I
>> assume that's what you want to work on :) ].
> 
> That's certainly what I'm focussed on. But I'm happy to do whatever is required
> to get it over the line. I guess I'll start by finishing my review of your v1
> rmap stuff.

I'm planning on sending out a new version today.

> 
>>
>>>
>>> As I see it at the moment, I would keep your folio_pte_batch() always core, but
>>> in subsequent patch, have it use pte_batch_remaining() (the arch function I have
>>> in my series, which defaults to one).
>>
>> Just double-checking, how would it use pte_batch_remaining() ?
> 
> I think something like this would do it (untested):
> 
> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> 		pte_t *start_ptep, pte_t pte, int max_nr)
> {
> 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
> 	pte_t expected_pte = pte_next_pfn(pte);
> 	pte_t *ptep = start_ptep;
> 	int nr;
> 
> 	for (;;) {
> 		nr = min(max_nr, pte_batch_remaining());
> 		ptep += nr;
> 		max_nr -= nr;
> 
> 		if (max_nr == 0)
> 			break;
> 

expected_pte would be messed up. We'd have to increment it a couple of 
times to make it match the nr of pages we're skipping.

> 		pte = ptep_get(ptep);
> 
> 		/* Do all PTE bits match, and the PFN is consecutive? */
> 		if (!pte_same(pte, expected_pte))
> 			break;
> 
> 		/*
> 		 * Stop immediately once we reached the end of the folio. In
> 		 * corner cases the next PFN might fall into a different
> 		 * folio.
> 		 */
> 		if (pte_pfn(pte) == folio_end_pfn - 1)
> 			break;
> 
> 		expected_pte = pte_next_pfn(expected_pte);
> 	}
> 
> 	return ptep - start_ptep;
> }
> 
> Of course, if we have the concept of a "pte batch" in the core-mm, then we might
> want to call the arch's thing something different; pte span? pte cont? pte cont
> batch? ...

So, you mean something like

/*
  * The architecture might be able to tell us efficiently using cont-pte
  * bits how many next PTEs are certainly compatible. So in that case,
  * simply skip forward.
  */
nr = min(max_nr, nr_cont_ptes(ptep));
...

I wonder if something simple at the start of the function might be good 
enough for arm with cont-pte as a first step:

nr = nr_cont_ptes(start_ptep)
if (nr != 1) {
	return min(max_nr, nr);
}

Which would get optimized out on other architectures.


-- 
Cheers,

David / dhildenb