From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 551CDE7717D for ; Wed, 11 Dec 2024 14:49:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6E5296B0093; Wed, 11 Dec 2024 09:49:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 693596B0095; Wed, 11 Dec 2024 09:49:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E6066B0098; Wed, 11 Dec 2024 09:49:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 27B796B0093 for ; Wed, 11 Dec 2024 09:49:20 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C2487A1121 for ; Wed, 11 Dec 2024 14:49:19 +0000 (UTC) X-FDA: 82882960356.13.7CB6148 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 18A6B1C0010 for ; Wed, 11 Dec 2024 14:48:32 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bzyB+9io; spf=pass (imf21.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733928533; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/+C3Hx9/CqaPaoguPeOnJK9wUwL56mlzpfMULKOSP8A=; b=2ybwiswkCCmMwtIM+xer4F07My8lUsvq6yHN9mXwvujfrcmcdpvHuzovs7bi5q7rQgQ0CK /7uAGlHFigEIKUOk9WFi6lWSvqsPGH8MJ+BOdin3t+p0UpqzR0A1XDGk8dRjX1OjsVGGDL aKSaUDBsgOakro20CEYDiDGUWB/dGBo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733928533; a=rsa-sha256; cv=none; b=IVnfPQqKz5zuL3r4NFB8itWpKo3iJFu94HmF+mdF+jXzwd0lUjAF5CT+Kz56BMP/B5+PdK RcOdarYBJt4uE3G9oWfp3W/7v2Rr02NdC3Z8ATshn6BzvBeVizQ34WrO5fKOPPti/KkRBi p740uBq3zi0TXot2tCmAt9rlMLS9Wcc= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bzyB+9io; spf=pass (imf21.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1733928556; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=/+C3Hx9/CqaPaoguPeOnJK9wUwL56mlzpfMULKOSP8A=; b=bzyB+9ioXiWdTe2U1qLb047fb4ZiVSyQILqCvb84FLo0eqXf3CP0ylRO4qTmFb5fYsV3m8 4FYxp9FSK3gQ9dun/jsudxA94BbSeUNqR0H/xn7GSzGTLj5zeALkeKcnmjPLRTVvOkcyt9 T0BbUi4NMcdGQEUuBPkqAjn/twpyzkk= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-354-HcwAcTijO4usSI0ttyXxuw-1; Wed, 11 Dec 2024 09:49:15 -0500 X-MC-Unique: HcwAcTijO4usSI0ttyXxuw-1 X-Mimecast-MFC-AGG-ID: HcwAcTijO4usSI0ttyXxuw Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-434f5b7b4a2so31696405e9.0 for ; Wed, 11 Dec 2024 06:49:15 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733928554; x=1734533354; h=content-transfer-encoding:in-reply-to:organization:autocrypt :content-language:from:references:cc:to:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=/+C3Hx9/CqaPaoguPeOnJK9wUwL56mlzpfMULKOSP8A=; b=Z8P/6UwD1tdUrEyO3J/P6mU7Gmppzl5p43oB2qYKLJied+586eQ6myCf+CZfTSgKqd IhozJr6sc7QWya+SeYBUeYo2O4kPEzrpc9Khb10cJrTP4PbW9hqqnYRoGbGkEkgsyiMi Lypx2QKzAodkxx6zh/6OShUfMoizW3ukHIWYQiZG1f/U8sj6ctwz1tuMWL5/vAC3OgvY iwdbRed/rqIgOrvNH9IY6jPv2KUfRyKfcVBVE0ct+OSd0GJgMeKinbjldUt1D4rk7spV LiePodrKsNySPyH9ZU/R27+68/Kt1LmNBHZbKoSb/mzztjBo3O2O+BIOqcK2iQor28SN 9hpg== X-Forwarded-Encrypted: i=1; AJvYcCXQWot8YAdH96LfPASNyGI3MhOacy21gHt75rqNLJcEOb1afsu4ys5QBhWnIFc00SWnjnfbAM0sPg==@kvack.org X-Gm-Message-State: AOJu0YxmX9fNDz2HSNstDVDllIK08KoJxYa2xpbHrDdXUlNj1Dv700vd ZA8OWMKP2z6wXmvLM9dtQrDmrSpkfDVFvKNRUiNUQy9P6SpRwFrOvRu+FxMHK27wzqAFGw+PaXA +A002NPAa+LrvVxJFDrWWbzoJZUOJ9CaEHwBo4znWIpPhHGi5 X-Gm-Gg: ASbGncvF4r0P/zkfgP/dNHugcAu/1rH2+WqGst6HDm5k0jx90RqKuYu/yPPEyxiOdBj WdXMsnq8eeeRvMP3hDTqi2DpGM5I2a0MiFf2Kj+w9obhQsZhaN0pJx06ttsx/p8yFPD+3YDMA2w cPwmHMSQ6ahuQ1sJAODmZUnjP+akan4x86a1ZMrIA9KtMSi+7NrSOsngx8gzdfCi2gWABy+Wyw3 QYZM1zfisnlkhgkkQD3+XFOKC9NAbaByaIw5wZbenCJBfMylpLkb5fkeiQiF0COZKla35fyDqit X-Received: by 2002:a05:6000:1547:b0:385:e1f5:476f with SMTP id ffacd0b85a97d-3864cec3a79mr2761279f8f.39.1733928554138; Wed, 11 Dec 2024 06:49:14 -0800 (PST) X-Google-Smtp-Source: AGHT+IHRbjXKNJucQJMhFi7LFemqKeNXHVY2+zz4NF3vhItTt2bF/jRlUg8MtBNqCzq4eYAcmFN3Ew== X-Received: by 2002:a05:6000:1547:b0:385:e1f5:476f with SMTP id ffacd0b85a97d-3864cec3a79mr2761260f8f.39.1733928553676; Wed, 11 Dec 2024 06:49:13 -0800 (PST) Received: from [10.32.64.156] (nat-pool-muc-t.redhat.com. [149.14.88.26]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38782521dfesm1400289f8f.107.2024.12.11.06.49.12 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 11 Dec 2024 06:49:13 -0800 (PST) Message-ID: <9378073b-06bb-47b4-8435-32aba67d5845@redhat.com> Date: Wed, 11 Dec 2024 15:49:12 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: The future of PageAnonExclusive To: Ryan Roberts , "linux-mm@kvack.org" Cc: Matthew Wilcox References: <9c2a17af-4df8-42f0-93c8-83133b6104fd@redhat.com> <3a9e7c3b-7b69-4f16-80ea-b1a0d4dba853@redhat.com> <41e5c2c6-01bb-40ab-a7bf-5b4e827377f8@arm.com> From: David Hildenbrand Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: <41e5c2c6-01bb-40ab-a7bf-5b4e827377f8@arm.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: zHSIyN9HrDKJg3WcAAe_eab-GH6VHpmc75eUvCFtYjk_1733928554 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 18A6B1C0010 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: ek8ch668jhswecw1sgps4fxxxibudfhq X-HE-Tag: 1733928512-45167 X-HE-Meta: U2FsdGVkX19j87MvxTzFj5zW6gJgb4vziRp8QwtedVqaLrqDcPfAxhV1/WfdZt4EktbWR7SUtI0rti5iYQDcqrFMNyBkH4ZFh4EJNgUcVNh2imNlmfdewb4lcHsjq+5ltq5qNpAhPIffHZn9ZkovOeKpzGAk2Mgt6zIM4XrWeksy5Qxh4/gqTnDmVO7jZElcegg4a9De2aNSVIi1M9CzG0vAwj4syBIG0VhGWdA3BUGgmazQQqtT0haGS0KKW1AyUWmz54jIQprDYdZ+zXV7TYgVtN4e79rHAHkE9GOnQF5qJpzknROH8nNlBnJj60mPGCul1gQqUxB/RCpIa/61gsl5MrWzc4yV3x1NFt8gkdAehyISlY7TwWGNgkQuWi84Yv0HoRPHfXNHa1YkCrVvduYIxaCZH6/JcBvF8maLLEEc8hsbRAM0iVeJlOqy9x3Hv1odrr/RhY/y47+ci6YQjlC5TSvmvu/e/pgcXZujIbkYfNO8tfxObNhgnyw71nvWBzOYgW1+Z/FToiQG/z09kQ2Myc3pZr8nS9P+St/WRuVZZMF0Y8jQCk14cSTKJoyfDAQUAXRit6cgBg6YXXzVJthou1IizaYRCW4SuA9KIWUa/LKZGpOjUaW5qnn5h9tUKh3zolOdwmzDSZN/mt0TbltXSzd2B6QaYS/wDedZXxVIP6XMaMMxDLnor/yoOXgjYQlVat3xB8M1K2Y7JGsIkoC32REXdRHPgKNnJ5/1RIBfkOpdZ/1GBzY3OO8s1+9Buel5T9jtPclB7LLfS2RMvxEPujTOG8foMalweOZH17eFnndC7r9C54ofNIB35SH7swjHXjpC9qRYDKY3xsb2h9/oWhdhxcpquqn+N01j+Is9sSr+UlfERCUg6sdJWNUELTa/qOrkQMmcOucLGxBepnhDrP6Cm5aG2C7Uz5ecRQFOjCX0OWAT/pN9FaNd7qZnaUTnOsij/niHyf92Lwo UqTMMF3R W4l/TEWqQGFPvUWhcg4G1Df6s0SEyY4jLBjBEeLbteyvTE1PCcuWrjPn3FlOwjIT9gh8QQsUsQ/JjUP8L0J0OQbfKXxmMZpqzX8zGSW0AePXIuDHUT0DtAEM3qhPQn15Bl6uf5vHiNXD9T0e9EHR/B6Q0h4y1ZLosT8tM5c4Ig4O7RJNocrMU+2Bux8nCBUZPugFd+CoCUBt1Any2b/AuacjVx1KXWD/fnloKEMliyPxyg6rjXfLQTrApM+TaRNgXrfT0wFPFrwZXCaTwvnvcoHhG1WMb040rNVxQ/NFYNTuFOiRxHJ901w+P8Z7aCKP06h2Ux2UyrTp2RAAOBb0D76xn26XFBGYJbFTgwODTexRUZF/BgCLuHPWEjrK+6x9Fog00q89q23S7nk0WAUzx6MhiQewRoLgaYthNVZenegX6sZ63jPzIKTvdDcoJzBosFxMZBP0jAZHcAnkKO4PrSgXWEL1L27TS6yts X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11.12.24 15:25, Ryan Roberts wrote: > On 11/12/2024 11:56, David Hildenbrand wrote: >> Now CCing the correct Willy :) >> >> On 11.12.24 12:55, David Hildenbrand wrote: >>> Hi, >>> >>> PageAnonExclusive (PAE) is working very reliable at this point. But >>> especially in the context of THPs (large folios) we'd like to do better: >>> >>> (1) For PTE-mapped THP, we have to maintain it per page. We'd like to >>>       avoid per-page flags as good as possible (e.g., waste in "struct >>>       page",  touching many cachelines). > > Presumably also important for the Glorious Future where struct page is just a > pointer and struct folio (et al) is allocated dynamically? I think Willy mentioned that there might be ways to encode it in the 8-byte for the "tail" pages. > >>> >>> (2) We currently have to use atomics to set/clear the flag, even when >>>       working on tail pages. While there would be ways to mitigate this >>>       when modifying the flags of multiple tail pages (bitlock protecting >>>       all tail page flag updates), I'd much rather avoid messing with >>>       page tail flags at all. >>> >>> >>> In general, the PAE bit can be considered an extended PTE bit that we >>> currently store in the "struct page" that is mapped by the PTE. Ideally, >>> we'd just store that information in the PTE, or alongside the PTE: >>> >>> A writable PTE implies PAE. A write-protected PTE needs additional >>> information whether it is PAE (-> whether we can just remap it writable, >>> FOLL_FORCE to it, PIN it ...). >>> >>> We are out of PTE bits, especially when having to implement it across >>> *all* architectures. That's one of the reasons we went with PAE back >>> then. As a nice side-effect it allowed for sanity checks when unpinning >>> folios (-> PAE must still be set, which is impossible when the >>> information stored in the PTE). >>> >>> >>> There are 3 main approaches I've been looking into: >>> >>> (A) Make it a per-folio flag. I've spent endless hours trying to get it >>>       conceptually right, but it's just a big pain: as soon as we clear >>>       the flag, we have to make sure that all PTEs are write-protected, >>>       that the folio is not pinned, and that concurrent GUP cannot work. >>>       So far the page table lock protected the PAE bit, but with a per- >>>       folio flag that is not guaranteed for THPs. >>> >>>       fork() with things like VM_DONTCOPY, VM_DONTFORK, early-abort, page >>>       migration/swapout that can happen any time during fork etc. make >>>       this really though to get right with THPs. My head hurts any time I >>>       think about it. >>> >>>       While I think fork() itself can be handled, the concurrent page >>>       migration / swapout is where it gets extremely tricky. >>> >>>       This can be done somehow I'm sure, but the devil is in the corner >>>       cases when having multiple PTEs mapping a large folio. We'd still >>>       need atomics to set/clear the single folio flag, because of the >>>       nature of concurrent folio flag updates. >>> >>> (B) Allocate additional metadata (PAE bitmap) for page tables, protected >>>       by the PTL. That is, when we want to *clear* PAE (fork(), KSM), we'd >>>       lazily allocate the bitmap and store it for our page table. >>> >>>       On x86: 512 PTEs -> 512bits -> 64byte >>> >>>       fork() gets a bit more expensive, because we'd have to allocate this >>>       bitmap for the parent and the child page table we are working on, so >>>       we can mark the pages as "!PAE" in both page tables. >>> >>>       This could work, I have not prototyped it. We'd have to support it >>>       on the PTE/PMD/PUD-table level. >>> >>>       One tricky thing is having multiple pagetables per page, but I >>>       assume it can be handled (we should have a single PT lock for all of >>>       them IIRC, and only need to address the bitmap at the right offset). >>> >>>       Another challenge is how to link to this metadata from ptdesc on all >>>       archs.. So far, __page_mapping is unused, and could maybe be used to >>>       link to such metadata -- at least page tables can be identified >>>       reliably using the page type. > > FWIW, I did a prototype of this sort of thing a while back to try to create some > extra (general purpose) PTE bits for arm64. There is already a union of various > arch-specific things in ptdesc, none of which were used by arm64 so I just added > an arm64 field to that. That doesn't help you though. Right. > > As I recall it all got horrible because I couldn't read the extra bits > atomically with the rest of the PTE and I couldn't convince myself that it was > always safe for lockless walkers. (I think we had a fairly long thread talking > about it). Anyway, I suspect that this is not a problem for your case because > you'll be operating at a higher level where you can always gaurantee the PTL is > held? Well, there is GUP-fast, without any locking :( Lazily allocating it would probably work: if there is no bitmap pointer, everything is exclusive. If there is a bitmap pointer, it cannot vanish. But the RCU freeing of page tables etc ... don't make this significantly easy to implement. > > arm64 uses slab allocator for its top level tables when that level is not an > entire page. So there is no ptdesc to attach to in that case. My prototype > swerved that by disallowing block mappings at the top level. After writing it here, I realized that lazy allocation will be a bit of a problem for remapping a PMD-mapped THP using PTEs. The page table we dispose would already have that bitmap allocated, otherwise we might not have the bitmap when remapping a PMD-mapped THP that is shared ... and we are not guaranteed to be able to allocate memory at that point. > >>> >>> (C) Encode it in the PTE. >>> >>>       pte_write() -> PAE >>> >>>       !pte_write() && pte_dirty() -> PAE >>> >>>       !pte_write && !pte_dirty() -> !PAE >>> >>>       That implies, that when wrprotecting a PTE, we'd have to move the >>>       dirty bit to the folio. When wr-unprotecting it, we could mark the >>>       PTE dirty if the folio is dirty. >>> >>>       I suspect that most anon folios are dirty most of the time either >>>       way, and the common case of having them just writable in the PTE >>>       wouldn't change. >>> >>>       The main idea is that nobody (including HW) should ever be marking a >>>       readonly PTE dirty (so the theory behind it). We have to take good >>>       care whenever we modify/query the dirty bit or modify the writable >>>       bit. >>> >>>       There is quite some code to audit/sanitize. Further, we'd have to >>>       decouple softdirty PTE handling from dirty PTE handling (pte_mkdirty >>>       sets the pte softdirty), and adjust arm64 cont-pte and similar PTE >>>       batching code to respect the per-PTE dirty bit when >>>       the PTE is write-protected. >>> >>>       This would be the most elegant solution, but requires a bit of care >>>       + sanity checks. > > This sounds like it could all get quite fragile to me. Lots of potential to get > accidentally broken over time... It could be fairly well sanity checked I think. From all the things, it's the clearest regarding locking, memory allocation ... and the rules can be extremely easily documented. Whereby (A) is just a nightmare, and I get the feeling that (B) is as well. Yeah, ideally we'd have a spare PTE bit, but that is pretty much out of the picture .. :( > >>> >>> >>> Any thoughts or other ideas? > > What happened to the idea in your "every mapping counts" paper? I'm still working on getting something simpler upstream (I sent a v1 for MM owner tracking, which I am reworking as we speak to be a bit simpler and maybe also work for 32bit ... somehow so we can enable it unconditionally) first. It's all tricky ... > Doesn't that > provide this info? Unfortunately not. "mapped exclusively" vs. "Anon Exclusive" are two things. "Anon exclusive" implies "mapped exclusively", but not the other way around. We could detect "this is mapped by one process" (the old broken page_mapcount()==1 check) but have vmsplice/O_DIRECT/swapcache references from another process; for example, the famous vmsplice issue. And because we cannot decide whether the references are from *this* process or from another one, we can only get it wrong. To reset PAE, one can use "mappped exclusively + all references from pinnings", but it cannot replace PAE. -- Cheers, David / dhildenb