From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E109CC3DA4A for ; Mon, 29 Jul 2024 17:46:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 69EAC6B009A; Mon, 29 Jul 2024 13:46:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 64E946B009B; Mon, 29 Jul 2024 13:46:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C7E26B009C; Mon, 29 Jul 2024 13:46:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 2EB306B009A for ; Mon, 29 Jul 2024 13:46:35 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id C9371140544 for ; Mon, 29 Jul 2024 17:46:34 +0000 (UTC) X-FDA: 82393519908.26.828E926 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf17.hostedemail.com (Postfix) with ESMTP id 3739240021 for ; Mon, 29 Jul 2024 17:46:32 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SzwHbVGR; spf=pass (imf17.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722275188; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=H3cjyU7DcRzlmh09Caqy7nDbYhyFo4sNrIlJX/deP18=; b=mdM//XQvucAYxHs9Sl0teYFSxA/JPWIV0mL70Vfxgq4dMg5mJCL3esHP0ZHkVhmLZgjKKX nmOFkrOp+Lk8wGCVk+y8XAKcUQeTh9EEq7xn+JeSfgMFIV7PV4pBZDi2aZ+29rKJotISh/ gAbiTajrn8hpZODCrLJuLwOS+Wahup8= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SzwHbVGR; spf=pass (imf17.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722275188; a=rsa-sha256; cv=none; b=HjGe6n2xLEtZncyQfaTF5oVzG0DWO2mSIaQdiYgXKDUaXowlwth3iDvYp/k6ZFYaD9P/MO VOpEEFkMj4Wy04nTlH5tscc1dUFOTD+RZHtVmsDv+wfSTGbO7AnGFqppaTZ/QAe7hTNONe UEoLe77XkXSAIgSwNuUJAaJ4vKsinuc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1722275191; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=H3cjyU7DcRzlmh09Caqy7nDbYhyFo4sNrIlJX/deP18=; b=SzwHbVGRyOAj4v/IgcjgCzMnp1KWhWqrpF2xdD91bXlSarM9/Mfter+SyYN+twoBT/JmZm GkitDpy3XEEmESAAHrihJi+dp+OHpn7dDW7Lgkjbtq8kAP6uufOCocI9fKHOgl4g373CHj JME4TzE8obx7GaBZSqd5AYeG6ohUiLo= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-182-fPYgc9bKPLqb36jFelM_Hg-1; Mon, 29 Jul 2024 13:46:29 -0400 X-MC-Unique: fPYgc9bKPLqb36jFelM_Hg-1 Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-369bbbdb5a1so1072791f8f.1 for ; Mon, 29 Jul 2024 10:46:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722275188; x=1722879988; h=content-transfer-encoding:in-reply-to:organization:autocrypt :content-language:from:references:cc:to:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=H3cjyU7DcRzlmh09Caqy7nDbYhyFo4sNrIlJX/deP18=; b=k+SC0UYzz7QmevRLImQBwTzYVfbvQfK2m+KsU9PP/WE0USoyJsrcEZBVxab6/F44XF 44ZWXzVADMEEpuC9e0WuR+EBEhJKx0/yEh48jdhju0tfwm8ZOnJaNg+qkFH6A0PhyBNt dZdYbb3Vb6E5a7V0mm/s2jv5nIguJxhFghL7Sob1StrCbtDVJObdvkiSouY2bZxgK7X4 rMISgz4Iyrrw0xJfIT5pjN6gsNuwdLLFI/p6MensFn6/er4/zma9jnNVe2sm7wf0Uokd YMPV104V5c6YScAceefSmkreer+voLiu2iL4AGiW21FNvVb35TfR9EVl48oEqtV6OWj6 ImLQ== X-Forwarded-Encrypted: i=1; AJvYcCVdouk9ImARinMyzgfAeR1IuUAPV8N6+a/cz5U5n/6+tmVdzViqXd6OfrNxUPeyOnaujIzOTToPFeFLjbatJWv3Cfo= X-Gm-Message-State: AOJu0YybRbjBEcPD1IY7J8BpznLTAx71uFIk8UaxUJdesORPcCzWb2SF qoUGjDJsKCb4PN8sg+TtrFZhFnXnQIM+OQG8t37FhkMulRTlquD0qNIzf92DalaNjgS8vlnbVJv UiKFgFReiJTmJq6l+UwlSFL2do/T6xUcYOiDC8Z3IDiyui1tF X-Received: by 2002:adf:e7c9:0:b0:366:eadc:573f with SMTP id ffacd0b85a97d-36b5d83c14fmr4906305f8f.27.1722275188612; Mon, 29 Jul 2024 10:46:28 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGZZdz9GwKmRLcZjIxmGDbi30MVDRDKGHp+RJFOkOUOAEfvMMbnz5F5ov80upTpSYDKQf74Ng== X-Received: by 2002:adf:e7c9:0:b0:366:eadc:573f with SMTP id ffacd0b85a97d-36b5d83c14fmr4906291f8f.27.1722275188152; Mon, 29 Jul 2024 10:46:28 -0700 (PDT) Received: from ?IPV6:2003:cb:c732:e100:8ba:76bc:e880:d17? (p200300cbc732e10008ba76bce8800d17.dip0.t-ipconnect.de. [2003:cb:c732:e100:8ba:76bc:e880:d17]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4280573f935sm187168345e9.14.2024.07.29.10.46.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 29 Jul 2024 10:46:27 -0700 (PDT) Message-ID: Date: Mon, 29 Jul 2024 19:46:26 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1 1/2] mm: let pte_lockptr() consume a pte_t pointer To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Muchun Song , Oscar Salvador , Qi Zheng , Hugh Dickins References: <20240725183955.2268884-1-david@redhat.com> <20240725183955.2268884-2-david@redhat.com> <9e671388-a5c6-4de1-8c85-b7af8aee7f44@redhat.com> From: David Hildenbrand Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: ncq4bopfwss6w69ggjwuchrbp9f5dfkt X-Rspamd-Queue-Id: 3739240021 X-Rspamd-Server: rspam11 X-HE-Tag: 1722275192-546735 X-HE-Meta: U2FsdGVkX1/QYr03YWehfeKEjRds/DV2AkydDp/OnvdUshFDs9/ymxQ1Jy3xlgDQSisgSZFz/2bZynEPQSScgJH34MpIKx2sscD4TnDqZNV/458owuLbHDRnKILxo4eDLamZSTLUjKfe8BgS159SJFeC9dQxJ+7smyr/rprOmbXxu3QGShh3HWJzsj38f5bTBDIuu8jOan/tScsX/iqNpmtdLhsOkdjymGXFjn8YPYayNssTw9Qe+sH10fk2ySOGJ96dTarWkdUYirIiQO2MY3AUiMwDJGxa8jepdGr2zNn7Wr/xKquhQoI7GRNB5df9L+IJrqpXZMk6DZcfwgolprEu9uwVeOvnRvEwM5DP/vx9lAEHvtbzMKqBkcVZXpdafMQTIC3K0Pk9Q5OBoFQUjzcKEEfOS/8V8xBTeHBIE/aQX9yvpDlaIZKFWe/DnQaaBJyYAYuTtO5NvxMXed1HgDydBRzSdlPEO28hUbfn7J8AbNpKW9l7u9Bexl9v0VWpP8/+AbISXeHkxnNtcu9qA3RLzwRIR5DKofVzb7PgEB4JnO62NyzPet8YUsCml21o7F+hOD7QD3DbyB0wRyL1ldz8z4VhRV8WN4ncRJ57n5/yYbxgJEI6tYsl98MqznqC0i1rddJEpfYAKZkG6Jd+1VKEtzkPKb2HHHZDNQXo88x1lfeAsajVVMWJddIzr92Fch3/aVpnrppwxHIFvz5E7UnZgOLIMxFxxpO1thubXBdedkdtRvfTmnIZJHdYP8otaAOufFiLXo0nhuUncBA2r5LtQTb5zAwAWZzWODa8dVGmmJWB9Gl7mOvckSC76CadFtRB6pER7AuJC/TnL1G4/6NKKrEneXEFJRHENuT1VpG42WCICrggHmo6zIrAtjItYdddUqOlXUPPzCpqq79noVCgpAPXU3G4KsblcLEGjzNX0xpZRO8IfeLj3mg5LrYrToebSP9BpAB78fRXifj Us2Kq0vN /PTjZl/6X2UjnbEV+MZW647aa1D2SfmOl/hBj7END1WIU+jgjqbFRcLNcg56zLX9fXf16so226FtPcseEY61hHQUViGoMdoc9m3IXYC+6kSwp2l5CpI2Y18Vq5FL6i6vgGsK8fAhJ6ycZou3LslNNArviu/4+tKm6WqD9qBS0Dn8X4vqR4bZbsw6ViiRXV3em8GQIUXPGa7d5PaIaCTkEEqfJHQ0dZH+Ln5bHtdJ4vxqTwXJjBGHxwt6cM3L3Tjz6opZgtebuDQQepph8dHrBARn1nGotbSOE+xfPDXf0fJuWMh/vN9iWStXei6bKnJ95U7K//WL+BNmFywKFlY3aQYn/2MfwhNw6n9LMppVw9zAOAl63SAqSMXeqxiIz5AE+oO29WVdWXX4VEF+4ZNOxZwLsV+mBRW/LA3RGr4urfb0lENWujNjqJb4PeGZwKDDnn0dX8Nyu4O5XCKx0PdhTCtZcXtB4d5PEWn3FG3R6LQZhi5ZW64raJkj0GZIZCzJ+4SKNPM84VK23YQl2DEcp9L7e/5ovrJCYDBkY X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 29.07.24 18:39, Peter Xu wrote: > On Mon, Jul 29, 2024 at 12:26:00PM -0400, Peter Xu wrote: >> On Fri, Jul 26, 2024 at 11:48:01PM +0200, David Hildenbrand wrote: >>> On 26.07.24 23:28, Peter Xu wrote: >>>> On Fri, Jul 26, 2024 at 06:02:17PM +0200, David Hildenbrand wrote: >>>>> On 26.07.24 17:36, Peter Xu wrote: >>>>>> On Thu, Jul 25, 2024 at 08:39:54PM +0200, David Hildenbrand wrote: >>>>>>> pte_lockptr() is the only *_lockptr() function that doesn't consume >>>>>>> what would be expected: it consumes a pmd_t pointer instead of a pte_t >>>>>>> pointer. >>>>>>> >>>>>>> Let's change that. The two callers in pgtable-generic.c are easily >>>>>>> adjusted. Adjust khugepaged.c:retract_page_tables() to simply do a >>>>>>> pte_offset_map_nolock() to obtain the lock, even though we won't actually >>>>>>> be traversing the page table. >>>>>>> >>>>>>> This makes the code more similar to the other variants and avoids other >>>>>>> hacks to make the new pte_lockptr() version happy. pte_lockptr() users >>>>>>> reside now only in pgtable-generic.c. >>>>>>> >>>>>>> Maybe, using pte_offset_map_nolock() is the right thing to do because >>>>>>> the PTE table could have been removed in the meantime? At least it sounds >>>>>>> more future proof if we ever have other means of page table reclaim. >>>>>> >>>>>> I think it can't change, because anyone who wants to race against this >>>>>> should try to take the pmd lock first (which was held already)? >>>>> >>>>> That doesn't explain why it is safe for us to assume that after we took the >>>>> PMD lock that the PMD actually still points at a completely empty page >>>>> table. Likely it currently works by accident, because we only have a single >>>>> such user that makes this assumption. It might certainly be a different once >>>>> we asynchronously reclaim page tables. >>>> >>>> I think it's safe because find_pmd_or_thp_or_none() returned SUCCEED, and >>>> we're holding i_mmap lock for read. I don't see any way that this pmd can >>>> become a non-pgtable-page. >>>> >>>> I meant, AFAIU tearing down pgtable in whatever sane way will need to at >>>> least take both mmap write lock and i_mmap write lock (in this case, a file >>>> mapping), no? >>> >>> Skimming over [1] where I still owe a review I think we can now do it now >>> purely under the read locks, with the PMD lock held. >> >> Err, how I missed that.. yeah you're definitely right, and that's the >> context here where we're collapsing. I think I somehow forgot all Hugh's >> work when I replied there, sorry. >> >>> >>> I think this is also what collapse_pte_mapped_thp() ends up doing: replace a >>> PTE table that maps a folio by a PMD (present or none, depends) that maps a >>> folio only while holding the mmap lock in read mode. Of course, here the >>> table is not empty but we need similar ways of making PT walkers aware of >>> concurrent page table retraction. >>> >>> IIRC, that was the magic added to __pte_offset_map(), such that >>> pte_offset_map_nolock/pte_offset_map_lock can fail on races. >> >> Said that, I still think current code (before this patch) is safe, same to >> a hard-coded line to lock the pte pgtable lock. Again, I'm fine if you >> prefer pte_offset_map_nolock() but I just think the rcu read lock and stuff >> can be avoided. >> >> I think it's because such collapse so far can only happen in such path >> where we need to hold the large folio (PMD-level) lock first. It means >> anyone who could change this pmd entry to be not a pte pgtable is blocked >> already, hence it must keeping being a pte pgtable page even if we don't >> take any rcu. >> >>> >>> >>> But if we hold the PMD lock, nothing should actually change (so far my >>> understanding) -- we cannot suddenly rip out a page table. >>> >>> [1] >>> https://lkml.kernel.org/r/cover.1719570849.git.zhengqi.arch@bytedance.com >>> >>>> >>>>> >>>>> But yes, the PMD cannot get modified while we hold the PMD lock, otherwise >>>>> we'd be in trouble >>>>> >>>>>> >>>>>> I wonder an open coded "ptlock_ptr(page_ptdesc(pmd_page(*pmd)))" would be >>>>>> nicer here, but only if my understanding is correct. >>>>> >>>>> I really don't like open-coding that. Fortunately we were able to limit the >>>>> use of ptlock_ptr to a single user outside of arch/x86/xen/mmu_pv.c so far. >>>> >>>> I'm fine if you prefer like that; I don't see it a huge deal to me. >>> >>> Let's keep it like that, unless we can come up with something neater. At >>> least it makes the code also more consistent with similar code in that file >>> and the overhead should be minimal. >>> >>> I was briefly thinking about actually testing if the PT is full of >>> pte_none(), either as a debugging check or to also handle what is currently >>> handled via: >>> >>> if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) { >>> >>> Seems wasteful just because some part of a VMA might have a private page >>> mapped / uffd-wp active to let all other parts suffer. >>> >>> Will think about if that is really worth it. >>> >>> ... also because I still want to understand why the PTL of the PMD table is >>> required at all. What if we lock it first and somebody else wants to lock it >>> after us while we already ripped it out? Sure there must be some reason for >>> the lock, I just don't understand it yet :/. >> >> IIUC the pte pgtable lock will be needed for checking anon_vma safely. >> >> e.g., consider if we don't take the pte pgtable lock, I think it's possible >> some thread tries to inject a private pte (and prepare anon_vma before >> that) concurrently with this thread trying to collapse the pgtable into a >> huge pmd. I mean, when without the pte pgtable lock held, I think it's >> racy to check this line: >> >> if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) { >> ... >> } >> >> On the 1st condition. > > Hmm, right after I replied, I think it also guarantees safety on the 2nd > condition.. > > Note that one thing I still prefer a hard-coded line over > pte_offset_map_nolock() is that, the new code seems to say we can treat the > pte pgtable page differently from the pte pgtable lock. But I think > they're really in the same realm. > > In short, AFAIU the rcu lock not only protects the pte pgtable's existance, > but also protects the pte lock. > > From that POV, below new code in this patch: > > - ptl = pte_lockptr(mm, pmd); > + > + /* > + * No need to check the PTE table content, but we'll grab the > + * PTE table lock while we zap it. > + */ > + pte = pte_offset_map_nolock(mm, pmd, addr, &ptl); > + if (!pte) > + goto unlock_pmd; > if (ptl != pml) > spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > + pte_unmap(pte); > > Could be very misleading, where it seems to say that it's fine to use the > pte pgtable lock even after rcu unlock. It might make the code harder to > understand. I see what you mean but this is a very similar pattern as used in collapse_pte_mapped_thp(), no? There we have start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl); ... if (!pml) spin_lock(ptl); ... pte_unmap(start_pte); if (!pml) spin_unlock(ptl); Again, I don't have a strong opinion on this, but doing it more similar to collapse_pte_mapped_thp() to obtain locks makes it clearer to me. But if I am missing something obvious please shout and I'll change it. -- Cheers, David / dhildenb