From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBC62C3DA4A for ; Mon, 29 Jul 2024 06:19:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5781F6B00A4; Mon, 29 Jul 2024 02:19:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 528956B00A6; Mon, 29 Jul 2024 02:19:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3F0526B00A7; Mon, 29 Jul 2024 02:19:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1DAFC6B00A4 for ; Mon, 29 Jul 2024 02:19:37 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 9FB518010F for ; Mon, 29 Jul 2024 06:19:36 +0000 (UTC) X-FDA: 82391788752.17.4C15547 Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) by imf21.hostedemail.com (Postfix) with ESMTP id 556AB1C0015 for ; Mon, 29 Jul 2024 06:19:34 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=EzuyvVyD; spf=pass (imf21.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722233921; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=o+ej+mYre5gtHO05efMaQEUIHF+HSHHhoHKJ6E1Z3jg=; b=ZPoYoDmrrqVKEu8oKmCtAkqhTuLl5ullvwgTbiMHJJ36nU7BdRCeAaX58GdrnlXR1o+9Fx x0Lo82v5PA+4wFUsXT9MkxcwKMfZFZ45rMOIWvwgbqUqn3Q708RqW3KA4iucGFZ3JYqtcV qlqQjJejbex7O32DESkttZaDWedITqg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722233921; a=rsa-sha256; cv=none; b=QjvqWkcL/4g5qiPxEDYUAZ/bEgkjOb2abqtQYuAZHKe/YyM7/Bf1WBAA74bHR7VwomtqSr m/J7L0sECe7ZWO28qPHlMYWmAwXVwpboqQ8sU6YO9KkluupIfNy3eRfg3K4ZK335Cdx3We KTr7Oa7QfT7FklYzLN/nfFVs4YgRZ3s= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=EzuyvVyD; spf=pass (imf21.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-pj1-f43.google.com with SMTP id 98e67ed59e1d1-2cb4c2d13a0so404898a91.0 for ; Sun, 28 Jul 2024 23:19:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1722233973; x=1722838773; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=o+ej+mYre5gtHO05efMaQEUIHF+HSHHhoHKJ6E1Z3jg=; b=EzuyvVyDIWTLwNMQKMx6JXCLhlVNV79Rk1Yo4J6vfYPXorZgbE/yM6G4jTDxoY/MVT 8oHFcVTF1ZE385OCmbZZBFnT3vQFnNc4Y09l+TxWqz8mu3iAaqsfyJ9Frka9yYk4/10o 9k+HuG7EvQ/00Zca9EH3iGa9OK4qv/6Pt7q15jjOXVOk7Zz9vJvDojNI8w0G/U0NQo2o S/L5NTSEd4scaEB6N0jt2pNxCoOFK1KIqHCsMz7cNAiSmiFFNJ4yx1NyQtIVyDew026f jYrI51IM9wc/ZvMs3/8hS/nc1prz+gV3+lbHc3QxinqqF7S5qXRN7cUeBsmFJJ+7xFUA d5Bw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722233973; x=1722838773; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=o+ej+mYre5gtHO05efMaQEUIHF+HSHHhoHKJ6E1Z3jg=; b=EmPoQcyOnJpFIbwwSKpBPYEc1ddD9k6lTy+mC7GlVUbAud1Ocqemdu0MU2LibPkDCU 1DXgy/vpMQrq1JVCYG3RWNc+LTewWxx0WZB6nljdzzPGiE47ZY4K+AFxnhg0U5nBfy+L Gk7vh1dr31fJRb9SYvPRECDHfJ/Pbj9Safas9KdXDzyVwYBi3A/2ejq9P3bhFnFLEXIF jlxYMLpydUK1FHf0fVKBSQpTxoSFG1c79BiRxqFQPQagGO2Y0H/lR1IuJ/vxvo7KK2xF PtxgAce1UvkBd1GVWX2xAt+rI6aUVEkQ+Dl730w79wKBC3pO3Fm0B7hf+MRt4Qg6uKrC mUGg== X-Forwarded-Encrypted: i=1; AJvYcCWZp2EWs8JHAB5ldpXsNC2Gx62/m5flEr7iCOHiGYqPHOS3218XF9EWbnvvPVWdvUV67h5RpsvEVA==@kvack.org X-Gm-Message-State: AOJu0YyGuDJt4+VLGE2v4BBZQ5/EJ/lyJQnQf6AvvwdcbBMr3S1hPv5v jKMoEwTa25xRf/a+HHcVqMdEHO63u7MqM/2i1fjVSFM6yzZXKt/MbP8WMinUeIk= X-Google-Smtp-Source: AGHT+IHP9yQ73+VVJMjlOmUNfggjrjH+6V00vvrQUWBo+pyJtk3PpYjxww3k2xmitRoahrST5LcUAQ== X-Received: by 2002:a17:902:bf04:b0:1fc:611a:bca with SMTP id d9443c01a7336-1fed6cc7ac1mr90923705ad.8.1722233972829; Sun, 28 Jul 2024 23:19:32 -0700 (PDT) Received: from [10.255.168.175] ([139.177.225.232]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fed7f3eb85sm74433105ad.214.2024.07.28.23.19.29 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 28 Jul 2024 23:19:32 -0700 (PDT) Message-ID: <12bae4c3-5dda-4798-9f6a-3ac040111551@bytedance.com> Date: Mon, 29 Jul 2024 14:19:26 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1 1/2] mm: let pte_lockptr() consume a pte_t pointer Content-Language: en-US To: David Hildenbrand Cc: Peter Xu , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Muchun Song , Oscar Salvador References: <20240725183955.2268884-1-david@redhat.com> <20240725183955.2268884-2-david@redhat.com> <9e671388-a5c6-4de1-8c85-b7af8aee7f44@redhat.com> From: Qi Zheng In-Reply-To: <9e671388-a5c6-4de1-8c85-b7af8aee7f44@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 556AB1C0015 X-Stat-Signature: j9n7ids9dh1xmq4g8ewhe367kkc5oyd9 X-HE-Tag: 1722233974-246141 X-HE-Meta: U2FsdGVkX1+bJh1adMdvicJN3S88GsWGb6Li3cqzCxzcT6t6j7m5bmuEpIsZAn8zmeTMms6vxa+bM6JSVduPfiRZaJk2TAQf6x5D4OvJq9yFykdiYUoESLrCWzBnUCfmjmrYlcTjAfag0QtRIPjBK93hsUtTjOTDW/nQbSq8WMAO+Ju4HcdhDD16efxYelHSLxmdy8GN3o7NrWg/yPBm/4RKFjvAmaH3jiyMTDmFD1fGVtT3bogoYT/tT0aJyBB1Nh+kfUs8MB9qErvIxB9vbd/aaB651kqvBvFtC4vFVsks2H0YunKkB7ThInBR8kS6gg9kh1tCMioTgtVoSPIhgDZmBev6WE1xoyfJ/Diyio5RFfUcDcn5uo+7vkprLDZddWyqBfhy+QnMKngRpb+MyKbBmg1RY7AE51YMlxgAPs9D0tCiBNQUijC2/rO1+5WTFNycWLcot2l9j7QHfuuLskpIF2uUn1co1XTB03ishHLBb/azx8diq5zISbkiNfjsvVE4yGv9+V1LqGg2aqEFP32FXbk8Y7/eH7axM31kCkIHy6pmq6cO4SKMC1Gb49HlxLvFNTYJp9dubzJZ1JdSQYA5GMgcVu3+SJuhz2VYh1u+Zu+u22H0U9TvyiPQg5/OiT0pwtt0Us+/Ox2DNhpAx3lk0uu4WEUl33TWKwAkuR53v10GiwcIO1edilYdy5BJViAzq19Cz2L3JKzX75GQLVyKIlWuCuiSenXVsY0MnuzU+MJ1SYPD4ohN/bCZk00CDwjZUJvDje0mlaN2gccnjp7Mys+0xK840bfDIusM47GT/f46lhyACZh+j2LY/3lNnrWY1kRCtwveic6x/pPEI3JukKTrhrCcum+jQTqnEB5isrJ6QP6BI0VRR3Zkygq08Um6nvOc/eezPaOCz94I3npPct31g0MPzAAKkAxunzCX8Y6jDLwoUiE8sALhI19z2ICfG+a60vwXbVAAJQ3 1ld2XC1O nV14qDo8cXtrz2xrcHvq3YkNCYh8XzCnw1+yH7RhVAIPNh6vHwryR0kQLseG2hwZttRdxi7dO9HiJCgVE9yfXd64ZeBAkUxDuWGI/GgkH2V0vIiaQmL26HFcjIZclPk+XPN+sCEHB/9gQHH35wvORYQ5WCUQl3+C94cs3N57dCPuKcWJ9bjL0m/61lBpMknrklLypWgfbFrj0BdhVjvZld8qQ8V1gwcKPyBj/wZZEPx4uQmkABGiBoTyvO/TwJYPWcBMhyOtIopY5kJDq3IO+/Fkfbks1Ns5IN5DN0xGxMU+3gNMv2JHuVGAHLAJ8qCtJTA/y75kQlNHypNFM1Y38ibAwY3WGoOYFBiMeMuqzw015ql539OlEh6t/18Hc8JVUpLfwW0DiENKjXFEuVTlX3iWbDFp/a8u1PfSpSnaJJJ9+CaQ4DsDyotTSQDNn7D8H+IdGUG0mNe1ND4KR16HVGJMrrulxlZtzKYWUH/vSlEl22QtW8m1qqfC7X90BQjz6xOzsnx1FrLV71lT66nXE3wniewNxh8B7pYoo X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi David, On 2024/7/27 05:48, David Hildenbrand wrote: > On 26.07.24 23:28, Peter Xu wrote: >> On Fri, Jul 26, 2024 at 06:02:17PM +0200, David Hildenbrand wrote: >>> On 26.07.24 17:36, Peter Xu wrote: >>>> On Thu, Jul 25, 2024 at 08:39:54PM +0200, David Hildenbrand wrote: >>>>> pte_lockptr() is the only *_lockptr() function that doesn't consume >>>>> what would be expected: it consumes a pmd_t pointer instead of a pte_t >>>>> pointer. >>>>> >>>>> Let's change that. The two callers in pgtable-generic.c are easily >>>>> adjusted. Adjust khugepaged.c:retract_page_tables() to simply do a >>>>> pte_offset_map_nolock() to obtain the lock, even though we won't >>>>> actually >>>>> be traversing the page table. >>>>> >>>>> This makes the code more similar to the other variants and avoids >>>>> other >>>>> hacks to make the new pte_lockptr() version happy. pte_lockptr() users >>>>> reside now only in  pgtable-generic.c. >>>>> >>>>> Maybe, using pte_offset_map_nolock() is the right thing to do because >>>>> the PTE table could have been removed in the meantime? At least it >>>>> sounds >>>>> more future proof if we ever have other means of page table reclaim. >>>> >>>> I think it can't change, because anyone who wants to race against this >>>> should try to take the pmd lock first (which was held already)? >>> >>> That doesn't explain why it is safe for us to assume that after we >>> took the >>> PMD lock that the PMD actually still points at a completely empty page >>> table. Likely it currently works by accident, because we only have a >>> single >>> such user that makes this assumption. It might certainly be a >>> different once >>> we asynchronously reclaim page tables. >> >> I think it's safe because find_pmd_or_thp_or_none() returned SUCCEED, and >> we're holding i_mmap lock for read.  I don't see any way that this pmd >> can >> become a non-pgtable-page. >> >> I meant, AFAIU tearing down pgtable in whatever sane way will need to at >> least take both mmap write lock and i_mmap write lock (in this case, a >> file >> mapping), no? > > Skimming over [1] where I still owe a review I think we can now do it > now purely under the read locks, with the PMD lock held. Yes. > > I think this is also what collapse_pte_mapped_thp() ends up doing: > replace a PTE table that maps a folio by a PMD (present or none, > depends) that maps a folio only while holding the mmap lock in read > mode. Of course, here the table is not empty but we need similar ways of > making PT walkers aware of concurrent page table retraction. > > IIRC, that was the magic added to __pte_offset_map(), such that > pte_offset_map_nolock/pte_offset_map_lock can fail on races. > > > But if we hold the PMD lock, nothing should actually change (so far my > understanding) -- we cannot suddenly rip out a page table. > > [1] > https://lkml.kernel.org/r/cover.1719570849.git.zhengqi.arch@bytedance.com > >> >>> >>> But yes, the PMD cannot get modified while we hold the PMD lock, >>> otherwise >>> we'd be in trouble >>> >>>> >>>> I wonder an open coded "ptlock_ptr(page_ptdesc(pmd_page(*pmd)))" >>>> would be >>>> nicer here, but only if my understanding is correct. >>> >>> I really don't like open-coding that. Fortunately we were able to >>> limit the >>> use of ptlock_ptr to a single user outside of arch/x86/xen/mmu_pv.c >>> so far. >> >> I'm fine if you prefer like that; I don't see it a huge deal to me. > > Let's keep it like that, unless we can come up with something neater. At > least it makes the code also more consistent with similar code in that > file and the overhead should be  minimal. > > I was briefly thinking about actually testing if the PT is full of > pte_none(), either as a debugging check or to also handle what is > currently handled via: > > if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) { > > Seems wasteful just because some part of a VMA might have a private page > mapped / uffd-wp active to let all other parts suffer. > > Will think about if that is really worth it. > > ... also because I still want to understand why the PTL of the PMD table > is required at all. What if we lock it first and somebody else wants to > lock it after us while we already ripped it out? Sure there must be some > reason for the lock, I just don't understand it yet :/. For pmd lock, I think this is needed to clear the pmd entry (pmdp_collapse_flush()). For pte lock, there should be the following two reasons: 1. release it after clearing pmd entry, then we can capture the changed pmd in pte_offset_map_lock() etc after holding this pte lock. (This is also what I did in my patchset) 2. As mentioned in the comments, we may be concurrent with userfaultfd_ioctl(), but we do not hold the read lock of mmap (or read lock of vma), so the VM_UFFD_WP may be set. Therefore, we need to hold the pte lock to check whether a new pte entry has been inserted. (See commit[1] for more details) [1]. https://github.com/torvalds/linux/commit/a98460494b16db9c377e55bc13e5407a0eb79fe8 Thanks, Qi >