From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ADB83C3DA45 for ; Thu, 11 Jul 2024 17:39:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 00D766B009A; Thu, 11 Jul 2024 13:39:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ED6B76B00A4; Thu, 11 Jul 2024 13:39:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D77A46B00AD; Thu, 11 Jul 2024 13:39:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B51B56B009A for ; Thu, 11 Jul 2024 13:39:22 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 271051A0626 for ; Thu, 11 Jul 2024 17:39:22 +0000 (UTC) X-FDA: 82328183364.28.63C9371 Received: from mail-qt1-f180.google.com (mail-qt1-f180.google.com [209.85.160.180]) by imf14.hostedemail.com (Postfix) with ESMTP id 42F5D100018 for ; Thu, 11 Jul 2024 17:39:20 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YGLVDDpe; spf=pass (imf14.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.180 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720719543; a=rsa-sha256; cv=none; b=O4SAOgLIHMaf0kkTHYXMpguQg66bfRFLva7cKtSWnub4Urek+dBAF8/Gh8p3ppF38YmDHy sIm3AzdPnbvEeEmukUZLo3hnwnreDrZ3TQrKCFoSO1DvhINaLNR7Mo/IKu3OkoAhybMQ0t dJxTsVuUsNYooLcTfame22e6djIh4no= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YGLVDDpe; spf=pass (imf14.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.180 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720719543; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/LH8mU+M+sCWnR9+Ca+ARV1BNg1CNNL62rL9vAcSPBM=; b=ehoZK1tWNnJV7nG3YsW191ghgte7ZW233t1LuBxhtg1Vp5xnvVXstl8rOmB07nC44KxZXA QMXAtssSGkmw0FBmCho7hzboUQUN0cO6tKFElFFtqROR0Y8X8GSD/p4CNgU5Pcs6FA5MCt RlW9A2J7kcIYbXS+poX02EO0Jgxs2+Y= Received: by mail-qt1-f180.google.com with SMTP id d75a77b69052e-44e534a1fbeso14051cf.1 for ; Thu, 11 Jul 2024 10:39:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1720719559; x=1721324359; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/LH8mU+M+sCWnR9+Ca+ARV1BNg1CNNL62rL9vAcSPBM=; b=YGLVDDpe67mp18QPHd7TSVQbpP7wUJ7iNtsG5t2AvZxWoLU8ImqYFS9JzqaCmF5S3r wZ1fb7gkTEJPAwmcjv4Bk0iWjnMajbKL61aUldu+6Nt/vAQwm//XWo2ghI+bzRAqoXij MR9zheYenA8i+WxsUbz+f6shT0TqgZuXD0mOHhB2YDufAS7vs4btz0fWZfX5HyDEkhyO Xu+VVDNk/Mb9HIBDD7Jyr2Sdk3C2KM+0/wu46efTJ+DFWkR75KZ+XXkfinKoQySY4TqY nMiiBIK1vIxwpDo3qTj/o5YZ4DZSW0hypA4LjFNWHrojV6T3kwyZ7HXPlj8hdY1v9lh4 T++g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720719559; x=1721324359; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/LH8mU+M+sCWnR9+Ca+ARV1BNg1CNNL62rL9vAcSPBM=; b=TR5158FwfMapRQqhEiQq7sDBgixJNoDSXD25pJa7dl7ocAmL9xAXvJ0h8BoM0E88n9 BqrDMJNvJB1p09fLsbynZK6neZy489KDN6xm5sETQ1hQm8TJ/wtkemzZllTitpXOXipY SFuCCLyGmmtDfU4VtIb2oWajd06GWB2uIJee+ii6wNgu8wwAEsIZnO58ZaUjM5GNVs35 WbwsxTCK3lAMCuAWGaX75bPCdIMUpAakgqTzc0uwaEZz9F9dV05u4W5BINeGO9EQxWgJ B5WAy24xu+PwkTZ8h8v2jq2ClMCz8xi+vpJV/3E8PZags09fZoI6f3jwtbiFY7N1miiH VaDA== X-Forwarded-Encrypted: i=1; AJvYcCXAt2dU8yQTCHbBzW0auE6FcKtcKmvlyaO9YVf4Ue70H4FqS+EKI7lU8WBrdM1fHBlPOCASdRB2TToI8V8gPEL3GEo= X-Gm-Message-State: AOJu0Yz3Tc6PBIYC5qxDwhiw5ezyHVElEiJviYyQXBm/hF0KKVBdIvgW y/MgBh68/6ENNoTh+aufcBh+2bnovz5VUAt/h5gJy2cb1FLd9JKBUacrFJ8xnubHl8yIxjyoM3R YLriG5kUfRnGBVAXnL8wKChw1QF6x0+dbvMAw X-Google-Smtp-Source: AGHT+IGhpAkKikX9WUeBj1u9fTmUMy9VLmxdlbU7nnkRl24NRs3MGy0QS8vyqtirWUKAa1YhY3F14tnsHnAw8L2LYas= X-Received: by 2002:ac8:7943:0:b0:447:e393:fed1 with SMTP id d75a77b69052e-44d3555cbd2mr3314221cf.7.1720719559055; Thu, 11 Jul 2024 10:39:19 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yu Zhao Date: Thu, 11 Jul 2024 11:38:40 -0600 Message-ID: Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize To: Catalin Marinas Cc: Nanyong Sun , will@kernel.org, mike.kravetz@oracle.com, muchun.song@linux.dev, akpm@linux-foundation.org, anshuman.khandual@arm.com, willy@infradead.org, wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: nrdw1quz7xixu8mm5ehqhkrtm5i5suyd X-Rspamd-Queue-Id: 42F5D100018 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1720719560-424735 X-HE-Meta: U2FsdGVkX1910mKe5Y3n/TRC5YyVc86jRe8jneQfGO5aZf/6VamUrVbM4OQS5VrD5POVaA4xIP9SBYOdYaPA1na7ebJTor0EHAVRmSRO4tUjC1QeWZGe/2kWnY2HJ3auNGhIyLEjCyZRkKEoMWktr6i323D32U9/s/ah7bdQkhOxDPl6RSqJWkJY2Q3kJes+53v7rwJSKY+9+J826STDnLYu9Pf8wh4HqLBaFNGzsXUwEst2sCjqYwOKQGJAQsdD/PxjMBSn9JdxRvIuI8n74lsMkLFBBGYNLXYfXnhnJyztrJZXVOwCZblO1FuiupR9dmBRXwVu5mffhAPv3cLY6g2DTbheFonqrKWfOpdgaHnHa2wSdsBvbuu3fOXX/kuPWNqoS9gD2Efpzi+bpuvMqh0OKTFyl4gKe2hSkRIFJb0SUf4C8gBkAnBbUhVb8bQraUKMCAB042mL6rdDTIKzeUBgOxvlD59TlD5dp46EK6MIa6Ao4A6pF/N/14S47V9kQaUpVi0QuNfHrFsDE/Hw4ACwUkQGTZd4SFF4zWE52f4gUu692my91DERzZAZTt6c+w0PnDv62PkSKrr2QeQYz42vX9rPv4xb3ojUt+ElS/xQeuHEjEBj6JuXzQ+lU0z6UwK6qjvEvfzu1MiINsYiGOn6GQ2kCNnPwstAUyH1CmGBRfp54rGSleF98tcu/VT/JoCFaWiPkHlVk7Q53Df4awUy3Nkh5nTzIi+xl/qhC96CsBi60UwYN+pSapZwTupStqqPnCtgYxH9dNfjrDctkdTPz+iEjtAoDCYWD7rTjJ11J+a2t8DflMnEWAwGAx/mN86PQFY+mdJbZYeE7vk7mZQXWd4TSYqG0bC6xCsAfr+54Zbf2JBcUod+hC6UoRuusYX4yxuLBM43Qmyt93fFMk6erFPrQKUbAcXR0XJ/lnD4LIvvEfDGFsMdHdpNExqHjCaQntvVya3JpNfO4II U2lSvKYG OOLZagWaM2PSdftsT28YlY0V8ftrz/pz4Zg/5lsl7IR9WyV6AZT9LjC9weP0y+iePJV2Zl6f5RqXNdQCVM2lIQmVgYmv9BP3wiY9a2pipO25AYVRWRAKcJjqiEW+4h49VBpZL7rrPv+hIteFmHygtu+jTIAp/KA30LKQgdig0pG7QU/LxS+C17XzwoG4LdylBHF8gsWeMGq2eiOSMzr4MaVxA1Z7LvL0GXQiof1FjByd5F/m6bBJzMktXnqvcBFm/Z8ER X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 11, 2024 at 5:39=E2=80=AFAM Catalin Marinas wrote: > > On Thu, Jul 11, 2024 at 02:31:25AM -0600, Yu Zhao wrote: > > On Wed, Jul 10, 2024 at 5:07=E2=80=AFPM Yu Zhao wro= te: > > > On Wed, Jul 10, 2024 at 4:29=E2=80=AFPM Catalin Marinas wrote: > > > > The Arm ARM states that we need a BBM if we change the output addre= ss > > > > and: the old or new mappings are RW *or* the content of the page > > > > changes. Ignoring the latter (page content), we can turn the PTEs R= O > > > > first without changing the pfn followed by changing the pfn while t= hey > > > > are RO. Once that's done, we make entry 0 RW and, of course, with > > > > additional TLBIs between all these steps. > > > > > > Aha! This is easy to do -- I just made the RO guaranteed, as I > > > mentioned earlier. > > > > > > Just to make sure I fully understand the workflow: > > > > > > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct = page` area. > > I don't think we can turn all of them RO here since some of those 512 > PTEs are not related to the hugetlb page. So you'd need to keep them RW > but preserving the pfn so that there's no actual translation change. I > think that's covered by FEAT_BBM level 2. Basically this step should be > only about breaking up a PMD block entry into a table entry. Ack. > > > 2. TLBI once, after pmd_populate_kernel() > > > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8 > > > PTEs, while they remain RO. > > You may need some intermediate step to turn these PTEs read-only since > step 1 should leave them RW. Also, if we want to free and order-3 page > here, it might be better to allocate an order 0 even for PTE entry 0 (I > had the impression that's what the core code does, I haven't checked). Ack. > > > 4. TLBI once, after set_pte_at() on PTE 1-7. > > > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page`= area. > > > 6. TLBI once, after set_pte_at() on PTE 0. > > > > > > No BBM required, regardless of FEAT_BBM level 2. > > > > I just studied D8.16.1 from the reference manual, and it seems to me: > > 1. We still need either FEAT_BBM or BBM to split PMD. > > Yes. Also, I want to confirm my understanding of "changing table size" from the reference manual: in our case, it means splitting a PMD into 512 PTEs with the same permission and OA. If we change the permission *or* OA, we still need to do BBM even with FEAT_BBM level 2. Is this correct? > > 2. We still need BBM when we change PTE 1-7, because even if they > > remain RO, the content of the `struct page` page at the new location > > does not match that at the old location. > > Yes, in theory, the data at the new pfn should be the same. We could try > to get clarification from the architects on what could go wrong but I > suspect it's some atomicity is not guarantee if you read the data (the > CPU getting confused whether to read from the old or the new page). > > Otherwise, since after all these steps PTEs 1-7 point to the same data > as PTE 0, before step 3 we could copy the data in page 0 over to the > other 7 pages while entries 1-7 are still RO. The remapping afterwards > would be fully compliant. Correct -- we do need to copy to make it fully compliant because the core MM doesn't guarantee that. The core MM only guarantees fields (of struct page) required for speculative PFN walkers to function correctly have the same value for all tail pages within a compound page. Non-correctness related fields in theory can have different values for those tail pages. > > > > Can we leave entry 0 RO? This would save an additional TLBI. > > > > > > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab = a > > > refcnt on any hugeTLB pages. > > OK, fair enough. > > > > > Now, I wonder if all this is worth it. What are the scenarios where= the > > > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB > > > > hugetlb page for example is pretty well defined - 8 x 4K pages, ali= gned. > > > > One of the fundamental assumptions in core MM is that anyone can > > read or try to grab (write) a refcnt from any `struct page`. Those > > speculative PFN walkers include memory compaction, etc. > > But how does this work if PTEs 1-7 are RO? Do those walkers detect it's > a tail page and skip it. Correct. > Actually, if they all point to the same vmemmap > page, how can one distinguish a tail page via PTE 1 from the head page > via PTE 0? Two of the correctness related fields are page->_refcount and page->compound_head: 1. _refcount is the only one that can be speculatively updated. Speculative walkers are not allowed to update other fields unless they can grab a refcount. All tail pages must have zero refcount. 2. compound_head speculatively indicates whether a page is head or tail, and if it's tail, its head can be extracted by compound_head(). Since a head can have non-zero refcount, after PTEs 1-7 are remapped to PTE 0, we need a way to prevent speculative walkers from mistaking the first tail for each PTE 1-7 for the head and trying to grab their refcount. This is done by page_is_fake_head() returning true, which relies on the following sequence on. On the writer side: 2a. init compound_head 2b. reset _refcount to 0 2c. synchronize_rcu() 2d. remap PTEs 1-7 to PTE 0 2e. inc _refcount Speculative readers of the first tails respectively at PTEs 1-7 either see refcount being 0 or page_is_fake_head() being true. > BTW, I'll be on holiday from tomorrow for two weeks and won't be able to > follow up on this thread (and likely to forget all the discussion by the > time I get back ;)). Thanks for the heads-up!