From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8ED7EC3DA41 for ; Thu, 11 Jul 2024 08:32:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 09A3A6B0098; Thu, 11 Jul 2024 04:32:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0498A6B009C; Thu, 11 Jul 2024 04:32:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E2C646B009D; Thu, 11 Jul 2024 04:32:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C5E676B0098 for ; Thu, 11 Jul 2024 04:32:07 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8E0351C0A8E for ; Thu, 11 Jul 2024 08:32:07 +0000 (UTC) X-FDA: 82326804294.17.DB2E8A0 Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) by imf19.hostedemail.com (Postfix) with ESMTP id C428E1A001A for ; Thu, 11 Jul 2024 08:32:05 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YgD0UF8y; spf=pass (imf19.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720686694; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YWHIbu+cvMagxFrDg1eyheaCUlAQDPbuPi2dpyEJyRM=; b=2lHUrsZXvjXCnKuGm2weM1+sf8eiIc9/fTmZdYQ1M5y0BIKPbT54e9KcVGUO9msoTvU89Z sYR20uzX3YPWbsAe8ujLKIohvODqpdtRjspXWf3rG3Bc2sBfU2EXK3npY4xcrCgs7SIn2t UR8Sy2huvxO84sU4/U3Zt0KQGou4FAg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720686694; a=rsa-sha256; cv=none; b=mOAsqEm8E/O4UTUnaTMXeoZFc6E9J2buPQtuXmSrJA1m1xlQxIkcKSHrTWlRFpapETF4Gv dtmetDekAQuGuEKwu4XB0nl7vp2r9hkIT292g+lb/KiyQIylRqa5zMF82WikRxs9XEA+M9 CqhqFeFRQYaYhiFzOGVeXV7KlwtxCBc= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YgD0UF8y; spf=pass (imf19.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-447df43324fso153081cf.1 for ; Thu, 11 Jul 2024 01:32:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1720686725; x=1721291525; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YWHIbu+cvMagxFrDg1eyheaCUlAQDPbuPi2dpyEJyRM=; b=YgD0UF8ykqmMtlhgf4y1cRDH/TdUcfJl7IysbvF6xugqvCmTJAeUKrufZvuiNUB4kL 5gF8mrSKjQW4APzZL46+OK02PJSvkG5wKPg83XYFYv61gPP/oCu59QtljTCvcE7h1TtM v5oGQ7z+QBARtdRkKFRnJ1k0hQwlYrFLcRnjjAgWmIM+9e9ZFnvuAPdBs/O2YpDarewI xEw84mxGs7RfhbvHLrGO8EMp0vShyBI8ixTHwAfitGEVs0dlC+F1z0/PKePKiTss6MuM sLs4uCg/FML0AHiF5cTxDnD+/z865RgcftB1i//KShiqk2H2fyb+/sKl2Y7EI4evA3+y +BMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720686725; x=1721291525; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YWHIbu+cvMagxFrDg1eyheaCUlAQDPbuPi2dpyEJyRM=; b=TR5NQYNpLhvN7dMPDO9EbCNKzkfhQu0/QjGSr9UXU10twLWR9bSwddaa/FuckNNo4g AbmKqvyYiM+NAwboUuJEilH94slprEfS9IvTJ1gHzUX2Dr0UhRq1CL0biX/SAtubYBLI uykKRYZOuuWu6YtFwuo+Y61+EbQPZ2t5Ygj5iAgi2Hl9wUkC1JHSvE5SER2Pn55dFr0Z DOtAijUFrL76q0Imjbv9AQzQNCh90Ntr6KKimt/4UDD/PaIFbtlzMfaZCbsixM1OjRRp nlUrzsmiNwD0OH//zyQCKktitrRun8T9khUHoAzntF7P9ASuDWHcPl+S9VPcvFLFUrh4 BnSg== X-Forwarded-Encrypted: i=1; AJvYcCW7uUF8N5sMuFI2RBaajBLLhTWzy8+F2GQx09phceV//gR4TV+W0ksBuZtKfnqRIf3AuTlBriecmiA6gU/56c/1BPY= X-Gm-Message-State: AOJu0YyCwbrqN0aKNPJQEaH2oQm81WtvaHJTYvBabZXzViqlfAbKazL6 l2NmqMRG40FeGCB63PNQv0fZ8vvgHHJJQu63Jw9KVzYlANX0jzBQllHPVXjIzFY0oDcS4SKMsb5 dHz8cNsRBhsy0b1CQ7lqR5U9ZHGw15KcvoORZ X-Google-Smtp-Source: AGHT+IGNqoML5kFUzVyQW8QnecFL6JZ0uDikh8HvHvIwDxa22nsavjslwHT/oJO6b3GimZSsoJkxgY4rNb48ZWdPxNw= X-Received: by 2002:ac8:73ca:0:b0:447:e01a:de95 with SMTP id d75a77b69052e-44d07cc7faemr2510091cf.0.1720686724500; Thu, 11 Jul 2024 01:32:04 -0700 (PDT) MIME-Version: 1.0 References: <20240113094436.2506396-1-sunnanyong@huawei.com> In-Reply-To: From: Yu Zhao Date: Thu, 11 Jul 2024 02:31:25 -0600 Message-ID: Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize To: Catalin Marinas Cc: Nanyong Sun , will@kernel.org, mike.kravetz@oracle.com, muchun.song@linux.dev, akpm@linux-foundation.org, anshuman.khandual@arm.com, willy@infradead.org, wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C428E1A001A X-Stat-Signature: t188doan4dkju71xxkaqsnmggkrymzay X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1720686725-7525 X-HE-Meta: U2FsdGVkX1+En9sZrkRr4IOf5mhyr1ATDITZUnFhH9/16Nxr8GP+kfZytx6BVMfbEl9y84pbuex4UezD4Sa8E24/4XcT9OAqLnWJRxlykqcgCRYwmqYWQKa09FH7loJ8wf4tIUdaYXz3sgAqLcjPcaaFQK473y4zj/TvCr41cnsPjux1R/Ai2mS8CtVG6oKyzqZImks0vWN3bAtDnDbqtgtllI30OVRG9aipdMhkz5R85YmO06m+o5chA8BbQyb6mm+Up8JTYYz5iTV4Y7TwDcr+NyWJtW3+deGsLFWGya2WMXA/8xvEmBM+qU/b/AA19Q3bKeEltC1M/0Xj8ee9+F3E0MA1+cYMd6YWJhheOTh+kDXvQMsxpeWth5ka8A9loylJgfjcGvXO8vbWog0nITzb97u2WHLACZZQUcNGKqj0LKrWi7S1WlUWjqeNSBwShSQkG+a6bHpYVr2N4Y0DS5Nv5LvPpfihw0dtv2Hr5NWuTvKQO9DBMmZm6OZS+EYn5n9soYR/OpzKo/LaCNiVKTT8dcOxaURdwsMgzmJ50Smp78lOCR658fhSqPRF9Tpy8QL16RTSFzc3ketGtVMaljv7cHBIsx8AbgAVQo4VfskXHsoeLi1qh5YifLXFggi+rfjDp8n0Hcs2TrW3sggUZ1dWGvw6tcvdZiXlQtSLNZax58naeuPYhaqBZ51p5rCYBZ33Ear6fppfr4p6nO/0vDs/KXpmxiy8rnKbpk1p1ulVFMYq9GPpV37Y4P+47rvQFNbP7Btihsp1lEjNx6V6ZXvG5ePBioL/nEhGcu3isNtRslLRkxWEb8DlMUj0rbPH6ZByktsNH8vLjO/PmBRqR1gQEmxjDMGpZI77bzbLVMHYwGQBgYkBF0E/8ruB95OYpCHXl5K3o1aZxMIewMpitPknOHrlPe3fg0LYT5rlKN9sCJvuNUDS4ucgJO4CZ4Mf8+TspI2G/vdYkMTk7aV 1yufeKAL YYe4TTZYWnXm1WdCJ6cQcL3qVNlZPoTDNt/LlL1QI16FK9F9bI+Zd/G+cDZmvIq7HDzE5pxXrezdivXgdUGyQLJmu99p3PXQoluUeJuata0/RssOp11tYI6nG+aAx4u2Zli0xe3LS4yRD8CfKVKyfelOCcth+g5B9DE+ihJPXg14zEQRevq/hL1zhARnWPpX41IQ66/HSM7N4ZnYCxp6NP3PcjT6CmuGH+Yb7Vb9tVn5AjCfdxoUPiu6Ck8nNnELXwhEI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 10, 2024 at 5:07=E2=80=AFPM Yu Zhao wrote: > > On Wed, Jul 10, 2024 at 4:29=E2=80=AFPM Catalin Marinas wrote: > > > > On Wed, Jul 10, 2024 at 11:12:01AM -0600, Yu Zhao wrote: > > > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Catalin Marinas > > > wrote: > > > > On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote: > > > > > On Fri, Jul 5, 2024 at 9:49=E2=80=AFAM Catalin Marinas wrote: > > > > > > If I did the maths right, for a 2MB hugetlb page, we have about= 8 > > > > > > vmemmap pages (32K). Once we split a 2MB vmemap range, > > > > > > > > > > Correct. > > > > > > > > > > > whatever else > > > > > > needs to be touched in this range won't require a stop_machine(= ). > > > > > > > > > > There might be some misunderstandings here. > > > > > > > > > > To do HVO: > > > > > 1. we split a PMD into 512 PTEs; > > > > > 2. for every 8 PTEs: > > > > > 2a. we allocate an order-0 page for PTE #0; > > > > > 2b. we remap PTE #0 *RW* to this page; > > > > > 2c. we remap PTEs #1-7 *RO* to this page; > > > > > 2d. we free the original order-3 page. > > > > > > > > Thanks. I now remember why we reverted such support in 060a2c92d1b6 > > > > ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The = main > > > > problem is that point 2c also changes the output address of the PTE > > > > (and the content of the page slightly). The architecture requires a > > > > break-before-make in such scenario, though it would have been nice = if it > > > > was more specific on what could go wrong. > > > > > > > > We can do point 1 safely if we have FEAT_BBM level 2. For point 2, = I > > > > assume these 8 vmemmap pages may be accessed and that's why we can'= t do > > > > a break-before-make safely. > > > > > > Correct > > > > > > > I was wondering whether we could make the > > > > PTEs RO first and then change the output address but we have anothe= r > > > > rule that the content of the page should be the same. I don't think > > > > entries 1-7 are identical to entry 0 (though we could ask the archi= tects > > > > for clarification here). Also, can we guarantee that nothing writes= to > > > > entry 0 while we would do such remapping? > > > > > > Yes, it's already guaranteed. > > > > > > > We know entries 1-7 won't be > > > > written as we mapped them as RO but entry 0 contains the head page. > > > > Maybe it's ok to map it RO temporarily until the newly allocated hu= getlb > > > > page is returned. > > > > > > We can do that. I don't understand how this could elide BBM. After th= e > > > above, we would still need to do: > > > 3. remap entry 0 from RO to RW, mapping the `struct page` page that > > > will be shared with entry 1-7. > > > 4. remap entry 1-7 from their respective `struct page` pages to that > > > of entry 0, while they remain RO. > > > > The Arm ARM states that we need a BBM if we change the output address > > and: the old or new mappings are RW *or* the content of the page > > changes. Ignoring the latter (page content), we can turn the PTEs RO > > first without changing the pfn followed by changing the pfn while they > > are RO. Once that's done, we make entry 0 RW and, of course, with > > additional TLBIs between all these steps. > > Aha! This is easy to do -- I just made the RO guaranteed, as I > mentioned earlier. > > Just to make sure I fully understand the workflow: > > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page= ` area. > 2. TLBI once, after pmd_populate_kernel() > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8 > PTEs, while they remain RO. > 4. TLBI once, after set_pte_at() on PTE 1-7. > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` are= a. > 6. TLBI once, after set_pte_at() on PTE 0. > > No BBM required, regardless of FEAT_BBM level 2. I just studied D8.16.1 from the reference manual, and it seems to me: 1. We still need either FEAT_BBM or BBM to split PMD. 2. We still need BBM when we change PTE 1-7, because even if they remain RO, the content of the `struct page` page at the new location does not match that at the old location. > Is this correct? > > > Can we leave entry 0 RO? This > > would save an additional TLBI. > > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a > refcnt on any hugeTLB pages. > > > Now, I wonder if all this is worth it. What are the scenarios where the > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB > > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned= . One of the fundamental assumptions in core MM is that anyone can read or try to grab (write) a refcnt from any `struct page`. Those speculative PFN walkers include memory compaction, etc. > > > > If we could get the above work, it would be a lot simpler than thin= king > > > > of stop_machine() or other locks to wait for such remapping. > > > > > > Steps 3/4 would not require BBM somehow? > > > > If we ignore the 'content' requirement, I think we could skip the BBM > > but we need to make sure we don't change the permission and pfn at the > > same time. > > Gotcha. > > > > > > To do de-HVO: > > > > > 1. for every 8 PTEs: > > > > > 1a. we allocate 7 order-0 pages. > > > > > 1b. we remap PTEs #1-7 *RW* to those pages, respectively. > > > > > > > > Similar problem in 1.b, changing the output address. Here we could = force > > > > the content to be the same > > > > > > I don't follow the "the content to be the same" part. After HVO, we h= ave: > > > > > > Entry 0 -> `struct page` page A, RW > > > Entry 1 -> `struct page` page A, RO > > > ... > > > Entry 7 -> `struct page` page A, RO > > > > > > To de-HVO, we need to make them: > > > > > > Entry 0 -> `struct page` page A, RW > > > Entry 1 -> `struct page` page B, RW > > > ... > > > Entry 7 -> `struct page` page H, RW > > > > > > I assume the same content means PTE_0 =3D=3D PTE_1/.../7? > > > > That's the content of the page at the corresponding pfn before and afte= r > > the pte change. I'm pretty sure the Arm ARM states this in case the > > hardware starts a load (e.g. unaligned) from one page and completes it > > from another, the software should not see any difference. But for the > > fields we care about in struct page, I assume they'd be the same (or > > that we just don't care about inconsistencies during this transient > > period). > > Thanks for the explanation. I'll cook up something if my understanding > above is correct.