From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8ED7EC3DA41
	for <linux-mm@archiver.kernel.org>; Thu, 11 Jul 2024 08:32:08 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 09A3A6B0098; Thu, 11 Jul 2024 04:32:08 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0498A6B009C; Thu, 11 Jul 2024 04:32:08 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E2C646B009D; Thu, 11 Jul 2024 04:32:07 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id C5E676B0098
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 04:32:07 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 8E0351C0A8E
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 08:32:07 +0000 (UTC)
X-FDA: 82326804294.17.DB2E8A0
Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170])
	by imf19.hostedemail.com (Postfix) with ESMTP id C428E1A001A
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 08:32:05 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=YgD0UF8y;
	spf=pass (imf19.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720686694;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=YWHIbu+cvMagxFrDg1eyheaCUlAQDPbuPi2dpyEJyRM=;
	b=2lHUrsZXvjXCnKuGm2weM1+sf8eiIc9/fTmZdYQ1M5y0BIKPbT54e9KcVGUO9msoTvU89Z
	sYR20uzX3YPWbsAe8ujLKIohvODqpdtRjspXWf3rG3Bc2sBfU2EXK3npY4xcrCgs7SIn2t
	UR8Sy2huvxO84sU4/U3Zt0KQGou4FAg=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720686694; a=rsa-sha256;
	cv=none;
	b=mOAsqEm8E/O4UTUnaTMXeoZFc6E9J2buPQtuXmSrJA1m1xlQxIkcKSHrTWlRFpapETF4Gv
	dtmetDekAQuGuEKwu4XB0nl7vp2r9hkIT292g+lb/KiyQIylRqa5zMF82WikRxs9XEA+M9
	CqhqFeFRQYaYhiFzOGVeXV7KlwtxCBc=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=YgD0UF8y;
	spf=pass (imf19.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-447df43324fso153081cf.1
        for <linux-mm@kvack.org>; Thu, 11 Jul 2024 01:32:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1720686725; x=1721291525; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=YWHIbu+cvMagxFrDg1eyheaCUlAQDPbuPi2dpyEJyRM=;
        b=YgD0UF8ykqmMtlhgf4y1cRDH/TdUcfJl7IysbvF6xugqvCmTJAeUKrufZvuiNUB4kL
         5gF8mrSKjQW4APzZL46+OK02PJSvkG5wKPg83XYFYv61gPP/oCu59QtljTCvcE7h1TtM
         v5oGQ7z+QBARtdRkKFRnJ1k0hQwlYrFLcRnjjAgWmIM+9e9ZFnvuAPdBs/O2YpDarewI
         xEw84mxGs7RfhbvHLrGO8EMp0vShyBI8ixTHwAfitGEVs0dlC+F1z0/PKePKiTss6MuM
         sLs4uCg/FML0AHiF5cTxDnD+/z865RgcftB1i//KShiqk2H2fyb+/sKl2Y7EI4evA3+y
         +BMg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720686725; x=1721291525;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=YWHIbu+cvMagxFrDg1eyheaCUlAQDPbuPi2dpyEJyRM=;
        b=TR5NQYNpLhvN7dMPDO9EbCNKzkfhQu0/QjGSr9UXU10twLWR9bSwddaa/FuckNNo4g
         AbmKqvyYiM+NAwboUuJEilH94slprEfS9IvTJ1gHzUX2Dr0UhRq1CL0biX/SAtubYBLI
         uykKRYZOuuWu6YtFwuo+Y61+EbQPZ2t5Ygj5iAgi2Hl9wUkC1JHSvE5SER2Pn55dFr0Z
         DOtAijUFrL76q0Imjbv9AQzQNCh90Ntr6KKimt/4UDD/PaIFbtlzMfaZCbsixM1OjRRp
         nlUrzsmiNwD0OH//zyQCKktitrRun8T9khUHoAzntF7P9ASuDWHcPl+S9VPcvFLFUrh4
         BnSg==
X-Forwarded-Encrypted: i=1; AJvYcCW7uUF8N5sMuFI2RBaajBLLhTWzy8+F2GQx09phceV//gR4TV+W0ksBuZtKfnqRIf3AuTlBriecmiA6gU/56c/1BPY=
X-Gm-Message-State: AOJu0YyCwbrqN0aKNPJQEaH2oQm81WtvaHJTYvBabZXzViqlfAbKazL6
	l2NmqMRG40FeGCB63PNQv0fZ8vvgHHJJQu63Jw9KVzYlANX0jzBQllHPVXjIzFY0oDcS4SKMsb5
	dHz8cNsRBhsy0b1CQ7lqR5U9ZHGw15KcvoORZ
X-Google-Smtp-Source: AGHT+IGNqoML5kFUzVyQW8QnecFL6JZ0uDikh8HvHvIwDxa22nsavjslwHT/oJO6b3GimZSsoJkxgY4rNb48ZWdPxNw=
X-Received: by 2002:ac8:73ca:0:b0:447:e01a:de95 with SMTP id
 d75a77b69052e-44d07cc7faemr2510091cf.0.1720686724500; Thu, 11 Jul 2024
 01:32:04 -0700 (PDT)
MIME-Version: 1.0
References: <20240113094436.2506396-1-sunnanyong@huawei.com>
 <ZbKjHHeEdFYY1xR5@arm.com> <d1671959-74a4-8ea5-81f0-539df8d9c0f0@huawei.com>
 <ZcN7P0CGUOOgki71@arm.com> <CAOUHufYo=SQmpaYA3ThrdHcY9fQfFmycriSvOX1iuC4Y=Gj7Xg@mail.gmail.com>
 <ZogV9Iag4mxe6enx@arm.com> <CAOUHufYwoTTsRBF_wWZU_jWzb8e6FF=vN8UKtVHBeXLBkwHWzA@mail.gmail.com>
 <Zo68DP6siXfb6ZBR@arm.com> <CAOUHufZC0Jn=bTpc0JhqONXbYXgyBOVZ4j8bbKSJWv1dOSmQEA@mail.gmail.com>
 <Zo8LTaP-6-zIcc9v@arm.com> <CAOUHufawuXOifhrULx6mC0Kv0_sbjT0y16QMePiw=gH9b6xxAw@mail.gmail.com>
In-Reply-To: <CAOUHufawuXOifhrULx6mC0Kv0_sbjT0y16QMePiw=gH9b6xxAw@mail.gmail.com>
From: Yu Zhao <yuzhao@google.com>
Date: Thu, 11 Jul 2024 02:31:25 -0600
Message-ID: <CAOUHufb3CHLCo54fZcPSG+mrXD-kRsa0Foi8=vL5=q+YHpQ+Rg@mail.gmail.com>
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>, will@kernel.org, mike.kravetz@oracle.com, 
	muchun.song@linux.dev, akpm@linux-foundation.org, anshuman.khandual@arm.com, 
	willy@infradead.org, wangkefeng.wang@huawei.com, 
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: C428E1A001A
X-Stat-Signature: t188doan4dkju71xxkaqsnmggkrymzay
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1720686725-7525
X-HE-Meta: U2FsdGVkX1+En9sZrkRr4IOf5mhyr1ATDITZUnFhH9/16Nxr8GP+kfZytx6BVMfbEl9y84pbuex4UezD4Sa8E24/4XcT9OAqLnWJRxlykqcgCRYwmqYWQKa09FH7loJ8wf4tIUdaYXz3sgAqLcjPcaaFQK473y4zj/TvCr41cnsPjux1R/Ai2mS8CtVG6oKyzqZImks0vWN3bAtDnDbqtgtllI30OVRG9aipdMhkz5R85YmO06m+o5chA8BbQyb6mm+Up8JTYYz5iTV4Y7TwDcr+NyWJtW3+deGsLFWGya2WMXA/8xvEmBM+qU/b/AA19Q3bKeEltC1M/0Xj8ee9+F3E0MA1+cYMd6YWJhheOTh+kDXvQMsxpeWth5ka8A9loylJgfjcGvXO8vbWog0nITzb97u2WHLACZZQUcNGKqj0LKrWi7S1WlUWjqeNSBwShSQkG+a6bHpYVr2N4Y0DS5Nv5LvPpfihw0dtv2Hr5NWuTvKQO9DBMmZm6OZS+EYn5n9soYR/OpzKo/LaCNiVKTT8dcOxaURdwsMgzmJ50Smp78lOCR658fhSqPRF9Tpy8QL16RTSFzc3ketGtVMaljv7cHBIsx8AbgAVQo4VfskXHsoeLi1qh5YifLXFggi+rfjDp8n0Hcs2TrW3sggUZ1dWGvw6tcvdZiXlQtSLNZax58naeuPYhaqBZ51p5rCYBZ33Ear6fppfr4p6nO/0vDs/KXpmxiy8rnKbpk1p1ulVFMYq9GPpV37Y4P+47rvQFNbP7Btihsp1lEjNx6V6ZXvG5ePBioL/nEhGcu3isNtRslLRkxWEb8DlMUj0rbPH6ZByktsNH8vLjO/PmBRqR1gQEmxjDMGpZI77bzbLVMHYwGQBgYkBF0E/8ruB95OYpCHXl5K3o1aZxMIewMpitPknOHrlPe3fg0LYT5rlKN9sCJvuNUDS4ucgJO4CZ4Mf8+TspI2G/vdYkMTk7aV
 1yufeKAL
 YYe4TTZYWnXm1WdCJ6cQcL3qVNlZPoTDNt/LlL1QI16FK9F9bI+Zd/G+cDZmvIq7HDzE5pxXrezdivXgdUGyQLJmu99p3PXQoluUeJuata0/RssOp11tYI6nG+aAx4u2Zli0xe3LS4yRD8CfKVKyfelOCcth+g5B9DE+ihJPXg14zEQRevq/hL1zhARnWPpX41IQ66/HSM7N4ZnYCxp6NP3PcjT6CmuGH+Yb7Vb9tVn5AjCfdxoUPiu6Ck8nNnELXwhEI
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Jul 10, 2024 at 5:07=E2=80=AFPM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jul 10, 2024 at 4:29=E2=80=AFPM Catalin Marinas <catalin.marinas@=
arm.com> wrote:
> >
> > On Wed, Jul 10, 2024 at 11:12:01AM -0600, Yu Zhao wrote:
> > > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > > > On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote:
> > > > > On Fri, Jul 5, 2024 at 9:49=E2=80=AFAM Catalin Marinas <catalin.m=
arinas@arm.com> wrote:
> > > > > > If I did the maths right, for a 2MB hugetlb page, we have about=
 8
> > > > > > vmemmap pages (32K). Once we split a 2MB vmemap range,
> > > > >
> > > > > Correct.
> > > > >
> > > > > > whatever else
> > > > > > needs to be touched in this range won't require a stop_machine(=
).
> > > > >
> > > > > There might be some misunderstandings here.
> > > > >
> > > > > To do HVO:
> > > > > 1. we split a PMD into 512 PTEs;
> > > > > 2. for every 8 PTEs:
> > > > >   2a. we allocate an order-0 page for PTE #0;
> > > > >   2b. we remap PTE #0 *RW* to this page;
> > > > >   2c. we remap PTEs #1-7 *RO* to this page;
> > > > >   2d. we free the original order-3 page.
> > > >
> > > > Thanks. I now remember why we reverted such support in 060a2c92d1b6
> > > > ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The =
main
> > > > problem is that point 2c also changes the output address of the PTE
> > > > (and the content of the page slightly). The architecture requires a
> > > > break-before-make in such scenario, though it would have been nice =
if it
> > > > was more specific on what could go wrong.
> > > >
> > > > We can do point 1 safely if we have FEAT_BBM level 2. For point 2, =
I
> > > > assume these 8 vmemmap pages may be accessed and that's why we can'=
t do
> > > > a break-before-make safely.
> > >
> > > Correct
> > >
> > > > I was wondering whether we could make the
> > > > PTEs RO first and then change the output address but we have anothe=
r
> > > > rule that the content of the page should be the same. I don't think
> > > > entries 1-7 are identical to entry 0 (though we could ask the archi=
tects
> > > > for clarification here). Also, can we guarantee that nothing writes=
 to
> > > > entry 0 while we would do such remapping?
> > >
> > > Yes, it's already guaranteed.
> > >
> > > > We know entries 1-7 won't be
> > > > written as we mapped them as RO but entry 0 contains the head page.
> > > > Maybe it's ok to map it RO temporarily until the newly allocated hu=
getlb
> > > > page is returned.
> > >
> > > We can do that. I don't understand how this could elide BBM. After th=
e
> > > above, we would still need to do:
> > > 3. remap entry 0 from RO to RW, mapping the `struct page` page that
> > > will be shared with entry 1-7.
> > > 4. remap entry 1-7 from their respective `struct page` pages to that
> > > of entry 0, while they remain RO.
> >
> > The Arm ARM states that we need a BBM if we change the output address
> > and: the old or new mappings are RW *or* the content of the page
> > changes. Ignoring the latter (page content), we can turn the PTEs RO
> > first without changing the pfn followed by changing the pfn while they
> > are RO. Once that's done, we make entry 0 RW and, of course, with
> > additional TLBIs between all these steps.
>
> Aha! This is easy to do -- I just made the RO guaranteed, as I
> mentioned earlier.
>
> Just to make sure I fully understand the workflow:
>
> 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page=
` area.
> 2. TLBI once, after pmd_populate_kernel()
> 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8
> PTEs, while they remain RO.
> 4. TLBI once, after set_pte_at() on PTE 1-7.
> 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` are=
a.
> 6. TLBI once, after set_pte_at() on PTE 0.
>
> No BBM required, regardless of FEAT_BBM level 2.

I just studied D8.16.1 from the reference manual, and it seems to me:
1. We still need either FEAT_BBM or BBM to split PMD.
2. We still need BBM when we change PTE 1-7, because even if they
remain RO, the content of the `struct page` page at the new location
does not match that at the old location.

> Is this correct?
>
> > Can we leave entry 0 RO? This
> > would save an additional TLBI.
>
> Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a
> refcnt on any hugeTLB pages.
>
> > Now, I wonder if all this is worth it. What are the scenarios where the
> > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB
> > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned=
.

One of the fundamental assumptions in core MM is that anyone can
read or try to grab (write) a refcnt from any `struct page`. Those
speculative PFN walkers include memory compaction, etc.


> > > > If we could get the above work, it would be a lot simpler than thin=
king
> > > > of stop_machine() or other locks to wait for such remapping.
> > >
> > > Steps 3/4 would not require BBM somehow?
> >
> > If we ignore the 'content' requirement, I think we could skip the BBM
> > but we need to make sure we don't change the permission and pfn at the
> > same time.
>
> Gotcha.
>
> > > > > To do de-HVO:
> > > > > 1. for every 8 PTEs:
> > > > >   1a. we allocate 7 order-0 pages.
> > > > >   1b. we remap PTEs #1-7 *RW* to those pages, respectively.
> > > >
> > > > Similar problem in 1.b, changing the output address. Here we could =
force
> > > > the content to be the same
> > >
> > > I don't follow the "the content to be the same" part. After HVO, we h=
ave:
> > >
> > > Entry 0 -> `struct page` page A, RW
> > > Entry 1 -> `struct page` page A, RO
> > > ...
> > > Entry 7 -> `struct page` page A, RO
> > >
> > > To de-HVO, we need to make them:
> > >
> > > Entry 0 -> `struct page` page A, RW
> > > Entry 1 -> `struct page` page B, RW
> > > ...
> > > Entry 7 -> `struct page` page H, RW
> > >
> > > I assume the same content means PTE_0 =3D=3D PTE_1/.../7?
> >
> > That's the content of the page at the corresponding pfn before and afte=
r
> > the pte change. I'm pretty sure the Arm ARM states this in case the
> > hardware starts a load (e.g. unaligned) from one page and completes it
> > from another, the software should not see any difference. But for the
> > fields we care about in struct page, I assume they'd be the same (or
> > that we just don't care about inconsistencies during this transient
> > period).
>
> Thanks for the explanation. I'll cook up something if my understanding
> above is correct.