From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0F9D1C3271E
	for <linux-mm@archiver.kernel.org>; Fri,  5 Jul 2024 17:42:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6A8EB6B00A3; Fri,  5 Jul 2024 13:42:16 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 630A66B00AC; Fri,  5 Jul 2024 13:42:16 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4F8C76B00B1; Fri,  5 Jul 2024 13:42:16 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 2E9E76B00A3
	for <linux-mm@kvack.org>; Fri,  5 Jul 2024 13:42:16 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id B582D1202D4
	for <linux-mm@kvack.org>; Fri,  5 Jul 2024 17:42:15 +0000 (UTC)
X-FDA: 82306417830.03.A5CD888
Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172])
	by imf21.hostedemail.com (Postfix) with ESMTP id E95E31C0026
	for <linux-mm@kvack.org>; Fri,  5 Jul 2024 17:42:13 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Rot3m1bj;
	spf=pass (imf21.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720201320; a=rsa-sha256;
	cv=none;
	b=gCs0l9ERaKDlaOK+zsF//3zTl8PWsPCIarseLIg8kiqiceaCIlhILLEEfpt361WvXQeTqN
	NbqOfBTZCN45gEdegHSFplma0ANqPsCOlm090D43mo03i63ic2UxOUCzHv6+9G7/4Zzh8k
	S4BbZTFGgosq+0Vkq8sy6nnb979I6XQ=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Rot3m1bj;
	spf=pass (imf21.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720201320;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Mx6t+420ZDZCraDFpW85LzdL9pA2qdlTORVzQJ7eiy0=;
	b=HbI8+bU1t8l/IlboVINzyIIuQeSJj6u4bPbWX5oXp9dU6s0mTy/3QaC+a/HnVTSrJN+KBI
	bOvTbpHxWzv3bgh83iU1fmx5QonWHD7fj1vYFI9WfzWzMARmH70s4lKz2fOOB3+fFztQHk
	WIFx6ioP6XO6hLm/0F06y6Dwso7nk5g=
Received: by mail-qt1-f172.google.com with SMTP id d75a77b69052e-447def0c4e7so40961cf.0
        for <linux-mm@kvack.org>; Fri, 05 Jul 2024 10:42:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1720201333; x=1720806133; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Mx6t+420ZDZCraDFpW85LzdL9pA2qdlTORVzQJ7eiy0=;
        b=Rot3m1bjtz1hRdr7/khROxDgNLpW0x23fHKKwMyKaeboDlEZEj/LYVZoN6mn1gvVzl
         sSPzE+Gw8yH1RXec2wVBewx5STectO1qN8Q7lb3nH0sPwhq28FhGizl+avBXVlP04ma1
         MEiwLKQviJEV/m0Nw9JdtpKfX80b39gyed0Lag/ijNiOjReMMa/oxCxkxE8BgMHRerGP
         HGihY94R8P++j/vLW9a8vRDkMOpNU+CJBqqt/m0IhR2Dkgz7W+oYgeG4XW8a202vDO2U
         gD2WR2NMr/ybQmTGKq5snce+Zxryc6cGgaVra12jziOzJ1ZtIoE6/eh+ApGTASSdBdQ4
         CAaA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720201333; x=1720806133;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Mx6t+420ZDZCraDFpW85LzdL9pA2qdlTORVzQJ7eiy0=;
        b=hDr9R5KYp7k2YwFEw7xfESnCloUZaCIODwi8h8F2GAYi2VjyZOhmOSq4i4o1BPtqdT
         +lalt5UfWLTKUv5HccH9b+nLvo2azTWjbroGZH9RcSkp6D2Aeh/an3lIctV1goqNIC/1
         B5nj74cGqiC8rw8JAImSd2oZsZhYfDvkP4V0j9IQs9hteeccAiRTR505ztbSczr47S9u
         kVv1WQJb2TIrTLDo3GKk+oGHht7iIpXTJ+CfJKmFK2RNOPGfEKRs8EFXeAtKA+qCQzE5
         Nys9i3xGaYPxNz/uHSTtfhJiexn37jw+PIKH23npxMYIk4KuGNfFs7K00AcMYATkQyWq
         3xFg==
X-Forwarded-Encrypted: i=1; AJvYcCUjnYH5c70Lsd0Q25PT2QCRU3zJ69RSJcXM/8nQgBaqNqTLw4Q4o6JLj8w3aRR/1X1h3CljVQkVVB6Eh/K4iwdd2WI=
X-Gm-Message-State: AOJu0YxJZcew+kMu6AaTBMXNiY2dgLBSTsN1DMlP5J2JUZ2MW3LC8yzn
	URhWJihS+cLxqBQawuoB3ie6XMb+irG7nFf/VbyWvKy4F+UbSwYrGeKtCSgNB7MrfJy+j7MV18T
	cpK6AiKHQzoyXCYUB9e75X1Wxl2i6bvYSH6fV
X-Google-Smtp-Source: AGHT+IFH+FEn/il/MFGc9qCTllR3mQ+txUjFoNG8+hZggbiPPwy6pLBbbda7EE+80q41BUgsC0ffSk6XRqhDhb0Myh8=
X-Received: by 2002:ac8:750c:0:b0:447:dd54:2cd4 with SMTP id
 d75a77b69052e-447dd54374amr511811cf.22.1720201332702; Fri, 05 Jul 2024
 10:42:12 -0700 (PDT)
MIME-Version: 1.0
References: <20240113094436.2506396-1-sunnanyong@huawei.com>
 <ZbKjHHeEdFYY1xR5@arm.com> <d1671959-74a4-8ea5-81f0-539df8d9c0f0@huawei.com>
 <ZcN7P0CGUOOgki71@arm.com> <CAOUHufYo=SQmpaYA3ThrdHcY9fQfFmycriSvOX1iuC4Y=Gj7Xg@mail.gmail.com>
 <ZogV9Iag4mxe6enx@arm.com>
In-Reply-To: <ZogV9Iag4mxe6enx@arm.com>
From: Yu Zhao <yuzhao@google.com>
Date: Fri, 5 Jul 2024 11:41:34 -0600
Message-ID: <CAOUHufYwoTTsRBF_wWZU_jWzb8e6FF=vN8UKtVHBeXLBkwHWzA@mail.gmail.com>
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>, will@kernel.org, mike.kravetz@oracle.com, 
	muchun.song@linux.dev, akpm@linux-foundation.org, anshuman.khandual@arm.com, 
	willy@infradead.org, wangkefeng.wang@huawei.com, 
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: mqj6d5ocz741eib1uwyt8y6no16wjqny
X-Rspamd-Queue-Id: E95E31C0026
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1720201333-507362
X-HE-Meta: U2FsdGVkX18uyj1LkHqKYw4xdx2e1T9SZiDPOdAZeEY+iMUBqIVwNfjB9F/1ReDvA2uFfA8Nbnlj/EooMJ38hXX8H6r17/yRJjbyFIqZPLeKnyNRS7eIAIy9EAEzDNo1SPNfoKvYPtIYmDGJoeIl2WgVnMM1zxVBXwVoD64WZatewWqrhbF/YOuxYMBzntkqK71foHiNY/FQKqZuTjTlskZBOaA2GUDs8pxJLjgSxNk8qOTO5YpWtddZm0XaSqyVZ7SYlv+r/7N0nD/fRdJx3fnwxSRDh8cYkj5gm+/4iSOVySD1rGvvQRl6cLbDFB72DIPxyKvB+25pPXrsD3me0Cr4PyJHZMHuGycJnyyZzrg1PxzkVAi9Skzpf5CWYMsEF6JX3vJ8fKaMlOcuZzY2RGfBW7nO2NR4tAR7BICce3Ys7p6JQ11XQO3p5EDU5H1YAHYvV+yhWDbwlMRMjNBcjki0uAWZOeLUDM2nFfxDniL1spTDPnEasChySClJXNGF7JpxXP2OomGkmwVqKhN+T2wqb2n7Jjzuz+c2O8riBN3nWh4QeczuzRUSFJx6tOARVHJ0w07C3dq7P6DCuEmKAEx0/nuNEtPoxD235Ro9AH6B46OqbG6ltK+drbnieTzqLeFRnS8dqkyVgWmWscq44ILt7FK/pH5M0hD6InyRtkb/aKyePyKgkv2M5E36Gx9kuZKdikQArp5+oohLXKyITQOkwZo9dFporsCQlOcjqvn8dBUnG2b65lAIJFkbJEes6rriOHF//GBmC2bykgS+UIkqhhVrV1XotF+yp65nMFBqgy6ybvr6SAi1Tgb8v31l7Si/H8NrEitmCskx+XfoYVD2R7/SHbBPlcSIS5PiCmAa2Mx8ECLdnjyU0mTWJtiPaITCmnw3SlLmUw5JiDAxB3QY3PGS9hE2SkguI/CUxYuuWKZdwwzjLWGIXwlsj+rbeuC58ZT7w4DFnqiY8OZ
 6O61tlDH
 r7aGosGCqEmvUXcWD+BE/1nXPhhq5uv6PMrNXNxN/ysLuoY1Ucnc2+nXZQXBCBzCWYHyOyjL5FJkY49aZDGYlyCaSWyIBw/sNw/xoPhsZkd10zE3lJ6TJ3y5lte6sXu/KBauM7Az7kGV/CT0Utli5cQNlbieR+R4fGogPLdGqeJw6RHYGXsc7KRDawoezqWJadQ0U0uIIlbi96Um+jiZAVNX42waY8b3ChijIRWBzFug6vBY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jul 5, 2024 at 9:49=E2=80=AFAM Catalin Marinas <catalin.marinas@arm=
.com> wrote:
>
> On Thu, Jun 27, 2024 at 03:19:55PM -0600, Yu Zhao wrote:
> > On Wed, Feb 7, 2024 at 5:44=E2=80=AFAM Catalin Marinas <catalin.marinas=
@arm.com> wrote:
> > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > > HVO was previously disabled on arm64 [1] due to the lack of nec=
essary
> > > > > > BBM(break-before-make) logic when changing page tables.
> > > > > > This set of patches fix this by adding necessary BBM sequence w=
hen
> > > > > > changing page table, and supporting vmemmap page fault handling=
 to
> > > > > > fixup kernel address translation fault if vmemmap is concurrent=
ly accessed.
> > > [...]
> > > > > How often is this code path called? I wonder whether a stop_machi=
ne()
> > > > > approach would be simpler.
> > > >
> > > > As long as allocating or releasing hugetlb is called.  We cannot
> > > > limit users to only allocate or release hugetlb when booting or
> > > > not running any workload on all other cpus, so if use
> > > > stop_machine(), it will be triggered 8 times every 2M and 4096
> > > > times every 1G, which is probably too expensive.
> > >
> > > I'm hoping this can be batched somehow and not do a stop_machine() (o=
r
> > > 8) for every 2MB huge page.
> >
> > Theoretically, all hugeTLB vmemmap operations from a single user
> > request can be done in one batch. This would require the preallocation
> > of the new copy of vmemmap so that the old copy can be replaced with
> > one BBM.
>
> Do we ever re-create pmd block entries back for the vmmemap range that
> was split or do they remain pmd table + pte entries? If the latter, I
> guess we could do a stop_machine() only for a pmd, it should be self
> limiting after a while.

It's the latter for now, but it can change in the future: we do want
to restore the original mapping at the PMD level; instead, we do it at
the PTE level because high-order pages backing PMD entries are not as
easy to allocate, compared with order-0 pages backing PTEs.

> I don't want user-space to DoS the system by
> triggering stop_machine() when mapping/unmapping hugetlbfs pages.

The operations are privileged, and each HVO or de-HVO request would
require at least one stop_machine(). So in theory, a privileged user
still can cause DoS.

> If I did the maths right, for a 2MB hugetlb page, we have about 8
> vmemmap pages (32K). Once we split a 2MB vmemap range,

Correct.

> whatever else
> needs to be touched in this range won't require a stop_machine().

There might be some misunderstandings here.

To do HVO:
1. we split a PMD into 512 PTEs;
2. for every 8 PTEs:
  2a. we allocate an order-0 page for PTE #0;
  2b. we remap PTE #0 *RW* to this page;
  2c. we remap PTEs #1-7 *RO* to this page;
  2d. we free the original order-3 page.

To do de-HVO:
1. for every 8 PTEs:
  1a. we allocate 7 order-0 pages.
  1b. we remap PTEs #1-7 *RW* to those pages, respectively.

We can in theory restore the original PTE or even PMD mappings at an
acceptable success rate by making changes on the MM side, e.g., only
allow movable allocations in the area backing the original PMD. Again,
we don't do it for now because high-order pages are not as easy to
allocate.

> > > Just to make sure I understand - is the goal to be able to free struc=
t
> > > pages corresponding to hugetlbfs pages?
> >
> > Correct, if you are referring to the pages holding struct page[].
> >
> > > Can we not leave the vmemmap in
> > > place and just release that memory to the page allocator?
> >
> > We cannot, since the goal is to reuse those pages for something else,
> > i.e., reduce the metadata overhead for hugeTLB.
>
> What I meant is that we can leave the vmemmap alias in place and just
> reuse those pages via the linear map etc. The kernel should touch those
> struct pages to corrupt the data. The only problem would be if we
> physically unplug those pages but I don't think that's the case here.

Set the repercussions of memory corruption aside, we still can't do
this because PTEs #1-7 need to map meaningful data, hence step 2c
above.