From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 90B7BC2BD09
	for <linux-mm@archiver.kernel.org>; Thu, 27 Jun 2024 21:04:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 013F96B0085; Thu, 27 Jun 2024 17:04:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F00CE6B0098; Thu, 27 Jun 2024 17:04:19 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DA1786B0099; Thu, 27 Jun 2024 17:04:19 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id BDB8D6B0092
	for <linux-mm@kvack.org>; Thu, 27 Jun 2024 17:04:19 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id CEF1C1C22BB
	for <linux-mm@kvack.org>; Thu, 27 Jun 2024 21:04:18 +0000 (UTC)
X-FDA: 82277896596.05.E97DA23
Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49])
	by imf07.hostedemail.com (Postfix) with ESMTP id 7FC3A4001C
	for <linux-mm@kvack.org>; Thu, 27 Jun 2024 21:04:15 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=1q0TR8Gj;
	spf=pass (imf07.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719522238;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=g1Son9PgFZxipUqIwP1rlgNrA3yZxpgDflb001yGw5A=;
	b=dCxyvLOoqBiwFnpYnz0h3X6AyqTPgWfhcJW2arbTiWbMGiWFT8A+pZpOtDN5B2Ci0J5D6a
	QKmAg08VJWLdJyXtRplX1Vt2i5TwjWDOdU0lslqLZ4Oj5JlzDAvGHNrMzgUKSd5Kq4+/dz
	BYrm3SMOSRVgB3fGSohqBUEgOfaW0qM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719522238; a=rsa-sha256;
	cv=none;
	b=g7RqwCWcAuzIWO1gN6xlrU04+YffHsj3VwgDbMa0WFLcEstgBMqyO+khAv8UMVVpnO3eFR
	bhFiug96tnXYMCOnajQUKJV/pVhHwyXZ+3HDMpsOWIuI6YdnP9f3M4w/S/pArbZ/2U65Zz
	Q7hL8WbH/2livZm/Yse6xuqufJaGaog=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=1q0TR8Gj;
	spf=pass (imf07.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-wm1-f49.google.com with SMTP id 5b1f17b1804b1-4255f915611so5135e9.0
        for <linux-mm@kvack.org>; Thu, 27 Jun 2024 14:04:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1719522254; x=1720127054; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=g1Son9PgFZxipUqIwP1rlgNrA3yZxpgDflb001yGw5A=;
        b=1q0TR8Gj1yHtx0Oa9abEnuRF4S1x87UJGJF8/ZfuF6jLBDpSMgcpjGR5tVCZrdCv0f
         GZjXt0a+wDxTvMCv64Y4xNbEBZjCajFWCRLfoVHxwoMQLju+wfTL9ckqNKaX5t0s9CXA
         fhcBKDU+vDvqzdThfiaXCqCAE8EMl39UaxeSefvNDxztVsxVOWaQgfkw9alVQ+TcNcoX
         /Bj/Cm5inRyNhW/cI5OAScOw48IoN5vE2e4vh8oBrJvxH3hjfX8NK1BSDVe0vQIBf5bl
         EdSMCtbE13lUlgSZ2MazAXLPI7mFRQUQc63mwGgE9ddKMf1RXlJctWsP1q5fb43pbMFf
         UHFw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719522254; x=1720127054;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=g1Son9PgFZxipUqIwP1rlgNrA3yZxpgDflb001yGw5A=;
        b=Ci3fsjoN6DxdBDOeZPVEp/HB8EAEwnuXZ41rG0CgjXFmUFMbMTq7tBfohuZ7I+vKeo
         zg71wi51IiCW9XDlYzvDQvlFfJJ2ILtp1/lcDiLyi/ZbloGbWAZOqU/ROpvBZVlA8olA
         rDW/U3hCuI1fdAybSMVmmmNOdC6czi5C9SLJE596kSMvf/i12Mm2TJl3TAQI8HfzxXoN
         n/qBRDpDB3iGXxvf2Up5t99yusoymL88EE/W4PDJu+wbM91c2l8m5Y3n09r1mXuI8N19
         Doki0OGZZUOVmYMcEJ4EpKY9oIwBi2y+xna/IeHDCHLwJ0gIQ7I89x8akEKFgBZ8heIP
         8U2Q==
X-Forwarded-Encrypted: i=1; AJvYcCWy+iPaE8wxVxiFs10z9iaXppenHtC+zbnaR7fgVnuP5W6n55xkseRh5LEGnnHqGSftP43KxHBne+0LlUyLgDOFSKY=
X-Gm-Message-State: AOJu0Yxe6duVF687tCGfrfiviTqt1e1W5RYqQH+zqF0zRdZmpCHLXezM
	qYyz5K8F5IgmTZ/FxcaYeOuCiIEE3R/qIisJ8T4/VNdQbINW54jZXJi2fQzG/FmU7wFg605gUSQ
	YnMKBamhc9fz75M3Qa0t64PubvngDjIbtA6j4
X-Google-Smtp-Source: AGHT+IF2M2h8golpqelyS8GkrbBkEYKMYLHQAr/rmMTFu2PNvyLpFVgJPntRKGsxwIzhN/r5HizzGcjI1uK5/VFqFR0=
X-Received: by 2002:a05:600c:1c19:b0:424:8a45:dd90 with SMTP id
 5b1f17b1804b1-4256c28f6d0mr123395e9.3.1719522253773; Thu, 27 Jun 2024
 14:04:13 -0700 (PDT)
MIME-Version: 1.0
References: <20240113094436.2506396-1-sunnanyong@huawei.com>
 <ZbKjHHeEdFYY1xR5@arm.com> <d1671959-74a4-8ea5-81f0-539df8d9c0f0@huawei.com>
 <20240207111252.GA22167@willie-the-truck> <ZcNnrdlb3fe0kGHK@casper.infradead.org>
 <ZcN1hTrAhy-B1P2_@arm.com> <44075bc2-ac5f-ffcd-0d2f-4093351a6151@huawei.com>
 <20240208131734.GA23428@willie-the-truck> <f8a643a9-4932-9ba4-94f1-4bc88ee27740@google.com>
 <22c14513-af78-0f1d-5647-384ff9cb5993@huawei.com> <ZnkGps-vQbiynNwP@google.com>
 <17232655-553d-7d48-8ba1-5425e8ab0f8b@huawei.com>
In-Reply-To: <17232655-553d-7d48-8ba1-5425e8ab0f8b@huawei.com>
From: Yu Zhao <yuzhao@google.com>
Date: Thu, 27 Jun 2024 15:03:36 -0600
Message-ID: <CAOUHufY8AZ7Td=OKg+Bbbnk+B-XspJQH2XDsEeZsiDJ-GuQgcQ@mail.gmail.com>
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
To: Nanyong Sun <sunnanyong@huawei.com>
Cc: David Rientjes <rientjes@google.com>, Will Deacon <will@kernel.org>, 
	Catalin Marinas <catalin.marinas@arm.com>, Matthew Wilcox <willy@infradead.org>, muchun.song@linux.dev, 
	Andrew Morton <akpm@linux-foundation.org>, anshuman.khandual@arm.com, 
	wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, 
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, 
	Yosry Ahmed <yosryahmed@google.com>, Sourav Panda <souravpanda@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 7FC3A4001C
X-Stat-Signature: ytzsmfbqxgj7oztkjan1kog7bf5ng43w
X-HE-Tag: 1719522255-857815
X-HE-Meta: U2FsdGVkX1++65p3tbrJJ2MLIMJuHwcAAXIw+wuKwSW34G8+BS0+8FBx5SLMwQE5WVnSMtAV0vWT79yIJfz62MYxDZpUE3DjJuEom8K0GvwxSVFhOz00mwNOCEGozYDlDnZNqP+eMi2oGjreJ6YLOYPb8KRCQbkXboUAliR/gSvRVQvZWgr+cazzJAyvVhzDUVe/gWlrKssDRCcnPaUCplKkvn0auhb1odQzYcT8J2kx1j3SQTBPnmju9J5XLWVjDm/yt9gUdh3rrsUOYHfrOQbp97PWpipy1kuu3LlBLJ3LJA3pNTHtcClm+fPTGZtF9amDI7+CQmZ+zFNOyq3GNG4i5kDJLJG0NMGkluDmmjJ7HWexAxjlIZa1F5LN5hA3IroECSPcSRM/d7QuVgtitVuVP9yOonT3VRrFyh++Zm17GmcOvPLxWLtLaGQM0ozZTAr2CEfxdlgCl+Z58l58WFWI1/RPuEoNB+GjsLD7G7dqgpdwG+74He1wjeDoAR0/Dc5DfbJvGx+jgY48peD2JQPec9cnj4LjUFJUHoBDv4sa0lp6i78pZ8R/tBwEwiUuj2iAl2AAVYfwpNqgoBqJzS82uDpku/sJucgjo1QiRxZopPX0IXtk1WU8WsEHkNJVxiKCHQ/41WCG9bJQiRcj8chVOg/gVEXyXzvuV0ZHpkDMXHJsNazPHvpPzFYvI1CnP/BH8lPSvzygUfODl3kKdoKQ8I5qnwZFsHdzNjA75nf2xL31MkTfhSHA2BkHLabg3pvJApaU5ZcuIOhdqm5xQPzXw3+1Iaz85VzbqSkGe2Y4vXwHv6akCek7j1TpzDrFI+MxoRHv+gElME5MVZzbC7tmB3I3Fz+oZYKIwuU4NHD2YLikfcCw0Bp6+rVf04AvDmCR9VPXyS0eQLBOxxG43j1ZRIEemHJKimlBDUDzNaIkwZWku0k1RtpTuPJY4uuUt00IgrtN1Fi3VerP243
 Q7CuygpX
 aryXXwLLpMO5Zo9nbw6bSJy82eAbonfo9OEqLoaQnSpgL2E7DKh25/SGsabLZZT5VRR0i4WNoDt8f5NFyeno0ILuLMMSSf28bDtbsVhzcuUwCIstJsesuYmgDkfsULwOvl6tPYj6XXuiDvQAJCJsMbRZcwlVLN+3zs0RhQOpslOizBAh0uyp21Kv8FIH4Z9V9TzaYlAAQDKlw6PCK3UQ87n93rXFJdrYpwTwTbw5FH6QAAXqUDazkwVrP72eRKT1asD/diohAdoDC+A3SdnNCiZyzmBLefmyjmOFEwEXmO53vL02L8XTekw8MZascWvKVUbWRW8vJfVzG7MUnM/knm5AXXA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jun 27, 2024 at 8:34=E2=80=AFAM Nanyong Sun <sunnanyong@huawei.com>=
 wrote:
>
>
> =E5=9C=A8 2024/6/24 13:39, Yu Zhao =E5=86=99=E9=81=93:
> > On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:
> >> On 2024/3/14 7:32, David Rientjes wrote:
> >>
> >>> On Thu, 8 Feb 2024, Will Deacon wrote:
> >>>
> >>>>> How about take a new lock with irq disabled during BBM, like:
> >>>>>
> >>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte=
)
> >>>>> +{
> >>>>> +     (NEW_LOCK);
> >>>>> +    pte_clear(&init_mm, addr, ptep);
> >>>>> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> >>>>> +    set_pte_at(&init_mm, addr, ptep, pte);
> >>>>> +    spin_unlock_irq(NEW_LOCK);
> >>>>> +}
> >>>> I really think the only maintainable way to achieve this is to avoid=
 the
> >>>> possibility of a fault altogether.
> >>>>
> >>>> Will
> >>>>
> >>>>
> >>> Nanyong, are you still actively working on making HVO possible on arm=
64?
> >>>
> >>> This would yield a substantial memory savings on hosts that are large=
ly
> >>> configured with hugetlbfs.  In our case, the size of this hugetlbfs p=
ool
> >>> is actually never changed after boot, but it sounds from the thread t=
hat
> >>> there was an idea to make HVO conditional on FEAT_BBM.  Is this being
> >>> pursued?
> >>>
> >>> If so, any testing help needed?
> >> I'm afraid that FEAT_BBM may not solve the problem here
> > I think so too -- I came cross this while working on TAO [1].
> >
> > [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/
> >
> >> because from Arm
> >> ARM,
> >> I see that FEAT_BBM is only used for changing block size. Therefore, i=
n this
> >> HVO feature,
> >> it can work in the split PMD stage, that is, BBM can be avoided in
> >> vmemmap_split_pmd,
> >> but in the subsequent vmemmap_remap_pte, the Output address of PTE sti=
ll
> >> needs to be
> >> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps =
my
> >> understanding
> >> of ARM FEAT_BBM is wrong, and I hope someone can correct me.
> >> Actually, the solution I first considered was to use the stop_machine
> >> method, but we have
> >> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamica=
lly
> >> use hugepages,
> >> so I have to consider performance issues. If your product does not cha=
nge
> >> the amount of huge
> >> pages after booting, using stop_machine() may be a feasible way.
> >> So far, I still haven't come up with a good solution.
> > I do have a patch that's similar to stop_machine() -- it uses NMI IPIs
> > to pause/resume remote CPUs while the local one is doing BBM.
> >
> > Note that the problem of updating vmemmap for struct page[], as I see
> > it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory
> > hot removal in general [2]. On arm64, we would need to support BBM on
> > vmemmap so that we can fix the problem with offlining memory (or to be
> > precise, unmapping offlined struct page[]), by mapping offlined struct
> > page[] to a read-only page of dummy struct page[], similar to
> > ZERO_PAGE(). (Or we would have to make extremely invasive changes to
> > the reader side, i.e., all speculative PFN walkers.)
> >
> > In case you are interested in testing my approach, you can swap your
> > patch 2 with the following:
> I don't have an NMI IPI capable ARM machine on hand, so I think this feat=
ure
> depends on a higher version of the ARM cpu.

(Pseudo) NMI does require GICv3 (released in 2015). But that's
independent from CPU versions. Just to double check: you don't have
GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=3Dy or
irqchip.gicv3_pseudo_nmi=3D1), is that correct?

Even without GICv3, IPIs can be masked but still works, with a less
bounded latency.

> What I worried about was that other cores would occasionally be interrupt=
ed
> frequently(8 times every 2M and 4096 times every 1G) and then wait for th=
e
> update of page table to complete before resuming.

Catalin has suggested batching, and to echo what he said [1]: it's
possible to make all vmemmap changes from a single HVO/de-HVO
operation into *one batch*.

[1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/

> If there are workloads
> running on other cores, performance may be affected. This implementation
> speeds up stopping and resuming other cores, but they still have to wait
> for the update to finish.

How often does your use case trigger HVO/de-HVO operations?

For our VM use case, it's generally correlated to VM lifetimes, i.e.,
how often VM bin-packing happens. For our THP use case, it can be more
often, but I still don't think we would trigger HVO/de-HVO every
minute. So with NMI IPIs, IMO, the performance impact would be
acceptable to our use cases.