From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 541C0C30653
	for <linux-mm@archiver.kernel.org>; Thu,  4 Jul 2024 11:47:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B292C6B00BA; Thu,  4 Jul 2024 07:47:12 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AD85B6B00BD; Thu,  4 Jul 2024 07:47:12 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 952D96B00CE; Thu,  4 Jul 2024 07:47:12 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 72F146B00BA
	for <linux-mm@kvack.org>; Thu,  4 Jul 2024 07:47:12 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 28198A400B
	for <linux-mm@kvack.org>; Thu,  4 Jul 2024 11:47:12 +0000 (UTC)
X-FDA: 82301894304.16.5EA4BED
Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255])
	by imf13.hostedemail.com (Postfix) with ESMTP id 4E6B420023
	for <linux-mm@kvack.org>; Thu,  4 Jul 2024 11:47:07 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf13.hostedemail.com: domain of sunnanyong@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=sunnanyong@huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720093611; a=rsa-sha256;
	cv=none;
	b=NOjb8spydQHjv2kJRpppb29SbYLOJ9r4I6KdTVb/adnETwA4Ddfy3TA0/d0a8HdpB+ASFs
	beyWYPDv8XVIIxPZPbl9KadLg4Lt18vcdOKg9ixRIUUghxbu3aRw8BNtVVt7847nK1v2Ck
	mRro6DNylYmrKfGF6iswOz8IgojFVU0=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf13.hostedemail.com: domain of sunnanyong@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=sunnanyong@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720093611;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=CNE63u3ac1r0E8Njrs2CZLz5/pibu9l08dC6rKe9p2A=;
	b=YoQAj/C6jFOpAzf/uwnmtRA7r8e61dC42nW7ZK/oNmXsKSLHim10gYthxB+vpTtIC4ndZS
	6PiWsXE1d65p1e0rOiI2ssLsnTFnX4iL4JGW0HArfIMWaO5m6Rs9IDFpUd06jwxb0Ci6ts
	ETIgBOWJI1exCColmxp4tND1Qf0cXm0=
Received: from mail.maildlp.com (unknown [172.19.163.252])
	by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4WFFBD6TQcz1T4xc;
	Thu,  4 Jul 2024 19:42:28 +0800 (CST)
Received: from kwepemm600003.china.huawei.com (unknown [7.193.23.202])
	by mail.maildlp.com (Postfix) with ESMTPS id 93C5E180A9C;
	Thu,  4 Jul 2024 19:47:03 +0800 (CST)
Received: from [10.174.179.79] (10.174.179.79) by
 kwepemm600003.china.huawei.com (7.193.23.202) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.1.2507.39; Thu, 4 Jul 2024 19:47:02 +0800
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
To: Yu Zhao <yuzhao@google.com>
CC: David Rientjes <rientjes@google.com>, Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>, Matthew Wilcox
	<willy@infradead.org>, <muchun.song@linux.dev>, Andrew Morton
	<akpm@linux-foundation.org>, <anshuman.khandual@arm.com>,
	<wangkefeng.wang@huawei.com>, <linux-arm-kernel@lists.infradead.org>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, Yosry Ahmed
	<yosryahmed@google.com>, Sourav Panda <souravpanda@google.com>
References: <20240113094436.2506396-1-sunnanyong@huawei.com>
 <ZbKjHHeEdFYY1xR5@arm.com> <d1671959-74a4-8ea5-81f0-539df8d9c0f0@huawei.com>
 <20240207111252.GA22167@willie-the-truck>
 <ZcNnrdlb3fe0kGHK@casper.infradead.org> <ZcN1hTrAhy-B1P2_@arm.com>
 <44075bc2-ac5f-ffcd-0d2f-4093351a6151@huawei.com>
 <20240208131734.GA23428@willie-the-truck>
 <f8a643a9-4932-9ba4-94f1-4bc88ee27740@google.com>
 <22c14513-af78-0f1d-5647-384ff9cb5993@huawei.com>
 <ZnkGps-vQbiynNwP@google.com>
 <17232655-553d-7d48-8ba1-5425e8ab0f8b@huawei.com>
 <CAOUHufY8AZ7Td=OKg+Bbbnk+B-XspJQH2XDsEeZsiDJ-GuQgcQ@mail.gmail.com>
From: Nanyong Sun <sunnanyong@huawei.com>
Message-ID: <06252b78-2b61-73d1-ddf8-920dd744c756@huawei.com>
Date: Thu, 4 Jul 2024 19:47:01 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <CAOUHufY8AZ7Td=OKg+Bbbnk+B-XspJQH2XDsEeZsiDJ-GuQgcQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.179.79]
X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To
 kwepemm600003.china.huawei.com (7.193.23.202)
X-Rspamd-Queue-Id: 4E6B420023
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: hz7p4kc1hk1kwkwzdpiir4u3ma8xf113
X-HE-Tag: 1720093627-732235
X-HE-Meta: U2FsdGVkX181t+sTth1fxccFDv/1e/rAQ0POmUULzo4XhxeN6o35IvWZ8JIEC2YIkuLhY99Sm1mFvjNHmJxfl7k/iqd1JZYpEIo9ZLyJQ/yPvAlAETdOTr7/w4Ra/QKTLMNqjBIyV83Ji/u9hzG1+myUWg6+4ryQFiiqi+zKEMkxu6g6yjTja91no8gvvp4vwwfo6qwh+bqQV41fxNQIGw5TjIZ5My9ghH443sbZ/7QGWS321RbXPBE8G73W7D64AlsryCmOsueQoyO5eM7y6v40gGDVDU1/vBhJd+AJ/wljDW9uA+Cay4WXMm0S64eeczZr0mAZ6bVHigD8uohEUK0qkKx9MXFQhlPPPt/Yfrkij10Ft5tlhZ3eHDy35hPzEoDfTg7BBLJPina6oUCRTq8oEZuVl/NZEA9tRt6xBSt5hErv7r+XN/AuXPpo3a9Vyhkdk0hQKe3fRnPo8NFn4PQeTtINPgMCHu3pY5agT3BQ6A26K2UGb66moDQnS+MZwcvYQfDhHSThqDipGKWZ8s76CSDk9musD0K0sQzjLv2y3CX1sl8VjJu4LbSu7nkgUL4zSVJOCOYLFaYxMjktTEBIxV5FIKm47E4VD9upB+91iH9eCwPRyy8ZNWOVksPIDgfyS/3c3IpYT4NRTd5HEj4XLKyreA2aedGTA6qaFUvl2uPEIuvxOp55HaF2j6G4EkesAvDXOn3upod8K/PfOvWE+7MGJlEvU34mlsU3DIXIUg+n4vMpVdMrRanKqkPeTXvyp6+sz4ND+g7gh7lRjXV6M9u4rHsdJX3yRRedFp4SC2fQQQv1U4hPtMTl1eT4VR+lVTr6vWzaT1oABR30RHUrspmamTTj1MfQZ6OFlts0SvTLmsStDhT5fq101X6+1/pB2iH8IEuvU0F2RpAqGg7ytLqHSpeILEL/LISMgpmrzOUQUERXtnHUVwP5hTHaD2wEwsGCLIdkpmqELaz
 RkLlzq+R
 rpt+Dq/sWs5sRby7mYVuiGkLMhfPpFmpE0W1Sw4RZtAfwATJkPYl/gVX6T3ggwuE0QLNua5jYQhiWTcoc8s/QR2/RWRZnluIHaWeQTPRFH2uAO2fZgtFGsbGog/GstzcRymS4Oe+6TpVIBuXJ7DPTY+c8M/hzL0btsUGhMQuwV1bc9+2lJ78016zdiRdSJNSbP56uHP2U3I3eHfhm0E+C2mBUdM+tq3vIkROnahSlIl/nHL4JpadX1kJ8H6EVhcoBPQelmHbksbiOMeK4L99FPdfaNjim74ybZskC6pZliKZP/vckbu4Ir1U40YZQpRwnHwjA
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 2024/6/28 5:03, Yu Zhao wrote:
> On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun <sunnanyong@huawei.com> wrote:
>>
>> 在 2024/6/24 13:39, Yu Zhao 写道:
>>> On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:
>>>> On 2024/3/14 7:32, David Rientjes wrote:
>>>>
>>>>> On Thu, 8 Feb 2024, Will Deacon wrote:
>>>>>
>>>>>>> How about take a new lock with irq disabled during BBM, like:
>>>>>>>
>>>>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
>>>>>>> +{
>>>>>>> +     (NEW_LOCK);
>>>>>>> +    pte_clear(&init_mm, addr, ptep);
>>>>>>> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>>>>>> +    set_pte_at(&init_mm, addr, ptep, pte);
>>>>>>> +    spin_unlock_irq(NEW_LOCK);
>>>>>>> +}
>>>>>> I really think the only maintainable way to achieve this is to avoid the
>>>>>> possibility of a fault altogether.
>>>>>>
>>>>>> Will
>>>>>>
>>>>>>
>>>>> Nanyong, are you still actively working on making HVO possible on arm64?
>>>>>
>>>>> This would yield a substantial memory savings on hosts that are largely
>>>>> configured with hugetlbfs.  In our case, the size of this hugetlbfs pool
>>>>> is actually never changed after boot, but it sounds from the thread that
>>>>> there was an idea to make HVO conditional on FEAT_BBM.  Is this being
>>>>> pursued?
>>>>>
>>>>> If so, any testing help needed?
>>>> I'm afraid that FEAT_BBM may not solve the problem here
>>> I think so too -- I came cross this while working on TAO [1].
>>>
>>> [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/
>>>
>>>> because from Arm
>>>> ARM,
>>>> I see that FEAT_BBM is only used for changing block size. Therefore, in this
>>>> HVO feature,
>>>> it can work in the split PMD stage, that is, BBM can be avoided in
>>>> vmemmap_split_pmd,
>>>> but in the subsequent vmemmap_remap_pte, the Output address of PTE still
>>>> needs to be
>>>> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my
>>>> understanding
>>>> of ARM FEAT_BBM is wrong, and I hope someone can correct me.
>>>> Actually, the solution I first considered was to use the stop_machine
>>>> method, but we have
>>>> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically
>>>> use hugepages,
>>>> so I have to consider performance issues. If your product does not change
>>>> the amount of huge
>>>> pages after booting, using stop_machine() may be a feasible way.
>>>> So far, I still haven't come up with a good solution.
>>> I do have a patch that's similar to stop_machine() -- it uses NMI IPIs
>>> to pause/resume remote CPUs while the local one is doing BBM.
>>>
>>> Note that the problem of updating vmemmap for struct page[], as I see
>>> it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory
>>> hot removal in general [2]. On arm64, we would need to support BBM on
>>> vmemmap so that we can fix the problem with offlining memory (or to be
>>> precise, unmapping offlined struct page[]), by mapping offlined struct
>>> page[] to a read-only page of dummy struct page[], similar to
>>> ZERO_PAGE(). (Or we would have to make extremely invasive changes to
>>> the reader side, i.e., all speculative PFN walkers.)
>>>
>>> In case you are interested in testing my approach, you can swap your
>>> patch 2 with the following:
>> I don't have an NMI IPI capable ARM machine on hand, so I think this feature
>> depends on a higher version of the ARM cpu.
> (Pseudo) NMI does require GICv3 (released in 2015). But that's
> independent from CPU versions. Just to double check: you don't have
> GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or
> irqchip.gicv3_pseudo_nmi=1), is that correct?
>
> Even without GICv3, IPIs can be masked but still works, with a less
> bounded latency.
Oh，I misunderstood. Pseudo NMI is available. We have 
CONFIG_ARM64_PSEUDO_NMI=y
but did not set irqchip.gicv3_pseudo_nmi=1 by default. So I can test 
this solution after
opening this in cmdline.

>> What I worried about was that other cores would occasionally be interrupted
>> frequently(8 times every 2M and 4096 times every 1G) and then wait for the
>> update of page table to complete before resuming.
> Catalin has suggested batching, and to echo what he said [1]: it's
> possible to make all vmemmap changes from a single HVO/de-HVO
> operation into *one batch*.
>
> [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/
>
>> If there are workloads
>> running on other cores, performance may be affected. This implementation
>> speeds up stopping and resuming other cores, but they still have to wait
>> for the update to finish.
> How often does your use case trigger HVO/de-HVO operations?
>
> For our VM use case, it's generally correlated to VM lifetimes, i.e.,
> how often VM bin-packing happens. For our THP use case, it can be more
> often, but I still don't think we would trigger HVO/de-HVO every
> minute. So with NMI IPIs, IMO, the performance impact would be
> acceptable to our use cases.
>
> .
We have many use cases so that I'm not thinking about a specific use case,
but rather a generic one. I will test the performance impact of different
HVO trigger frequencies, such as triggering HVO while running redis.