From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 541C0C30653 for ; Thu, 4 Jul 2024 11:47:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B292C6B00BA; Thu, 4 Jul 2024 07:47:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AD85B6B00BD; Thu, 4 Jul 2024 07:47:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 952D96B00CE; Thu, 4 Jul 2024 07:47:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 72F146B00BA for ; Thu, 4 Jul 2024 07:47:12 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 28198A400B for ; Thu, 4 Jul 2024 11:47:12 +0000 (UTC) X-FDA: 82301894304.16.5EA4BED Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf13.hostedemail.com (Postfix) with ESMTP id 4E6B420023 for ; Thu, 4 Jul 2024 11:47:07 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf13.hostedemail.com: domain of sunnanyong@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=sunnanyong@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720093611; a=rsa-sha256; cv=none; b=NOjb8spydQHjv2kJRpppb29SbYLOJ9r4I6KdTVb/adnETwA4Ddfy3TA0/d0a8HdpB+ASFs beyWYPDv8XVIIxPZPbl9KadLg4Lt18vcdOKg9ixRIUUghxbu3aRw8BNtVVt7847nK1v2Ck mRro6DNylYmrKfGF6iswOz8IgojFVU0= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf13.hostedemail.com: domain of sunnanyong@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=sunnanyong@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720093611; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CNE63u3ac1r0E8Njrs2CZLz5/pibu9l08dC6rKe9p2A=; b=YoQAj/C6jFOpAzf/uwnmtRA7r8e61dC42nW7ZK/oNmXsKSLHim10gYthxB+vpTtIC4ndZS 6PiWsXE1d65p1e0rOiI2ssLsnTFnX4iL4JGW0HArfIMWaO5m6Rs9IDFpUd06jwxb0Ci6ts ETIgBOWJI1exCColmxp4tND1Qf0cXm0= Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4WFFBD6TQcz1T4xc; Thu, 4 Jul 2024 19:42:28 +0800 (CST) Received: from kwepemm600003.china.huawei.com (unknown [7.193.23.202]) by mail.maildlp.com (Postfix) with ESMTPS id 93C5E180A9C; Thu, 4 Jul 2024 19:47:03 +0800 (CST) Received: from [10.174.179.79] (10.174.179.79) by kwepemm600003.china.huawei.com (7.193.23.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Thu, 4 Jul 2024 19:47:02 +0800 Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize To: Yu Zhao CC: David Rientjes , Will Deacon , Catalin Marinas , Matthew Wilcox , , Andrew Morton , , , , , , Yosry Ahmed , Sourav Panda References: <20240113094436.2506396-1-sunnanyong@huawei.com> <20240207111252.GA22167@willie-the-truck> <44075bc2-ac5f-ffcd-0d2f-4093351a6151@huawei.com> <20240208131734.GA23428@willie-the-truck> <22c14513-af78-0f1d-5647-384ff9cb5993@huawei.com> <17232655-553d-7d48-8ba1-5425e8ab0f8b@huawei.com> From: Nanyong Sun Message-ID: <06252b78-2b61-73d1-ddf8-920dd744c756@huawei.com> Date: Thu, 4 Jul 2024 19:47:01 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.179.79] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemm600003.china.huawei.com (7.193.23.202) X-Rspamd-Queue-Id: 4E6B420023 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: hz7p4kc1hk1kwkwzdpiir4u3ma8xf113 X-HE-Tag: 1720093627-732235 X-HE-Meta: U2FsdGVkX181t+sTth1fxccFDv/1e/rAQ0POmUULzo4XhxeN6o35IvWZ8JIEC2YIkuLhY99Sm1mFvjNHmJxfl7k/iqd1JZYpEIo9ZLyJQ/yPvAlAETdOTr7/w4Ra/QKTLMNqjBIyV83Ji/u9hzG1+myUWg6+4ryQFiiqi+zKEMkxu6g6yjTja91no8gvvp4vwwfo6qwh+bqQV41fxNQIGw5TjIZ5My9ghH443sbZ/7QGWS321RbXPBE8G73W7D64AlsryCmOsueQoyO5eM7y6v40gGDVDU1/vBhJd+AJ/wljDW9uA+Cay4WXMm0S64eeczZr0mAZ6bVHigD8uohEUK0qkKx9MXFQhlPPPt/Yfrkij10Ft5tlhZ3eHDy35hPzEoDfTg7BBLJPina6oUCRTq8oEZuVl/NZEA9tRt6xBSt5hErv7r+XN/AuXPpo3a9Vyhkdk0hQKe3fRnPo8NFn4PQeTtINPgMCHu3pY5agT3BQ6A26K2UGb66moDQnS+MZwcvYQfDhHSThqDipGKWZ8s76CSDk9musD0K0sQzjLv2y3CX1sl8VjJu4LbSu7nkgUL4zSVJOCOYLFaYxMjktTEBIxV5FIKm47E4VD9upB+91iH9eCwPRyy8ZNWOVksPIDgfyS/3c3IpYT4NRTd5HEj4XLKyreA2aedGTA6qaFUvl2uPEIuvxOp55HaF2j6G4EkesAvDXOn3upod8K/PfOvWE+7MGJlEvU34mlsU3DIXIUg+n4vMpVdMrRanKqkPeTXvyp6+sz4ND+g7gh7lRjXV6M9u4rHsdJX3yRRedFp4SC2fQQQv1U4hPtMTl1eT4VR+lVTr6vWzaT1oABR30RHUrspmamTTj1MfQZ6OFlts0SvTLmsStDhT5fq101X6+1/pB2iH8IEuvU0F2RpAqGg7ytLqHSpeILEL/LISMgpmrzOUQUERXtnHUVwP5hTHaD2wEwsGCLIdkpmqELaz RkLlzq+R rpt+Dq/sWs5sRby7mYVuiGkLMhfPpFmpE0W1Sw4RZtAfwATJkPYl/gVX6T3ggwuE0QLNua5jYQhiWTcoc8s/QR2/RWRZnluIHaWeQTPRFH2uAO2fZgtFGsbGog/GstzcRymS4Oe+6TpVIBuXJ7DPTY+c8M/hzL0btsUGhMQuwV1bc9+2lJ78016zdiRdSJNSbP56uHP2U3I3eHfhm0E+C2mBUdM+tq3vIkROnahSlIl/nHL4JpadX1kJ8H6EVhcoBPQelmHbksbiOMeK4L99FPdfaNjim74ybZskC6pZliKZP/vckbu4Ir1U40YZQpRwnHwjA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/6/28 5:03, Yu Zhao wrote: > On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun wrote: >> >> 在 2024/6/24 13:39, Yu Zhao 写道: >>> On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: >>>> On 2024/3/14 7:32, David Rientjes wrote: >>>> >>>>> On Thu, 8 Feb 2024, Will Deacon wrote: >>>>> >>>>>>> How about take a new lock with irq disabled during BBM, like: >>>>>>> >>>>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) >>>>>>> +{ >>>>>>> + (NEW_LOCK); >>>>>>> + pte_clear(&init_mm, addr, ptep); >>>>>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >>>>>>> + set_pte_at(&init_mm, addr, ptep, pte); >>>>>>> + spin_unlock_irq(NEW_LOCK); >>>>>>> +} >>>>>> I really think the only maintainable way to achieve this is to avoid the >>>>>> possibility of a fault altogether. >>>>>> >>>>>> Will >>>>>> >>>>>> >>>>> Nanyong, are you still actively working on making HVO possible on arm64? >>>>> >>>>> This would yield a substantial memory savings on hosts that are largely >>>>> configured with hugetlbfs. In our case, the size of this hugetlbfs pool >>>>> is actually never changed after boot, but it sounds from the thread that >>>>> there was an idea to make HVO conditional on FEAT_BBM. Is this being >>>>> pursued? >>>>> >>>>> If so, any testing help needed? >>>> I'm afraid that FEAT_BBM may not solve the problem here >>> I think so too -- I came cross this while working on TAO [1]. >>> >>> [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ >>> >>>> because from Arm >>>> ARM, >>>> I see that FEAT_BBM is only used for changing block size. Therefore, in this >>>> HVO feature, >>>> it can work in the split PMD stage, that is, BBM can be avoided in >>>> vmemmap_split_pmd, >>>> but in the subsequent vmemmap_remap_pte, the Output address of PTE still >>>> needs to be >>>> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my >>>> understanding >>>> of ARM FEAT_BBM is wrong, and I hope someone can correct me. >>>> Actually, the solution I first considered was to use the stop_machine >>>> method, but we have >>>> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically >>>> use hugepages, >>>> so I have to consider performance issues. If your product does not change >>>> the amount of huge >>>> pages after booting, using stop_machine() may be a feasible way. >>>> So far, I still haven't come up with a good solution. >>> I do have a patch that's similar to stop_machine() -- it uses NMI IPIs >>> to pause/resume remote CPUs while the local one is doing BBM. >>> >>> Note that the problem of updating vmemmap for struct page[], as I see >>> it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory >>> hot removal in general [2]. On arm64, we would need to support BBM on >>> vmemmap so that we can fix the problem with offlining memory (or to be >>> precise, unmapping offlined struct page[]), by mapping offlined struct >>> page[] to a read-only page of dummy struct page[], similar to >>> ZERO_PAGE(). (Or we would have to make extremely invasive changes to >>> the reader side, i.e., all speculative PFN walkers.) >>> >>> In case you are interested in testing my approach, you can swap your >>> patch 2 with the following: >> I don't have an NMI IPI capable ARM machine on hand, so I think this feature >> depends on a higher version of the ARM cpu. > (Pseudo) NMI does require GICv3 (released in 2015). But that's > independent from CPU versions. Just to double check: you don't have > GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or > irqchip.gicv3_pseudo_nmi=1), is that correct? > > Even without GICv3, IPIs can be masked but still works, with a less > bounded latency. Oh,I misunderstood. Pseudo NMI is available. We have CONFIG_ARM64_PSEUDO_NMI=y but did not set irqchip.gicv3_pseudo_nmi=1 by default. So I can test this solution after opening this in cmdline. >> What I worried about was that other cores would occasionally be interrupted >> frequently(8 times every 2M and 4096 times every 1G) and then wait for the >> update of page table to complete before resuming. > Catalin has suggested batching, and to echo what he said [1]: it's > possible to make all vmemmap changes from a single HVO/de-HVO > operation into *one batch*. > > [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/ > >> If there are workloads >> running on other cores, performance may be affected. This implementation >> speeds up stopping and resuming other cores, but they still have to wait >> for the update to finish. > How often does your use case trigger HVO/de-HVO operations? > > For our VM use case, it's generally correlated to VM lifetimes, i.e., > how often VM bin-packing happens. For our THP use case, it can be more > often, but I still don't think we would trigger HVO/de-HVO every > minute. So with NMI IPIs, IMO, the performance impact would be > acceptable to our use cases. > > . We have many use cases so that I'm not thinking about a specific use case, but rather a generic one. I will test the performance impact of different HVO trigger frequencies, such as triggering HVO while running redis.