From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90B7BC2BD09 for ; Thu, 27 Jun 2024 21:04:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 013F96B0085; Thu, 27 Jun 2024 17:04:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F00CE6B0098; Thu, 27 Jun 2024 17:04:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA1786B0099; Thu, 27 Jun 2024 17:04:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BDB8D6B0092 for ; Thu, 27 Jun 2024 17:04:19 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id CEF1C1C22BB for ; Thu, 27 Jun 2024 21:04:18 +0000 (UTC) X-FDA: 82277896596.05.E97DA23 Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) by imf07.hostedemail.com (Postfix) with ESMTP id 7FC3A4001C for ; Thu, 27 Jun 2024 21:04:15 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=1q0TR8Gj; spf=pass (imf07.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719522238; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=g1Son9PgFZxipUqIwP1rlgNrA3yZxpgDflb001yGw5A=; b=dCxyvLOoqBiwFnpYnz0h3X6AyqTPgWfhcJW2arbTiWbMGiWFT8A+pZpOtDN5B2Ci0J5D6a QKmAg08VJWLdJyXtRplX1Vt2i5TwjWDOdU0lslqLZ4Oj5JlzDAvGHNrMzgUKSd5Kq4+/dz BYrm3SMOSRVgB3fGSohqBUEgOfaW0qM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719522238; a=rsa-sha256; cv=none; b=g7RqwCWcAuzIWO1gN6xlrU04+YffHsj3VwgDbMa0WFLcEstgBMqyO+khAv8UMVVpnO3eFR bhFiug96tnXYMCOnajQUKJV/pVhHwyXZ+3HDMpsOWIuI6YdnP9f3M4w/S/pArbZ/2U65Zz Q7hL8WbH/2livZm/Yse6xuqufJaGaog= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=1q0TR8Gj; spf=pass (imf07.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wm1-f49.google.com with SMTP id 5b1f17b1804b1-4255f915611so5135e9.0 for ; Thu, 27 Jun 2024 14:04:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719522254; x=1720127054; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=g1Son9PgFZxipUqIwP1rlgNrA3yZxpgDflb001yGw5A=; b=1q0TR8Gj1yHtx0Oa9abEnuRF4S1x87UJGJF8/ZfuF6jLBDpSMgcpjGR5tVCZrdCv0f GZjXt0a+wDxTvMCv64Y4xNbEBZjCajFWCRLfoVHxwoMQLju+wfTL9ckqNKaX5t0s9CXA fhcBKDU+vDvqzdThfiaXCqCAE8EMl39UaxeSefvNDxztVsxVOWaQgfkw9alVQ+TcNcoX /Bj/Cm5inRyNhW/cI5OAScOw48IoN5vE2e4vh8oBrJvxH3hjfX8NK1BSDVe0vQIBf5bl EdSMCtbE13lUlgSZ2MazAXLPI7mFRQUQc63mwGgE9ddKMf1RXlJctWsP1q5fb43pbMFf UHFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719522254; x=1720127054; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=g1Son9PgFZxipUqIwP1rlgNrA3yZxpgDflb001yGw5A=; b=Ci3fsjoN6DxdBDOeZPVEp/HB8EAEwnuXZ41rG0CgjXFmUFMbMTq7tBfohuZ7I+vKeo zg71wi51IiCW9XDlYzvDQvlFfJJ2ILtp1/lcDiLyi/ZbloGbWAZOqU/ROpvBZVlA8olA rDW/U3hCuI1fdAybSMVmmmNOdC6czi5C9SLJE596kSMvf/i12Mm2TJl3TAQI8HfzxXoN n/qBRDpDB3iGXxvf2Up5t99yusoymL88EE/W4PDJu+wbM91c2l8m5Y3n09r1mXuI8N19 Doki0OGZZUOVmYMcEJ4EpKY9oIwBi2y+xna/IeHDCHLwJ0gIQ7I89x8akEKFgBZ8heIP 8U2Q== X-Forwarded-Encrypted: i=1; AJvYcCWy+iPaE8wxVxiFs10z9iaXppenHtC+zbnaR7fgVnuP5W6n55xkseRh5LEGnnHqGSftP43KxHBne+0LlUyLgDOFSKY= X-Gm-Message-State: AOJu0Yxe6duVF687tCGfrfiviTqt1e1W5RYqQH+zqF0zRdZmpCHLXezM qYyz5K8F5IgmTZ/FxcaYeOuCiIEE3R/qIisJ8T4/VNdQbINW54jZXJi2fQzG/FmU7wFg605gUSQ YnMKBamhc9fz75M3Qa0t64PubvngDjIbtA6j4 X-Google-Smtp-Source: AGHT+IF2M2h8golpqelyS8GkrbBkEYKMYLHQAr/rmMTFu2PNvyLpFVgJPntRKGsxwIzhN/r5HizzGcjI1uK5/VFqFR0= X-Received: by 2002:a05:600c:1c19:b0:424:8a45:dd90 with SMTP id 5b1f17b1804b1-4256c28f6d0mr123395e9.3.1719522253773; Thu, 27 Jun 2024 14:04:13 -0700 (PDT) MIME-Version: 1.0 References: <20240113094436.2506396-1-sunnanyong@huawei.com> <20240207111252.GA22167@willie-the-truck> <44075bc2-ac5f-ffcd-0d2f-4093351a6151@huawei.com> <20240208131734.GA23428@willie-the-truck> <22c14513-af78-0f1d-5647-384ff9cb5993@huawei.com> <17232655-553d-7d48-8ba1-5425e8ab0f8b@huawei.com> In-Reply-To: <17232655-553d-7d48-8ba1-5425e8ab0f8b@huawei.com> From: Yu Zhao Date: Thu, 27 Jun 2024 15:03:36 -0600 Message-ID: Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize To: Nanyong Sun Cc: David Rientjes , Will Deacon , Catalin Marinas , Matthew Wilcox , muchun.song@linux.dev, Andrew Morton , anshuman.khandual@arm.com, wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Yosry Ahmed , Sourav Panda Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 7FC3A4001C X-Stat-Signature: ytzsmfbqxgj7oztkjan1kog7bf5ng43w X-HE-Tag: 1719522255-857815 X-HE-Meta: U2FsdGVkX1++65p3tbrJJ2MLIMJuHwcAAXIw+wuKwSW34G8+BS0+8FBx5SLMwQE5WVnSMtAV0vWT79yIJfz62MYxDZpUE3DjJuEom8K0GvwxSVFhOz00mwNOCEGozYDlDnZNqP+eMi2oGjreJ6YLOYPb8KRCQbkXboUAliR/gSvRVQvZWgr+cazzJAyvVhzDUVe/gWlrKssDRCcnPaUCplKkvn0auhb1odQzYcT8J2kx1j3SQTBPnmju9J5XLWVjDm/yt9gUdh3rrsUOYHfrOQbp97PWpipy1kuu3LlBLJ3LJA3pNTHtcClm+fPTGZtF9amDI7+CQmZ+zFNOyq3GNG4i5kDJLJG0NMGkluDmmjJ7HWexAxjlIZa1F5LN5hA3IroECSPcSRM/d7QuVgtitVuVP9yOonT3VRrFyh++Zm17GmcOvPLxWLtLaGQM0ozZTAr2CEfxdlgCl+Z58l58WFWI1/RPuEoNB+GjsLD7G7dqgpdwG+74He1wjeDoAR0/Dc5DfbJvGx+jgY48peD2JQPec9cnj4LjUFJUHoBDv4sa0lp6i78pZ8R/tBwEwiUuj2iAl2AAVYfwpNqgoBqJzS82uDpku/sJucgjo1QiRxZopPX0IXtk1WU8WsEHkNJVxiKCHQ/41WCG9bJQiRcj8chVOg/gVEXyXzvuV0ZHpkDMXHJsNazPHvpPzFYvI1CnP/BH8lPSvzygUfODl3kKdoKQ8I5qnwZFsHdzNjA75nf2xL31MkTfhSHA2BkHLabg3pvJApaU5ZcuIOhdqm5xQPzXw3+1Iaz85VzbqSkGe2Y4vXwHv6akCek7j1TpzDrFI+MxoRHv+gElME5MVZzbC7tmB3I3Fz+oZYKIwuU4NHD2YLikfcCw0Bp6+rVf04AvDmCR9VPXyS0eQLBOxxG43j1ZRIEemHJKimlBDUDzNaIkwZWku0k1RtpTuPJY4uuUt00IgrtN1Fi3VerP243 Q7CuygpX aryXXwLLpMO5Zo9nbw6bSJy82eAbonfo9OEqLoaQnSpgL2E7DKh25/SGsabLZZT5VRR0i4WNoDt8f5NFyeno0ILuLMMSSf28bDtbsVhzcuUwCIstJsesuYmgDkfsULwOvl6tPYj6XXuiDvQAJCJsMbRZcwlVLN+3zs0RhQOpslOizBAh0uyp21Kv8FIH4Z9V9TzaYlAAQDKlw6PCK3UQ87n93rXFJdrYpwTwTbw5FH6QAAXqUDazkwVrP72eRKT1asD/diohAdoDC+A3SdnNCiZyzmBLefmyjmOFEwEXmO53vL02L8XTekw8MZascWvKVUbWRW8vJfVzG7MUnM/knm5AXXA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 27, 2024 at 8:34=E2=80=AFAM Nanyong Sun = wrote: > > > =E5=9C=A8 2024/6/24 13:39, Yu Zhao =E5=86=99=E9=81=93: > > On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: > >> On 2024/3/14 7:32, David Rientjes wrote: > >> > >>> On Thu, 8 Feb 2024, Will Deacon wrote: > >>> > >>>>> How about take a new lock with irq disabled during BBM, like: > >>>>> > >>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte= ) > >>>>> +{ > >>>>> + (NEW_LOCK); > >>>>> + pte_clear(&init_mm, addr, ptep); > >>>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >>>>> + set_pte_at(&init_mm, addr, ptep, pte); > >>>>> + spin_unlock_irq(NEW_LOCK); > >>>>> +} > >>>> I really think the only maintainable way to achieve this is to avoid= the > >>>> possibility of a fault altogether. > >>>> > >>>> Will > >>>> > >>>> > >>> Nanyong, are you still actively working on making HVO possible on arm= 64? > >>> > >>> This would yield a substantial memory savings on hosts that are large= ly > >>> configured with hugetlbfs. In our case, the size of this hugetlbfs p= ool > >>> is actually never changed after boot, but it sounds from the thread t= hat > >>> there was an idea to make HVO conditional on FEAT_BBM. Is this being > >>> pursued? > >>> > >>> If so, any testing help needed? > >> I'm afraid that FEAT_BBM may not solve the problem here > > I think so too -- I came cross this while working on TAO [1]. > > > > [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ > > > >> because from Arm > >> ARM, > >> I see that FEAT_BBM is only used for changing block size. Therefore, i= n this > >> HVO feature, > >> it can work in the split PMD stage, that is, BBM can be avoided in > >> vmemmap_split_pmd, > >> but in the subsequent vmemmap_remap_pte, the Output address of PTE sti= ll > >> needs to be > >> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps = my > >> understanding > >> of ARM FEAT_BBM is wrong, and I hope someone can correct me. > >> Actually, the solution I first considered was to use the stop_machine > >> method, but we have > >> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamica= lly > >> use hugepages, > >> so I have to consider performance issues. If your product does not cha= nge > >> the amount of huge > >> pages after booting, using stop_machine() may be a feasible way. > >> So far, I still haven't come up with a good solution. > > I do have a patch that's similar to stop_machine() -- it uses NMI IPIs > > to pause/resume remote CPUs while the local one is doing BBM. > > > > Note that the problem of updating vmemmap for struct page[], as I see > > it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory > > hot removal in general [2]. On arm64, we would need to support BBM on > > vmemmap so that we can fix the problem with offlining memory (or to be > > precise, unmapping offlined struct page[]), by mapping offlined struct > > page[] to a read-only page of dummy struct page[], similar to > > ZERO_PAGE(). (Or we would have to make extremely invasive changes to > > the reader side, i.e., all speculative PFN walkers.) > > > > In case you are interested in testing my approach, you can swap your > > patch 2 with the following: > I don't have an NMI IPI capable ARM machine on hand, so I think this feat= ure > depends on a higher version of the ARM cpu. (Pseudo) NMI does require GICv3 (released in 2015). But that's independent from CPU versions. Just to double check: you don't have GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=3Dy or irqchip.gicv3_pseudo_nmi=3D1), is that correct? Even without GICv3, IPIs can be masked but still works, with a less bounded latency. > What I worried about was that other cores would occasionally be interrupt= ed > frequently(8 times every 2M and 4096 times every 1G) and then wait for th= e > update of page table to complete before resuming. Catalin has suggested batching, and to echo what he said [1]: it's possible to make all vmemmap changes from a single HVO/de-HVO operation into *one batch*. [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/ > If there are workloads > running on other cores, performance may be affected. This implementation > speeds up stopping and resuming other cores, but they still have to wait > for the update to finish. How often does your use case trigger HVO/de-HVO operations? For our VM use case, it's generally correlated to VM lifetimes, i.e., how often VM bin-packing happens. For our THP use case, it can be more often, but I still don't think we would trigger HVO/de-HVO every minute. So with NMI IPIs, IMO, the performance impact would be acceptable to our use cases.