From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6AD5C3271F for ; Thu, 4 Jul 2024 19:46:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 651038D0005; Thu, 4 Jul 2024 15:46:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 600F48D0003; Thu, 4 Jul 2024 15:46:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A2808D0005; Thu, 4 Jul 2024 15:46:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 29FB08D0003 for ; Thu, 4 Jul 2024 15:46:13 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id CA38DA02E4 for ; Thu, 4 Jul 2024 19:46:12 +0000 (UTC) X-FDA: 82303101384.11.80FFF81 Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) by imf12.hostedemail.com (Postfix) with ESMTP id 0675E4000C for ; Thu, 4 Jul 2024 19:46:10 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eNrYpJcJ; spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720122337; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SfTs+c6tWogjOwpdCdqI/4bO1J7Lrmxq3+VtWrtGVqc=; b=jJOM5rUhc6lvRVAiylbmZEHg4JMRYw2sll5RjkY7DN1vdkD6QedkoM4ltJbEGWJJUgvsy7 HUaPnAQuXDhFdZj/zfdI5c2MU0+XAq1BGNQNDobK+FcpFzPVlFqdacRgMzwD/myf6ohs91 KOkvknArM+FpZOcXefBnlnwivsv/bk8= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eNrYpJcJ; spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720122337; a=rsa-sha256; cv=none; b=E6aQBGPOBM5wvzjOG8WxEONvmT3Wb2HrFJ0t6Qm/2IGyfmqVvPmCaFLAzQYkG/128coovg lyx8BBkOYjjHYJCfUAGCFrse3jAvKQmtty4GJpKmLruPUJfK7ONyVx7Fp7ZpDLsZ2OGwGM tE6jVuExbAu0Bj+9R3+jC6ic6yGyWVg= Received: by mail-qt1-f172.google.com with SMTP id d75a77b69052e-44664ad946eso336381cf.1 for ; Thu, 04 Jul 2024 12:46:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1720122370; x=1720727170; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=SfTs+c6tWogjOwpdCdqI/4bO1J7Lrmxq3+VtWrtGVqc=; b=eNrYpJcJdUGomkcoPn3CYVOKgVKz203CU++0XW/WZdjjMiWciQbmYTeA+CR6HigOHv bhoEfY/+oFofy3DcvB9kBOUkoSKSoYV7oIGda3ATSvuJ5QNCzgChnNcrxuLZM6/EPLO8 +e9MWJ27QfeP1CrFPmfTbiFL3zL2jeDCsaFRMErAv1p/X6iocGhzUdvdP6bAjVpxNwqF ZubUQT/AjKqzG3IcK8ZtAJyBG6BOel9iqWqZ2JLWGS6SkBMLe5DPAUJAb9Tjy2x0zBkO iir/sx49BjyrsFmuCa9IoOhmjlNwhEqG767Pzq5XvImttFg46/X7SPBALatamdV+FfAM MXPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720122370; x=1720727170; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SfTs+c6tWogjOwpdCdqI/4bO1J7Lrmxq3+VtWrtGVqc=; b=TE/o4007cO8MNOy9gK0h1hoUY0+8O/hyi+lhuJQ2hHN5r9mulHb07wK3EnfVapdy1F hRKNmGzJQB1M305PCnLaejj2s/P/R2gz9CEtSk+85gY5TB1rAD91zqf8voKhkyfKson7 n3oRkeEv/lHGFBH1wk8P0VHN6XWck6zjR+E4UsGkOfbAbL3RwbRt+i0sK6YCXgLanzsa 4BdS8ntLyOa+t2+6tYWlP40DNl23T//olguehSGPmRAqDpNFgX94Zy1NVCTDGaoKgofs OTs1M/gjI8DwoSfxvBUx1fC1PhRnxBKgoBYPcmg1fWeVO+PVW50C9v9buyIFbVxHWtRc kujQ== X-Forwarded-Encrypted: i=1; AJvYcCXJeT2nsBkTnLCSg80Z2yXwKWibkUFRgF5dySkqNAT9Nslj+bzg6xu0NmJAeADH1TLypiVgsCr3obuxHq+4BemPLcE= X-Gm-Message-State: AOJu0Yyz6QPUQZwk5O303YdIR2o+AQyq4mj+QjcG+J1ludW0KqiN3dXs jraQfx0YXFO7ENxD544PLhg6ctgSB1n/OWQ7ql9njqtQ4mYZqC76EY7ay7UBitz3Postco4Cm7C p8isrWatMMEDQ/ojFrliwRPXkpWRArroDxTVR X-Google-Smtp-Source: AGHT+IEcqpbDsCmz8L+dNVIWPBxow0ZX8QbEgnZTJCk4plKWtw4268sklGrFIEoyqu4GemhN0khOwgHpS8Snam3Hy3w= X-Received: by 2002:ac8:5652:0:b0:444:d4d0:8a8e with SMTP id d75a77b69052e-447c8fbee38mr3470821cf.14.1720122369618; Thu, 04 Jul 2024 12:46:09 -0700 (PDT) MIME-Version: 1.0 References: <20240113094436.2506396-1-sunnanyong@huawei.com> <20240207111252.GA22167@willie-the-truck> <44075bc2-ac5f-ffcd-0d2f-4093351a6151@huawei.com> <20240208131734.GA23428@willie-the-truck> <22c14513-af78-0f1d-5647-384ff9cb5993@huawei.com> <17232655-553d-7d48-8ba1-5425e8ab0f8b@huawei.com> <06252b78-2b61-73d1-ddf8-920dd744c756@huawei.com> In-Reply-To: <06252b78-2b61-73d1-ddf8-920dd744c756@huawei.com> From: Yu Zhao Date: Thu, 4 Jul 2024 13:45:31 -0600 Message-ID: Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize To: Nanyong Sun Cc: David Rientjes , Will Deacon , Catalin Marinas , Matthew Wilcox , muchun.song@linux.dev, Andrew Morton , anshuman.khandual@arm.com, wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Yosry Ahmed , Sourav Panda Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 0675E4000C X-Stat-Signature: fhofjfxbmnnuw1xpd68qc6d6ow3g6km3 X-Rspam-User: X-HE-Tag: 1720122370-685788 X-HE-Meta: U2FsdGVkX1/uxc8SyootTsKlHSXJHzFT9xn+kJUyOyJjcTpBRH37UwlotCeXeDpoSoqRwbguuLZX3rhfRivYwPrAJXi62j5u9FblUc0PL9/30A1M6rOtEg8agl5agjQShydHA3dQvPPkEzsvj25wUi8UXHMDF/kqggzAZncd6lD0Z7hXN0wZzrI9Wyz/WifANwWnDrwMmUByn6WbErcmtXnkRnAlnY7GqSoaQ2I1HhwhUICX/UG7kgfJRw8dmcXVwKGej/5zyzfLBxOQA84NJN70FYekZX96Oiu30u5qQ+VmuaHMBZns3Faub6vqW/joxCXGRABTqAPmstcW+R6/HHM3Rz95Q6R3b7c6eLWrXWYpE8TX56qh+A9qOLdjOozM98MoxyPxW33oGSgXRfPGhMQym90DgGJZTLPPqiCnzhpJxhIDqyHl9HxMdkYpkiWtt6SdUO4DMVVMY6C/wRekJLirQIvTalf+a3CXE6yc7nk1r7FNx/GSfH4DBozpEqvxv7+nLVDvcsBgBPDbcLRa6efs3hCdr/N65FJ7CfMFmF+nI78zWlfexRUQFPNQtosCrg5l+LkRnGoCDrQrXOY/fqkxkbLckInZ/rmC8Rr1OhT42/IQJJu5yUxqh1tz7pX0v32oaHBmGUL3fWBD035A7GpuSekxMwkz1pwE2Q6ClJLnK9ald0pXlE7xGaY3D1m+hJhQ12wPtI3Uo4mqtKK3mtuzJMLAERKu2yPgQy12hAhCFnz/bfSjsJh3ZEAK21Y1UE5U3kMbzvr2IbpynLkSCU8oFq1vjeF8ShN6j2GpvD+Opswm64tBoKgH+jW4HCpHKv5ho9mPmEkljJNEaeT0LiT3SpKa5gJbewOrJgDfEGb8Y2lClDpa/ikRsFQ3UKxLyHK4hOdaQiDhlUGOpTjq/wuFZZqpd7luuP4TGjcQEGLtUIpHn0GQeRMWb5yG04POoPvW9eapDaEXBScq1l0 dxU4zUqp bS2eONv4HpH/ZDvhvoVOewWcsxWA5kT6Mzrz5NLaec7luUi/fV83RVixIlHs/qgOsC19RIeQ04TVxfjfcgLrtXI5/KsiTvna5EyoBT+qFG1YBUReCX0vekCQ+nlljuTnXIl64T8+682x65z2fnxLcrpwcAypjunWLCyTjKvfjhmu8qDu0mrTk541+uGOJ5iwiFzyuiz23SuKGu6EYz+7qBq3VF3RfCUO/99ztGyru7TzmPocz362OMsdMWbdzrQT8eotghDXNK+dmYz2LUOXzCxoC6at0flFax3YwTctJGO9TFnNY3ZXNhDWY3l4Ui2/vTfARqgWVNIbuxKT1p0F9YggvYA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 4, 2024 at 5:47=E2=80=AFAM Nanyong Sun = wrote: > > On 2024/6/28 5:03, Yu Zhao wrote: > > On Thu, Jun 27, 2024 at 8:34=E2=80=AFAM Nanyong Sun wrote: > >> > >> =E5=9C=A8 2024/6/24 13:39, Yu Zhao =E5=86=99=E9=81=93: > >>> On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: > >>>> On 2024/3/14 7:32, David Rientjes wrote: > >>>> > >>>>> On Thu, 8 Feb 2024, Will Deacon wrote: > >>>>> > >>>>>>> How about take a new lock with irq disabled during BBM, like: > >>>>>>> > >>>>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t p= te) > >>>>>>> +{ > >>>>>>> + (NEW_LOCK); > >>>>>>> + pte_clear(&init_mm, addr, ptep); > >>>>>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >>>>>>> + set_pte_at(&init_mm, addr, ptep, pte); > >>>>>>> + spin_unlock_irq(NEW_LOCK); > >>>>>>> +} > >>>>>> I really think the only maintainable way to achieve this is to avo= id the > >>>>>> possibility of a fault altogether. > >>>>>> > >>>>>> Will > >>>>>> > >>>>>> > >>>>> Nanyong, are you still actively working on making HVO possible on a= rm64? > >>>>> > >>>>> This would yield a substantial memory savings on hosts that are lar= gely > >>>>> configured with hugetlbfs. In our case, the size of this hugetlbfs= pool > >>>>> is actually never changed after boot, but it sounds from the thread= that > >>>>> there was an idea to make HVO conditional on FEAT_BBM. Is this bei= ng > >>>>> pursued? > >>>>> > >>>>> If so, any testing help needed? > >>>> I'm afraid that FEAT_BBM may not solve the problem here > >>> I think so too -- I came cross this while working on TAO [1]. > >>> > >>> [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.co= m/ > >>> > >>>> because from Arm > >>>> ARM, > >>>> I see that FEAT_BBM is only used for changing block size. Therefore,= in this > >>>> HVO feature, > >>>> it can work in the split PMD stage, that is, BBM can be avoided in > >>>> vmemmap_split_pmd, > >>>> but in the subsequent vmemmap_remap_pte, the Output address of PTE s= till > >>>> needs to be > >>>> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhap= s my > >>>> understanding > >>>> of ARM FEAT_BBM is wrong, and I hope someone can correct me. > >>>> Actually, the solution I first considered was to use the stop_machin= e > >>>> method, but we have > >>>> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynami= cally > >>>> use hugepages, > >>>> so I have to consider performance issues. If your product does not c= hange > >>>> the amount of huge > >>>> pages after booting, using stop_machine() may be a feasible way. > >>>> So far, I still haven't come up with a good solution. > >>> I do have a patch that's similar to stop_machine() -- it uses NMI IPI= s > >>> to pause/resume remote CPUs while the local one is doing BBM. > >>> > >>> Note that the problem of updating vmemmap for struct page[], as I see > >>> it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory > >>> hot removal in general [2]. On arm64, we would need to support BBM on > >>> vmemmap so that we can fix the problem with offlining memory (or to b= e > >>> precise, unmapping offlined struct page[]), by mapping offlined struc= t > >>> page[] to a read-only page of dummy struct page[], similar to > >>> ZERO_PAGE(). (Or we would have to make extremely invasive changes to > >>> the reader side, i.e., all speculative PFN walkers.) > >>> > >>> In case you are interested in testing my approach, you can swap your > >>> patch 2 with the following: > >> I don't have an NMI IPI capable ARM machine on hand, so I think this f= eature > >> depends on a higher version of the ARM cpu. > > (Pseudo) NMI does require GICv3 (released in 2015). But that's > > independent from CPU versions. Just to double check: you don't have > > GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=3Dy or > > irqchip.gicv3_pseudo_nmi=3D1), is that correct? > > > > Even without GICv3, IPIs can be masked but still works, with a less > > bounded latency. > Oh=EF=BC=8CI misunderstood. Pseudo NMI is available. We have > CONFIG_ARM64_PSEUDO_NMI=3Dy > but did not set irqchip.gicv3_pseudo_nmi=3D1 by default. So I can test > this solution after > opening this in cmdline. > > >> What I worried about was that other cores would occasionally be interr= upted > >> frequently(8 times every 2M and 4096 times every 1G) and then wait for= the > >> update of page table to complete before resuming. > > Catalin has suggested batching, and to echo what he said [1]: it's > > possible to make all vmemmap changes from a single HVO/de-HVO > > operation into *one batch*. > > > > [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/ > > > >> If there are workloads > >> running on other cores, performance may be affected. This implementati= on > >> speeds up stopping and resuming other cores, but they still have to wa= it > >> for the update to finish. > > How often does your use case trigger HVO/de-HVO operations? > > > > For our VM use case, it's generally correlated to VM lifetimes, i.e., > > how often VM bin-packing happens. For our THP use case, it can be more > > often, but I still don't think we would trigger HVO/de-HVO every > > minute. So with NMI IPIs, IMO, the performance impact would be > > acceptable to our use cases. > > > > . > We have many use cases so that I'm not thinking about a specific use case= , > but rather a generic one. I will test the performance impact of different > HVO trigger frequencies, such as triggering HVO while running redis. Thanks, and if it's not good enough for whatever you are going to test, we can batch the updates at least at the PTE level, or even at the PMD level.