Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yu Zhao <yuzhao@google.com>
To: Nanyong Sun <sunnanyong@huawei.com>
Cc: David Rientjes <rientjes@google.com>,
	Will Deacon <will@kernel.org>,
	 Catalin Marinas <catalin.marinas@arm.com>,
	Matthew Wilcox <willy@infradead.org>,
	muchun.song@linux.dev,  Andrew Morton <akpm@linux-foundation.org>,
	anshuman.khandual@arm.com,  wangkefeng.wang@huawei.com,
	linux-arm-kernel@lists.infradead.org,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	 Yosry Ahmed <yosryahmed@google.com>,
	Sourav Panda <souravpanda@google.com>
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
Date: Thu, 4 Jul 2024 13:45:31 -0600	[thread overview]
Message-ID: <CAOUHufbq5LN+crcZ0hV1uKNCvSa2Z9fmc55SvpEqb1CTqDRSTw@mail.gmail.com> (raw)
In-Reply-To: <06252b78-2b61-73d1-ddf8-920dd744c756@huawei.com>

On Thu, Jul 4, 2024 at 5:47 AM Nanyong Sun <sunnanyong@huawei.com> wrote:
>
> On 2024/6/28 5:03, Yu Zhao wrote:
> > On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun <sunnanyong@huawei.com> wrote:
> >>
> >> 在 2024/6/24 13:39, Yu Zhao 写道:
> >>> On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:
> >>>> On 2024/3/14 7:32, David Rientjes wrote:
> >>>>
> >>>>> On Thu, 8 Feb 2024, Will Deacon wrote:
> >>>>>
> >>>>>>> How about take a new lock with irq disabled during BBM, like:
> >>>>>>>
> >>>>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> >>>>>>> +{
> >>>>>>> +     (NEW_LOCK);
> >>>>>>> +    pte_clear(&init_mm, addr, ptep);
> >>>>>>> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> >>>>>>> +    set_pte_at(&init_mm, addr, ptep, pte);
> >>>>>>> +    spin_unlock_irq(NEW_LOCK);
> >>>>>>> +}
> >>>>>> I really think the only maintainable way to achieve this is to avoid the
> >>>>>> possibility of a fault altogether.
> >>>>>>
> >>>>>> Will
> >>>>>>
> >>>>>>
> >>>>> Nanyong, are you still actively working on making HVO possible on arm64?
> >>>>>
> >>>>> This would yield a substantial memory savings on hosts that are largely
> >>>>> configured with hugetlbfs.  In our case, the size of this hugetlbfs pool
> >>>>> is actually never changed after boot, but it sounds from the thread that
> >>>>> there was an idea to make HVO conditional on FEAT_BBM.  Is this being
> >>>>> pursued?
> >>>>>
> >>>>> If so, any testing help needed?
> >>>> I'm afraid that FEAT_BBM may not solve the problem here
> >>> I think so too -- I came cross this while working on TAO [1].
> >>>
> >>> [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/
> >>>
> >>>> because from Arm
> >>>> ARM,
> >>>> I see that FEAT_BBM is only used for changing block size. Therefore, in this
> >>>> HVO feature,
> >>>> it can work in the split PMD stage, that is, BBM can be avoided in
> >>>> vmemmap_split_pmd,
> >>>> but in the subsequent vmemmap_remap_pte, the Output address of PTE still
> >>>> needs to be
> >>>> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my
> >>>> understanding
> >>>> of ARM FEAT_BBM is wrong, and I hope someone can correct me.
> >>>> Actually, the solution I first considered was to use the stop_machine
> >>>> method, but we have
> >>>> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically
> >>>> use hugepages,
> >>>> so I have to consider performance issues. If your product does not change
> >>>> the amount of huge
> >>>> pages after booting, using stop_machine() may be a feasible way.
> >>>> So far, I still haven't come up with a good solution.
> >>> I do have a patch that's similar to stop_machine() -- it uses NMI IPIs
> >>> to pause/resume remote CPUs while the local one is doing BBM.
> >>>
> >>> Note that the problem of updating vmemmap for struct page[], as I see
> >>> it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory
> >>> hot removal in general [2]. On arm64, we would need to support BBM on
> >>> vmemmap so that we can fix the problem with offlining memory (or to be
> >>> precise, unmapping offlined struct page[]), by mapping offlined struct
> >>> page[] to a read-only page of dummy struct page[], similar to
> >>> ZERO_PAGE(). (Or we would have to make extremely invasive changes to
> >>> the reader side, i.e., all speculative PFN walkers.)
> >>>
> >>> In case you are interested in testing my approach, you can swap your
> >>> patch 2 with the following:
> >> I don't have an NMI IPI capable ARM machine on hand, so I think this feature
> >> depends on a higher version of the ARM cpu.
> > (Pseudo) NMI does require GICv3 (released in 2015). But that's
> > independent from CPU versions. Just to double check: you don't have
> > GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or
> > irqchip.gicv3_pseudo_nmi=1), is that correct?
> >
> > Even without GICv3, IPIs can be masked but still works, with a less
> > bounded latency.
> Oh，I misunderstood. Pseudo NMI is available. We have
> CONFIG_ARM64_PSEUDO_NMI=y
> but did not set irqchip.gicv3_pseudo_nmi=1 by default. So I can test
> this solution after
> opening this in cmdline.
>
> >> What I worried about was that other cores would occasionally be interrupted
> >> frequently(8 times every 2M and 4096 times every 1G) and then wait for the
> >> update of page table to complete before resuming.
> > Catalin has suggested batching, and to echo what he said [1]: it's
> > possible to make all vmemmap changes from a single HVO/de-HVO
> > operation into *one batch*.
> >
> > [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/
> >
> >> If there are workloads
> >> running on other cores, performance may be affected. This implementation
> >> speeds up stopping and resuming other cores, but they still have to wait
> >> for the update to finish.
> > How often does your use case trigger HVO/de-HVO operations?
> >
> > For our VM use case, it's generally correlated to VM lifetimes, i.e.,
> > how often VM bin-packing happens. For our THP use case, it can be more
> > often, but I still don't think we would trigger HVO/de-HVO every
> > minute. So with NMI IPIs, IMO, the performance impact would be
> > acceptable to our use cases.
> >
> > .
> We have many use cases so that I'm not thinking about a specific use case,
> but rather a generic one. I will test the performance impact of different
> HVO trigger frequencies, such as triggering HVO while running redis.

Thanks, and if it's not good enough for whatever you are going to
test, we can batch the updates at least at the PTE level, or even at
the PMD level.

next prev parent reply	other threads:[~2024-07-04 19:46 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-13  9:44 Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 1/3] mm: HVO: introduce helper function to update and flush pgtable Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely Nanyong Sun
2024-01-15  2:38   ` Muchun Song
2024-02-07 12:21   ` Mark Rutland
2024-02-08  9:30     ` Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 3/3] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP Nanyong Sun
2024-01-25 18:06 ` [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Catalin Marinas
2024-01-27  5:04   ` Nanyong Sun
2024-02-07 11:12     ` Will Deacon
2024-02-07 11:21       ` Matthew Wilcox
2024-02-07 12:11         ` Will Deacon
2024-02-07 12:24           ` Mark Rutland
2024-02-07 14:17           ` Matthew Wilcox
2024-02-08  2:24             ` Jane Chu
2024-02-08 15:49               ` Matthew Wilcox
2024-02-08 19:21                 ` Jane Chu
2024-02-11 11:59                 ` Muchun Song
2024-06-05 20:50                   ` Yu Zhao
2024-06-06  8:30                     ` David Hildenbrand
2024-06-07 16:55                       ` Frank van der Linden
2024-02-07 12:20         ` Catalin Marinas
2024-02-08  9:44           ` Nanyong Sun
2024-02-08 13:17             ` Will Deacon
2024-03-13 23:32               ` David Rientjes
2024-03-25 15:24                 ` Nanyong Sun
2024-03-26 12:54                   ` Will Deacon
2024-06-24  5:39                   ` Yu Zhao
2024-06-27 14:33                     ` Nanyong Sun
2024-06-27 21:03                       ` Yu Zhao
2024-07-04 11:47                         ` Nanyong Sun
2024-07-04 19:45                           ` Yu Zhao [this message]
2024-02-07 12:44     ` Catalin Marinas
2024-06-27 21:19       ` Yu Zhao
2024-07-05 15:49         ` Catalin Marinas
2024-07-05 17:41           ` Yu Zhao
2024-07-10 16:51             ` Catalin Marinas
2024-07-10 17:12               ` Yu Zhao
2024-07-10 22:29                 ` Catalin Marinas
2024-07-10 23:07                   ` Yu Zhao
2024-07-11  8:31                     ` Yu Zhao
2024-07-11 11:39                       ` Catalin Marinas
2024-07-11 17:38                         ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOUHufbq5LN+crcZ0hV1uKNCvSa2Z9fmc55SvpEqb1CTqDRSTw@mail.gmail.com \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=rientjes@google.com \
    --cc=souravpanda@google.com \
    --cc=sunnanyong@huawei.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox