[PATCH v1 0/6] mm/arm64: re-enable HVO

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/6] mm/arm64: re-enable HVO
@ 2024-10-21  4:22 Yu Zhao
  2024-10-21  4:22 ` [PATCH v1 1/6] mm/hugetlb_vmemmap: batch update PTEs Yu Zhao
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Yu Zhao @ 2024-10-21  4:22 UTC (permalink / raw)
  To: Andrew Morton, Catalin Marinas, Marc Zyngier, Muchun Song,
	Thomas Gleixner, Will Deacon
  Cc: Douglas Anderson, Mark Rutland, Nanyong Sun, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao

This series presents one of the previously discussed approaches to
re-enable HugeTLB Vmemmap Optimization (HVO) on arm64. HVO was
disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason:

  This is deemed UNPREDICTABLE by the Arm architecture without a
  break-before-make sequence (make the PTE invalid, TLBI, write the
  new valid PTE). However, such sequence is not possible since the
  vmemmap may be concurrently accessed by the kernel.

Other approaches that have been discussed include:
  A. Handle kernel PF while doing BBM [1],
  B. Use stop_machine() while doing BBM [2], and,
  C. Enable FEAT_BBM level 2 and keep the memory contents at the old
     and new output addresses unchanged to avoid BBM (D8.16.1-2) [3].

A quick comparison between this approach (D) and the above approaches:
  --+------------------------------+-----------------------------+
    |              Pro             |             Con             |
  --+------------------------------+-----------------------------+
  A | Low latency, h/w independent | Predictability concerns [4] |
  B | Predictable, h/w independent | High latency                |
  C | Predictable, low latency     | H/w dependent, complex      |
  D | Predictable, h/w independent | Medium latency              |
  --+------------------------------+-----------------------------+

This approach is being tested for Google's production systems, which
generally find the "con" above acceptable, making it the preferred
tradeoff for our use cases:
  +------------------------------+------------+----------+--------+
  | HugeTLB operations           | Before [0] + After    | Change |
  +------------------------------+------------+----------+--------+
  | Alloc 600 1GB                | 0m3.526s   | 0m3.779s | +7%    |
  | Free 600 1GB                 | 0m0.880s   | 0m0.940s | +7%    |
  | Demote 600 1GB to 307200 2MB | 0m1.575s   | 0m5.132s | +326%  |
  | Free 307200 2MB              | 0m0.946s   | 0m4.456s | +471%  |
  +------------------------------+------------+----------+--------+

[0] For comparison purposes, this only includes the last patch in the
    series, i.e., CONFIG_ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP=y.
[1] https://lore.kernel.org/20240113094436.2506396-1-sunnanyong@huawei.com/
[2] https://lore.kernel.org/ZbKjHHeEdFYY1xR5@arm.com/
[3] https://lore.kernel.org/Zo68DP6siXfb6ZBR@arm.com/
[4] https://lore.kernel.org/20240326125409.GA9552@willie-the-truck/

Yu Zhao (6):
  mm/hugetlb_vmemmap: batch update PTEs
  mm/hugetlb_vmemmap: add arch-independent helpers
  irqchip/gic-v3: support SGI broadcast
  arm64: broadcast IPIs to pause remote CPUs
  arm64: pause remote CPUs to update vmemmap
  arm64: select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP

 arch/arm64/Kconfig               |   1 +
 arch/arm64/include/asm/pgalloc.h |  69 ++++++++
 arch/arm64/include/asm/smp.h     |   3 +
 arch/arm64/kernel/smp.c          |  92 ++++++++++-
 drivers/irqchip/irq-gic-v3.c     |  20 ++-
 include/linux/mm_types.h         |   7 +
 mm/hugetlb_vmemmap.c             | 262 +++++++++++++++++++++----------
 7 files changed, 360 insertions(+), 94 deletions(-)


base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
-- 
2.47.0.rc1.288.g06298d1525-goog



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 1/6] mm/hugetlb_vmemmap: batch update PTEs
  2024-10-21  4:22 [PATCH v1 0/6] mm/arm64: re-enable HVO Yu Zhao
@ 2024-10-21  4:22 ` Yu Zhao
  2024-10-21  4:22 ` [PATCH v1 2/6] mm/hugetlb_vmemmap: add arch-independent helpers Yu Zhao
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Yu Zhao @ 2024-10-21  4:22 UTC (permalink / raw)
  To: Andrew Morton, Catalin Marinas, Marc Zyngier, Muchun Song,
	Thomas Gleixner, Will Deacon
  Cc: Douglas Anderson, Mark Rutland, Nanyong Sun, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao

Convert vmemmap_remap_walk->remap_pte to ->remap_pte_range so that
vmemmap remap walks can batch update PTEs.

The goal of this conversion is to allow architectures to implement
their own optimizations if possible, e.g., only to stop remote CPUs
once for each batch when updating vmemmap on arm64. It is not intended
to change the remap workflow nor should it by itself have any side
effects on performance.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/hugetlb_vmemmap.c | 163 ++++++++++++++++++++++++-------------------
 1 file changed, 91 insertions(+), 72 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 57b7f591eee8..46befab48d41 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -22,7 +22,7 @@
 /**
  * struct vmemmap_remap_walk - walk vmemmap page table
  *
- * @remap_pte:		called for each lowest-level entry (PTE).
+ * @remap_pte_range:	called on a range of PTEs.
  * @nr_walked:		the number of walked pte.
  * @reuse_page:		the page which is reused for the tail vmemmap pages.
  * @reuse_addr:		the virtual address of the @reuse_page page.
@@ -32,8 +32,8 @@
  *			operations.
  */
 struct vmemmap_remap_walk {
-	void			(*remap_pte)(pte_t *pte, unsigned long addr,
-					     struct vmemmap_remap_walk *walk);
+	void			(*remap_pte_range)(pte_t *pte, unsigned long start,
+					unsigned long end, struct vmemmap_remap_walk *walk);
 	unsigned long		nr_walked;
 	struct page		*reuse_page;
 	unsigned long		reuse_addr;
@@ -101,10 +101,6 @@ static int vmemmap_pmd_entry(pmd_t *pmd, unsigned long addr,
 	struct page *head;
 	struct vmemmap_remap_walk *vmemmap_walk = walk->private;
 
-	/* Only splitting, not remapping the vmemmap pages. */
-	if (!vmemmap_walk->remap_pte)
-		walk->action = ACTION_CONTINUE;
-
 	spin_lock(&init_mm.page_table_lock);
 	head = pmd_leaf(*pmd) ? pmd_page(*pmd) : NULL;
 	/*
@@ -129,33 +125,36 @@ static int vmemmap_pmd_entry(pmd_t *pmd, unsigned long addr,
 			ret = -ENOTSUPP;
 	}
 	spin_unlock(&init_mm.page_table_lock);
-	if (!head || ret)
+	if (ret)
 		return ret;
 
-	return vmemmap_split_pmd(pmd, head, addr & PMD_MASK, vmemmap_walk);
-}
+	if (head) {
+		ret = vmemmap_split_pmd(pmd, head, addr & PMD_MASK, vmemmap_walk);
+		if (ret)
+			return ret;
+	}
 
-static int vmemmap_pte_entry(pte_t *pte, unsigned long addr,
-			     unsigned long next, struct mm_walk *walk)
-{
-	struct vmemmap_remap_walk *vmemmap_walk = walk->private;
+	if (vmemmap_walk->remap_pte_range) {
+		pte_t *pte = pte_offset_kernel(pmd, addr);
 
-	/*
-	 * The reuse_page is found 'first' in page table walking before
-	 * starting remapping.
-	 */
-	if (!vmemmap_walk->reuse_page)
-		vmemmap_walk->reuse_page = pte_page(ptep_get(pte));
-	else
-		vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
-	vmemmap_walk->nr_walked++;
+		vmemmap_walk->nr_walked += (next - addr) / PAGE_SIZE;
+		/*
+		 * The reuse_page is found 'first' in page table walking before
+		 * starting remapping.
+		 */
+		if (!vmemmap_walk->reuse_page) {
+			vmemmap_walk->reuse_page = pte_page(ptep_get(pte));
+			pte++;
+			addr += PAGE_SIZE;
+		}
+		vmemmap_walk->remap_pte_range(pte, addr, next, vmemmap_walk);
+	}
 
 	return 0;
 }
 
 static const struct mm_walk_ops vmemmap_remap_ops = {
 	.pmd_entry	= vmemmap_pmd_entry,
-	.pte_entry	= vmemmap_pte_entry,
 };
 
 static int vmemmap_remap_range(unsigned long start, unsigned long end,
@@ -172,7 +171,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 	if (ret)
 		return ret;
 
-	if (walk->remap_pte && !(walk->flags & VMEMMAP_REMAP_NO_TLB_FLUSH))
+	if (walk->remap_pte_range && !(walk->flags & VMEMMAP_REMAP_NO_TLB_FLUSH))
 		flush_tlb_kernel_range(start, end);
 
 	return 0;
@@ -204,33 +203,45 @@ static void free_vmemmap_page_list(struct list_head *list)
 		free_vmemmap_page(page);
 }
 
-static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
-			      struct vmemmap_remap_walk *walk)
+static void vmemmap_remap_pte_range(pte_t *pte, unsigned long start, unsigned long end,
+				    struct vmemmap_remap_walk *walk)
 {
-	/*
-	 * Remap the tail pages as read-only to catch illegal write operation
-	 * to the tail pages.
-	 */
-	pgprot_t pgprot = PAGE_KERNEL_RO;
-	struct page *page = pte_page(ptep_get(pte));
-	pte_t entry;
-
-	/* Remapping the head page requires r/w */
-	if (unlikely(addr == walk->reuse_addr)) {
-		pgprot = PAGE_KERNEL;
-		list_del(&walk->reuse_page->lru);
+	int i;
+	struct page *page;
+	int nr_pages = (end - start) / PAGE_SIZE;
 
+	for (i = 0; i < nr_pages; i++) {
+		page = pte_page(ptep_get(pte + i));
+
+		list_add(&page->lru, walk->vmemmap_pages);
+	}
+
+	page = walk->reuse_page;
+
+	if (start == walk->reuse_addr) {
+		list_del(&page->lru);
+		copy_page(page_to_virt(page), (void *)walk->reuse_addr);
 		/*
-		 * Makes sure that preceding stores to the page contents from
-		 * vmemmap_remap_free() become visible before the set_pte_at()
-		 * write.
+		 * Makes sure that preceding stores to the page contents become
+		 * visible before set_pte_at().
 		 */
 		smp_wmb();
 	}
 
-	entry = mk_pte(walk->reuse_page, pgprot);
-	list_add(&page->lru, walk->vmemmap_pages);
-	set_pte_at(&init_mm, addr, pte, entry);
+	for (i = 0; i < nr_pages; i++) {
+		pte_t val;
+
+		/*
+		 * The head page must be mapped read-write; the tail pages are
+		 * mapped read-only to catch illegal modifications.
+		 */
+		if (!i && start == walk->reuse_addr)
+			val = mk_pte(page, PAGE_KERNEL);
+		else
+			val = mk_pte(page, PAGE_KERNEL_RO);
+
+		set_pte_at(&init_mm, start + PAGE_SIZE * i, pte + i, val);
+	}
 }
 
 /*
@@ -252,27 +263,39 @@ static inline void reset_struct_pages(struct page *start)
 	memcpy(start, from, sizeof(*from) * NR_RESET_STRUCT_PAGE);
 }
 
-static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
-				struct vmemmap_remap_walk *walk)
+static void vmemmap_restore_pte_range(pte_t *pte, unsigned long start, unsigned long end,
+				      struct vmemmap_remap_walk *walk)
 {
-	pgprot_t pgprot = PAGE_KERNEL;
+	int i;
 	struct page *page;
-	void *to;
-
-	BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);
+	int nr_pages = (end - start) / PAGE_SIZE;
 
 	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
-	list_del(&page->lru);
-	to = page_to_virt(page);
-	copy_page(to, (void *)walk->reuse_addr);
-	reset_struct_pages(to);
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(pte_page(ptep_get(pte + i)) != walk->reuse_page);
+
+		copy_page(page_to_virt(page), (void *)walk->reuse_addr);
+		reset_struct_pages(page_to_virt(page));
+
+		page = list_next_entry(page, lru);
+	}
 
 	/*
 	 * Makes sure that preceding stores to the page contents become visible
-	 * before the set_pte_at() write.
+	 * before set_pte_at().
 	 */
 	smp_wmb();
-	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+
+	for (i = 0; i < nr_pages; i++) {
+		pte_t val;
+
+		page = list_first_entry(walk->vmemmap_pages, struct page, lru);
+		list_del(&page->lru);
+
+		val = mk_pte(page, PAGE_KERNEL);
+		set_pte_at(&init_mm, start + PAGE_SIZE * i, pte + i, val);
+	}
 }
 
 /**
@@ -290,7 +313,6 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
 			       unsigned long reuse)
 {
 	struct vmemmap_remap_walk walk = {
-		.remap_pte	= NULL,
 		.flags		= VMEMMAP_SPLIT_NO_TLB_FLUSH,
 	};
 
@@ -322,10 +344,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 {
 	int ret;
 	struct vmemmap_remap_walk walk = {
-		.remap_pte	= vmemmap_remap_pte,
-		.reuse_addr	= reuse,
-		.vmemmap_pages	= vmemmap_pages,
-		.flags		= flags,
+		.remap_pte_range	= vmemmap_remap_pte_range,
+		.reuse_addr		= reuse,
+		.vmemmap_pages		= vmemmap_pages,
+		.flags			= flags,
 	};
 	int nid = page_to_nid((struct page *)reuse);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
@@ -340,8 +362,6 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 	 */
 	walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
 	if (walk.reuse_page) {
-		copy_page(page_to_virt(walk.reuse_page),
-			  (void *)walk.reuse_addr);
 		list_add(&walk.reuse_page->lru, vmemmap_pages);
 		memmap_pages_add(1);
 	}
@@ -371,10 +391,9 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		 * They will be restored in the following call.
 		 */
 		walk = (struct vmemmap_remap_walk) {
-			.remap_pte	= vmemmap_restore_pte,
-			.reuse_addr	= reuse,
-			.vmemmap_pages	= vmemmap_pages,
-			.flags		= 0,
+			.remap_pte_range	= vmemmap_restore_pte_range,
+			.reuse_addr		= reuse,
+			.vmemmap_pages		= vmemmap_pages,
 		};
 
 		vmemmap_remap_range(reuse, end, &walk);
@@ -425,10 +444,10 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 {
 	LIST_HEAD(vmemmap_pages);
 	struct vmemmap_remap_walk walk = {
-		.remap_pte	= vmemmap_restore_pte,
-		.reuse_addr	= reuse,
-		.vmemmap_pages	= &vmemmap_pages,
-		.flags		= flags,
+		.remap_pte_range	= vmemmap_restore_pte_range,
+		.reuse_addr		= reuse,
+		.vmemmap_pages		= &vmemmap_pages,
+		.flags			= flags,
 	};
 
 	/* See the comment in the vmemmap_remap_free(). */
-- 
2.47.0.rc1.288.g06298d1525-goog



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 2/6] mm/hugetlb_vmemmap: add arch-independent helpers
  2024-10-21  4:22 [PATCH v1 0/6] mm/arm64: re-enable HVO Yu Zhao
  2024-10-21  4:22 ` [PATCH v1 1/6] mm/hugetlb_vmemmap: batch update PTEs Yu Zhao
@ 2024-10-21  4:22 ` Yu Zhao
  2024-10-21  4:22 ` [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast Yu Zhao
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Yu Zhao @ 2024-10-21  4:22 UTC (permalink / raw)
  To: Andrew Morton, Catalin Marinas, Marc Zyngier, Muchun Song,
	Thomas Gleixner, Will Deacon
  Cc: Douglas Anderson, Mark Rutland, Nanyong Sun, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao

Add architecture-independent helpers to allow individual architectures
to work around their own limitations when updating vmemmap.

Specifically, the current remap workflow requires break-before-make
(BBM) on arm64. By overriding the default helpers later in this
series, arm64 will be able to support the current HVO implementation.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/mm_types.h |  7 +++
 mm/hugetlb_vmemmap.c     | 99 ++++++++++++++++++++++++++++++++++------
 2 files changed, 92 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8e38bc..0f3ae6e173f6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1499,4 +1499,11 @@ enum {
 	/* See also internal only FOLL flags in mm/internal.h */
 };
 
+/* Skip the TLB flush when we split the PMD */
+#define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
+/* Skip the TLB flush when we remap the PTE */
+#define VMEMMAP_REMAP_NO_TLB_FLUSH	BIT(1)
+/* synchronize_rcu() to avoid writes from page_ref_add_unless() */
+#define VMEMMAP_SYNCHRONIZE_RCU		BIT(2)
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 46befab48d41..e50a196399f5 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -38,16 +38,56 @@ struct vmemmap_remap_walk {
 	struct page		*reuse_page;
 	unsigned long		reuse_addr;
 	struct list_head	*vmemmap_pages;
-
-/* Skip the TLB flush when we split the PMD */
-#define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
-/* Skip the TLB flush when we remap the PTE */
-#define VMEMMAP_REMAP_NO_TLB_FLUSH	BIT(1)
-/* synchronize_rcu() to avoid writes from page_ref_add_unless() */
-#define VMEMMAP_SYNCHRONIZE_RCU		BIT(2)
 	unsigned long		flags;
 };
 
+#ifndef VMEMMAP_ARCH_TLB_FLUSH_FLAGS
+#define VMEMMAP_ARCH_TLB_FLUSH_FLAGS 0
+#endif
+
+#ifndef vmemmap_update_supported
+static bool vmemmap_update_supported(void)
+{
+	return true;
+}
+#endif
+
+#ifndef vmemmap_update_lock
+static void vmemmap_update_lock(void)
+{
+}
+#endif
+
+#ifndef vmemmap_update_unlock
+static void vmemmap_update_unlock(void)
+{
+}
+#endif
+
+#ifndef vmemmap_update_pte_range_start
+static void vmemmap_update_pte_range_start(pte_t *pte, unsigned long start, unsigned long end)
+{
+}
+#endif
+
+#ifndef vmemmap_update_pte_range_end
+static void vmemmap_update_pte_range_end(void)
+{
+}
+#endif
+
+#ifndef vmemmap_update_pmd_range_start
+static void vmemmap_update_pmd_range_start(pmd_t *pmd, unsigned long start, unsigned long end)
+{
+}
+#endif
+
+#ifndef vmemmap_update_pmd_range_end
+static void vmemmap_update_pmd_range_end(void)
+{
+}
+#endif
+
 static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
 			     struct vmemmap_remap_walk *walk)
 {
@@ -83,7 +123,9 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
 
 		/* Make pte visible before pmd. See comment in pmd_install(). */
 		smp_wmb();
+		vmemmap_update_pmd_range_start(pmd, start, start + PMD_SIZE);
 		pmd_populate_kernel(&init_mm, pmd, pgtable);
+		vmemmap_update_pmd_range_end();
 		if (!(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH))
 			flush_tlb_kernel_range(start, start + PMD_SIZE);
 	} else {
@@ -164,10 +206,12 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 
 	VM_BUG_ON(!PAGE_ALIGNED(start | end));
 
+	vmemmap_update_lock();
 	mmap_read_lock(&init_mm);
 	ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops,
 				    NULL, walk);
 	mmap_read_unlock(&init_mm);
+	vmemmap_update_unlock();
 	if (ret)
 		return ret;
 
@@ -228,6 +272,8 @@ static void vmemmap_remap_pte_range(pte_t *pte, unsigned long start, unsigned lo
 		smp_wmb();
 	}
 
+	vmemmap_update_pte_range_start(pte, start, end);
+
 	for (i = 0; i < nr_pages; i++) {
 		pte_t val;
 
@@ -242,6 +288,8 @@ static void vmemmap_remap_pte_range(pte_t *pte, unsigned long start, unsigned lo
 
 		set_pte_at(&init_mm, start + PAGE_SIZE * i, pte + i, val);
 	}
+
+	vmemmap_update_pte_range_end();
 }
 
 /*
@@ -287,6 +335,8 @@ static void vmemmap_restore_pte_range(pte_t *pte, unsigned long start, unsigned
 	 */
 	smp_wmb();
 
+	vmemmap_update_pte_range_start(pte, start, end);
+
 	for (i = 0; i < nr_pages; i++) {
 		pte_t val;
 
@@ -296,6 +346,8 @@ static void vmemmap_restore_pte_range(pte_t *pte, unsigned long start, unsigned
 		val = mk_pte(page, PAGE_KERNEL);
 		set_pte_at(&init_mm, start + PAGE_SIZE * i, pte + i, val);
 	}
+
+	vmemmap_update_pte_range_end();
 }
 
 /**
@@ -513,7 +565,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
  */
 int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
 {
-	return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU);
+	return __hugetlb_vmemmap_restore_folio(h, folio,
+			VMEMMAP_SYNCHRONIZE_RCU | VMEMMAP_ARCH_TLB_FLUSH_FLAGS);
 }
 
 /**
@@ -553,7 +606,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 		list_move(&folio->lru, non_hvo_folios);
 	}
 
-	if (restored)
+	if (restored && !(VMEMMAP_ARCH_TLB_FLUSH_FLAGS & VMEMMAP_REMAP_NO_TLB_FLUSH))
 		flush_tlb_all();
 	if (!ret)
 		ret = restored;
@@ -641,7 +694,8 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
 {
 	LIST_HEAD(vmemmap_pages);
 
-	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU);
+	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
+			VMEMMAP_SYNCHRONIZE_RCU | VMEMMAP_ARCH_TLB_FLUSH_FLAGS);
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
@@ -683,7 +737,8 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 			break;
 	}
 
-	flush_tlb_all();
+	if (!(VMEMMAP_ARCH_TLB_FLUSH_FLAGS & VMEMMAP_SPLIT_NO_TLB_FLUSH))
+		flush_tlb_all();
 
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret;
@@ -701,24 +756,35 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 		 * allowing more vmemmap remaps to occur.
 		 */
 		if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
-			flush_tlb_all();
+			if (!(VMEMMAP_ARCH_TLB_FLUSH_FLAGS & VMEMMAP_REMAP_NO_TLB_FLUSH))
+				flush_tlb_all();
 			free_vmemmap_page_list(&vmemmap_pages);
 			INIT_LIST_HEAD(&vmemmap_pages);
 			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
 		}
 	}
 
-	flush_tlb_all();
+	if (!(VMEMMAP_ARCH_TLB_FLUSH_FLAGS & VMEMMAP_REMAP_NO_TLB_FLUSH))
+		flush_tlb_all();
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
+static int hugetlb_vmemmap_sysctl(const struct ctl_table *ctl, int write,
+				  void *buffer, size_t *lenp, loff_t *ppos)
+{
+	if (!vmemmap_update_supported())
+		return -ENODEV;
+
+	return proc_dobool(ctl, write, buffer, lenp, ppos);
+}
+
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
 	{
 		.procname	= "hugetlb_optimize_vmemmap",
 		.data		= &vmemmap_optimize_enabled,
 		.maxlen		= sizeof(vmemmap_optimize_enabled),
 		.mode		= 0644,
-		.proc_handler	= proc_dobool,
+		.proc_handler	= hugetlb_vmemmap_sysctl,
 	},
 };
 
@@ -729,6 +795,11 @@ static int __init hugetlb_vmemmap_init(void)
 	/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
 	BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
 
+	if (READ_ONCE(vmemmap_optimize_enabled) && !vmemmap_update_supported()) {
+		pr_warn("HugeTLB: disabling HVO due to missing support.\n");
+		WRITE_ONCE(vmemmap_optimize_enabled, false);
+	}
+
 	for_each_hstate(h) {
 		if (hugetlb_vmemmap_optimizable(h)) {
 			register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
-- 
2.47.0.rc1.288.g06298d1525-goog



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
  2024-10-21  4:22 [PATCH v1 0/6] mm/arm64: re-enable HVO Yu Zhao
  2024-10-21  4:22 ` [PATCH v1 1/6] mm/hugetlb_vmemmap: batch update PTEs Yu Zhao
  2024-10-21  4:22 ` [PATCH v1 2/6] mm/hugetlb_vmemmap: add arch-independent helpers Yu Zhao
@ 2024-10-21  4:22 ` Yu Zhao
  2024-10-22  0:24   ` kernel test robot
  2024-10-22 15:03   ` Marc Zyngier
  2024-10-21  4:22 ` [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs Yu Zhao
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Yu Zhao @ 2024-10-21  4:22 UTC (permalink / raw)
  To: Andrew Morton, Catalin Marinas, Marc Zyngier, Muchun Song,
	Thomas Gleixner, Will Deacon
  Cc: Douglas Anderson, Mark Rutland, Nanyong Sun, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao

GIC v3 and later support SGI broadcast, i.e., the mode that routes
interrupts to all PEs in the system excluding the local CPU.

Supporting this mode can avoid looping through all the remote CPUs
when broadcasting SGIs, especially for systems with 200+ CPUs. The
performance improvement can be measured with the rest of this series
booted with "hugetlb_free_vmemmap=on irqchip.gicv3_pseudo_nmi=1":

  cd /sys/kernel/mm/hugepages/
  echo 600 >hugepages-1048576kB/nr_hugepages
  echo 2048kB >hugepages-1048576kB/demote_size
  perf record -g -- bash -c "echo 600 >hugepages-1048576kB/demote"

         gic_ipi_send_mask()  bash sys time
Before:  38.14%               0m10.513s
After:    0.20%               0m5.132s

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 drivers/irqchip/irq-gic-v3.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index ce87205e3e82..42c39385e1b9 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1394,9 +1394,20 @@ static void gic_send_sgi(u64 cluster_id, u16 tlist, unsigned int irq)
 	gic_write_sgi1r(val);
 }
 
+static void gic_broadcast_sgi(unsigned int irq)
+{
+	u64 val;
+
+	val = BIT(ICC_SGI1R_IRQ_ROUTING_MODE_BIT) | (irq << ICC_SGI1R_SGI_ID_SHIFT);
+
+	pr_devel("CPU %d: broadcasting SGI %u\n", smp_processor_id(), irq);
+	gic_write_sgi1r(val);
+}
+
 static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
 {
 	int cpu;
+	cpumask_t broadcast;
 
 	if (WARN_ON(d->hwirq >= 16))
 		return;
@@ -1407,6 +1418,13 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
 	 */
 	dsb(ishst);
 
+	cpumask_copy(&broadcast, cpu_present_mask);
+	cpumask_clear_cpu(smp_processor_id(), &broadcast);
+	if (cpumask_equal(&broadcast, mask)) {
+		gic_broadcast_sgi(d->hwirq);
+		goto done;
+	}
+
 	for_each_cpu(cpu, mask) {
 		u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(gic_cpu_to_affinity(cpu));
 		u16 tlist;
@@ -1414,7 +1432,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
 		tlist = gic_compute_target_list(&cpu, mask, cluster_id);
 		gic_send_sgi(cluster_id, tlist, d->hwirq);
 	}
-
+done:
 	/* Force the above writes to ICC_SGI1R_EL1 to be executed */
 	isb();
 }
-- 
2.47.0.rc1.288.g06298d1525-goog



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
  2024-10-21  4:22 ` [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast Yu Zhao
@ 2024-10-22  0:24   ` kernel test robot
  2024-10-22 15:03   ` Marc Zyngier
  1 sibling, 0 replies; 18+ messages in thread
From: kernel test robot @ 2024-10-22  0:24 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Catalin Marinas, Marc Zyngier,
	Muchun Song, Thomas Gleixner, Will Deacon
  Cc: llvm, oe-kbuild-all, Linux Memory Management List,
	Douglas Anderson, Mark Rutland, Nanyong Sun, linux-arm-kernel,
	linux-kernel, Yu Zhao

Hi Yu,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 42f7652d3eb527d03665b09edac47f85fb600924]

url:    https://github.com/intel-lab-lkp/linux/commits/Yu-Zhao/mm-hugetlb_vmemmap-batch-update-PTEs/20241021-122330
base:   42f7652d3eb527d03665b09edac47f85fb600924
patch link:    https://lore.kernel.org/r/20241021042218.746659-4-yuzhao%40google.com
patch subject: [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
config: arm-defconfig (https://download.01.org/0day-ci/archive/20241022/202410221026.yJKHaGWA-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241022/202410221026.yJKHaGWA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410221026.yJKHaGWA-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/irqchip/irq-gic-v3.c:1401:8: warning: shift count >= width of type [-Wshift-count-overflow]
           val = BIT(ICC_SGI1R_IRQ_ROUTING_MODE_BIT) | (irq << ICC_SGI1R_SGI_ID_SHIFT);
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/vdso/bits.h:7:26: note: expanded from macro 'BIT'
   #define BIT(nr)                 (UL(1) << (nr))
                                          ^  ~~~~
   1 warning generated.


vim +1401 drivers/irqchip/irq-gic-v3.c

  1396	
  1397	static void gic_broadcast_sgi(unsigned int irq)
  1398	{
  1399		u64 val;
  1400	
> 1401		val = BIT(ICC_SGI1R_IRQ_ROUTING_MODE_BIT) | (irq << ICC_SGI1R_SGI_ID_SHIFT);
  1402	
  1403		pr_devel("CPU %d: broadcasting SGI %u\n", smp_processor_id(), irq);
  1404		gic_write_sgi1r(val);
  1405	}
  1406	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
  2024-10-21  4:22 ` [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast Yu Zhao
  2024-10-22  0:24   ` kernel test robot
@ 2024-10-22 15:03   ` Marc Zyngier
  2024-10-25  5:07     ` Yu Zhao
  1 sibling, 1 reply; 18+ messages in thread
From: Marc Zyngier @ 2024-10-22 15:03 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Mon, 21 Oct 2024 05:22:15 +0100,
Yu Zhao <yuzhao@google.com> wrote:
> 
> GIC v3 and later support SGI broadcast, i.e., the mode that routes
> interrupts to all PEs in the system excluding the local CPU.
> 
> Supporting this mode can avoid looping through all the remote CPUs
> when broadcasting SGIs, especially for systems with 200+ CPUs. The
> performance improvement can be measured with the rest of this series
> booted with "hugetlb_free_vmemmap=on irqchip.gicv3_pseudo_nmi=1":
> 
>   cd /sys/kernel/mm/hugepages/
>   echo 600 >hugepages-1048576kB/nr_hugepages
>   echo 2048kB >hugepages-1048576kB/demote_size
>   perf record -g -- bash -c "echo 600 >hugepages-1048576kB/demote"
> 
>          gic_ipi_send_mask()  bash sys time
> Before:  38.14%               0m10.513s
> After:    0.20%               0m5.132s
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  drivers/irqchip/irq-gic-v3.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index ce87205e3e82..42c39385e1b9 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -1394,9 +1394,20 @@ static void gic_send_sgi(u64 cluster_id, u16 tlist, unsigned int irq)
>  	gic_write_sgi1r(val);
>  }
>  
> +static void gic_broadcast_sgi(unsigned int irq)
> +{
> +	u64 val;
> +
> +	val = BIT(ICC_SGI1R_IRQ_ROUTING_MODE_BIT) | (irq << ICC_SGI1R_SGI_ID_SHIFT);

As picked up by the test bot, please fix the 32bit build.

> +
> +	pr_devel("CPU %d: broadcasting SGI %u\n", smp_processor_id(), irq);
> +	gic_write_sgi1r(val);
> +}
> +
>  static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
>  {
>  	int cpu;
> +	cpumask_t broadcast;
>  
>  	if (WARN_ON(d->hwirq >= 16))
>  		return;
> @@ -1407,6 +1418,13 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
>  	 */
>  	dsb(ishst);
>  
> +	cpumask_copy(&broadcast, cpu_present_mask);

Why cpu_present_mask? I'd expect that cpu_online_mask should be the
correct mask to use -- we don't IPI offline CPUs, in general.

> +	cpumask_clear_cpu(smp_processor_id(), &broadcast);
> +	if (cpumask_equal(&broadcast, mask)) {
> +		gic_broadcast_sgi(d->hwirq);
> +		goto done;
> +	}

So the (valid) case where you would IPI *everyone* is not handled as a
fast path? That seems a missed opportunity.

This also seem an like expensive way to do it. How about something
like:

	int mcnt = cpumask_weight(mask);
	int ocnt = cpumask_weight(cpu_online_mask);
	if (mcnt == ocnt)  {
		/* Broadcast to all CPUs including self */
	} else if (mcnt == (ocnt - 1) &&
		   !cpumask_test_cpu(smp_processor_id(), mask)) {
		/* Broadcast to all but self */
	}

which avoids the copy+update_full compare.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
  2024-10-22 15:03   ` Marc Zyngier
@ 2024-10-25  5:07     ` Yu Zhao
  2024-10-25 16:14       ` Marc Zyngier
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Zhao @ 2024-10-25  5:07 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

Hi Marc,

On Tue, Oct 22, 2024 at 9:03 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 21 Oct 2024 05:22:15 +0100,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > GIC v3 and later support SGI broadcast, i.e., the mode that routes
> > interrupts to all PEs in the system excluding the local CPU.
> >
> > Supporting this mode can avoid looping through all the remote CPUs
> > when broadcasting SGIs, especially for systems with 200+ CPUs. The
> > performance improvement can be measured with the rest of this series
> > booted with "hugetlb_free_vmemmap=on irqchip.gicv3_pseudo_nmi=1":
> >
> >   cd /sys/kernel/mm/hugepages/
> >   echo 600 >hugepages-1048576kB/nr_hugepages
> >   echo 2048kB >hugepages-1048576kB/demote_size
> >   perf record -g -- bash -c "echo 600 >hugepages-1048576kB/demote"
> >
> >          gic_ipi_send_mask()  bash sys time
> > Before:  38.14%               0m10.513s
> > After:    0.20%               0m5.132s
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  drivers/irqchip/irq-gic-v3.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> > index ce87205e3e82..42c39385e1b9 100644
> > --- a/drivers/irqchip/irq-gic-v3.c
> > +++ b/drivers/irqchip/irq-gic-v3.c
> > @@ -1394,9 +1394,20 @@ static void gic_send_sgi(u64 cluster_id, u16 tlist, unsigned int irq)
> >       gic_write_sgi1r(val);
> >  }
> >
> > +static void gic_broadcast_sgi(unsigned int irq)
> > +{
> > +     u64 val;
> > +
> > +     val = BIT(ICC_SGI1R_IRQ_ROUTING_MODE_BIT) | (irq << ICC_SGI1R_SGI_ID_SHIFT);
>
> As picked up by the test bot, please fix the 32bit build.

Will do.

> > +
> > +     pr_devel("CPU %d: broadcasting SGI %u\n", smp_processor_id(), irq);
> > +     gic_write_sgi1r(val);
> > +}
> > +
> >  static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> >  {
> >       int cpu;
> > +     cpumask_t broadcast;
> >
> >       if (WARN_ON(d->hwirq >= 16))
> >               return;
> > @@ -1407,6 +1418,13 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> >        */
> >       dsb(ishst);
> >
> > +     cpumask_copy(&broadcast, cpu_present_mask);
>
> Why cpu_present_mask? I'd expect that cpu_online_mask should be the
> correct mask to use -- we don't IPI offline CPUs, in general.

This is exactly because "we don't IPI offline CPUs, in general",
assuming "we" means the kernel, not GIC.

My interpretation of what the GIC spec says ("0b1: Interrupts routed
to all PEs in the system, excluding self") is that it broadcasts IPIs to
"cpu_present_mask" (minus the local one). So if the kernel uses
"cpu_online_mask" here, GIC would send IPIs to offline CPUs
(cpu_present_mask ^ cpu_online_mask), which I don't know whether it's
a defined behavior.

But if you actually meant GIC doesn't IPI offline CPUs, then yes, here
the kernel should use "cpu_online_mask".

> > +     cpumask_clear_cpu(smp_processor_id(), &broadcast);
> > +     if (cpumask_equal(&broadcast, mask)) {
> > +             gic_broadcast_sgi(d->hwirq);
> > +             goto done;
> > +     }
>
> So the (valid) case where you would IPI *everyone* is not handled as a
> fast path? That seems a missed opportunity.

You are right: it should handle that case.

> This also seem an like expensive way to do it. How about something
> like:
>
>         int mcnt = cpumask_weight(mask);
>         int ocnt = cpumask_weight(cpu_online_mask);
>         if (mcnt == ocnt)  {
>                 /* Broadcast to all CPUs including self */

Does the comment mean the following two steps?
1. Broadcasting to everyone else.
2. Sending to self.

My understanding of the "Interrupt Routing Mode" is that it can't
broadcast to all CPUs including self, and therefore we need the above
two steps, which still can be a lot faster. Is my understanding
correct?

>         } else if (mcnt == (ocnt - 1) &&
>                    !cpumask_test_cpu(smp_processor_id(), mask)) {
>                 /* Broadcast to all but self */
>         }
>
> which avoids the copy+update_full compare.

Thank you.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
  2024-10-25  5:07     ` Yu Zhao
@ 2024-10-25 16:14       ` Marc Zyngier
  2024-10-25 17:31         ` Yu Zhao
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Zyngier @ 2024-10-25 16:14 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Fri, 25 Oct 2024 06:07:45 +0100,
Yu Zhao <yuzhao@google.com> wrote:
> 
> Hi Marc,
> 
> On Tue, Oct 22, 2024 at 9:03 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Mon, 21 Oct 2024 05:22:15 +0100,
> > Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > @@ -1407,6 +1418,13 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> > >        */
> > >       dsb(ishst);
> > >
> > > +     cpumask_copy(&broadcast, cpu_present_mask);
> >
> > Why cpu_present_mask? I'd expect that cpu_online_mask should be the
> > correct mask to use -- we don't IPI offline CPUs, in general.
> 
> This is exactly because "we don't IPI offline CPUs, in general",
> assuming "we" means the kernel, not GIC.
> 
> My interpretation of what the GIC spec says ("0b1: Interrupts routed
> to all PEs in the system, excluding self") is that it broadcasts IPIs to
> "cpu_present_mask" (minus the local one). So if the kernel uses
> "cpu_online_mask" here, GIC would send IPIs to offline CPUs
> (cpu_present_mask ^ cpu_online_mask), which I don't know whether it's
> a defined behavior.

Offline CPUs are not known to the kernel. Most likely, they are either
powered off, or spending quality time in Secure or Realm mode. Either
way, this is none of our business.

Your approach would make also the kernel perform pretty inconsistently
depending on whether CPUs are offline and not.

> 
> But if you actually meant GIC doesn't IPI offline CPUs, then yes, here
> the kernel should use "cpu_online_mask".
> 
> > > +     cpumask_clear_cpu(smp_processor_id(), &broadcast);
> > > +     if (cpumask_equal(&broadcast, mask)) {
> > > +             gic_broadcast_sgi(d->hwirq);
> > > +             goto done;
> > > +     }
> >
> > So the (valid) case where you would IPI *everyone* is not handled as a
> > fast path? That seems a missed opportunity.
> 
> You are right: it should handle that case.
> 
> > This also seem an like expensive way to do it. How about something
> > like:
> >
> >         int mcnt = cpumask_weight(mask);
> >         int ocnt = cpumask_weight(cpu_online_mask);
> >         if (mcnt == ocnt)  {
> >                 /* Broadcast to all CPUs including self */
> 
> Does the comment mean the following two steps?
> 1. Broadcasting to everyone else.
> 2. Sending to self.

Correct.

> My understanding of the "Interrupt Routing Mode" is that it can't
> broadcast to all CPUs including self, and therefore we need the above
> two steps, which still can be a lot faster. Is my understanding
> correct?

Yes.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
  2024-10-25 16:14       ` Marc Zyngier
@ 2024-10-25 17:31         ` Yu Zhao
  2024-10-29 19:02           ` Marc Zyngier
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Zhao @ 2024-10-25 17:31 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Fri, Oct 25, 2024 at 10:15 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 25 Oct 2024 06:07:45 +0100,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > Hi Marc,
> >
> > On Tue, Oct 22, 2024 at 9:03 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Mon, 21 Oct 2024 05:22:15 +0100,
> > > Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > @@ -1407,6 +1418,13 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> > > >        */
> > > >       dsb(ishst);
> > > >
> > > > +     cpumask_copy(&broadcast, cpu_present_mask);
> > >
> > > Why cpu_present_mask? I'd expect that cpu_online_mask should be the
> > > correct mask to use -- we don't IPI offline CPUs, in general.
> >
> > This is exactly because "we don't IPI offline CPUs, in general",
> > assuming "we" means the kernel, not GIC.
> >
> > My interpretation of what the GIC spec says ("0b1: Interrupts routed
> > to all PEs in the system, excluding self") is that it broadcasts IPIs to
> > "cpu_present_mask" (minus the local one). So if the kernel uses
> > "cpu_online_mask" here, GIC would send IPIs to offline CPUs
> > (cpu_present_mask ^ cpu_online_mask), which I don't know whether it's
> > a defined behavior.

Thanks for clarifying.

> Offline CPUs are not known to the kernel.

I assume it wouldn't matter to firmware either, correct? IOW, we
wouldn't cause firmware any trouble by letting GIC send IPIs to
(cpu_present_mask ^ cpu_online_mask), assuming those two masks can be
different on arm64 when hotplug is enabled?

> Most likely, they are either
> powered off, or spending quality time in Secure or Realm mode. Either
> way, this is none of our business.
>
> Your approach would make also the kernel perform pretty inconsistently
> depending on whether CPUs are offline and not.
>
> >
> > But if you actually meant GIC doesn't IPI offline CPUs, then yes, here
> > the kernel should use "cpu_online_mask".
> >
> > > > +     cpumask_clear_cpu(smp_processor_id(), &broadcast);
> > > > +     if (cpumask_equal(&broadcast, mask)) {
> > > > +             gic_broadcast_sgi(d->hwirq);
> > > > +             goto done;
> > > > +     }
> > >
> > > So the (valid) case where you would IPI *everyone* is not handled as a
> > > fast path? That seems a missed opportunity.
> >
> > You are right: it should handle that case.
> >
> > > This also seem an like expensive way to do it. How about something
> > > like:
> > >
> > >         int mcnt = cpumask_weight(mask);
> > >         int ocnt = cpumask_weight(cpu_online_mask);
> > >         if (mcnt == ocnt)  {
> > >                 /* Broadcast to all CPUs including self */
> >
> > Does the comment mean the following two steps?
> > 1. Broadcasting to everyone else.
> > 2. Sending to self.
>
> Correct.
>
> > My understanding of the "Interrupt Routing Mode" is that it can't
> > broadcast to all CPUs including self, and therefore we need the above
> > two steps, which still can be a lot faster. Is my understanding
> > correct?
>
> Yes.
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
  2024-10-25 17:31         ` Yu Zhao
@ 2024-10-29 19:02           ` Marc Zyngier
  2024-10-29 19:53             ` Yu Zhao
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Zyngier @ 2024-10-29 19:02 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Fri, 25 Oct 2024 18:31:01 +0100,
Yu Zhao <yuzhao@google.com> wrote:
> 
> On Fri, Oct 25, 2024 at 10:15 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Fri, 25 Oct 2024 06:07:45 +0100,
> > Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Hi Marc,
> > >
> > > On Tue, Oct 22, 2024 at 9:03 AM Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On Mon, 21 Oct 2024 05:22:15 +0100,
> > > > Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > @@ -1407,6 +1418,13 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> > > > >        */
> > > > >       dsb(ishst);
> > > > >
> > > > > +     cpumask_copy(&broadcast, cpu_present_mask);
> > > >
> > > > Why cpu_present_mask? I'd expect that cpu_online_mask should be the
> > > > correct mask to use -- we don't IPI offline CPUs, in general.
> > >
> > > This is exactly because "we don't IPI offline CPUs, in general",
> > > assuming "we" means the kernel, not GIC.
> > >
> > > My interpretation of what the GIC spec says ("0b1: Interrupts routed
> > > to all PEs in the system, excluding self") is that it broadcasts IPIs to
> > > "cpu_present_mask" (minus the local one). So if the kernel uses
> > > "cpu_online_mask" here, GIC would send IPIs to offline CPUs
> > > (cpu_present_mask ^ cpu_online_mask), which I don't know whether it's
> > > a defined behavior.
> 
> Thanks for clarifying.
> 
> > Offline CPUs are not known to the kernel.
> 
> I assume it wouldn't matter to firmware either, correct? IOW, we

Firmware is on the secure side of the stack.

> wouldn't cause firmware any trouble by letting GIC send IPIs to
> (cpu_present_mask ^ cpu_online_mask), assuming those two masks can be
> different on arm64 when hotplug is enabled?

You can't send SGIs from non-secure to secure using ICC_SGI1R_EL1. You
would need to use ICC_ASGI1R_EL1, and have secure to allow such
brokenness via a configuration of GICR_NSACR. Linux doesn't use the
former, and no sane software touches the latter.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast
  2024-10-29 19:02           ` Marc Zyngier
@ 2024-10-29 19:53             ` Yu Zhao
  0 siblings, 0 replies; 18+ messages in thread
From: Yu Zhao @ 2024-10-29 19:53 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Tue, Oct 29, 2024 at 1:02 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 25 Oct 2024 18:31:01 +0100,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, Oct 25, 2024 at 10:15 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Fri, 25 Oct 2024 06:07:45 +0100,
> > > Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > Hi Marc,
> > > >
> > > > On Tue, Oct 22, 2024 at 9:03 AM Marc Zyngier <maz@kernel.org> wrote:
> > > > >
> > > > > On Mon, 21 Oct 2024 05:22:15 +0100,
> > > > > Yu Zhao <yuzhao@google.com> wrote:
> > > > > >
> > > > > > @@ -1407,6 +1418,13 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> > > > > >        */
> > > > > >       dsb(ishst);
> > > > > >
> > > > > > +     cpumask_copy(&broadcast, cpu_present_mask);
> > > > >
> > > > > Why cpu_present_mask? I'd expect that cpu_online_mask should be the
> > > > > correct mask to use -- we don't IPI offline CPUs, in general.
> > > >
> > > > This is exactly because "we don't IPI offline CPUs, in general",
> > > > assuming "we" means the kernel, not GIC.
> > > >
> > > > My interpretation of what the GIC spec says ("0b1: Interrupts routed
> > > > to all PEs in the system, excluding self") is that it broadcasts IPIs to
> > > > "cpu_present_mask" (minus the local one). So if the kernel uses
> > > > "cpu_online_mask" here, GIC would send IPIs to offline CPUs
> > > > (cpu_present_mask ^ cpu_online_mask), which I don't know whether it's
> > > > a defined behavior.
> >
> > Thanks for clarifying.
> >
> > > Offline CPUs are not known to the kernel.
> >
> > I assume it wouldn't matter to firmware either, correct? IOW, we
>
> Firmware is on the secure side of the stack.
>
> > wouldn't cause firmware any trouble by letting GIC send IPIs to
> > (cpu_present_mask ^ cpu_online_mask), assuming those two masks can be
> > different on arm64 when hotplug is enabled?
>
> You can't send SGIs from non-secure to secure using ICC_SGI1R_EL1.

Right, this makes sense now.

> You
> would need to use ICC_ASGI1R_EL1, and have secure to allow such
> brokenness via a configuration of GICR_NSACR. Linux doesn't use the
> former, and no sane software touches the latter.
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs
  2024-10-21  4:22 [PATCH v1 0/6] mm/arm64: re-enable HVO Yu Zhao
                   ` (2 preceding siblings ...)
  2024-10-21  4:22 ` [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast Yu Zhao
@ 2024-10-21  4:22 ` Yu Zhao
  2024-10-22 16:15   ` Marc Zyngier
  2024-10-21  4:22 ` [PATCH v1 5/6] arm64: pause remote CPUs to update vmemmap Yu Zhao
  2024-10-21  4:22 ` [PATCH v1 6/6] arm64: select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP Yu Zhao
  5 siblings, 1 reply; 18+ messages in thread
From: Yu Zhao @ 2024-10-21  4:22 UTC (permalink / raw)
  To: Andrew Morton, Catalin Marinas, Marc Zyngier, Muchun Song,
	Thomas Gleixner, Will Deacon
  Cc: Douglas Anderson, Mark Rutland, Nanyong Sun, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao

Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of
time, and then reliably resume them when the local CPU exits critical
sections that preclude the execution of remote CPUs.

A typical example of such critical sections is BBM on kernel PTEs.
HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by
commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason:

  This is deemed UNPREDICTABLE by the Arm architecture without a
  break-before-make sequence (make the PTE invalid, TLBI, write the
  new valid PTE). However, such sequence is not possible since the
  vmemmap may be concurrently accessed by the kernel.

Supporting BBM on kernel PTEs is one of the approaches that can make
HVO theoretically safe on arm64.

Note that it is still possible for the paused CPUs to perform
speculative translations. Such translations would cause spurious
kernel PFs, which should be properly handled by
is_spurious_el1_translation_fault().

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/include/asm/smp.h |  3 ++
 arch/arm64/kernel/smp.c      | 92 +++++++++++++++++++++++++++++++++---
 2 files changed, 88 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index 2510eec026f7..cffb0cfed961 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
 extern void crash_smp_send_stop(void);
 extern bool smp_crash_stop_failed(void);
 
+void pause_remote_cpus(void);
+void resume_remote_cpus(void);
+
 #endif /* ifndef __ASSEMBLY__ */
 
 #endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 3b3f6b56e733..68829c6de1b1 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init;
 static int nr_ipi __ro_after_init = NR_IPI;
 static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
 
-static bool crash_stop;
+enum {
+	SEND_STOP = BIT(0),
+	CRASH_STOP = BIT(1),
+};
+
+static unsigned long stop_in_progress;
 
 static void ipi_setup(int cpu);
 
@@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
 #endif
 }
 
+static DEFINE_SPINLOCK(cpu_pause_lock);
+static cpumask_t paused_cpus;
+static cpumask_t resumed_cpus;
+
+static void pause_local_cpu(void)
+{
+	int cpu = smp_processor_id();
+
+	cpumask_clear_cpu(cpu, &resumed_cpus);
+	/*
+	 * Paired with pause_remote_cpus() to confirm that this CPU not only
+	 * will be paused but also can be reliably resumed.
+	 */
+	smp_wmb();
+	cpumask_set_cpu(cpu, &paused_cpus);
+	/* paused_cpus must be set before waiting on resumed_cpus. */
+	barrier();
+	while (!cpumask_test_cpu(cpu, &resumed_cpus))
+		cpu_relax();
+	/* A typical example for sleep and wake-up functions. */
+	smp_mb();
+	cpumask_clear_cpu(cpu, &paused_cpus);
+}
+
+void pause_remote_cpus(void)
+{
+	cpumask_t cpus_to_pause;
+
+	lockdep_assert_cpus_held();
+	lockdep_assert_preemption_disabled();
+
+	cpumask_copy(&cpus_to_pause, cpu_online_mask);
+	cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
+
+	spin_lock(&cpu_pause_lock);
+
+	WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
+
+	smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
+
+	while (!cpumask_equal(&cpus_to_pause, &paused_cpus))
+		cpu_relax();
+	/*
+	 * Paired with pause_local_cpu() to confirm that all CPUs not only will
+	 * be paused but also can be reliably resumed.
+	 */
+	smp_rmb();
+	WARN_ON_ONCE(cpumask_intersects(&cpus_to_pause, &resumed_cpus));
+
+	spin_unlock(&cpu_pause_lock);
+}
+
+void resume_remote_cpus(void)
+{
+	cpumask_t cpus_to_resume;
+
+	lockdep_assert_cpus_held();
+	lockdep_assert_preemption_disabled();
+
+	cpumask_copy(&cpus_to_resume, cpu_online_mask);
+	cpumask_clear_cpu(smp_processor_id(), &cpus_to_resume);
+
+	spin_lock(&cpu_pause_lock);
+
+	cpumask_setall(&resumed_cpus);
+	/* A typical example for sleep and wake-up functions. */
+	smp_mb();
+	while (cpumask_intersects(&cpus_to_resume, &paused_cpus))
+		cpu_relax();
+
+	spin_unlock(&cpu_pause_lock);
+}
+
 static void arm64_backtrace_ipi(cpumask_t *mask)
 {
 	__ipi_send_mask(ipi_desc[IPI_CPU_BACKTRACE], mask);
@@ -970,7 +1048,9 @@ static void do_handle_IPI(int ipinr)
 
 	case IPI_CPU_STOP:
 	case IPI_CPU_STOP_NMI:
-		if (IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop) {
+		if (!test_bit(SEND_STOP, &stop_in_progress)) {
+			pause_local_cpu();
+		} else if (test_bit(CRASH_STOP, &stop_in_progress)) {
 			ipi_cpu_crash_stop(cpu, get_irq_regs());
 			unreachable();
 		} else {
@@ -1142,7 +1222,6 @@ static inline unsigned int num_other_online_cpus(void)
 
 void smp_send_stop(void)
 {
-	static unsigned long stop_in_progress;
 	cpumask_t mask;
 	unsigned long timeout;
 
@@ -1154,7 +1233,7 @@ void smp_send_stop(void)
 		goto skip_ipi;
 
 	/* Only proceed if this is the first CPU to reach this code */
-	if (test_and_set_bit(0, &stop_in_progress))
+	if (test_and_set_bit(SEND_STOP, &stop_in_progress))
 		return;
 
 	/*
@@ -1230,12 +1309,11 @@ void crash_smp_send_stop(void)
 	 * This function can be called twice in panic path, but obviously
 	 * we execute this only once.
 	 *
-	 * We use this same boolean to tell whether the IPI we send was a
+	 * We use the CRASH_STOP bit to tell whether the IPI we send was a
 	 * stop or a "crash stop".
 	 */
-	if (crash_stop)
+	if (test_and_set_bit(CRASH_STOP, &stop_in_progress))
 		return;
-	crash_stop = 1;
 
 	smp_send_stop();
 
-- 
2.47.0.rc1.288.g06298d1525-goog



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs
  2024-10-21  4:22 ` [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs Yu Zhao
@ 2024-10-22 16:15   ` Marc Zyngier
  2024-10-28 22:11     ` Yu Zhao
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Zyngier @ 2024-10-22 16:15 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Mon, 21 Oct 2024 05:22:16 +0100,
Yu Zhao <yuzhao@google.com> wrote:
> 
> Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of
> time, and then reliably resume them when the local CPU exits critical
> sections that preclude the execution of remote CPUs.
> 
> A typical example of such critical sections is BBM on kernel PTEs.
> HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by
> commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason:
> 
>   This is deemed UNPREDICTABLE by the Arm architecture without a
>   break-before-make sequence (make the PTE invalid, TLBI, write the
>   new valid PTE). However, such sequence is not possible since the
>   vmemmap may be concurrently accessed by the kernel.
> 
> Supporting BBM on kernel PTEs is one of the approaches that can make
> HVO theoretically safe on arm64.

Is the safety only theoretical? I would have expected that we'd use an
approach that is absolutely rock-solid.

> 
> Note that it is still possible for the paused CPUs to perform
> speculative translations. Such translations would cause spurious
> kernel PFs, which should be properly handled by
> is_spurious_el1_translation_fault().

Speculative translation faults are never reported, that'd be a CPU
bug. *Spurious* translation faults can be reported if the CPU doesn't
implement FEAT_ETS2, for example, and that has to do with the ordering
of memory access wrt page-table walking for the purpose of translations.

> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/arm64/include/asm/smp.h |  3 ++
>  arch/arm64/kernel/smp.c      | 92 +++++++++++++++++++++++++++++++++---
>  2 files changed, 88 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> index 2510eec026f7..cffb0cfed961 100644
> --- a/arch/arm64/include/asm/smp.h
> +++ b/arch/arm64/include/asm/smp.h
> @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
>  extern void crash_smp_send_stop(void);
>  extern bool smp_crash_stop_failed(void);
>  
> +void pause_remote_cpus(void);
> +void resume_remote_cpus(void);
> +
>  #endif /* ifndef __ASSEMBLY__ */
>  
>  #endif /* ifndef __ASM_SMP_H */
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 3b3f6b56e733..68829c6de1b1 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init;
>  static int nr_ipi __ro_after_init = NR_IPI;
>  static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
>  
> -static bool crash_stop;
> +enum {
> +	SEND_STOP = BIT(0),
> +	CRASH_STOP = BIT(1),
> +};
> +
> +static unsigned long stop_in_progress;
>  
>  static void ipi_setup(int cpu);
>  
> @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
>  #endif
>  }
>  
> +static DEFINE_SPINLOCK(cpu_pause_lock);

PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep as
you are dealing with kernel mappings?

> +static cpumask_t paused_cpus;
> +static cpumask_t resumed_cpus;
> +
> +static void pause_local_cpu(void)
> +{
> +	int cpu = smp_processor_id();
> +
> +	cpumask_clear_cpu(cpu, &resumed_cpus);
> +	/*
> +	 * Paired with pause_remote_cpus() to confirm that this CPU not only
> +	 * will be paused but also can be reliably resumed.
> +	 */
> +	smp_wmb();
> +	cpumask_set_cpu(cpu, &paused_cpus);
> +	/* paused_cpus must be set before waiting on resumed_cpus. */
> +	barrier();

I'm not sure what this is trying to enforce. Yes, the compiler won't
reorder the set and the test. But your comment seems to indicate that
also need to make sure the CPU preserves that ordering, and short of a
DMB, the test below could be reordered.

> +	while (!cpumask_test_cpu(cpu, &resumed_cpus))
> +		cpu_relax();
> +	/* A typical example for sleep and wake-up functions. */

I'm not sure this is "typical",...

> +	smp_mb();
> +	cpumask_clear_cpu(cpu, &paused_cpus);
> +}
> +
> +void pause_remote_cpus(void)
> +{
> +	cpumask_t cpus_to_pause;
> +
> +	lockdep_assert_cpus_held();
> +	lockdep_assert_preemption_disabled();
> +
> +	cpumask_copy(&cpus_to_pause, cpu_online_mask);
> +	cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);

This bitmap is manipulated outside of your cpu_pause_lock. What
guarantees you can't have two CPUs stepping on each other here?

> +
> +	spin_lock(&cpu_pause_lock);
> +
> +	WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
> +
> +	smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
> +
> +	while (!cpumask_equal(&cpus_to_pause, &paused_cpus))
> +		cpu_relax();

This can be a lot of things to compare, specially that you are
explicitly mentioning large systems. Why can't this be implemented as
a counter instead?

Overall, this looks like stop_machine() in disguise. Why can't this
use the existing infrastructure?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs
  2024-10-22 16:15   ` Marc Zyngier
@ 2024-10-28 22:11     ` Yu Zhao
  2024-10-29 19:36       ` Marc Zyngier
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Zhao @ 2024-10-28 22:11 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Tue, Oct 22, 2024 at 10:15 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 21 Oct 2024 05:22:16 +0100,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of
> > time, and then reliably resume them when the local CPU exits critical
> > sections that preclude the execution of remote CPUs.
> >
> > A typical example of such critical sections is BBM on kernel PTEs.
> > HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by
> > commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason:
> >
> >   This is deemed UNPREDICTABLE by the Arm architecture without a
> >   break-before-make sequence (make the PTE invalid, TLBI, write the
> >   new valid PTE). However, such sequence is not possible since the
> >   vmemmap may be concurrently accessed by the kernel.
> >
> > Supporting BBM on kernel PTEs is one of the approaches that can make
> > HVO theoretically safe on arm64.
>
> Is the safety only theoretical? I would have expected that we'd use an
> approach that is absolutely rock-solid.

We've been trying to construct a repro against the original HVO
(missing BBM), but so far no success. Hopefully a repro does exist,
and then we'd be able to demonstrate the effectiveness of this series,
which is only theoretical at the moment.

> > Note that it is still possible for the paused CPUs to perform
> > speculative translations. Such translations would cause spurious
> > kernel PFs, which should be properly handled by
> > is_spurious_el1_translation_fault().
>
> Speculative translation faults are never reported, that'd be a CPU
> bug.

Right, I meant to say "speculative accesses that cause translations".

> *Spurious* translation faults can be reported if the CPU doesn't
> implement FEAT_ETS2, for example, and that has to do with the ordering
> of memory access wrt page-table walking for the purpose of translations.

Just want to make sure I fully understand: after the local CPU sends
TLBI (followed by DSB & ISB), FEAT_ETS2 would act like DSB & ISB on
remote CPUs when they perform the invalidation, and therefore
speculative accesses are ordered as well on remote CPUs.

Also I'm assuming IPIs on remote CPUs don't act like a full barrier,
and whatever they interrupt still can be speculatively executed even
though the IPI hander itself doesn't access the vmemmap area
undergoing BBM. Is this correct?

> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  arch/arm64/include/asm/smp.h |  3 ++
> >  arch/arm64/kernel/smp.c      | 92 +++++++++++++++++++++++++++++++++---
> >  2 files changed, 88 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> > index 2510eec026f7..cffb0cfed961 100644
> > --- a/arch/arm64/include/asm/smp.h
> > +++ b/arch/arm64/include/asm/smp.h
> > @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
> >  extern void crash_smp_send_stop(void);
> >  extern bool smp_crash_stop_failed(void);
> >
> > +void pause_remote_cpus(void);
> > +void resume_remote_cpus(void);
> > +
> >  #endif /* ifndef __ASSEMBLY__ */
> >
> >  #endif /* ifndef __ASM_SMP_H */
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 3b3f6b56e733..68829c6de1b1 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init;
> >  static int nr_ipi __ro_after_init = NR_IPI;
> >  static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
> >
> > -static bool crash_stop;
> > +enum {
> > +     SEND_STOP = BIT(0),
> > +     CRASH_STOP = BIT(1),
> > +};
> > +
> > +static unsigned long stop_in_progress;
> >
> >  static void ipi_setup(int cpu);
> >
> > @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
> >  #endif
> >  }
> >
> > +static DEFINE_SPINLOCK(cpu_pause_lock);
>
> PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep as
> you are dealing with kernel mappings?

Right, it should be a raw spinlock -- the caller disabled preemption,
which as you said is required when dealing with the kernel mappings.

> > +static cpumask_t paused_cpus;
> > +static cpumask_t resumed_cpus;
> > +
> > +static void pause_local_cpu(void)
> > +{
> > +     int cpu = smp_processor_id();
> > +
> > +     cpumask_clear_cpu(cpu, &resumed_cpus);
> > +     /*
> > +      * Paired with pause_remote_cpus() to confirm that this CPU not only
> > +      * will be paused but also can be reliably resumed.
> > +      */
> > +     smp_wmb();
> > +     cpumask_set_cpu(cpu, &paused_cpus);
> > +     /* paused_cpus must be set before waiting on resumed_cpus. */
> > +     barrier();
>
> I'm not sure what this is trying to enforce. Yes, the compiler won't
> reorder the set and the test.

Sorry I don't follow: does cpumask_set_cpu(), i.e., set_bit(), already
contain a compiler barrier?

My understanding is that the compiler is free to reorder the set and
test on those two independent variables, and make it like this:

  while (!cpumask_test_cpu(cpu, &resumed_cpus))
      cpu_relax();
  cpumask_set_cpu(cpu, &paused_cpus);

So the CPU sent the IPI would keep waiting on paused_cpus being set,
and this CPU would keep waiting on resumed_cpus being set, which would
end up with a deadlock.

> But your comment seems to indicate that
> also need to make sure the CPU preserves that ordering
> and short of a
> DMB, the test below could be reordered.

If this CPU reorders the set and test like above, it wouldn't be a
problem because the set would eventually appear on the other CPU that
sent the IPI.

> > +     while (!cpumask_test_cpu(cpu, &resumed_cpus))
> > +             cpu_relax();
> > +     /* A typical example for sleep and wake-up functions. */
>
> I'm not sure this is "typical",...

Sorry, this full barrier isn't needed. Apparently I didn't properly
fix this from the previous attempt to use wfe()/sev() to make this
function the sleeper for resume_remote_cpus() to wake up.

> > +     smp_mb();
> > +     cpumask_clear_cpu(cpu, &paused_cpus);
> > +}
> > +
> > +void pause_remote_cpus(void)
> > +{
> > +     cpumask_t cpus_to_pause;
> > +
> > +     lockdep_assert_cpus_held();
> > +     lockdep_assert_preemption_disabled();
> > +
> > +     cpumask_copy(&cpus_to_pause, cpu_online_mask);
> > +     cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
>
> This bitmap is manipulated outside of your cpu_pause_lock. What
> guarantees you can't have two CPUs stepping on each other here?

Do you mean cpus_to_pause? If so, that's a local bitmap.

> > +
> > +     spin_lock(&cpu_pause_lock);
> > +
> > +     WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
> > +
> > +     smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
> > +
> > +     while (!cpumask_equal(&cpus_to_pause, &paused_cpus))
> > +             cpu_relax();
>
> This can be a lot of things to compare, specially that you are
> explicitly mentioning large systems. Why can't this be implemented as
> a counter instead?

Agreed - that'd be sufficient and simpler.

> Overall, this looks like stop_machine() in disguise. Why can't this
> use the existing infrastructure?

This came up during the previous discussion [1]. There are
similarities. The main concern with reusing stop_machine() (or part of
its current implementation) is that it may not meet the performance
requirements. Refactoring might be possible, however, it seems to me
(after checking the code again) that such a refactoring unlikely ends
up cleaner or simpler codebase, especially after I got rid of CPU
masks, as you suggested.

[1] https://lore.kernel.org/all/ZbKjHHeEdFYY1xR5@arm.com/

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 3b3f6b56e733..b672af2441a3 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -917,6 +922,64 @@ static void __noreturn
ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
 #endif
 }

+static DEFINE_RAW_SPINLOCK(cpu_pause_lock);
+static bool __cacheline_aligned_in_smp cpu_paused;
+static atomic_t __cacheline_aligned_in_smp nr_cpus_paused;
+
+static void pause_local_cpu(void)
+{
+ atomic_inc(&nr_cpus_paused);
+
+ while (READ_ONCE(cpu_paused))
+ cpu_relax();
+
+ atomic_dec(&nr_cpus_paused);
+}
+
+void pause_remote_cpus(void)
+{
+ cpumask_t cpus_to_pause;
+ int nr_cpus_to_pause = num_online_cpus() - 1;
+
+ lockdep_assert_cpus_held();
+ lockdep_assert_preemption_disabled();
+
+ if (!nr_cpus_to_pause)
+ return;
+
+ cpumask_copy(&cpus_to_pause, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
+
+ raw_spin_lock(&cpu_pause_lock);
+
+ WARN_ON_ONCE(cpu_paused);
+ WARN_ON_ONCE(atomic_read(&nr_cpus_paused));
+
+ cpu_paused = true;
+
+ smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
+
+ while (atomic_read(&nr_cpus_paused) != nr_cpus_to_pause)
+ cpu_relax();
+
+ raw_spin_unlock(&cpu_pause_lock);
+}
+
+void resume_remote_cpus(void)
+{
+ if (!cpu_paused)
+ return;
+
+ raw_spin_lock(&cpu_pause_lock);
+
+ WRITE_ONCE(cpu_paused, false);
+
+ while (atomic_read(&nr_cpus_paused))
+ cpu_relax();
+
+ raw_spin_unlock(&cpu_pause_lock);

+}
+
 static void arm64_backtrace_ipi(cpumask_t *mask)
 {
  __ipi_send_mask(ipi_desc[IPI_CPU_BACKTRACE], mask);


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs
  2024-10-28 22:11     ` Yu Zhao
@ 2024-10-29 19:36       ` Marc Zyngier
  2024-10-31 18:10         ` Yu Zhao
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Zyngier @ 2024-10-29 19:36 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Mon, 28 Oct 2024 22:11:52 +0000,
Yu Zhao <yuzhao@google.com> wrote:
> 
> On Tue, Oct 22, 2024 at 10:15 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Mon, 21 Oct 2024 05:22:16 +0100,
> > Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of
> > > time, and then reliably resume them when the local CPU exits critical
> > > sections that preclude the execution of remote CPUs.
> > >
> > > A typical example of such critical sections is BBM on kernel PTEs.
> > > HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by
> > > commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason:
> > >
> > >   This is deemed UNPREDICTABLE by the Arm architecture without a
> > >   break-before-make sequence (make the PTE invalid, TLBI, write the
> > >   new valid PTE). However, such sequence is not possible since the
> > >   vmemmap may be concurrently accessed by the kernel.
> > >
> > > Supporting BBM on kernel PTEs is one of the approaches that can make
> > > HVO theoretically safe on arm64.
> >
> > Is the safety only theoretical? I would have expected that we'd use an
> > approach that is absolutely rock-solid.
> 
> We've been trying to construct a repro against the original HVO
> (missing BBM), but so far no success. Hopefully a repro does exist,
> and then we'd be able to demonstrate the effectiveness of this series,
> which is only theoretical at the moment.

That wasn't my question.

Just because your HW doesn't show you a failure mode doesn't mean the
issue doesn't exist. Or that someone will eventually make use of the
relaxed aspects of the architecture and turn the behaviour you are
relying on into the perfect memory corruption  You absolutely cannot
rely on an implementation-defined behaviour for this stuff.

Hence my question: is your current approach actually safe? In a
non-theoretical manner?

> 
> > > Note that it is still possible for the paused CPUs to perform
> > > speculative translations. Such translations would cause spurious
> > > kernel PFs, which should be properly handled by
> > > is_spurious_el1_translation_fault().
> >
> > Speculative translation faults are never reported, that'd be a CPU
> > bug.
> 
> Right, I meant to say "speculative accesses that cause translations".
> 
> > *Spurious* translation faults can be reported if the CPU doesn't
> > implement FEAT_ETS2, for example, and that has to do with the ordering
> > of memory access wrt page-table walking for the purpose of translations.
> 
> Just want to make sure I fully understand: after the local CPU sends
> TLBI (followed by DSB & ISB), FEAT_ETS2 would act like DSB & ISB on
> remote CPUs when they perform the invalidation, and therefore
> speculative accesses are ordered as well on remote CPUs.

ETS2 has nothing to do with TLBI. It has to do with non-cacheable
(translation, access and address size) faults, and whether an older
update to a translation is visible to a younger memory access without
extra synchronisation. It definitely has nothing to do with remote
CPUs (and the idea of a remote ISB, while cute, doesn't exist).

> Also I'm assuming IPIs on remote CPUs don't act like a full barrier,
> and whatever they interrupt still can be speculatively executed even
> though the IPI hander itself doesn't access the vmemmap area
> undergoing BBM. Is this correct?

Full barrier *for what*? An interrupt is a CSE, which has the effects
described in the ARM ARM, but that's about it.

> 
> > > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  arch/arm64/include/asm/smp.h |  3 ++
> > >  arch/arm64/kernel/smp.c      | 92 +++++++++++++++++++++++++++++++++---
> > >  2 files changed, 88 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> > > index 2510eec026f7..cffb0cfed961 100644
> > > --- a/arch/arm64/include/asm/smp.h
> > > +++ b/arch/arm64/include/asm/smp.h
> > > @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
> > >  extern void crash_smp_send_stop(void);
> > >  extern bool smp_crash_stop_failed(void);
> > >
> > > +void pause_remote_cpus(void);
> > > +void resume_remote_cpus(void);
> > > +
> > >  #endif /* ifndef __ASSEMBLY__ */
> > >
> > >  #endif /* ifndef __ASM_SMP_H */
> > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > index 3b3f6b56e733..68829c6de1b1 100644
> > > --- a/arch/arm64/kernel/smp.c
> > > +++ b/arch/arm64/kernel/smp.c
> > > @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init;
> > >  static int nr_ipi __ro_after_init = NR_IPI;
> > >  static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
> > >
> > > -static bool crash_stop;
> > > +enum {
> > > +     SEND_STOP = BIT(0),
> > > +     CRASH_STOP = BIT(1),
> > > +};
> > > +
> > > +static unsigned long stop_in_progress;
> > >
> > >  static void ipi_setup(int cpu);
> > >
> > > @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
> > >  #endif
> > >  }
> > >
> > > +static DEFINE_SPINLOCK(cpu_pause_lock);
> >
> > PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep as
> > you are dealing with kernel mappings?
> 
> Right, it should be a raw spinlock -- the caller disabled preemption,
> which as you said is required when dealing with the kernel mappings.
> 
> > > +static cpumask_t paused_cpus;
> > > +static cpumask_t resumed_cpus;
> > > +
> > > +static void pause_local_cpu(void)
> > > +{
> > > +     int cpu = smp_processor_id();
> > > +
> > > +     cpumask_clear_cpu(cpu, &resumed_cpus);
> > > +     /*
> > > +      * Paired with pause_remote_cpus() to confirm that this CPU not only
> > > +      * will be paused but also can be reliably resumed.
> > > +      */
> > > +     smp_wmb();
> > > +     cpumask_set_cpu(cpu, &paused_cpus);
> > > +     /* paused_cpus must be set before waiting on resumed_cpus. */
> > > +     barrier();
> >
> > I'm not sure what this is trying to enforce. Yes, the compiler won't
> > reorder the set and the test.
> 
> Sorry I don't follow: does cpumask_set_cpu(), i.e., set_bit(), already
> contain a compiler barrier?
> 
> My understanding is that the compiler is free to reorder the set and
> test on those two independent variables, and make it like this:
> 
>   while (!cpumask_test_cpu(cpu, &resumed_cpus))
>       cpu_relax();
>   cpumask_set_cpu(cpu, &paused_cpus);
> 
> So the CPU sent the IPI would keep waiting on paused_cpus being set,
> and this CPU would keep waiting on resumed_cpus being set, which would
> end up with a deadlock.
> 
> > But your comment seems to indicate that
> > also need to make sure the CPU preserves that ordering
> > and short of a
> > DMB, the test below could be reordered.
> 
> If this CPU reorders the set and test like above, it wouldn't be a
> problem because the set would eventually appear on the other CPU that
> sent the IPI.

I don't get it. You are requiring that the compiler doesn't reorder
things, but you're happy that the CPU reorders the same things. Surely
this leads to the same outcome...

> 
> > > +     while (!cpumask_test_cpu(cpu, &resumed_cpus))
> > > +             cpu_relax();
> > > +     /* A typical example for sleep and wake-up functions. */
> >
> > I'm not sure this is "typical",...
> 
> Sorry, this full barrier isn't needed. Apparently I didn't properly
> fix this from the previous attempt to use wfe()/sev() to make this
> function the sleeper for resume_remote_cpus() to wake up.
>
> > > +     smp_mb();
> > > +     cpumask_clear_cpu(cpu, &paused_cpus);
> > > +}
> > > +
> > > +void pause_remote_cpus(void)
> > > +{
> > > +     cpumask_t cpus_to_pause;
> > > +
> > > +     lockdep_assert_cpus_held();
> > > +     lockdep_assert_preemption_disabled();
> > > +
> > > +     cpumask_copy(&cpus_to_pause, cpu_online_mask);
> > > +     cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
> >
> > This bitmap is manipulated outside of your cpu_pause_lock. What
> > guarantees you can't have two CPUs stepping on each other here?
> 
> Do you mean cpus_to_pause? If so, that's a local bitmap.

Ah yes, sorry.

> 
> > > +
> > > +     spin_lock(&cpu_pause_lock);
> > > +
> > > +     WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
> > > +
> > > +     smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
> > > +
> > > +     while (!cpumask_equal(&cpus_to_pause, &paused_cpus))
> > > +             cpu_relax();
> >
> > This can be a lot of things to compare, specially that you are
> > explicitly mentioning large systems. Why can't this be implemented as
> > a counter instead?
> 
> Agreed - that'd be sufficient and simpler.
> 
> > Overall, this looks like stop_machine() in disguise. Why can't this
> > use the existing infrastructure?
> 
> This came up during the previous discussion [1]. There are
> similarities. The main concern with reusing stop_machine() (or part of
> its current implementation) is that it may not meet the performance
> requirements. Refactoring might be possible, however, it seems to me
> (after checking the code again) that such a refactoring unlikely ends
> up cleaner or simpler codebase, especially after I got rid of CPU
> masks, as you suggested.

I would have expected to see an implementation handling faults during
the break window. IPIs are slow, virtualised extremely badly, and
pNMIs are rarely enabled, due to the long history of issues with it.
At this stage, I'm not sure what you have here is any better than
stop_machine(). On the other hand, faults should be the rarest thing,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs
  2024-10-29 19:36       ` Marc Zyngier
@ 2024-10-31 18:10         ` Yu Zhao
  0 siblings, 0 replies; 18+ messages in thread
From: Yu Zhao @ 2024-10-31 18:10 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Andrew Morton, Catalin Marinas, Muchun Song, Thomas Gleixner,
	Will Deacon, Douglas Anderson, Mark Rutland, Nanyong Sun,
	linux-arm-kernel, linux-kernel, linux-mm

On Tue, Oct 29, 2024 at 1:36 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 28 Oct 2024 22:11:52 +0000,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Tue, Oct 22, 2024 at 10:15 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Mon, 21 Oct 2024 05:22:16 +0100,
> > > Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of
> > > > time, and then reliably resume them when the local CPU exits critical
> > > > sections that preclude the execution of remote CPUs.
> > > >
> > > > A typical example of such critical sections is BBM on kernel PTEs.
> > > > HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by
> > > > commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> > > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason:
> > > >
> > > >   This is deemed UNPREDICTABLE by the Arm architecture without a
> > > >   break-before-make sequence (make the PTE invalid, TLBI, write the
> > > >   new valid PTE). However, such sequence is not possible since the
> > > >   vmemmap may be concurrently accessed by the kernel.
> > > >
> > > > Supporting BBM on kernel PTEs is one of the approaches that can make
> > > > HVO theoretically safe on arm64.
> > >
> > > Is the safety only theoretical? I would have expected that we'd use an
> > > approach that is absolutely rock-solid.
> >
> > We've been trying to construct a repro against the original HVO
> > (missing BBM), but so far no success. Hopefully a repro does exist,
> > and then we'd be able to demonstrate the effectiveness of this series,
> > which is only theoretical at the moment.
>
> That wasn't my question.
>
> Just because your HW doesn't show you a failure mode doesn't mean the
> issue doesn't exist. Or that someone will eventually make use of the
> relaxed aspects of the architecture and turn the behaviour you are
> relying on into the perfect memory corruption  You absolutely cannot
> rely on an implementation-defined behaviour for this stuff.

I agree.

> Hence my question: is your current approach actually safe?

AFAIK, yes.

> In a
> non-theoretical manner?

We are confident enough to try it in our production, but it doesn't
mean it's bug free. It would take more eyeballs and soak time before
it can prove itself.

> > > > Note that it is still possible for the paused CPUs to perform
> > > > speculative translations. Such translations would cause spurious
> > > > kernel PFs, which should be properly handled by
> > > > is_spurious_el1_translation_fault().
> > >
> > > Speculative translation faults are never reported, that'd be a CPU
> > > bug.
> >
> > Right, I meant to say "speculative accesses that cause translations".
> >
> > > *Spurious* translation faults can be reported if the CPU doesn't
> > > implement FEAT_ETS2, for example, and that has to do with the ordering
> > > of memory access wrt page-table walking for the purpose of translations.
> >
> > Just want to make sure I fully understand: after the local CPU sends
> > TLBI (followed by DSB & ISB), FEAT_ETS2 would act like DSB & ISB on
> > remote CPUs when they perform the invalidation, and therefore
> > speculative accesses are ordered as well on remote CPUs.
>
> ETS2 has nothing to do with TLBI. It has to do with non-cacheable
> (translation, access and address size) faults, and whether an older
> update to a translation is visible to a younger memory access without
> extra synchronisation.

Is it correct to say that *without* ETS2, we also wouldn't see
spurious faults on the local CPU from the following BBM sequence?
  clear_pte(); dsb(); isb(); tlbi(); dsb(); isb(); set_valid_pte();
dsb(); isb(); ...

> It definitely has nothing to do with remote
> CPUs (and the idea of a remote ISB, while cute, doesn't exist).
>
> > Also I'm assuming IPIs on remote CPUs don't act like a full barrier,
> > and whatever they interrupt still can be speculatively executed even
> > though the IPI hander itself doesn't access the vmemmap area
> > undergoing BBM. Is this correct?
>
> Full barrier *for what*? An interrupt is a CSE, which has the effects
> described in the ARM ARM, but that's about it.

Sorry for the confusion. I hope the following could explain what's in my mind:

local cpu          remote cpu

send ipi
                   cse (= dsb(); isb() ?)
                   enters ipi handler
                   sets cpu_paused
sees cpu_paused
clear_pte()
dsb();
tlbi();
dsb(); isb()
set_valid_pte()
dsb();
sets cpu_resumed
                  sees cpu_resumed
                  No isb()
                  exits ipi handler

My current understanding is:
Because exception exit doesn't have the CSE and there are no isb()s
within the IPI handler running on the remote CPU, it's possible for
the remote CPU to speculative execute the next instruction before it
sees the valid PTE and trigger a spurious fault.

> > > > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > > > ---
> > > >  arch/arm64/include/asm/smp.h |  3 ++
> > > >  arch/arm64/kernel/smp.c      | 92 +++++++++++++++++++++++++++++++++---
> > > >  2 files changed, 88 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> > > > index 2510eec026f7..cffb0cfed961 100644
> > > > --- a/arch/arm64/include/asm/smp.h
> > > > +++ b/arch/arm64/include/asm/smp.h
> > > > @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
> > > >  extern void crash_smp_send_stop(void);
> > > >  extern bool smp_crash_stop_failed(void);
> > > >
> > > > +void pause_remote_cpus(void);
> > > > +void resume_remote_cpus(void);
> > > > +
> > > >  #endif /* ifndef __ASSEMBLY__ */
> > > >
> > > >  #endif /* ifndef __ASM_SMP_H */
> > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > index 3b3f6b56e733..68829c6de1b1 100644
> > > > --- a/arch/arm64/kernel/smp.c
> > > > +++ b/arch/arm64/kernel/smp.c
> > > > @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init;
> > > >  static int nr_ipi __ro_after_init = NR_IPI;
> > > >  static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
> > > >
> > > > -static bool crash_stop;
> > > > +enum {
> > > > +     SEND_STOP = BIT(0),
> > > > +     CRASH_STOP = BIT(1),
> > > > +};
> > > > +
> > > > +static unsigned long stop_in_progress;
> > > >
> > > >  static void ipi_setup(int cpu);
> > > >
> > > > @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
> > > >  #endif
> > > >  }
> > > >
> > > > +static DEFINE_SPINLOCK(cpu_pause_lock);
> > >
> > > PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep as
> > > you are dealing with kernel mappings?
> >
> > Right, it should be a raw spinlock -- the caller disabled preemption,
> > which as you said is required when dealing with the kernel mappings.
> >
> > > > +static cpumask_t paused_cpus;
> > > > +static cpumask_t resumed_cpus;
> > > > +
> > > > +static void pause_local_cpu(void)
> > > > +{
> > > > +     int cpu = smp_processor_id();
> > > > +
> > > > +     cpumask_clear_cpu(cpu, &resumed_cpus);
> > > > +     /*
> > > > +      * Paired with pause_remote_cpus() to confirm that this CPU not only
> > > > +      * will be paused but also can be reliably resumed.
> > > > +      */
> > > > +     smp_wmb();
> > > > +     cpumask_set_cpu(cpu, &paused_cpus);
> > > > +     /* paused_cpus must be set before waiting on resumed_cpus. */
> > > > +     barrier();
> > >
> > > I'm not sure what this is trying to enforce. Yes, the compiler won't
> > > reorder the set and the test.
> >
> > Sorry I don't follow: does cpumask_set_cpu(), i.e., set_bit(), already
> > contain a compiler barrier?
> >
> > My understanding is that the compiler is free to reorder the set and
> > test on those two independent variables, and make it like this:
> >
> >   while (!cpumask_test_cpu(cpu, &resumed_cpus))
> >       cpu_relax();
> >   cpumask_set_cpu(cpu, &paused_cpus);
> >
> > So the CPU sent the IPI would keep waiting on paused_cpus being set,
> > and this CPU would keep waiting on resumed_cpus being set, which would
> > end up with a deadlock.
> >
> > > But your comment seems to indicate that
> > > also need to make sure the CPU preserves that ordering
> > > and short of a
> > > DMB, the test below could be reordered.
> >
> > If this CPU reorders the set and test like above, it wouldn't be a
> > problem because the set would eventually appear on the other CPU that
> > sent the IPI.
>
> I don't get it. You are requiring that the compiler doesn't reorder
> things, but you're happy that the CPU reorders the same things. Surely
> this leads to the same outcome...

I was thinking that the compiler might be able to reorder even if it
can't prove the loop actually terminates. But I now think that
shouldn't happen. So yes, I agree with what you said earlier that "the
compiler won't reorder the set and the test".

> > > > +     while (!cpumask_test_cpu(cpu, &resumed_cpus))
> > > > +             cpu_relax();
> > > > +     /* A typical example for sleep and wake-up functions. */
> > >
> > > I'm not sure this is "typical",...
> >
> > Sorry, this full barrier isn't needed. Apparently I didn't properly
> > fix this from the previous attempt to use wfe()/sev() to make this
> > function the sleeper for resume_remote_cpus() to wake up.
> >
> > > > +     smp_mb();
> > > > +     cpumask_clear_cpu(cpu, &paused_cpus);
> > > > +}
> > > > +
> > > > +void pause_remote_cpus(void)
> > > > +{
> > > > +     cpumask_t cpus_to_pause;
> > > > +
> > > > +     lockdep_assert_cpus_held();
> > > > +     lockdep_assert_preemption_disabled();
> > > > +
> > > > +     cpumask_copy(&cpus_to_pause, cpu_online_mask);
> > > > +     cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
> > >
> > > This bitmap is manipulated outside of your cpu_pause_lock. What
> > > guarantees you can't have two CPUs stepping on each other here?
> >
> > Do you mean cpus_to_pause? If so, that's a local bitmap.
>
> Ah yes, sorry.
>
> >
> > > > +
> > > > +     spin_lock(&cpu_pause_lock);
> > > > +
> > > > +     WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
> > > > +
> > > > +     smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
> > > > +
> > > > +     while (!cpumask_equal(&cpus_to_pause, &paused_cpus))
> > > > +             cpu_relax();
> > >
> > > This can be a lot of things to compare, specially that you are
> > > explicitly mentioning large systems. Why can't this be implemented as
> > > a counter instead?
> >
> > Agreed - that'd be sufficient and simpler.
> >
> > > Overall, this looks like stop_machine() in disguise. Why can't this
> > > use the existing infrastructure?
> >
> > This came up during the previous discussion [1]. There are
> > similarities. The main concern with reusing stop_machine() (or part of
> > its current implementation) is that it may not meet the performance
> > requirements. Refactoring might be possible, however, it seems to me
> > (after checking the code again) that such a refactoring unlikely ends
> > up cleaner or simpler codebase, especially after I got rid of CPU
> > masks, as you suggested.
>
> I would have expected to see an implementation handling faults during
> the break window. IPIs are slow, virtualised extremely badly, and
> pNMIs are rarely enabled, due to the long history of issues with it.
> At this stage, I'm not sure what you have here is any better than
> stop_machine(). On the other hand, faults should be the rarest thing,

Let me measure stop_machine() and see how that works for our
production. Meanwhile, if you don't mind that we leave the IPI
approach along with the kernel PF approach on the table?

Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 5/6] arm64: pause remote CPUs to update vmemmap
  2024-10-21  4:22 [PATCH v1 0/6] mm/arm64: re-enable HVO Yu Zhao
                   ` (3 preceding siblings ...)
  2024-10-21  4:22 ` [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs Yu Zhao
@ 2024-10-21  4:22 ` Yu Zhao
  2024-10-21  4:22 ` [PATCH v1 6/6] arm64: select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP Yu Zhao
  5 siblings, 0 replies; 18+ messages in thread
From: Yu Zhao @ 2024-10-21  4:22 UTC (permalink / raw)
  To: Andrew Morton, Catalin Marinas, Marc Zyngier, Muchun Song,
	Thomas Gleixner, Will Deacon
  Cc: Douglas Anderson, Mark Rutland, Nanyong Sun, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao

Pause remote CPUs so that the local CPU can follow the proper BBM
sequence to safely update the vmemmap mapping `struct page` areas.

While updating the vmemmap, it is guaranteed that neither the local
CPU nor the remote ones will access the `struct page` area being
updated, and therefore they should not trigger (non-spurious) kernel
PFs.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/include/asm/pgalloc.h | 69 ++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 8ff5f2a2579e..f50f79f57c1e 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -12,6 +12,7 @@
 #include <asm/processor.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <asm/cpu.h>
 
 #define __HAVE_ARCH_PGD_FREE
 #define __HAVE_ARCH_PUD_FREE
@@ -137,4 +138,72 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmdp, pgtable_t ptep)
 	__pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN);
 }
 
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+
+#define VMEMMAP_ARCH_TLB_FLUSH_FLAGS (VMEMMAP_SPLIT_NO_TLB_FLUSH | VMEMMAP_REMAP_NO_TLB_FLUSH)
+
+#define vmemmap_update_supported vmemmap_update_supported
+static inline bool vmemmap_update_supported(void)
+{
+	return system_uses_irq_prio_masking();
+}
+
+#define vmemmap_update_lock vmemmap_update_lock
+static inline void vmemmap_update_lock(void)
+{
+	cpus_read_lock();
+}
+
+#define vmemmap_update_unlock vmemmap_update_unlock
+static inline void vmemmap_update_unlock(void)
+{
+	cpus_read_unlock();
+}
+
+#define vmemmap_update_pte_range_start vmemmap_update_pte_range_start
+static inline void vmemmap_update_pte_range_start(pte_t *pte,
+						  unsigned long start, unsigned long end)
+{
+	unsigned long addr;
+
+	local_irq_disable();
+	pause_remote_cpus();
+
+	for (addr = start; addr != end; addr += PAGE_SIZE, pte++)
+		pte_clear(&init_mm, addr, pte);
+
+	flush_tlb_kernel_range(start, end);
+}
+
+#define vmemmap_update_pte_range_end vmemmap_update_pte_range_end
+static inline void vmemmap_update_pte_range_end(void)
+{
+	resume_remote_cpus();
+	local_irq_enable();
+}
+
+#define vmemmap_update_pmd_range_start vmemmap_update_pmd_range_start
+static inline void vmemmap_update_pmd_range_start(pmd_t *pmd,
+						  unsigned long start, unsigned long end)
+{
+	unsigned long addr;
+
+	local_irq_disable();
+	pause_remote_cpus();
+
+	for (addr = start; addr != end; addr += PMD_SIZE, pmd++)
+		pmd_clear(pmd);
+
+	flush_tlb_kernel_range(start, end);
+}
+
+#define vmemmap_update_pmd_range_end vmemmap_update_pmd_range_end
+static inline void vmemmap_update_pmd_range_end(void)
+{
+	resume_remote_cpus();
+	local_irq_enable();
+}
+
+#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
+
 #endif
-- 
2.47.0.rc1.288.g06298d1525-goog



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 6/6] arm64: select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
  2024-10-21  4:22 [PATCH v1 0/6] mm/arm64: re-enable HVO Yu Zhao
                   ` (4 preceding siblings ...)
  2024-10-21  4:22 ` [PATCH v1 5/6] arm64: pause remote CPUs to update vmemmap Yu Zhao
@ 2024-10-21  4:22 ` Yu Zhao
  5 siblings, 0 replies; 18+ messages in thread
From: Yu Zhao @ 2024-10-21  4:22 UTC (permalink / raw)
  To: Andrew Morton, Catalin Marinas, Marc Zyngier, Muchun Song,
	Thomas Gleixner, Will Deacon
  Cc: Douglas Anderson, Mark Rutland, Nanyong Sun, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao

To use HVO, make sure that the kernel is booted with pseudo-NMI
enabled by "irqchip.gicv3_pseudo_nmi=1", as well as
"hugetlb_free_vmemmap=on" unless HVO is enabled by default.

Note that HVO checks the pseudo-NMI capability and is disabled at
runtime if the capability turns out not supported. Successfully
enabling HVO should have the following:
  # dmesg | grep NMI
  GICv3: Pseudo-NMIs enabled using ...
  # sysctl vm.hugetlb_optimize_vmemmap
  vm.hugetlb_optimize_vmemmap = 1

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fd9df6dcc593..e93745f819d9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -109,6 +109,7 @@ config ARM64
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
+	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_EXECMEM_LATE if EXECMEM
 	select ARCH_WANTS_NO_INSTR
-- 
2.47.0.rc1.288.g06298d1525-goog



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-10-31 18:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-21  4:22 [PATCH v1 0/6] mm/arm64: re-enable HVO Yu Zhao
2024-10-21  4:22 ` [PATCH v1 1/6] mm/hugetlb_vmemmap: batch update PTEs Yu Zhao
2024-10-21  4:22 ` [PATCH v1 2/6] mm/hugetlb_vmemmap: add arch-independent helpers Yu Zhao
2024-10-21  4:22 ` [PATCH v1 3/6] irqchip/gic-v3: support SGI broadcast Yu Zhao
2024-10-22  0:24   ` kernel test robot
2024-10-22 15:03   ` Marc Zyngier
2024-10-25  5:07     ` Yu Zhao
2024-10-25 16:14       ` Marc Zyngier
2024-10-25 17:31         ` Yu Zhao
2024-10-29 19:02           ` Marc Zyngier
2024-10-29 19:53             ` Yu Zhao
2024-10-21  4:22 ` [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs Yu Zhao
2024-10-22 16:15   ` Marc Zyngier
2024-10-28 22:11     ` Yu Zhao
2024-10-29 19:36       ` Marc Zyngier
2024-10-31 18:10         ` Yu Zhao
2024-10-21  4:22 ` [PATCH v1 5/6] arm64: pause remote CPUs to update vmemmap Yu Zhao
2024-10-21  4:22 ` [PATCH v1 6/6] arm64: select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox