linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/4] Svvptc extension to remove preventive sfence.vma
@ 2024-01-31 15:59 Alexandre Ghiti
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Alexandre Ghiti @ 2024-01-31 15:59 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

In RISC-V, after a new mapping is established, a sfence.vma needs to be
emitted for different reasons:

- if the uarch caches invalid entries, we need to invalidate it otherwise
  we would trap on this invalid entry,
- if the uarch does not cache invalid entries, a reordered access could fail
  to see the new mapping and then trap (sfence.vma acts as a fence).

We can actually avoid emitting those (mostly) useless and costly sfence.vma
by handling the traps instead:

- for new kernel mappings: only vmalloc mappings need to be taken care of,
  other new mapping are rare and already emit the required sfence.vma if
  needed.
  That must be achieved very early in the exception path as explained in
  patch 3, and this also fixes our fragile way of dealing with vmalloc faults.

- for new user mappings: Svvptc makes update_mmu_cache() a no-op and no
  traps can happen since xRET instructions now act as fences.

Patch 1 and 2 introduce Svvptc extension probing.

It's still an RFC because Svvptc is not ratified yet.

On our uarch that does not cache invalid entries and a 6.5 kernel, the
gains are measurable:

* Kernel boot:                  6%
* ltp - mmapstress01:           8%
* lmbench - lat_pagefault:      20%
* lmbench - lat_mmap:           5%

Thanks to Ved and Matt Evans for triggering the discussion that led to
this patchset!

Any feedback, test or relevant benchmark are welcome :)

Changes in v2:
- Rebase on top of 6.8-rc1
- Remove patch with runtime detection of tlb caching and debugfs patch
- Add patch that probes Svvptc
- Add patch that defines the new Svvptc dt-binding
- Leave the behaviour as-is for uarchs that cache invalid TLB entries since
  I don't have any good perf numbers
- Address comments from Christoph on v1
- Fix a race condition in new_vmalloc update:

       ld      a2, 0(a0) <= this could load something which is != -1
       not     a1, a1    <= here or in the instruction after, flush_cache_vmap()
                            could make the whole bitmap to 1
       and     a1, a2, a1
       sd      a1, 0(a0) <= here we would clear bits that should not be cleared!

   Instead, replace the whole sequence with:
       amoxor.w        a0, a1, (a0)

Alexandre Ghiti (4):
  riscv: Add ISA extension parsing for Svvptc
  dt-bindings: riscv: Add Svvptc ISA extension description
  riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  riscv: Stop emitting preventive sfence.vma for new userspace mappings
    with Svvptc

 .../devicetree/bindings/riscv/extensions.yaml |  7 ++
 arch/riscv/include/asm/cacheflush.h           | 18 +++-
 arch/riscv/include/asm/hwcap.h                |  1 +
 arch/riscv/include/asm/pgtable.h              | 16 +++-
 arch/riscv/include/asm/thread_info.h          |  5 ++
 arch/riscv/kernel/asm-offsets.c               |  5 ++
 arch/riscv/kernel/cpufeature.c                |  1 +
 arch/riscv/kernel/entry.S                     | 84 +++++++++++++++++++
 arch/riscv/mm/init.c                          |  2 +
 arch/riscv/mm/pgtable.c                       | 13 +++
 10 files changed, 150 insertions(+), 2 deletions(-)

-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC/RFT v2 1/4] riscv: Add ISA extension parsing for Svvptc
  2024-01-31 15:59 [PATCH RFC v2 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
@ 2024-01-31 15:59 ` Alexandre Ghiti
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 2/4] dt-bindings: riscv: Add Svvptc ISA extension description Alexandre Ghiti
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Alexandre Ghiti @ 2024-01-31 15:59 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

Add support to parse the Svvptc string in the riscv,isa string.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/riscv/include/asm/hwcap.h | 1 +
 arch/riscv/kernel/cpufeature.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h
index 5340f818746b..2e15192135fb 100644
--- a/arch/riscv/include/asm/hwcap.h
+++ b/arch/riscv/include/asm/hwcap.h
@@ -80,6 +80,7 @@
 #define RISCV_ISA_EXT_ZFA		71
 #define RISCV_ISA_EXT_ZTSO		72
 #define RISCV_ISA_EXT_ZACAS		73
+#define RISCV_ISA_EXT_SVVPTC		74
 
 #define RISCV_ISA_EXT_MAX		128
 #define RISCV_ISA_EXT_INVALID		U32_MAX
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 89920f84d0a3..4a8f14bfa0f2 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -307,6 +307,7 @@ const struct riscv_isa_ext_data riscv_isa_ext[] = {
 	__RISCV_ISA_EXT_DATA(svinval, RISCV_ISA_EXT_SVINVAL),
 	__RISCV_ISA_EXT_DATA(svnapot, RISCV_ISA_EXT_SVNAPOT),
 	__RISCV_ISA_EXT_DATA(svpbmt, RISCV_ISA_EXT_SVPBMT),
+	__RISCV_ISA_EXT_DATA(svvptc, RISCV_ISA_EXT_SVVPTC),
 };
 
 const size_t riscv_isa_ext_count = ARRAY_SIZE(riscv_isa_ext);
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC/RFT v2 2/4] dt-bindings: riscv: Add Svvptc ISA extension description
  2024-01-31 15:59 [PATCH RFC v2 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
@ 2024-01-31 15:59 ` Alexandre Ghiti
  2024-02-01  9:22   ` Krzysztof Kozlowski
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
  3 siblings, 1 reply; 18+ messages in thread
From: Alexandre Ghiti @ 2024-01-31 15:59 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

Add description for the Svvptc ISA extension which was ratified recently.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 Documentation/devicetree/bindings/riscv/extensions.yaml | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml
index 63d81dc895e5..59bf14d2c1eb 100644
--- a/Documentation/devicetree/bindings/riscv/extensions.yaml
+++ b/Documentation/devicetree/bindings/riscv/extensions.yaml
@@ -171,6 +171,13 @@ properties:
             memory types as ratified in the 20191213 version of the privileged
             ISA specification.
 
+        - const: svvptc
+          description:
+            The standard Svvptc supervisor-level extension for
+            address-translation cache behaviour with respect to invalid entries
+            as ratified in the XXXXXXXX version of the privileged ISA
+            specification.
+
         - const: zacas
           description: |
             The Zacas extension for Atomic Compare-and-Swap (CAS) instructions
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-01-31 15:59 [PATCH RFC v2 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 2/4] dt-bindings: riscv: Add Svvptc ISA extension description Alexandre Ghiti
@ 2024-01-31 15:59 ` Alexandre Ghiti
  2024-06-03  2:26   ` [External] " yunhui cui
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
  3 siblings, 1 reply; 18+ messages in thread
From: Alexandre Ghiti @ 2024-01-31 15:59 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

In 6.5, we removed the vmalloc fault path because that can't work (see
[1] [2]). Then in order to make sure that new page table entries were
seen by the page table walker, we had to preventively emit a sfence.vma
on all harts [3] but this solution is very costly since it relies on IPI.

And even there, we could end up in a loop of vmalloc faults if a vmalloc
allocation is done in the IPI path (for example if it is traced, see
[4]), which could result in a kernel stack overflow.

Those preventive sfence.vma needed to be emitted because:

- if the uarch caches invalid entries, the new mapping may not be
  observed by the page table walker and an invalidation may be needed.
- if the uarch does not cache invalid entries, a reordered access
  could "miss" the new mapping and traps: in that case, we would actually
  only need to retry the access, no sfence.vma is required.

So this patch removes those preventive sfence.vma and actually handles
the possible (and unlikely) exceptions. And since the kernel stacks
mappings lie in the vmalloc area, this handling must be done very early
when the trap is taken, at the very beginning of handle_exception: this
also rules out the vmalloc allocations in the fault path.

Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/riscv/include/asm/cacheflush.h  | 18 +++++-
 arch/riscv/include/asm/thread_info.h |  5 ++
 arch/riscv/kernel/asm-offsets.c      |  5 ++
 arch/riscv/kernel/entry.S            | 84 ++++++++++++++++++++++++++++
 arch/riscv/mm/init.c                 |  2 +
 5 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
index a129dac4521d..b0d631701757 100644
--- a/arch/riscv/include/asm/cacheflush.h
+++ b/arch/riscv/include/asm/cacheflush.h
@@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
 	flush_icache_mm(vma->vm_mm, 0)
 
 #ifdef CONFIG_64BIT
-#define flush_cache_vmap(start, end)		flush_tlb_kernel_range(start, end)
+extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
+extern char _end[];
+#define flush_cache_vmap flush_cache_vmap
+static inline void flush_cache_vmap(unsigned long start, unsigned long end)
+{
+	if (is_vmalloc_or_module_addr((void *)start)) {
+		int i;
+
+		/*
+		 * We don't care if concurrently a cpu resets this value since
+		 * the only place this can happen is in handle_exception() where
+		 * an sfence.vma is emitted.
+		 */
+		for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
+			new_vmalloc[i] = -1ULL;
+	}
+}
 #define flush_cache_vmap_early(start, end)	local_flush_tlb_kernel_range(start, end)
 #endif
 
diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
index 5d473343634b..32631acdcdd4 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -60,6 +60,11 @@ struct thread_info {
 	void			*scs_base;
 	void			*scs_sp;
 #endif
+	/*
+	 * Used in handle_exception() to save a0, a1 and a2 before knowing if we
+	 * can access the kernel stack.
+	 */
+	unsigned long		a0, a1, a2;
 };
 
 #ifdef CONFIG_SHADOW_CALL_STACK
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index a03129f40c46..939ddc0e3c6e 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -35,6 +35,8 @@ void asm_offsets(void)
 	OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
 	OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
 	OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
+
+	OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
 	OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
 	OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
 	OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
@@ -42,6 +44,9 @@ void asm_offsets(void)
 #ifdef CONFIG_SHADOW_CALL_STACK
 	OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
 #endif
+	OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
+	OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
+	OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
 
 	OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
 	OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 9d1a305d5508..c1ffaeaba7aa 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -19,6 +19,78 @@
 
 	.section .irqentry.text, "ax"
 
+.macro new_vmalloc_check
+	REG_S 	a0, TASK_TI_A0(tp)
+	REG_S 	a1, TASK_TI_A1(tp)
+	REG_S	a2, TASK_TI_A2(tp)
+
+	csrr 	a0, CSR_CAUSE
+	/* Exclude IRQs */
+	blt  	a0, zero, _new_vmalloc_restore_context
+	/* Only check new_vmalloc if we are in page/protection fault */
+	li   	a1, EXC_LOAD_PAGE_FAULT
+	beq  	a0, a1, _new_vmalloc_kernel_address
+	li   	a1, EXC_STORE_PAGE_FAULT
+	beq  	a0, a1, _new_vmalloc_kernel_address
+	li   	a1, EXC_INST_PAGE_FAULT
+	bne  	a0, a1, _new_vmalloc_restore_context
+
+_new_vmalloc_kernel_address:
+	/* Is it a kernel address? */
+	csrr 	a0, CSR_TVAL
+	bge 	a0, zero, _new_vmalloc_restore_context
+
+	/* Check if a new vmalloc mapping appeared that could explain the trap */
+
+	/*
+	 * Computes:
+	 * a0 = &new_vmalloc[BIT_WORD(cpu)]
+	 * a1 = BIT_MASK(cpu)
+	 */
+	REG_L 	a2, TASK_TI_CPU(tp)
+	/*
+	 * Compute the new_vmalloc element position:
+	 * (cpu / 64) * 8 = (cpu >> 6) << 3
+	 */
+	srli	a1, a2, 6
+	slli	a1, a1, 3
+	la	a0, new_vmalloc
+	add	a0, a0, a1
+	/*
+	 * Compute the bit position in the new_vmalloc element:
+	 * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
+	 * 	   = cpu - ((cpu >> 6) << 3) << 3
+	 */
+	slli	a1, a1, 3
+	sub	a1, a2, a1
+	/* Compute the "get mask": 1 << bit_pos */
+	li	a2, 1
+	sll	a1, a2, a1
+
+	/* Check the value of new_vmalloc for this cpu */
+	REG_L	a2, 0(a0)
+	and	a2, a2, a1
+	beq	a2, zero, _new_vmalloc_restore_context
+
+	/* Atomically reset the current cpu bit in new_vmalloc */
+	amoxor.w	a0, a1, (a0)
+
+	/* Only emit a sfence.vma if the uarch caches invalid entries */
+	ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
+
+	REG_L	a0, TASK_TI_A0(tp)
+	REG_L	a1, TASK_TI_A1(tp)
+	REG_L	a2, TASK_TI_A2(tp)
+	csrw	CSR_SCRATCH, x0
+	sret
+
+_new_vmalloc_restore_context:
+	REG_L	a0, TASK_TI_A0(tp)
+	REG_L 	a1, TASK_TI_A1(tp)
+	REG_L 	a2, TASK_TI_A2(tp)
+.endm
+
+
 SYM_CODE_START(handle_exception)
 	/*
 	 * If coming from userspace, preserve the user thread pointer and load
@@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
 
 .Lrestore_kernel_tpsp:
 	csrr tp, CSR_SCRATCH
+
+	/*
+	 * The RISC-V kernel does not eagerly emit a sfence.vma after each
+	 * new vmalloc mapping, which may result in exceptions:
+	 * - if the uarch caches invalid entries, the new mapping would not be
+	 *   observed by the page table walker and an invalidation is needed.
+	 * - if the uarch does not cache invalid entries, a reordered access
+	 *   could "miss" the new mapping and traps: in that case, we only need
+	 *   to retry the access, no sfence.vma is required.
+	 */
+	new_vmalloc_check
+
 	REG_S sp, TASK_TI_KERNEL_SP(tp)
 
 #ifdef CONFIG_VMAP_STACK
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index eafc4c2200f2..54c9fdeda11e 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -36,6 +36,8 @@
 
 #include "../kernel/head.h"
 
+u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
+
 struct kernel_mapping kernel_map __ro_after_init;
 EXPORT_SYMBOL(kernel_map);
 #ifdef CONFIG_XIP_KERNEL
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
  2024-01-31 15:59 [PATCH RFC v2 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
                   ` (2 preceding siblings ...)
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
@ 2024-01-31 15:59 ` Alexandre Ghiti
  2024-02-01 15:03   ` Andrea Parri
  2024-05-30  9:35   ` [External] " yunhui cui
  3 siblings, 2 replies; 18+ messages in thread
From: Alexandre Ghiti @ 2024-01-31 15:59 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

The preventive sfence.vma were emitted because new mappings must be made
visible to the page table walker but Svvptc guarantees that xRET act as
a fence, so no need to sfence.vma for the uarchs that implement this
extension.

This allows to drastically reduce the number of sfence.vma emitted:

* Ubuntu boot to login:
Before: ~630k sfence.vma
After:  ~200k sfence.vma

* ltp - mmapstress01
Before: ~45k
After:  ~6.3k

* lmbench - lat_pagefault
Before: ~665k
After:   832 (!)

* lmbench - lat_mmap
Before: ~546k
After:   718 (!)

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/riscv/include/asm/pgtable.h | 16 +++++++++++++++-
 arch/riscv/mm/pgtable.c          | 13 +++++++++++++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 0c94260b5d0c..50986e4c4601 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -473,6 +473,9 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
 		struct vm_area_struct *vma, unsigned long address,
 		pte_t *ptep, unsigned int nr)
 {
+	asm_volatile_goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, RISCV_ISA_EXT_SVVPTC, 1)
+			  : : : : svvptc);
+
 	/*
 	 * The kernel assumes that TLBs don't cache invalid entries, but
 	 * in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
@@ -482,12 +485,23 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
 	 */
 	while (nr--)
 		local_flush_tlb_page(address + nr * PAGE_SIZE);
+
+svvptc:
+	/*
+	 * Svvptc guarantees that xRET act as a fence, so when the uarch does
+	 * not cache invalid entries, we don't have to do anything.
+	 */
+	;
 }
 #define update_mmu_cache(vma, addr, ptep) \
 	update_mmu_cache_range(NULL, vma, addr, ptep, 1)
 
 #define __HAVE_ARCH_UPDATE_MMU_TLB
-#define update_mmu_tlb update_mmu_cache
+static inline void update_mmu_tlb(struct vm_area_struct *vma,
+				  unsigned long address, pte_t *ptep)
+{
+	flush_tlb_range(vma, address, address + PAGE_SIZE);
+}
 
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmdp)
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index ef887efcb679..99ed389e4c8a 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -9,6 +9,9 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 			  unsigned long address, pte_t *ptep,
 			  pte_t entry, int dirty)
 {
+	asm_volatile_goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, RISCV_ISA_EXT_SVVPTC, 1)
+			  : : : : svvptc);
+
 	if (!pte_same(ptep_get(ptep), entry))
 		__set_pte_at(ptep, entry);
 	/*
@@ -16,6 +19,16 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	 * the case that the PTE changed and the spurious fault case.
 	 */
 	return true;
+
+svvptc:
+	if (!pte_same(ptep_get(ptep), entry)) {
+		__set_pte_at(ptep, entry);
+		/* Here only not svadu is impacted */
+		flush_tlb_page(vma, address);
+		return true;
+	}
+
+	return false;
 }
 
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC/RFT v2 2/4] dt-bindings: riscv: Add Svvptc ISA extension description
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 2/4] dt-bindings: riscv: Add Svvptc ISA extension description Alexandre Ghiti
@ 2024-02-01  9:22   ` Krzysztof Kozlowski
  0 siblings, 0 replies; 18+ messages in thread
From: Krzysztof Kozlowski @ 2024-02-01  9:22 UTC (permalink / raw)
  To: Alexandre Ghiti, Catalin Marinas, Will Deacon,
	Thomas Bogendoerfer, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Morton, Ved Shanbhogue, Matt Evans, Dylan Jhong,
	linux-arm-kernel, linux-kernel, linux-mips, linuxppc-dev,
	linux-riscv, linux-mm

On 31/01/2024 16:59, Alexandre Ghiti wrote:
> Add description for the Svvptc ISA extension which was ratified recently.
> 
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---

Please use scripts/get_maintainers.pl to get a list of necessary people
and lists to CC. It might happen, that command when run on an older
kernel, gives you outdated entries. Therefore please be sure you base
your patches on recent Linux kernel.

Tools like b4 or scripts_getmaintainer.pl provide you proper list of
people, so fix your workflow. Tools might also fail if you work on some
ancient tree (don't, use mainline), work on fork of kernel (don't, use
mainline) or you ignore some maintainers (really don't). Just use b4 and
everything should be fine, although remember about `b4 prep
--auto-to-cc` if you added new patches to the patchset.

You missed at least devicetree list (maybe more), so this won't be
tested by automated tooling. Performing review on untested code might be
a waste of time, thus I will skip this patch entirely till you follow
the process allowing the patch to be tested.

Please kindly resend and include all necessary To/Cc entries.


Best regards,
Krzysztof



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
@ 2024-02-01 15:03   ` Andrea Parri
  2024-02-02 15:42     ` Alexandre Ghiti
  2024-05-30  9:35   ` [External] " yunhui cui
  1 sibling, 1 reply; 18+ messages in thread
From: Andrea Parri @ 2024-02-01 15:03 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

On Wed, Jan 31, 2024 at 04:59:29PM +0100, Alexandre Ghiti wrote:
> The preventive sfence.vma were emitted because new mappings must be made
> visible to the page table walker but Svvptc guarantees that xRET act as
> a fence, so no need to sfence.vma for the uarchs that implement this
> extension.

AFAIU, your first submission shows that you don't need that xRET property.
Similarly for other archs.  What was rationale behind this Svvptc change?


> This allows to drastically reduce the number of sfence.vma emitted:
> 
> * Ubuntu boot to login:
> Before: ~630k sfence.vma
> After:  ~200k sfence.vma
> 
> * ltp - mmapstress01
> Before: ~45k
> After:  ~6.3k
> 
> * lmbench - lat_pagefault
> Before: ~665k
> After:   832 (!)
> 
> * lmbench - lat_mmap
> Before: ~546k
> After:   718 (!)

This Svvptc seems to move/add the "burden" of the synchronization to xRET:
Perhaps integrate the above counts w/ the perf gains in the cover letter?

  Andrea


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
  2024-02-01 15:03   ` Andrea Parri
@ 2024-02-02 15:42     ` Alexandre Ghiti
  2024-02-02 22:05       ` Alexandre Ghiti
  0 siblings, 1 reply; 18+ messages in thread
From: Alexandre Ghiti @ 2024-02-02 15:42 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

Hi Andrea,

On Thu, Feb 1, 2024 at 4:03 PM Andrea Parri <parri.andrea@gmail.com> wrote:
>
> On Wed, Jan 31, 2024 at 04:59:29PM +0100, Alexandre Ghiti wrote:
> > The preventive sfence.vma were emitted because new mappings must be made
> > visible to the page table walker but Svvptc guarantees that xRET act as
> > a fence, so no need to sfence.vma for the uarchs that implement this
> > extension.
>
> AFAIU, your first submission shows that you don't need that xRET property.
> Similarly for other archs.  What was rationale behind this Svvptc change?

Actually, the ARC has just changed its mind and removed this new
behaviour from the Svvptc extension, so we will take some gratuitous
page faults (but that should be outliners), which makes riscv similar
to x86 and arm64.

>
>
> > This allows to drastically reduce the number of sfence.vma emitted:
> >
> > * Ubuntu boot to login:
> > Before: ~630k sfence.vma
> > After:  ~200k sfence.vma
> >
> > * ltp - mmapstress01
> > Before: ~45k
> > After:  ~6.3k
> >
> > * lmbench - lat_pagefault
> > Before: ~665k
> > After:   832 (!)
> >
> > * lmbench - lat_mmap
> > Before: ~546k
> > After:   718 (!)
>
> This Svvptc seems to move/add the "burden" of the synchronization to xRET:
> Perhaps integrate the above counts w/ the perf gains in the cover letter?

Yes, I'll copy that to the cover letter.

Thanks for your interest!

Alex

>
>   Andrea


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
  2024-02-02 15:42     ` Alexandre Ghiti
@ 2024-02-02 22:05       ` Alexandre Ghiti
  0 siblings, 0 replies; 18+ messages in thread
From: Alexandre Ghiti @ 2024-02-02 22:05 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

On Fri, Feb 2, 2024 at 4:42 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> Hi Andrea,
>
> On Thu, Feb 1, 2024 at 4:03 PM Andrea Parri <parri.andrea@gmail.com> wrote:
> >
> > On Wed, Jan 31, 2024 at 04:59:29PM +0100, Alexandre Ghiti wrote:
> > > The preventive sfence.vma were emitted because new mappings must be made
> > > visible to the page table walker but Svvptc guarantees that xRET act as
> > > a fence, so no need to sfence.vma for the uarchs that implement this
> > > extension.
> >
> > AFAIU, your first submission shows that you don't need that xRET property.
> > Similarly for other archs.  What was rationale behind this Svvptc change?
>
> Actually, the ARC has just changed its mind and removed this new

The wording was incorrect here, the ARC did not state anything, the
author of Svvptc proposed an amended version of the spec that removes
this behaviour and that's under discussion.

> behaviour from the Svvptc extension, so we will take some gratuitous
> page faults (but that should be outliners), which makes riscv similar
> to x86 and arm64.
>
> >
> >
> > > This allows to drastically reduce the number of sfence.vma emitted:
> > >
> > > * Ubuntu boot to login:
> > > Before: ~630k sfence.vma
> > > After:  ~200k sfence.vma
> > >
> > > * ltp - mmapstress01
> > > Before: ~45k
> > > After:  ~6.3k
> > >
> > > * lmbench - lat_pagefault
> > > Before: ~665k
> > > After:   832 (!)
> > >
> > > * lmbench - lat_mmap
> > > Before: ~546k
> > > After:   718 (!)
> >
> > This Svvptc seems to move/add the "burden" of the synchronization to xRET:
> > Perhaps integrate the above counts w/ the perf gains in the cover letter?
>
> Yes, I'll copy that to the cover letter.
>
> Thanks for your interest!
>
> Alex
>
> >
> >   Andrea


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
  2024-02-01 15:03   ` Andrea Parri
@ 2024-05-30  9:35   ` yunhui cui
  1 sibling, 0 replies; 18+ messages in thread
From: yunhui cui @ 2024-05-30  9:35 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

Hi Alex,

On Thu, Feb 1, 2024 at 12:04 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> The preventive sfence.vma were emitted because new mappings must be made
> visible to the page table walker but Svvptc guarantees that xRET act as
> a fence, so no need to sfence.vma for the uarchs that implement this
> extension.
>
> This allows to drastically reduce the number of sfence.vma emitted:
>
> * Ubuntu boot to login:
> Before: ~630k sfence.vma
> After:  ~200k sfence.vma
>
> * ltp - mmapstress01
> Before: ~45k
> After:  ~6.3k
>
> * lmbench - lat_pagefault
> Before: ~665k
> After:   832 (!)
>
> * lmbench - lat_mmap
> Before: ~546k
> After:   718 (!)
>
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>  arch/riscv/include/asm/pgtable.h | 16 +++++++++++++++-
>  arch/riscv/mm/pgtable.c          | 13 +++++++++++++
>  2 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 0c94260b5d0c..50986e4c4601 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -473,6 +473,9 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>                 struct vm_area_struct *vma, unsigned long address,
>                 pte_t *ptep, unsigned int nr)
>  {
> +       asm_volatile_goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, RISCV_ISA_EXT_SVVPTC, 1)
> +                         : : : : svvptc);
> +
>         /*
>          * The kernel assumes that TLBs don't cache invalid entries, but
>          * in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
> @@ -482,12 +485,23 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>          */
>         while (nr--)
>                 local_flush_tlb_page(address + nr * PAGE_SIZE);
> +
> +svvptc:
> +       /*
> +        * Svvptc guarantees that xRET act as a fence, so when the uarch does
> +        * not cache invalid entries, we don't have to do anything.
> +        */
> +       ;
>  }

From the perspective of RISC-V arch, the logic of this patch is
reasonable. The code of mm comm submodule may be missing
update_mmu_cache_range(), for example: there is no flush TLB in
remap_pte_range() after updating pte.
I will send a patch to mm/ to fix this problem next.


Thanks,
Yunhui


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-01-31 15:59 ` [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
@ 2024-06-03  2:26   ` yunhui cui
  2024-06-03 12:02     ` Alexandre Ghiti
  0 siblings, 1 reply; 18+ messages in thread
From: yunhui cui @ 2024-06-03  2:26 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

Hi Alexandre,

On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> In 6.5, we removed the vmalloc fault path because that can't work (see
> [1] [2]). Then in order to make sure that new page table entries were
> seen by the page table walker, we had to preventively emit a sfence.vma
> on all harts [3] but this solution is very costly since it relies on IPI.
>
> And even there, we could end up in a loop of vmalloc faults if a vmalloc
> allocation is done in the IPI path (for example if it is traced, see
> [4]), which could result in a kernel stack overflow.
>
> Those preventive sfence.vma needed to be emitted because:
>
> - if the uarch caches invalid entries, the new mapping may not be
>   observed by the page table walker and an invalidation may be needed.
> - if the uarch does not cache invalid entries, a reordered access
>   could "miss" the new mapping and traps: in that case, we would actually
>   only need to retry the access, no sfence.vma is required.
>
> So this patch removes those preventive sfence.vma and actually handles
> the possible (and unlikely) exceptions. And since the kernel stacks
> mappings lie in the vmalloc area, this handling must be done very early
> when the trap is taken, at the very beginning of handle_exception: this
> also rules out the vmalloc allocations in the fault path.
>
> Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>  arch/riscv/include/asm/cacheflush.h  | 18 +++++-
>  arch/riscv/include/asm/thread_info.h |  5 ++
>  arch/riscv/kernel/asm-offsets.c      |  5 ++
>  arch/riscv/kernel/entry.S            | 84 ++++++++++++++++++++++++++++
>  arch/riscv/mm/init.c                 |  2 +
>  5 files changed, 113 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> index a129dac4521d..b0d631701757 100644
> --- a/arch/riscv/include/asm/cacheflush.h
> +++ b/arch/riscv/include/asm/cacheflush.h
> @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
>         flush_icache_mm(vma->vm_mm, 0)
>
>  #ifdef CONFIG_64BIT
> -#define flush_cache_vmap(start, end)           flush_tlb_kernel_range(start, end)
> +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> +extern char _end[];
> +#define flush_cache_vmap flush_cache_vmap
> +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> +{
> +       if (is_vmalloc_or_module_addr((void *)start)) {
> +               int i;
> +
> +               /*
> +                * We don't care if concurrently a cpu resets this value since
> +                * the only place this can happen is in handle_exception() where
> +                * an sfence.vma is emitted.
> +                */
> +               for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> +                       new_vmalloc[i] = -1ULL;
> +       }
> +}
>  #define flush_cache_vmap_early(start, end)     local_flush_tlb_kernel_range(start, end)
>  #endif
>
> diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> index 5d473343634b..32631acdcdd4 100644
> --- a/arch/riscv/include/asm/thread_info.h
> +++ b/arch/riscv/include/asm/thread_info.h
> @@ -60,6 +60,11 @@ struct thread_info {
>         void                    *scs_base;
>         void                    *scs_sp;
>  #endif
> +       /*
> +        * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> +        * can access the kernel stack.
> +        */
> +       unsigned long           a0, a1, a2;
>  };
>
>  #ifdef CONFIG_SHADOW_CALL_STACK
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index a03129f40c46..939ddc0e3c6e 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -35,6 +35,8 @@ void asm_offsets(void)
>         OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
>         OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
>         OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> +
> +       OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
>         OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
>         OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
>         OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> @@ -42,6 +44,9 @@ void asm_offsets(void)
>  #ifdef CONFIG_SHADOW_CALL_STACK
>         OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
>  #endif
> +       OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> +       OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> +       OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
>
>         OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
>         OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index 9d1a305d5508..c1ffaeaba7aa 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -19,6 +19,78 @@
>
>         .section .irqentry.text, "ax"
>
> +.macro new_vmalloc_check
> +       REG_S   a0, TASK_TI_A0(tp)
> +       REG_S   a1, TASK_TI_A1(tp)
> +       REG_S   a2, TASK_TI_A2(tp)
> +
> +       csrr    a0, CSR_CAUSE
> +       /* Exclude IRQs */
> +       blt     a0, zero, _new_vmalloc_restore_context
> +       /* Only check new_vmalloc if we are in page/protection fault */
> +       li      a1, EXC_LOAD_PAGE_FAULT
> +       beq     a0, a1, _new_vmalloc_kernel_address
> +       li      a1, EXC_STORE_PAGE_FAULT
> +       beq     a0, a1, _new_vmalloc_kernel_address
> +       li      a1, EXC_INST_PAGE_FAULT
> +       bne     a0, a1, _new_vmalloc_restore_context
> +
> +_new_vmalloc_kernel_address:
> +       /* Is it a kernel address? */
> +       csrr    a0, CSR_TVAL
> +       bge     a0, zero, _new_vmalloc_restore_context
> +
> +       /* Check if a new vmalloc mapping appeared that could explain the trap */
> +
> +       /*
> +        * Computes:
> +        * a0 = &new_vmalloc[BIT_WORD(cpu)]
> +        * a1 = BIT_MASK(cpu)
> +        */
> +       REG_L   a2, TASK_TI_CPU(tp)
> +       /*
> +        * Compute the new_vmalloc element position:
> +        * (cpu / 64) * 8 = (cpu >> 6) << 3
> +        */
> +       srli    a1, a2, 6
> +       slli    a1, a1, 3
> +       la      a0, new_vmalloc
> +       add     a0, a0, a1
> +       /*
> +        * Compute the bit position in the new_vmalloc element:
> +        * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> +        *         = cpu - ((cpu >> 6) << 3) << 3
> +        */
> +       slli    a1, a1, 3
> +       sub     a1, a2, a1
> +       /* Compute the "get mask": 1 << bit_pos */
> +       li      a2, 1
> +       sll     a1, a2, a1
> +
> +       /* Check the value of new_vmalloc for this cpu */
> +       REG_L   a2, 0(a0)
> +       and     a2, a2, a1
> +       beq     a2, zero, _new_vmalloc_restore_context
> +
> +       /* Atomically reset the current cpu bit in new_vmalloc */
> +       amoxor.w        a0, a1, (a0)
> +
> +       /* Only emit a sfence.vma if the uarch caches invalid entries */
> +       ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> +
> +       REG_L   a0, TASK_TI_A0(tp)
> +       REG_L   a1, TASK_TI_A1(tp)
> +       REG_L   a2, TASK_TI_A2(tp)
> +       csrw    CSR_SCRATCH, x0
> +       sret
> +
> +_new_vmalloc_restore_context:
> +       REG_L   a0, TASK_TI_A0(tp)
> +       REG_L   a1, TASK_TI_A1(tp)
> +       REG_L   a2, TASK_TI_A2(tp)
> +.endm
> +
> +
>  SYM_CODE_START(handle_exception)
>         /*
>          * If coming from userspace, preserve the user thread pointer and load
> @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
>
>  .Lrestore_kernel_tpsp:
>         csrr tp, CSR_SCRATCH
> +
> +       /*
> +        * The RISC-V kernel does not eagerly emit a sfence.vma after each
> +        * new vmalloc mapping, which may result in exceptions:
> +        * - if the uarch caches invalid entries, the new mapping would not be
> +        *   observed by the page table walker and an invalidation is needed.
> +        * - if the uarch does not cache invalid entries, a reordered access
> +        *   could "miss" the new mapping and traps: in that case, we only need
> +        *   to retry the access, no sfence.vma is required.
> +        */
> +       new_vmalloc_check
> +
>         REG_S sp, TASK_TI_KERNEL_SP(tp)
>
>  #ifdef CONFIG_VMAP_STACK
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index eafc4c2200f2..54c9fdeda11e 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -36,6 +36,8 @@
>
>  #include "../kernel/head.h"
>
> +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> +
>  struct kernel_mapping kernel_map __ro_after_init;
>  EXPORT_SYMBOL(kernel_map);
>  #ifdef CONFIG_XIP_KERNEL
> --
> 2.39.2
>
>

Can we consider using new_vmalloc as a percpu variable, so that we
don't need to add a0/1/2 in thread_info? Also, try not to do too much
calculation logic in new_vmalloc_check, after all, handle_exception is
a high-frequency path. In this case, can we consider writing
new_vmalloc_check in C language to increase readability?

Thanks,
Yunhui


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-06-03  2:26   ` [External] " yunhui cui
@ 2024-06-03 12:02     ` Alexandre Ghiti
  2024-06-04  6:21       ` yunhui cui
  0 siblings, 1 reply; 18+ messages in thread
From: Alexandre Ghiti @ 2024-06-03 12:02 UTC (permalink / raw)
  To: yunhui cui
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

Hi Yunhui,

On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
>
> Hi Alexandre,
>
> On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > In 6.5, we removed the vmalloc fault path because that can't work (see
> > [1] [2]). Then in order to make sure that new page table entries were
> > seen by the page table walker, we had to preventively emit a sfence.vma
> > on all harts [3] but this solution is very costly since it relies on IPI.
> >
> > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > allocation is done in the IPI path (for example if it is traced, see
> > [4]), which could result in a kernel stack overflow.
> >
> > Those preventive sfence.vma needed to be emitted because:
> >
> > - if the uarch caches invalid entries, the new mapping may not be
> >   observed by the page table walker and an invalidation may be needed.
> > - if the uarch does not cache invalid entries, a reordered access
> >   could "miss" the new mapping and traps: in that case, we would actually
> >   only need to retry the access, no sfence.vma is required.
> >
> > So this patch removes those preventive sfence.vma and actually handles
> > the possible (and unlikely) exceptions. And since the kernel stacks
> > mappings lie in the vmalloc area, this handling must be done very early
> > when the trap is taken, at the very beginning of handle_exception: this
> > also rules out the vmalloc allocations in the fault path.
> >
> > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >  arch/riscv/include/asm/cacheflush.h  | 18 +++++-
> >  arch/riscv/include/asm/thread_info.h |  5 ++
> >  arch/riscv/kernel/asm-offsets.c      |  5 ++
> >  arch/riscv/kernel/entry.S            | 84 ++++++++++++++++++++++++++++
> >  arch/riscv/mm/init.c                 |  2 +
> >  5 files changed, 113 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > index a129dac4521d..b0d631701757 100644
> > --- a/arch/riscv/include/asm/cacheflush.h
> > +++ b/arch/riscv/include/asm/cacheflush.h
> > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
> >         flush_icache_mm(vma->vm_mm, 0)
> >
> >  #ifdef CONFIG_64BIT
> > -#define flush_cache_vmap(start, end)           flush_tlb_kernel_range(start, end)
> > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > +extern char _end[];
> > +#define flush_cache_vmap flush_cache_vmap
> > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > +{
> > +       if (is_vmalloc_or_module_addr((void *)start)) {
> > +               int i;
> > +
> > +               /*
> > +                * We don't care if concurrently a cpu resets this value since
> > +                * the only place this can happen is in handle_exception() where
> > +                * an sfence.vma is emitted.
> > +                */
> > +               for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > +                       new_vmalloc[i] = -1ULL;
> > +       }
> > +}
> >  #define flush_cache_vmap_early(start, end)     local_flush_tlb_kernel_range(start, end)
> >  #endif
> >
> > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> > index 5d473343634b..32631acdcdd4 100644
> > --- a/arch/riscv/include/asm/thread_info.h
> > +++ b/arch/riscv/include/asm/thread_info.h
> > @@ -60,6 +60,11 @@ struct thread_info {
> >         void                    *scs_base;
> >         void                    *scs_sp;
> >  #endif
> > +       /*
> > +        * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> > +        * can access the kernel stack.
> > +        */
> > +       unsigned long           a0, a1, a2;
> >  };
> >
> >  #ifdef CONFIG_SHADOW_CALL_STACK
> > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > index a03129f40c46..939ddc0e3c6e 100644
> > --- a/arch/riscv/kernel/asm-offsets.c
> > +++ b/arch/riscv/kernel/asm-offsets.c
> > @@ -35,6 +35,8 @@ void asm_offsets(void)
> >         OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> >         OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> >         OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > +
> > +       OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> >         OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> >         OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> >         OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> > @@ -42,6 +44,9 @@ void asm_offsets(void)
> >  #ifdef CONFIG_SHADOW_CALL_STACK
> >         OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> >  #endif
> > +       OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> > +       OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> > +       OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
> >
> >         OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> >         OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
> > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > index 9d1a305d5508..c1ffaeaba7aa 100644
> > --- a/arch/riscv/kernel/entry.S
> > +++ b/arch/riscv/kernel/entry.S
> > @@ -19,6 +19,78 @@
> >
> >         .section .irqentry.text, "ax"
> >
> > +.macro new_vmalloc_check
> > +       REG_S   a0, TASK_TI_A0(tp)
> > +       REG_S   a1, TASK_TI_A1(tp)
> > +       REG_S   a2, TASK_TI_A2(tp)
> > +
> > +       csrr    a0, CSR_CAUSE
> > +       /* Exclude IRQs */
> > +       blt     a0, zero, _new_vmalloc_restore_context
> > +       /* Only check new_vmalloc if we are in page/protection fault */
> > +       li      a1, EXC_LOAD_PAGE_FAULT
> > +       beq     a0, a1, _new_vmalloc_kernel_address
> > +       li      a1, EXC_STORE_PAGE_FAULT
> > +       beq     a0, a1, _new_vmalloc_kernel_address
> > +       li      a1, EXC_INST_PAGE_FAULT
> > +       bne     a0, a1, _new_vmalloc_restore_context
> > +
> > +_new_vmalloc_kernel_address:
> > +       /* Is it a kernel address? */
> > +       csrr    a0, CSR_TVAL
> > +       bge     a0, zero, _new_vmalloc_restore_context
> > +
> > +       /* Check if a new vmalloc mapping appeared that could explain the trap */
> > +
> > +       /*
> > +        * Computes:
> > +        * a0 = &new_vmalloc[BIT_WORD(cpu)]
> > +        * a1 = BIT_MASK(cpu)
> > +        */
> > +       REG_L   a2, TASK_TI_CPU(tp)
> > +       /*
> > +        * Compute the new_vmalloc element position:
> > +        * (cpu / 64) * 8 = (cpu >> 6) << 3
> > +        */
> > +       srli    a1, a2, 6
> > +       slli    a1, a1, 3
> > +       la      a0, new_vmalloc
> > +       add     a0, a0, a1
> > +       /*
> > +        * Compute the bit position in the new_vmalloc element:
> > +        * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> > +        *         = cpu - ((cpu >> 6) << 3) << 3
> > +        */
> > +       slli    a1, a1, 3
> > +       sub     a1, a2, a1
> > +       /* Compute the "get mask": 1 << bit_pos */
> > +       li      a2, 1
> > +       sll     a1, a2, a1
> > +
> > +       /* Check the value of new_vmalloc for this cpu */
> > +       REG_L   a2, 0(a0)
> > +       and     a2, a2, a1
> > +       beq     a2, zero, _new_vmalloc_restore_context
> > +
> > +       /* Atomically reset the current cpu bit in new_vmalloc */
> > +       amoxor.w        a0, a1, (a0)
> > +
> > +       /* Only emit a sfence.vma if the uarch caches invalid entries */
> > +       ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> > +
> > +       REG_L   a0, TASK_TI_A0(tp)
> > +       REG_L   a1, TASK_TI_A1(tp)
> > +       REG_L   a2, TASK_TI_A2(tp)
> > +       csrw    CSR_SCRATCH, x0
> > +       sret
> > +
> > +_new_vmalloc_restore_context:
> > +       REG_L   a0, TASK_TI_A0(tp)
> > +       REG_L   a1, TASK_TI_A1(tp)
> > +       REG_L   a2, TASK_TI_A2(tp)
> > +.endm
> > +
> > +
> >  SYM_CODE_START(handle_exception)
> >         /*
> >          * If coming from userspace, preserve the user thread pointer and load
> > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
> >
> >  .Lrestore_kernel_tpsp:
> >         csrr tp, CSR_SCRATCH
> > +
> > +       /*
> > +        * The RISC-V kernel does not eagerly emit a sfence.vma after each
> > +        * new vmalloc mapping, which may result in exceptions:
> > +        * - if the uarch caches invalid entries, the new mapping would not be
> > +        *   observed by the page table walker and an invalidation is needed.
> > +        * - if the uarch does not cache invalid entries, a reordered access
> > +        *   could "miss" the new mapping and traps: in that case, we only need
> > +        *   to retry the access, no sfence.vma is required.
> > +        */
> > +       new_vmalloc_check
> > +
> >         REG_S sp, TASK_TI_KERNEL_SP(tp)
> >
> >  #ifdef CONFIG_VMAP_STACK
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index eafc4c2200f2..54c9fdeda11e 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -36,6 +36,8 @@
> >
> >  #include "../kernel/head.h"
> >
> > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > +
> >  struct kernel_mapping kernel_map __ro_after_init;
> >  EXPORT_SYMBOL(kernel_map);
> >  #ifdef CONFIG_XIP_KERNEL
> > --
> > 2.39.2
> >
> >
>
> Can we consider using new_vmalloc as a percpu variable, so that we
> don't need to add a0/1/2 in thread_info?

At first, I used percpu variables. But then I realized that percpu
areas are allocated in the vmalloc area, so if somehow we take a trap
when accessing the new_vmalloc percpu variable, we could not recover
from this as we would trap forever in new_vmalloc_check. But
admittedly, not sure that can happen.

And how would that remove a0, a1 and a2 from thread_info? We'd still
need to save some registers somewhere to access the percpu variable
right?

> Also, try not to do too much
> calculation logic in new_vmalloc_check, after all, handle_exception is
> a high-frequency path. In this case, can we consider writing
> new_vmalloc_check in C language to increase readability?

If we write that in C, we don't have the control over the allocated
registers and then we can't correctly save the context.

Thanks for your interest in this patchset :)

Alex

>
> Thanks,
> Yunhui


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-06-03 12:02     ` Alexandre Ghiti
@ 2024-06-04  6:21       ` yunhui cui
  2024-06-04  7:15         ` Alexandre Ghiti
  0 siblings, 1 reply; 18+ messages in thread
From: yunhui cui @ 2024-06-04  6:21 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

Hi Alexandre,

On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> Hi Yunhui,
>
> On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> >
> > Hi Alexandre,
> >
> > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > >
> > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > [1] [2]). Then in order to make sure that new page table entries were
> > > seen by the page table walker, we had to preventively emit a sfence.vma
> > > on all harts [3] but this solution is very costly since it relies on IPI.
> > >
> > > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > > allocation is done in the IPI path (for example if it is traced, see
> > > [4]), which could result in a kernel stack overflow.
> > >
> > > Those preventive sfence.vma needed to be emitted because:
> > >
> > > - if the uarch caches invalid entries, the new mapping may not be
> > >   observed by the page table walker and an invalidation may be needed.
> > > - if the uarch does not cache invalid entries, a reordered access
> > >   could "miss" the new mapping and traps: in that case, we would actually
> > >   only need to retry the access, no sfence.vma is required.
> > >
> > > So this patch removes those preventive sfence.vma and actually handles
> > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > mappings lie in the vmalloc area, this handling must be done very early
> > > when the trap is taken, at the very beginning of handle_exception: this
> > > also rules out the vmalloc allocations in the fault path.
> > >
> > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > ---
> > >  arch/riscv/include/asm/cacheflush.h  | 18 +++++-
> > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > >  arch/riscv/kernel/asm-offsets.c      |  5 ++
> > >  arch/riscv/kernel/entry.S            | 84 ++++++++++++++++++++++++++++
> > >  arch/riscv/mm/init.c                 |  2 +
> > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > index a129dac4521d..b0d631701757 100644
> > > --- a/arch/riscv/include/asm/cacheflush.h
> > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
> > >         flush_icache_mm(vma->vm_mm, 0)
> > >
> > >  #ifdef CONFIG_64BIT
> > > -#define flush_cache_vmap(start, end)           flush_tlb_kernel_range(start, end)
> > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > +extern char _end[];
> > > +#define flush_cache_vmap flush_cache_vmap
> > > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > > +{
> > > +       if (is_vmalloc_or_module_addr((void *)start)) {
> > > +               int i;
> > > +
> > > +               /*
> > > +                * We don't care if concurrently a cpu resets this value since
> > > +                * the only place this can happen is in handle_exception() where
> > > +                * an sfence.vma is emitted.
> > > +                */
> > > +               for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > +                       new_vmalloc[i] = -1ULL;
> > > +       }
> > > +}
> > >  #define flush_cache_vmap_early(start, end)     local_flush_tlb_kernel_range(start, end)
> > >  #endif
> > >
> > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> > > index 5d473343634b..32631acdcdd4 100644
> > > --- a/arch/riscv/include/asm/thread_info.h
> > > +++ b/arch/riscv/include/asm/thread_info.h
> > > @@ -60,6 +60,11 @@ struct thread_info {
> > >         void                    *scs_base;
> > >         void                    *scs_sp;
> > >  #endif
> > > +       /*
> > > +        * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> > > +        * can access the kernel stack.
> > > +        */
> > > +       unsigned long           a0, a1, a2;
> > >  };
> > >
> > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > index a03129f40c46..939ddc0e3c6e 100644
> > > --- a/arch/riscv/kernel/asm-offsets.c
> > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > @@ -35,6 +35,8 @@ void asm_offsets(void)
> > >         OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> > >         OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> > >         OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > > +
> > > +       OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> > >         OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> > >         OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> > >         OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> > > @@ -42,6 +44,9 @@ void asm_offsets(void)
> > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > >         OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> > >  #endif
> > > +       OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> > > +       OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> > > +       OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
> > >
> > >         OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> > >         OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
> > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > > index 9d1a305d5508..c1ffaeaba7aa 100644
> > > --- a/arch/riscv/kernel/entry.S
> > > +++ b/arch/riscv/kernel/entry.S
> > > @@ -19,6 +19,78 @@
> > >
> > >         .section .irqentry.text, "ax"
> > >
> > > +.macro new_vmalloc_check
> > > +       REG_S   a0, TASK_TI_A0(tp)
> > > +       REG_S   a1, TASK_TI_A1(tp)
> > > +       REG_S   a2, TASK_TI_A2(tp)
> > > +
> > > +       csrr    a0, CSR_CAUSE
> > > +       /* Exclude IRQs */
> > > +       blt     a0, zero, _new_vmalloc_restore_context
> > > +       /* Only check new_vmalloc if we are in page/protection fault */
> > > +       li      a1, EXC_LOAD_PAGE_FAULT
> > > +       beq     a0, a1, _new_vmalloc_kernel_address
> > > +       li      a1, EXC_STORE_PAGE_FAULT
> > > +       beq     a0, a1, _new_vmalloc_kernel_address
> > > +       li      a1, EXC_INST_PAGE_FAULT
> > > +       bne     a0, a1, _new_vmalloc_restore_context
> > > +
> > > +_new_vmalloc_kernel_address:
> > > +       /* Is it a kernel address? */
> > > +       csrr    a0, CSR_TVAL
> > > +       bge     a0, zero, _new_vmalloc_restore_context
> > > +
> > > +       /* Check if a new vmalloc mapping appeared that could explain the trap */
> > > +
> > > +       /*
> > > +        * Computes:
> > > +        * a0 = &new_vmalloc[BIT_WORD(cpu)]
> > > +        * a1 = BIT_MASK(cpu)
> > > +        */
> > > +       REG_L   a2, TASK_TI_CPU(tp)
> > > +       /*
> > > +        * Compute the new_vmalloc element position:
> > > +        * (cpu / 64) * 8 = (cpu >> 6) << 3
> > > +        */
> > > +       srli    a1, a2, 6
> > > +       slli    a1, a1, 3
> > > +       la      a0, new_vmalloc
> > > +       add     a0, a0, a1
> > > +       /*
> > > +        * Compute the bit position in the new_vmalloc element:
> > > +        * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> > > +        *         = cpu - ((cpu >> 6) << 3) << 3
> > > +        */
> > > +       slli    a1, a1, 3
> > > +       sub     a1, a2, a1
> > > +       /* Compute the "get mask": 1 << bit_pos */
> > > +       li      a2, 1
> > > +       sll     a1, a2, a1
> > > +
> > > +       /* Check the value of new_vmalloc for this cpu */
> > > +       REG_L   a2, 0(a0)
> > > +       and     a2, a2, a1
> > > +       beq     a2, zero, _new_vmalloc_restore_context
> > > +
> > > +       /* Atomically reset the current cpu bit in new_vmalloc */
> > > +       amoxor.w        a0, a1, (a0)
> > > +
> > > +       /* Only emit a sfence.vma if the uarch caches invalid entries */
> > > +       ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> > > +
> > > +       REG_L   a0, TASK_TI_A0(tp)
> > > +       REG_L   a1, TASK_TI_A1(tp)
> > > +       REG_L   a2, TASK_TI_A2(tp)
> > > +       csrw    CSR_SCRATCH, x0
> > > +       sret
> > > +
> > > +_new_vmalloc_restore_context:
> > > +       REG_L   a0, TASK_TI_A0(tp)
> > > +       REG_L   a1, TASK_TI_A1(tp)
> > > +       REG_L   a2, TASK_TI_A2(tp)
> > > +.endm
> > > +
> > > +
> > >  SYM_CODE_START(handle_exception)
> > >         /*
> > >          * If coming from userspace, preserve the user thread pointer and load
> > > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
> > >
> > >  .Lrestore_kernel_tpsp:
> > >         csrr tp, CSR_SCRATCH
> > > +
> > > +       /*
> > > +        * The RISC-V kernel does not eagerly emit a sfence.vma after each
> > > +        * new vmalloc mapping, which may result in exceptions:
> > > +        * - if the uarch caches invalid entries, the new mapping would not be
> > > +        *   observed by the page table walker and an invalidation is needed.
> > > +        * - if the uarch does not cache invalid entries, a reordered access
> > > +        *   could "miss" the new mapping and traps: in that case, we only need
> > > +        *   to retry the access, no sfence.vma is required.
> > > +        */
> > > +       new_vmalloc_check
> > > +
> > >         REG_S sp, TASK_TI_KERNEL_SP(tp)
> > >
> > >  #ifdef CONFIG_VMAP_STACK
> > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > index eafc4c2200f2..54c9fdeda11e 100644
> > > --- a/arch/riscv/mm/init.c
> > > +++ b/arch/riscv/mm/init.c
> > > @@ -36,6 +36,8 @@
> > >
> > >  #include "../kernel/head.h"
> > >
> > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > +
> > >  struct kernel_mapping kernel_map __ro_after_init;
> > >  EXPORT_SYMBOL(kernel_map);
> > >  #ifdef CONFIG_XIP_KERNEL
> > > --
> > > 2.39.2
> > >
> > >
> >
> > Can we consider using new_vmalloc as a percpu variable, so that we
> > don't need to add a0/1/2 in thread_info?
>
> At first, I used percpu variables. But then I realized that percpu
> areas are allocated in the vmalloc area, so if somehow we take a trap
> when accessing the new_vmalloc percpu variable, we could not recover
> from this as we would trap forever in new_vmalloc_check. But
> admittedly, not sure that can happen.
>
> And how would that remove a0, a1 and a2 from thread_info? We'd still
> need to save some registers somewhere to access the percpu variable
> right?
>
> > Also, try not to do too much
> > calculation logic in new_vmalloc_check, after all, handle_exception is
> > a high-frequency path. In this case, can we consider writing
> > new_vmalloc_check in C language to increase readability?
>
> If we write that in C, we don't have the control over the allocated
> registers and then we can't correctly save the context.

If we use C language, new_vmalloc_check is written just like do_irq(),
then we need _save_context, but for new_vmalloc_check, it is not worth
the loss, because exceptions from user mode do not need
new_vmalloc_check, which also shows that it is reasonable to put
new_vmalloc_check after _restore_kernel_tpsp.

Saving is necessary. We can save a0, a1, a2 without using thread_info.
We can choose to save on the kernel stack of the current tp, but we
need to add the following instructions:
REG_S sp, TASK_TI_USER_SP(tp)
REG_L sp, TASK_TI_KERNEL_SP(tp)
addi sp, sp, -(PT_SIZE_ON_STACK)
It seems that saving directly on thread_info is more direct, but
saving on the kernel stack is more logically consistent, and there is
no need to increase the size of thread_info.

As for the current status of the patch, there are two points that can
be optimized:
1. Some chip hardware implementations may not cache TLB invalid
entries, so it doesn't matter whether svvptc is available or not. Can
we consider adding a CONFIG_RISCV_SVVPTC to control it?

2. .macro new_vmalloc_check
REG_S a0, TASK_TI_A0(tp)
REG_S a1, TASK_TI_A1(tp)
REG_S a2, TASK_TI_A2(tp)
When executing blt a0, zero, _new_vmalloc_restore_context, you can not
save a1, a2 first

>
> Thanks for your interest in this patchset :)
>
> Alex
>
> >
> > Thanks,
> > Yunhui

Thanks,
Yunhui


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-06-04  6:21       ` yunhui cui
@ 2024-06-04  7:15         ` Alexandre Ghiti
  2024-06-04  7:17           ` Alexandre Ghiti
  0 siblings, 1 reply; 18+ messages in thread
From: Alexandre Ghiti @ 2024-06-04  7:15 UTC (permalink / raw)
  To: yunhui cui
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

Hi Yunhui,

On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
>
> Hi Alexandre,
>
> On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > Hi Yunhui,
> >
> > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > >
> > > Hi Alexandre,
> > >
> > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > >
> > > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > > [1] [2]). Then in order to make sure that new page table entries were
> > > > seen by the page table walker, we had to preventively emit a sfence.vma
> > > > on all harts [3] but this solution is very costly since it relies on IPI.
> > > >
> > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > > > allocation is done in the IPI path (for example if it is traced, see
> > > > [4]), which could result in a kernel stack overflow.
> > > >
> > > > Those preventive sfence.vma needed to be emitted because:
> > > >
> > > > - if the uarch caches invalid entries, the new mapping may not be
> > > >   observed by the page table walker and an invalidation may be needed.
> > > > - if the uarch does not cache invalid entries, a reordered access
> > > >   could "miss" the new mapping and traps: in that case, we would actually
> > > >   only need to retry the access, no sfence.vma is required.
> > > >
> > > > So this patch removes those preventive sfence.vma and actually handles
> > > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > > mappings lie in the vmalloc area, this handling must be done very early
> > > > when the trap is taken, at the very beginning of handle_exception: this
> > > > also rules out the vmalloc allocations in the fault path.
> > > >
> > > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> > > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> > > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> > > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > ---
> > > >  arch/riscv/include/asm/cacheflush.h  | 18 +++++-
> > > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > > >  arch/riscv/kernel/asm-offsets.c      |  5 ++
> > > >  arch/riscv/kernel/entry.S            | 84 ++++++++++++++++++++++++++++
> > > >  arch/riscv/mm/init.c                 |  2 +
> > > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > index a129dac4521d..b0d631701757 100644
> > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
> > > >         flush_icache_mm(vma->vm_mm, 0)
> > > >
> > > >  #ifdef CONFIG_64BIT
> > > > -#define flush_cache_vmap(start, end)           flush_tlb_kernel_range(start, end)
> > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > +extern char _end[];
> > > > +#define flush_cache_vmap flush_cache_vmap
> > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > > > +{
> > > > +       if (is_vmalloc_or_module_addr((void *)start)) {
> > > > +               int i;
> > > > +
> > > > +               /*
> > > > +                * We don't care if concurrently a cpu resets this value since
> > > > +                * the only place this can happen is in handle_exception() where
> > > > +                * an sfence.vma is emitted.
> > > > +                */
> > > > +               for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > > +                       new_vmalloc[i] = -1ULL;
> > > > +       }
> > > > +}
> > > >  #define flush_cache_vmap_early(start, end)     local_flush_tlb_kernel_range(start, end)
> > > >  #endif
> > > >
> > > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> > > > index 5d473343634b..32631acdcdd4 100644
> > > > --- a/arch/riscv/include/asm/thread_info.h
> > > > +++ b/arch/riscv/include/asm/thread_info.h
> > > > @@ -60,6 +60,11 @@ struct thread_info {
> > > >         void                    *scs_base;
> > > >         void                    *scs_sp;
> > > >  #endif
> > > > +       /*
> > > > +        * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> > > > +        * can access the kernel stack.
> > > > +        */
> > > > +       unsigned long           a0, a1, a2;
> > > >  };
> > > >
> > > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > index a03129f40c46..939ddc0e3c6e 100644
> > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > @@ -35,6 +35,8 @@ void asm_offsets(void)
> > > >         OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> > > >         OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> > > >         OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > > > +
> > > > +       OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> > > >         OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> > > >         OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> > > >         OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> > > > @@ -42,6 +44,9 @@ void asm_offsets(void)
> > > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > >         OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> > > >  #endif
> > > > +       OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> > > > +       OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> > > > +       OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
> > > >
> > > >         OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> > > >         OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
> > > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > > > index 9d1a305d5508..c1ffaeaba7aa 100644
> > > > --- a/arch/riscv/kernel/entry.S
> > > > +++ b/arch/riscv/kernel/entry.S
> > > > @@ -19,6 +19,78 @@
> > > >
> > > >         .section .irqentry.text, "ax"
> > > >
> > > > +.macro new_vmalloc_check
> > > > +       REG_S   a0, TASK_TI_A0(tp)
> > > > +       REG_S   a1, TASK_TI_A1(tp)
> > > > +       REG_S   a2, TASK_TI_A2(tp)
> > > > +
> > > > +       csrr    a0, CSR_CAUSE
> > > > +       /* Exclude IRQs */
> > > > +       blt     a0, zero, _new_vmalloc_restore_context
> > > > +       /* Only check new_vmalloc if we are in page/protection fault */
> > > > +       li      a1, EXC_LOAD_PAGE_FAULT
> > > > +       beq     a0, a1, _new_vmalloc_kernel_address
> > > > +       li      a1, EXC_STORE_PAGE_FAULT
> > > > +       beq     a0, a1, _new_vmalloc_kernel_address
> > > > +       li      a1, EXC_INST_PAGE_FAULT
> > > > +       bne     a0, a1, _new_vmalloc_restore_context
> > > > +
> > > > +_new_vmalloc_kernel_address:
> > > > +       /* Is it a kernel address? */
> > > > +       csrr    a0, CSR_TVAL
> > > > +       bge     a0, zero, _new_vmalloc_restore_context
> > > > +
> > > > +       /* Check if a new vmalloc mapping appeared that could explain the trap */
> > > > +
> > > > +       /*
> > > > +        * Computes:
> > > > +        * a0 = &new_vmalloc[BIT_WORD(cpu)]
> > > > +        * a1 = BIT_MASK(cpu)
> > > > +        */
> > > > +       REG_L   a2, TASK_TI_CPU(tp)
> > > > +       /*
> > > > +        * Compute the new_vmalloc element position:
> > > > +        * (cpu / 64) * 8 = (cpu >> 6) << 3
> > > > +        */
> > > > +       srli    a1, a2, 6
> > > > +       slli    a1, a1, 3
> > > > +       la      a0, new_vmalloc
> > > > +       add     a0, a0, a1
> > > > +       /*
> > > > +        * Compute the bit position in the new_vmalloc element:
> > > > +        * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> > > > +        *         = cpu - ((cpu >> 6) << 3) << 3
> > > > +        */
> > > > +       slli    a1, a1, 3
> > > > +       sub     a1, a2, a1
> > > > +       /* Compute the "get mask": 1 << bit_pos */
> > > > +       li      a2, 1
> > > > +       sll     a1, a2, a1
> > > > +
> > > > +       /* Check the value of new_vmalloc for this cpu */
> > > > +       REG_L   a2, 0(a0)
> > > > +       and     a2, a2, a1
> > > > +       beq     a2, zero, _new_vmalloc_restore_context
> > > > +
> > > > +       /* Atomically reset the current cpu bit in new_vmalloc */
> > > > +       amoxor.w        a0, a1, (a0)
> > > > +
> > > > +       /* Only emit a sfence.vma if the uarch caches invalid entries */
> > > > +       ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> > > > +
> > > > +       REG_L   a0, TASK_TI_A0(tp)
> > > > +       REG_L   a1, TASK_TI_A1(tp)
> > > > +       REG_L   a2, TASK_TI_A2(tp)
> > > > +       csrw    CSR_SCRATCH, x0
> > > > +       sret
> > > > +
> > > > +_new_vmalloc_restore_context:
> > > > +       REG_L   a0, TASK_TI_A0(tp)
> > > > +       REG_L   a1, TASK_TI_A1(tp)
> > > > +       REG_L   a2, TASK_TI_A2(tp)
> > > > +.endm
> > > > +
> > > > +
> > > >  SYM_CODE_START(handle_exception)
> > > >         /*
> > > >          * If coming from userspace, preserve the user thread pointer and load
> > > > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
> > > >
> > > >  .Lrestore_kernel_tpsp:
> > > >         csrr tp, CSR_SCRATCH
> > > > +
> > > > +       /*
> > > > +        * The RISC-V kernel does not eagerly emit a sfence.vma after each
> > > > +        * new vmalloc mapping, which may result in exceptions:
> > > > +        * - if the uarch caches invalid entries, the new mapping would not be
> > > > +        *   observed by the page table walker and an invalidation is needed.
> > > > +        * - if the uarch does not cache invalid entries, a reordered access
> > > > +        *   could "miss" the new mapping and traps: in that case, we only need
> > > > +        *   to retry the access, no sfence.vma is required.
> > > > +        */
> > > > +       new_vmalloc_check
> > > > +
> > > >         REG_S sp, TASK_TI_KERNEL_SP(tp)
> > > >
> > > >  #ifdef CONFIG_VMAP_STACK
> > > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > > index eafc4c2200f2..54c9fdeda11e 100644
> > > > --- a/arch/riscv/mm/init.c
> > > > +++ b/arch/riscv/mm/init.c
> > > > @@ -36,6 +36,8 @@
> > > >
> > > >  #include "../kernel/head.h"
> > > >
> > > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > +
> > > >  struct kernel_mapping kernel_map __ro_after_init;
> > > >  EXPORT_SYMBOL(kernel_map);
> > > >  #ifdef CONFIG_XIP_KERNEL
> > > > --
> > > > 2.39.2
> > > >
> > > >
> > >
> > > Can we consider using new_vmalloc as a percpu variable, so that we
> > > don't need to add a0/1/2 in thread_info?
> >
> > At first, I used percpu variables. But then I realized that percpu
> > areas are allocated in the vmalloc area, so if somehow we take a trap
> > when accessing the new_vmalloc percpu variable, we could not recover
> > from this as we would trap forever in new_vmalloc_check. But
> > admittedly, not sure that can happen.
> >
> > And how would that remove a0, a1 and a2 from thread_info? We'd still
> > need to save some registers somewhere to access the percpu variable
> > right?
> >
> > > Also, try not to do too much
> > > calculation logic in new_vmalloc_check, after all, handle_exception is
> > > a high-frequency path. In this case, can we consider writing
> > > new_vmalloc_check in C language to increase readability?
> >
> > If we write that in C, we don't have the control over the allocated
> > registers and then we can't correctly save the context.
>
> If we use C language, new_vmalloc_check is written just like do_irq(),
> then we need _save_context, but for new_vmalloc_check, it is not worth
> the loss, because exceptions from user mode do not need
> new_vmalloc_check, which also shows that it is reasonable to put
> new_vmalloc_check after _restore_kernel_tpsp.
>
> Saving is necessary. We can save a0, a1, a2 without using thread_info.
> We can choose to save on the kernel stack of the current tp, but we
> need to add the following instructions:
> REG_S sp, TASK_TI_USER_SP(tp)
> REG_L sp, TASK_TI_KERNEL_SP(tp)
> addi sp, sp, -(PT_SIZE_ON_STACK)
> It seems that saving directly on thread_info is more direct, but
> saving on the kernel stack is more logically consistent, and there is
> no need to increase the size of thread_info.

You can't save on the kernel stack since kernel stacks are allocated
in the vmalloc area.

>
> As for the current status of the patch, there are two points that can
> be optimized:
> 1. Some chip hardware implementations may not cache TLB invalid
> entries, so it doesn't matter whether svvptc is available or not. Can
> we consider adding a CONFIG_RISCV_SVVPTC to control it?
>
> 2. .macro new_vmalloc_check
> REG_S a0, TASK_TI_A0(tp)
> REG_S a1, TASK_TI_A1(tp)
> REG_S a2, TASK_TI_A2(tp)
> When executing blt a0, zero, _new_vmalloc_restore_context, you can not
> save a1, a2 first

Ok, I can do that :)

Thanks again for your inputs,

Alex

>
> >
> > Thanks for your interest in this patchset :)
> >
> > Alex
> >
> > >
> > > Thanks,
> > > Yunhui
>
> Thanks,
> Yunhui


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-06-04  7:15         ` Alexandre Ghiti
@ 2024-06-04  7:17           ` Alexandre Ghiti
  2024-06-04  8:51             ` Conor Dooley
  0 siblings, 1 reply; 18+ messages in thread
From: Alexandre Ghiti @ 2024-06-04  7:17 UTC (permalink / raw)
  To: yunhui cui, Conor Dooley
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> Hi Yunhui,
>
> On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> >
> > Hi Alexandre,
> >
> > On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > >
> > > Hi Yunhui,
> > >
> > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > > >
> > > > Hi Alexandre,
> > > >
> > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > >
> > > > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > > > [1] [2]). Then in order to make sure that new page table entries were
> > > > > seen by the page table walker, we had to preventively emit a sfence.vma
> > > > > on all harts [3] but this solution is very costly since it relies on IPI.
> > > > >
> > > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > > > > allocation is done in the IPI path (for example if it is traced, see
> > > > > [4]), which could result in a kernel stack overflow.
> > > > >
> > > > > Those preventive sfence.vma needed to be emitted because:
> > > > >
> > > > > - if the uarch caches invalid entries, the new mapping may not be
> > > > >   observed by the page table walker and an invalidation may be needed.
> > > > > - if the uarch does not cache invalid entries, a reordered access
> > > > >   could "miss" the new mapping and traps: in that case, we would actually
> > > > >   only need to retry the access, no sfence.vma is required.
> > > > >
> > > > > So this patch removes those preventive sfence.vma and actually handles
> > > > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > > > mappings lie in the vmalloc area, this handling must be done very early
> > > > > when the trap is taken, at the very beginning of handle_exception: this
> > > > > also rules out the vmalloc allocations in the fault path.
> > > > >
> > > > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> > > > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> > > > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> > > > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > ---
> > > > >  arch/riscv/include/asm/cacheflush.h  | 18 +++++-
> > > > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > > > >  arch/riscv/kernel/asm-offsets.c      |  5 ++
> > > > >  arch/riscv/kernel/entry.S            | 84 ++++++++++++++++++++++++++++
> > > > >  arch/riscv/mm/init.c                 |  2 +
> > > > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > index a129dac4521d..b0d631701757 100644
> > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
> > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > >
> > > > >  #ifdef CONFIG_64BIT
> > > > > -#define flush_cache_vmap(start, end)           flush_tlb_kernel_range(start, end)
> > > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > > +extern char _end[];
> > > > > +#define flush_cache_vmap flush_cache_vmap
> > > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > > > > +{
> > > > > +       if (is_vmalloc_or_module_addr((void *)start)) {
> > > > > +               int i;
> > > > > +
> > > > > +               /*
> > > > > +                * We don't care if concurrently a cpu resets this value since
> > > > > +                * the only place this can happen is in handle_exception() where
> > > > > +                * an sfence.vma is emitted.
> > > > > +                */
> > > > > +               for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > > > +                       new_vmalloc[i] = -1ULL;
> > > > > +       }
> > > > > +}
> > > > >  #define flush_cache_vmap_early(start, end)     local_flush_tlb_kernel_range(start, end)
> > > > >  #endif
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> > > > > index 5d473343634b..32631acdcdd4 100644
> > > > > --- a/arch/riscv/include/asm/thread_info.h
> > > > > +++ b/arch/riscv/include/asm/thread_info.h
> > > > > @@ -60,6 +60,11 @@ struct thread_info {
> > > > >         void                    *scs_base;
> > > > >         void                    *scs_sp;
> > > > >  #endif
> > > > > +       /*
> > > > > +        * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> > > > > +        * can access the kernel stack.
> > > > > +        */
> > > > > +       unsigned long           a0, a1, a2;
> > > > >  };
> > > > >
> > > > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > index a03129f40c46..939ddc0e3c6e 100644
> > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > @@ -35,6 +35,8 @@ void asm_offsets(void)
> > > > >         OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> > > > >         OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> > > > >         OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > > > > +
> > > > > +       OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> > > > >         OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> > > > >         OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> > > > >         OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> > > > > @@ -42,6 +44,9 @@ void asm_offsets(void)
> > > > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > > >         OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> > > > >  #endif
> > > > > +       OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> > > > > +       OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> > > > > +       OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
> > > > >
> > > > >         OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> > > > >         OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
> > > > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > > > > index 9d1a305d5508..c1ffaeaba7aa 100644
> > > > > --- a/arch/riscv/kernel/entry.S
> > > > > +++ b/arch/riscv/kernel/entry.S
> > > > > @@ -19,6 +19,78 @@
> > > > >
> > > > >         .section .irqentry.text, "ax"
> > > > >
> > > > > +.macro new_vmalloc_check
> > > > > +       REG_S   a0, TASK_TI_A0(tp)
> > > > > +       REG_S   a1, TASK_TI_A1(tp)
> > > > > +       REG_S   a2, TASK_TI_A2(tp)
> > > > > +
> > > > > +       csrr    a0, CSR_CAUSE
> > > > > +       /* Exclude IRQs */
> > > > > +       blt     a0, zero, _new_vmalloc_restore_context
> > > > > +       /* Only check new_vmalloc if we are in page/protection fault */
> > > > > +       li      a1, EXC_LOAD_PAGE_FAULT
> > > > > +       beq     a0, a1, _new_vmalloc_kernel_address
> > > > > +       li      a1, EXC_STORE_PAGE_FAULT
> > > > > +       beq     a0, a1, _new_vmalloc_kernel_address
> > > > > +       li      a1, EXC_INST_PAGE_FAULT
> > > > > +       bne     a0, a1, _new_vmalloc_restore_context
> > > > > +
> > > > > +_new_vmalloc_kernel_address:
> > > > > +       /* Is it a kernel address? */
> > > > > +       csrr    a0, CSR_TVAL
> > > > > +       bge     a0, zero, _new_vmalloc_restore_context
> > > > > +
> > > > > +       /* Check if a new vmalloc mapping appeared that could explain the trap */
> > > > > +
> > > > > +       /*
> > > > > +        * Computes:
> > > > > +        * a0 = &new_vmalloc[BIT_WORD(cpu)]
> > > > > +        * a1 = BIT_MASK(cpu)
> > > > > +        */
> > > > > +       REG_L   a2, TASK_TI_CPU(tp)
> > > > > +       /*
> > > > > +        * Compute the new_vmalloc element position:
> > > > > +        * (cpu / 64) * 8 = (cpu >> 6) << 3
> > > > > +        */
> > > > > +       srli    a1, a2, 6
> > > > > +       slli    a1, a1, 3
> > > > > +       la      a0, new_vmalloc
> > > > > +       add     a0, a0, a1
> > > > > +       /*
> > > > > +        * Compute the bit position in the new_vmalloc element:
> > > > > +        * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> > > > > +        *         = cpu - ((cpu >> 6) << 3) << 3
> > > > > +        */
> > > > > +       slli    a1, a1, 3
> > > > > +       sub     a1, a2, a1
> > > > > +       /* Compute the "get mask": 1 << bit_pos */
> > > > > +       li      a2, 1
> > > > > +       sll     a1, a2, a1
> > > > > +
> > > > > +       /* Check the value of new_vmalloc for this cpu */
> > > > > +       REG_L   a2, 0(a0)
> > > > > +       and     a2, a2, a1
> > > > > +       beq     a2, zero, _new_vmalloc_restore_context
> > > > > +
> > > > > +       /* Atomically reset the current cpu bit in new_vmalloc */
> > > > > +       amoxor.w        a0, a1, (a0)
> > > > > +
> > > > > +       /* Only emit a sfence.vma if the uarch caches invalid entries */
> > > > > +       ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> > > > > +
> > > > > +       REG_L   a0, TASK_TI_A0(tp)
> > > > > +       REG_L   a1, TASK_TI_A1(tp)
> > > > > +       REG_L   a2, TASK_TI_A2(tp)
> > > > > +       csrw    CSR_SCRATCH, x0
> > > > > +       sret
> > > > > +
> > > > > +_new_vmalloc_restore_context:
> > > > > +       REG_L   a0, TASK_TI_A0(tp)
> > > > > +       REG_L   a1, TASK_TI_A1(tp)
> > > > > +       REG_L   a2, TASK_TI_A2(tp)
> > > > > +.endm
> > > > > +
> > > > > +
> > > > >  SYM_CODE_START(handle_exception)
> > > > >         /*
> > > > >          * If coming from userspace, preserve the user thread pointer and load
> > > > > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
> > > > >
> > > > >  .Lrestore_kernel_tpsp:
> > > > >         csrr tp, CSR_SCRATCH
> > > > > +
> > > > > +       /*
> > > > > +        * The RISC-V kernel does not eagerly emit a sfence.vma after each
> > > > > +        * new vmalloc mapping, which may result in exceptions:
> > > > > +        * - if the uarch caches invalid entries, the new mapping would not be
> > > > > +        *   observed by the page table walker and an invalidation is needed.
> > > > > +        * - if the uarch does not cache invalid entries, a reordered access
> > > > > +        *   could "miss" the new mapping and traps: in that case, we only need
> > > > > +        *   to retry the access, no sfence.vma is required.
> > > > > +        */
> > > > > +       new_vmalloc_check
> > > > > +
> > > > >         REG_S sp, TASK_TI_KERNEL_SP(tp)
> > > > >
> > > > >  #ifdef CONFIG_VMAP_STACK
> > > > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > > > index eafc4c2200f2..54c9fdeda11e 100644
> > > > > --- a/arch/riscv/mm/init.c
> > > > > +++ b/arch/riscv/mm/init.c
> > > > > @@ -36,6 +36,8 @@
> > > > >
> > > > >  #include "../kernel/head.h"
> > > > >
> > > > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > > +
> > > > >  struct kernel_mapping kernel_map __ro_after_init;
> > > > >  EXPORT_SYMBOL(kernel_map);
> > > > >  #ifdef CONFIG_XIP_KERNEL
> > > > > --
> > > > > 2.39.2
> > > > >
> > > > >
> > > >
> > > > Can we consider using new_vmalloc as a percpu variable, so that we
> > > > don't need to add a0/1/2 in thread_info?
> > >
> > > At first, I used percpu variables. But then I realized that percpu
> > > areas are allocated in the vmalloc area, so if somehow we take a trap
> > > when accessing the new_vmalloc percpu variable, we could not recover
> > > from this as we would trap forever in new_vmalloc_check. But
> > > admittedly, not sure that can happen.
> > >
> > > And how would that remove a0, a1 and a2 from thread_info? We'd still
> > > need to save some registers somewhere to access the percpu variable
> > > right?
> > >
> > > > Also, try not to do too much
> > > > calculation logic in new_vmalloc_check, after all, handle_exception is
> > > > a high-frequency path. In this case, can we consider writing
> > > > new_vmalloc_check in C language to increase readability?
> > >
> > > If we write that in C, we don't have the control over the allocated
> > > registers and then we can't correctly save the context.
> >
> > If we use C language, new_vmalloc_check is written just like do_irq(),
> > then we need _save_context, but for new_vmalloc_check, it is not worth
> > the loss, because exceptions from user mode do not need
> > new_vmalloc_check, which also shows that it is reasonable to put
> > new_vmalloc_check after _restore_kernel_tpsp.
> >
> > Saving is necessary. We can save a0, a1, a2 without using thread_info.
> > We can choose to save on the kernel stack of the current tp, but we
> > need to add the following instructions:
> > REG_S sp, TASK_TI_USER_SP(tp)
> > REG_L sp, TASK_TI_KERNEL_SP(tp)
> > addi sp, sp, -(PT_SIZE_ON_STACK)
> > It seems that saving directly on thread_info is more direct, but
> > saving on the kernel stack is more logically consistent, and there is
> > no need to increase the size of thread_info.
>
> You can't save on the kernel stack since kernel stacks are allocated
> in the vmalloc area.
>
> >
> > As for the current status of the patch, there are two points that can
> > be optimized:
> > 1. Some chip hardware implementations may not cache TLB invalid
> > entries, so it doesn't matter whether svvptc is available or not. Can
> > we consider adding a CONFIG_RISCV_SVVPTC to control it?

That would produce a non-portable kernel. But I'm not opposed to that
at all, let me check how we handle other extensions. Maybe @Conor
Dooley has some feedback here?

> >
> > 2. .macro new_vmalloc_check
> > REG_S a0, TASK_TI_A0(tp)
> > REG_S a1, TASK_TI_A1(tp)
> > REG_S a2, TASK_TI_A2(tp)
> > When executing blt a0, zero, _new_vmalloc_restore_context, you can not
> > save a1, a2 first
>
> Ok, I can do that :)
>
> Thanks again for your inputs,
>
> Alex
>
> >
> > >
> > > Thanks for your interest in this patchset :)
> > >
> > > Alex
> > >
> > > >
> > > > Thanks,
> > > > Yunhui
> >
> > Thanks,
> > Yunhui


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-06-04  7:17           ` Alexandre Ghiti
@ 2024-06-04  8:51             ` Conor Dooley
  2024-06-04 11:44               ` Alexandre Ghiti
  0 siblings, 1 reply; 18+ messages in thread
From: Conor Dooley @ 2024-06-04  8:51 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: yunhui cui, Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]

On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > >
> > > As for the current status of the patch, there are two points that can
> > > be optimized:
> > > 1. Some chip hardware implementations may not cache TLB invalid
> > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> 
> That would produce a non-portable kernel. But I'm not opposed to that
> at all, let me check how we handle other extensions. Maybe @Conor
> Dooley has some feedback here?

To be honest, not really sure what to give feedback on. Could you
elaborate on exactly what the option is going to do? Given the
portability concern, I guess you were proposing that the option would
remove the preventative fences, rather than your current patch that
removes them via an alternative? I don't think we have any extension
related options that work like that at the moment, and making that an
option will just mean that distros that look to cater for multiple
platforms won't be able to turn it on.

Thanks,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-06-04  8:51             ` Conor Dooley
@ 2024-06-04 11:44               ` Alexandre Ghiti
  2024-06-04 20:17                 ` Conor Dooley
  0 siblings, 1 reply; 18+ messages in thread
From: Alexandre Ghiti @ 2024-06-04 11:44 UTC (permalink / raw)
  To: Conor Dooley
  Cc: yunhui cui, Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley <conor@kernel.org> wrote:
>
> On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > > >
> > > > As for the current status of the patch, there are two points that can
> > > > be optimized:
> > > > 1. Some chip hardware implementations may not cache TLB invalid
> > > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> >
> > That would produce a non-portable kernel. But I'm not opposed to that
> > at all, let me check how we handle other extensions. Maybe @Conor
> > Dooley has some feedback here?
>
> To be honest, not really sure what to give feedback on. Could you
> elaborate on exactly what the option is going to do? Given the
> portability concern, I guess you were proposing that the option would
> remove the preventative fences, rather than your current patch that
> removes them via an alternative?

No no, I won't do that, we need a generic kernel for distros so that's
not even a question. What Yunhui was asking about (to me) is: can we
introduce a Kconfig option to always remove the preventive fences,
bypassing the use of alternatives altogether?

To me, it won't make a difference in terms of performance. But if we
already offer such a possibility for other extensions, well I'll do
it. Otherwise, the question is: should we start doing that?

> I don't think we have any extension
> related options that work like that at the moment, and making that an
> option will just mean that distros that look to cater for multiple
> platforms won't be able to turn it on.
>
> Thanks,
> Conor.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2024-06-04 11:44               ` Alexandre Ghiti
@ 2024-06-04 20:17                 ` Conor Dooley
  0 siblings, 0 replies; 18+ messages in thread
From: Conor Dooley @ 2024-06-04 20:17 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: yunhui cui, Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1982 bytes --]

On Tue, Jun 04, 2024 at 01:44:15PM +0200, Alexandre Ghiti wrote:
> On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley <conor@kernel.org> wrote:
> >
> > On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> > > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > > > >
> > > > > As for the current status of the patch, there are two points that can
> > > > > be optimized:
> > > > > 1. Some chip hardware implementations may not cache TLB invalid
> > > > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> > >
> > > That would produce a non-portable kernel. But I'm not opposed to that
> > > at all, let me check how we handle other extensions. Maybe @Conor
> > > Dooley has some feedback here?
> >
> > To be honest, not really sure what to give feedback on. Could you
> > elaborate on exactly what the option is going to do? Given the
> > portability concern, I guess you were proposing that the option would
> > remove the preventative fences, rather than your current patch that
> > removes them via an alternative?
> 
> No no, I won't do that, we need a generic kernel for distros so that's
> not even a question. What Yunhui was asking about (to me) is: can we
> introduce a Kconfig option to always remove the preventive fences,
> bypassing the use of alternatives altogether?
> 
> To me, it won't make a difference in terms of performance. But if we
> already offer such a possibility for other extensions, well I'll do
> it. Otherwise, the question is: should we start doing that?

We don't do that for other extensions yet, because currently all the
extensions we have options for are additive. There's like 3 alternative
patchsites, and they are all just one nop? I don't see the point of
having a Kconfig knob for that.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-06-04 20:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-31 15:59 [PATCH RFC v2 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
2024-01-31 15:59 ` [PATCH RFC/RFT v2 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
2024-01-31 15:59 ` [PATCH RFC/RFT v2 2/4] dt-bindings: riscv: Add Svvptc ISA extension description Alexandre Ghiti
2024-02-01  9:22   ` Krzysztof Kozlowski
2024-01-31 15:59 ` [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
2024-06-03  2:26   ` [External] " yunhui cui
2024-06-03 12:02     ` Alexandre Ghiti
2024-06-04  6:21       ` yunhui cui
2024-06-04  7:15         ` Alexandre Ghiti
2024-06-04  7:17           ` Alexandre Ghiti
2024-06-04  8:51             ` Conor Dooley
2024-06-04 11:44               ` Alexandre Ghiti
2024-06-04 20:17                 ` Conor Dooley
2024-01-31 15:59 ` [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
2024-02-01 15:03   ` Andrea Parri
2024-02-02 15:42     ` Alexandre Ghiti
2024-02-02 22:05       ` Alexandre Ghiti
2024-05-30  9:35   ` [External] " yunhui cui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox