[RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
@ 2024-07-23 14:47 Andrew Zaborowski
  2024-07-23 14:47 ` [RESEND][PATCH 2/3] execve: Ensure SIGBUS delivered on memory failure Andrew Zaborowski
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Andrew Zaborowski @ 2024-07-23 14:47 UTC (permalink / raw)
  To: linux-edac, linux-mm
  Cc: Kees Cook, Tony Luck, Eric Biederman, Borislav Petkov

Uncorrected memory errors for user pages are signaled to processes
using SIGBUS or, if the error happens in a syscall, an error retval
from the syscall.  The SIGBUS is documented in
Documentation/mm/hwpoison.rst#failure-recovery-modes

But there are corner cases where we cannot or don't want to return a
plain error from the syscall.  Subsequent commits covers two such cases:
execve and rseq.  Current code, in both places, will kill the task with a
SIGSEGV on error.  While not explicitly stated, it can be argued that it
should be a SIGBUS, for consistency and for the benefit of the userspace
signal handlers.  Even if the process cannot handle the signal, perhaps
the parent process can.  This was the case in the scenario that
motivated this patch.

In both cases, the architecture's exception handler (MCE handler on x86)
will queue a call to memory_failure.  This doesn't work because the
syscall-specific code sees the -EFAULT and terminates the task before
the queued work runs.

To fix this: 1. let pending work run in the error cases in both places.

And 2. on MCE, ensure memory_failure() is passed MF_ACTION_REQUIRED so
that the SIGBUS is queued.  Normally when the MCE is in a syscall,
a fixup of return IP and a call to kill_me_never() are what we want.
But in this case it's necessary to queue kill_me_maybe() which will set
MF_ACTION_REQUIRED which is checked by memory_failure().

To do this the syscall code will set current->kill_on_efault, a new
task_struct flag.  Check that flag in
arch/x86/kernel/cpu/mce/core.c:do_machine_check()

Note: the flag is not x86 specific even if only x86 handling is being
added here.  The definition could be guarded by #ifdef
CONFIG_MEMORY_FAILURE, but it would then need set/clear utilities.

Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
---
Resending through an SMTP server that won't add the company footer.

This is a v2 of
https://lore.kernel.org/linux-mm/20240501015340.3014724-1-andrew.zaborowski@intel.com/
In the v1 the existing flag current->in_execve was being reused instead
of adding a new one.  Kees Cook commented in
https://lore.kernel.org/linux-mm/202405010915.465AF19@keescook/ that
current->in_execve is going away.  Lacking a better idea and seeing
that execve() and rseq() would benefit from using a common mechanism, I
decided to add this new flag.

Perhaps with a better name current->kill_on_efault could replace
brpm->point_of_no_return to offset the pain of having this extra flag.
---
 arch/x86/kernel/cpu/mce/core.c | 18 +++++++++++++++++-
 include/linux/sched.h          |  2 ++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index ad0623b65..13f2ace3d 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1611,7 +1611,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 			if (p)
 				SetPageHWPoison(p);
 		}
-	} else {
+	} else if (!current->kill_on_efault) {
 		/*
 		 * Handle an MCE which has happened in kernel space but from
 		 * which the kernel can recover: ex_has_fault_handler() has
@@ -1628,6 +1628,22 @@ noinstr void do_machine_check(struct pt_regs *regs)

 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
 			queue_task_work(&m, msg, kill_me_never);
+	} else {
+		/*
+		 * Even with recovery code extra handling is required when
+		 * we're not returning to userspace after error (e.g. in
+		 * execve() beyond the point of no return) to ensure that
+		 * a SIGBUS is delivered.
+		 */
+		if (m.kflags & MCE_IN_KERNEL_RECOV) {
+			if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
+				mce_panic("Failed kernel mode recovery", &m, msg);
+		}
+
+		if (!mce_usable_address(&m))
+			queue_task_work(&m, msg, kill_me_now);
+		else
+			queue_task_work(&m, msg, kill_me_maybe);
 	}

 out:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6e..0cde1ba11 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -975,6 +975,8 @@ struct task_struct {
 	/* delay due to memory thrashing */
 	unsigned                        in_thrashing:1;
 #endif
+	/* Kill task on user memory access error */
+	unsigned                        kill_on_efault:1;

 	unsigned long			atomic_flags; /* Flags requiring atomic access. */

-- 
2.43.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RESEND][PATCH 2/3] execve: Ensure SIGBUS delivered on memory failure
  2024-07-23 14:47 [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Andrew Zaborowski
@ 2024-07-23 14:47 ` Andrew Zaborowski
  2024-07-23 14:47 ` [RESEND][PATCH 3/3] rseq: " Andrew Zaborowski
  2024-08-06  4:36 ` [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Kees Cook
  2 siblings, 0 replies; 14+ messages in thread
From: Andrew Zaborowski @ 2024-07-23 14:47 UTC (permalink / raw)
  To: linux-edac, linux-mm
  Cc: Kees Cook, Tony Luck, Eric Biederman, Borislav Petkov

Uncorrected memory errors for user pages are signaled to processes
using SIGBUS or, if the error happens in a syscall, an error retval
from the syscall.  The SIGBUS is documented in
Documentation/mm/hwpoison.rst#failure-recovery-modes

In execve() there is a point of no return
(bprm->point_of_no_return) after which the syscall... cannot return.
The binary loading happens after this point so if the loader triggers
a memory error reading user pages, and after control returns to
bprm_execve(), that function reacts by sending a SIGSEGV.

Set the new current->kill_on_efault flag and run pending task work to
ensure that a SIGBUS is queued in memory_failure()

Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
---
 fs/exec.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 400731422..26c4efe1a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -68,6 +68,7 @@
 #include <linux/user_events.h>
 #include <linux/rseq.h>
 #include <linux/ksm.h>
+#include <linux/task_work.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1290,6 +1291,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 * Ensure all future errors are fatal.
 	 */
 	bprm->point_of_no_return = true;
+	me->kill_on_efault = true;
 
 	/*
 	 * Make this the only thread in the thread group.
@@ -1896,6 +1898,7 @@ static int bprm_execve(struct linux_binprm *bprm)
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	current->kill_on_efault = false;
 	rseq_execve(current);
 	user_events_execve(current);
 	acct_update_integrals(current);
@@ -1907,14 +1910,20 @@ static int bprm_execve(struct linux_binprm *bprm)
 	 * If past the point of no return ensure the code never
 	 * returns to the userspace process.  Use an existing fatal
 	 * signal if present otherwise terminate the process with
-	 * SIGSEGV.
+	 * SIGSEGV.  Run pending work before that in case it is
+	 * terminating the process with a different signal.
 	 */
-	if (bprm->point_of_no_return && !fatal_signal_pending(current))
-		force_fatal_sig(SIGSEGV);
+	if (bprm->point_of_no_return) {
+		task_work_run();
+
+		if (!fatal_signal_pending(current))
+			force_fatal_sig(SIGSEGV);
+	}
 
 	sched_mm_cid_after_execve(current);
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	current->kill_on_efault = false;
 
 	return retval;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RESEND][PATCH 3/3] rseq: Ensure SIGBUS delivered on memory failure
  2024-07-23 14:47 [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Andrew Zaborowski
  2024-07-23 14:47 ` [RESEND][PATCH 2/3] execve: Ensure SIGBUS delivered on memory failure Andrew Zaborowski
@ 2024-07-23 14:47 ` Andrew Zaborowski
  2024-08-06  4:37   ` Kees Cook
  2024-08-06  4:36 ` [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Kees Cook
  2 siblings, 1 reply; 14+ messages in thread
From: Andrew Zaborowski @ 2024-07-23 14:47 UTC (permalink / raw)
  To: linux-edac, linux-mm
  Cc: Kees Cook, Tony Luck, Eric Biederman, Borislav Petkov

Uncorrected memory errors for user pages are signaled to processes
using SIGBUS or, if the error happens in a syscall, an error retval
from the syscall.  The SIGBUS is documented in
Documentation/mm/hwpoison.rst#failure-recovery-modes

Once a user task sets t->rseq in the rseq() syscall, if the kernel
cannot access the memory pointed to by t->rseq->rseq_cs, that initial
rseq() and all future syscalls should return an error so understandably
the code just kills the task.

To ensure that SIGBUS is used set the new t->kill_on_efault flag and
run queued task work on rseq_get_rseq_cs() errors to give memory_failure
the chance to run.

Note: the rseq checks run inside resume_user_mode_work() so whenever
_TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
I'm not concerned that these extra flag operations are in a hot path,
except with CONFIG_DEBUG_RSEQ.

Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
---
 kernel/rseq.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9de6e35fe..c5809cd13 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -13,6 +13,7 @@
 #include <linux/syscalls.h>
 #include <linux/rseq.h>
 #include <linux/types.h>
+#include <linux/task_work.h>
 #include <asm/ptrace.h>
 
 #define CREATE_TRACE_POINTS
@@ -320,6 +321,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	if (unlikely(t->flags & PF_EXITING))
 		return;
 
+	t->kill_on_efault = true;
+
 	/*
 	 * regs is NULL if and only if the caller is in a syscall path.  Skip
 	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
@@ -330,13 +333,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 		if (unlikely(ret < 0))
 			goto error;
 	}
-	if (unlikely(rseq_update_cpu_node_id(t)))
-		goto error;
-	return;
+	if (likely(!rseq_update_cpu_node_id(t)))
+		goto out;
 
 error:
+	/* Allow task work to override signr */
+	task_work_run();
+
 	sig = ksig ? ksig->sig : 0;
 	force_sigsegv(sig);
+
+out:
+	t->kill_on_efault = false;
 }
 
 #ifdef CONFIG_DEBUG_RSEQ
@@ -353,8 +361,17 @@ void rseq_syscall(struct pt_regs *regs)
 
 	if (!t->rseq)
 		return;
-	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+
+	t->kill_on_efault = true;
+
+	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) {
+		/* Allow task work to override signr */
+		task_work_run();
+
 		force_sig(SIGSEGV);
+	}
+
+	t->kill_on_efault = false;
 }
 
 #endif
-- 
2.43.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 3/3] rseq: Ensure SIGBUS delivered on memory failure
  2024-07-23 14:47 ` [RESEND][PATCH 3/3] rseq: " Andrew Zaborowski
@ 2024-08-06  4:37   ` Kees Cook
  2024-08-06  7:51     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Kees Cook @ 2024-08-06  4:37 UTC (permalink / raw)
  To: Andrew Zaborowski
  Cc: linux-edac, linux-mm, Tony Luck, Eric Biederman, Borislav Petkov,
	Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng

On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote:
> Uncorrected memory errors for user pages are signaled to processes
> using SIGBUS or, if the error happens in a syscall, an error retval
> from the syscall.  The SIGBUS is documented in
> Documentation/mm/hwpoison.rst#failure-recovery-modes
> 
> Once a user task sets t->rseq in the rseq() syscall, if the kernel
> cannot access the memory pointed to by t->rseq->rseq_cs, that initial
> rseq() and all future syscalls should return an error so understandably
> the code just kills the task.
> 
> To ensure that SIGBUS is used set the new t->kill_on_efault flag and
> run queued task work on rseq_get_rseq_cs() errors to give memory_failure
> the chance to run.
> 
> Note: the rseq checks run inside resume_user_mode_work() so whenever
> _TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
> I'm not concerned that these extra flag operations are in a hot path,
> except with CONFIG_DEBUG_RSEQ.
> 
> Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
> ---
>  kernel/rseq.c | 25 +++++++++++++++++++++----

Can an rseq maintainer please review this? I can carry it via the execve
tree with the related patches...

-Kees

>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 9de6e35fe..c5809cd13 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -13,6 +13,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/rseq.h>
>  #include <linux/types.h>
> +#include <linux/task_work.h>
>  #include <asm/ptrace.h>
>  
>  #define CREATE_TRACE_POINTS
> @@ -320,6 +321,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>  	if (unlikely(t->flags & PF_EXITING))
>  		return;
>  
> +	t->kill_on_efault = true;
> +
>  	/*
>  	 * regs is NULL if and only if the caller is in a syscall path.  Skip
>  	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> @@ -330,13 +333,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>  		if (unlikely(ret < 0))
>  			goto error;
>  	}
> -	if (unlikely(rseq_update_cpu_node_id(t)))
> -		goto error;
> -	return;
> +	if (likely(!rseq_update_cpu_node_id(t)))
> +		goto out;
>  
>  error:
> +	/* Allow task work to override signr */
> +	task_work_run();
> +
>  	sig = ksig ? ksig->sig : 0;
>  	force_sigsegv(sig);
> +
> +out:
> +	t->kill_on_efault = false;
>  }
>  
>  #ifdef CONFIG_DEBUG_RSEQ
> @@ -353,8 +361,17 @@ void rseq_syscall(struct pt_regs *regs)
>  
>  	if (!t->rseq)
>  		return;
> -	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
> +
> +	t->kill_on_efault = true;
> +
> +	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) {
> +		/* Allow task work to override signr */
> +		task_work_run();
> +
>  		force_sig(SIGSEGV);
> +	}
> +
> +	t->kill_on_efault = false;
>  }
>  
>  #endif
> -- 
> 2.43.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 3/3] rseq: Ensure SIGBUS delivered on memory failure
  2024-08-06  4:37   ` Kees Cook
@ 2024-08-06  7:51     ` Peter Zijlstra
  2024-08-06 14:19       ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2024-08-06  7:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Zaborowski, linux-edac, linux-mm, Tony Luck,
	Eric Biederman, Borislav Petkov, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, oleg

On Mon, Aug 05, 2024 at 09:37:48PM -0700, Kees Cook wrote:
> On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote:
> > Uncorrected memory errors for user pages are signaled to processes
> > using SIGBUS or, if the error happens in a syscall, an error retval
> > from the syscall.  The SIGBUS is documented in
> > Documentation/mm/hwpoison.rst#failure-recovery-modes
> > 
> > Once a user task sets t->rseq in the rseq() syscall, if the kernel
> > cannot access the memory pointed to by t->rseq->rseq_cs, that initial
> > rseq() and all future syscalls should return an error so understandably
> > the code just kills the task.
> > 
> > To ensure that SIGBUS is used set the new t->kill_on_efault flag and
> > run queued task work on rseq_get_rseq_cs() errors to give memory_failure
> > the chance to run.
> > 
> > Note: the rseq checks run inside resume_user_mode_work() so whenever
> > _TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
> > I'm not concerned that these extra flag operations are in a hot path,
> > except with CONFIG_DEBUG_RSEQ.
> > 
> > Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
> > ---
> >  kernel/rseq.c | 25 +++++++++++++++++++++----
> 
> Can an rseq maintainer please review this? I can carry it via the execve
> tree with the related patches...

*sigh*,.. because get_maintainers just doesn't work or something?

Anyway, I'm confused by the signal code (as always), why isn't the
task_work_run() in get_signal() sufficient?

At some point we're going to run into trouble with sprinkling
task_work_run() around willy nilly :/

> 
> >  1 file changed, 21 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index 9de6e35fe..c5809cd13 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/rseq.h>
> >  #include <linux/types.h>
> > +#include <linux/task_work.h>
> >  #include <asm/ptrace.h>
> >  
> >  #define CREATE_TRACE_POINTS
> > @@ -320,6 +321,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> >  	if (unlikely(t->flags & PF_EXITING))
> >  		return;
> >  
> > +	t->kill_on_efault = true;
> > +
> >  	/*
> >  	 * regs is NULL if and only if the caller is in a syscall path.  Skip
> >  	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> > @@ -330,13 +333,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> >  		if (unlikely(ret < 0))
> >  			goto error;
> >  	}
> > -	if (unlikely(rseq_update_cpu_node_id(t)))
> > -		goto error;
> > -	return;
> > +	if (likely(!rseq_update_cpu_node_id(t)))
> > +		goto out;
> >  
> >  error:
> > +	/* Allow task work to override signr */
> > +	task_work_run();
> > +
> >  	sig = ksig ? ksig->sig : 0;
> >  	force_sigsegv(sig);
> > +
> > +out:
> > +	t->kill_on_efault = false;
> >  }
> >  
> >  #ifdef CONFIG_DEBUG_RSEQ
> > @@ -353,8 +361,17 @@ void rseq_syscall(struct pt_regs *regs)
> >  
> >  	if (!t->rseq)
> >  		return;
> > -	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
> > +
> > +	t->kill_on_efault = true;
> > +
> > +	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) {
> > +		/* Allow task work to override signr */
> > +		task_work_run();
> > +
> >  		force_sig(SIGSEGV);
> > +	}
> > +
> > +	t->kill_on_efault = false;
> >  }
> >  
> >  #endif
> > -- 
> > 2.43.0
> > 
> 
> -- 
> Kees Cook


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 3/3] rseq: Ensure SIGBUS delivered on memory failure
  2024-08-06  7:51     ` Peter Zijlstra
@ 2024-08-06 14:19       ` Mathieu Desnoyers
  0 siblings, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2024-08-06 14:19 UTC (permalink / raw)
  To: Peter Zijlstra, Kees Cook
  Cc: Andrew Zaborowski, linux-edac, linux-mm, Tony Luck,
	Eric Biederman, Borislav Petkov, Paul E. McKenney, Boqun Feng,
	oleg

On 2024-08-06 03:51, Peter Zijlstra wrote:
> On Mon, Aug 05, 2024 at 09:37:48PM -0700, Kees Cook wrote:
>> On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote:
>>> Uncorrected memory errors for user pages are signaled to processes
>>> using SIGBUS or, if the error happens in a syscall, an error retval
>>> from the syscall.  The SIGBUS is documented in
>>> Documentation/mm/hwpoison.rst#failure-recovery-modes
>>>
>>> Once a user task sets t->rseq in the rseq() syscall, if the kernel
>>> cannot access the memory pointed to by t->rseq->rseq_cs, that initial
>>> rseq() and all future syscalls should return an error so understandably
>>> the code just kills the task.
>>>
>>> To ensure that SIGBUS is used set the new t->kill_on_efault flag and
>>> run queued task work on rseq_get_rseq_cs() errors to give memory_failure
>>> the chance to run.
>>>
>>> Note: the rseq checks run inside resume_user_mode_work() so whenever
>>> _TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
>>> I'm not concerned that these extra flag operations are in a hot path,
>>> except with CONFIG_DEBUG_RSEQ.
>>>
>>> Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
>>> ---
>>>   kernel/rseq.c | 25 +++++++++++++++++++++----
>>
>> Can an rseq maintainer please review this? I can carry it via the execve
>> tree with the related patches...
> 
> *sigh*,.. because get_maintainers just doesn't work or something?
> 
> Anyway, I'm confused by the signal code (as always), why isn't the
> task_work_run() in get_signal() sufficient?
> 
> At some point we're going to run into trouble with sprinkling
> task_work_run() around willy nilly :/

I agree with Peter: adding explicit calls to task_work_run all over
the kernel does not appear to be an elegant solution.

One thing I am missing is a clear motivation for adding this code:
what is the real-world use-case that benefits from getting this SIGBUS
rather than a SIGSEGV or -EFAULT ?

Also, I feel like we should investigate turning SIGSEGV into SIGBUS
at signal delivery rather than handle this here and there in the kernel
code.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
  2024-07-23 14:47 [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Andrew Zaborowski
  2024-07-23 14:47 ` [RESEND][PATCH 2/3] execve: Ensure SIGBUS delivered on memory failure Andrew Zaborowski
  2024-07-23 14:47 ` [RESEND][PATCH 3/3] rseq: " Andrew Zaborowski
@ 2024-08-06  4:36 ` Kees Cook
  2024-08-06  8:35   ` Borislav Petkov
  2 siblings, 1 reply; 14+ messages in thread
From: Kees Cook @ 2024-08-06  4:36 UTC (permalink / raw)
  To: Andrew Zaborowski
  Cc: linux-edac, linux-mm, Tony Luck, Eric Biederman, Borislav Petkov, x86

On Tue, Jul 23, 2024 at 04:47:50PM +0200, Andrew Zaborowski wrote:
> Uncorrected memory errors for user pages are signaled to processes
> using SIGBUS or, if the error happens in a syscall, an error retval
> from the syscall.  The SIGBUS is documented in
> Documentation/mm/hwpoison.rst#failure-recovery-modes
> 
> But there are corner cases where we cannot or don't want to return a
> plain error from the syscall.  Subsequent commits covers two such cases:
> execve and rseq.  Current code, in both places, will kill the task with a
> SIGSEGV on error.  While not explicitly stated, it can be argued that it
> should be a SIGBUS, for consistency and for the benefit of the userspace
> signal handlers.  Even if the process cannot handle the signal, perhaps
> the parent process can.  This was the case in the scenario that
> motivated this patch.
> 
> In both cases, the architecture's exception handler (MCE handler on x86)
> will queue a call to memory_failure.  This doesn't work because the
> syscall-specific code sees the -EFAULT and terminates the task before
> the queued work runs.
> 
> To fix this: 1. let pending work run in the error cases in both places.
> 
> And 2. on MCE, ensure memory_failure() is passed MF_ACTION_REQUIRED so
> that the SIGBUS is queued.  Normally when the MCE is in a syscall,
> a fixup of return IP and a call to kill_me_never() are what we want.
> But in this case it's necessary to queue kill_me_maybe() which will set
> MF_ACTION_REQUIRED which is checked by memory_failure().
> 
> To do this the syscall code will set current->kill_on_efault, a new
> task_struct flag.  Check that flag in
> arch/x86/kernel/cpu/mce/core.c:do_machine_check()
> 
> Note: the flag is not x86 specific even if only x86 handling is being
> added here.  The definition could be guarded by #ifdef
> CONFIG_MEMORY_FAILURE, but it would then need set/clear utilities.
> 
> Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
> ---
> Resending through an SMTP server that won't add the company footer.
> 
> This is a v2 of
> https://lore.kernel.org/linux-mm/20240501015340.3014724-1-andrew.zaborowski@intel.com/
> In the v1 the existing flag current->in_execve was being reused instead
> of adding a new one.  Kees Cook commented in
> https://lore.kernel.org/linux-mm/202405010915.465AF19@keescook/ that
> current->in_execve is going away.  Lacking a better idea and seeing
> that execve() and rseq() would benefit from using a common mechanism, I
> decided to add this new flag.
> 
> Perhaps with a better name current->kill_on_efault could replace
> brpm->point_of_no_return to offset the pain of having this extra flag.
> ---
>  arch/x86/kernel/cpu/mce/core.c | 18 +++++++++++++++++-

Since this touches arch/x86/, can an x86 maintainer review this? I can
carry this via the execve tree...

-Kees

>  include/linux/sched.h          |  2 ++
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index ad0623b65..13f2ace3d 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1611,7 +1611,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  			if (p)
>  				SetPageHWPoison(p);
>  		}
> -	} else {
> +	} else if (!current->kill_on_efault) {
>  		/*
>  		 * Handle an MCE which has happened in kernel space but from
>  		 * which the kernel can recover: ex_has_fault_handler() has
> @@ -1628,6 +1628,22 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  
>  		if (m.kflags & MCE_IN_KERNEL_COPYIN)
>  			queue_task_work(&m, msg, kill_me_never);
> +	} else {
> +		/*
> +		 * Even with recovery code extra handling is required when
> +		 * we're not returning to userspace after error (e.g. in
> +		 * execve() beyond the point of no return) to ensure that
> +		 * a SIGBUS is delivered.
> +		 */
> +		if (m.kflags & MCE_IN_KERNEL_RECOV) {
> +			if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
> +				mce_panic("Failed kernel mode recovery", &m, msg);
> +		}
> +
> +		if (!mce_usable_address(&m))
> +			queue_task_work(&m, msg, kill_me_now);
> +		else
> +			queue_task_work(&m, msg, kill_me_maybe);
>  	}
>  
>  out:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 61591ac6e..0cde1ba11 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -975,6 +975,8 @@ struct task_struct {
>  	/* delay due to memory thrashing */
>  	unsigned                        in_thrashing:1;
>  #endif
> +	/* Kill task on user memory access error */
> +	unsigned                        kill_on_efault:1;
>  
>  	unsigned long			atomic_flags; /* Flags requiring atomic access. */
>  
> -- 
> 2.43.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
  2024-08-06  4:36 ` [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Kees Cook
@ 2024-08-06  8:35   ` Borislav Petkov
       [not found]     ` <SA1PR11MB69926BFE8EFDA7B3C3D84560E7B82@SA1PR11MB6992.namprd11.prod.outlook.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2024-08-06  8:35 UTC (permalink / raw)
  To: Kees Cook, Andrew Zaborowski
  Cc: linux-edac, linux-mm, Tony Luck, Eric Biederman, x86

On August 6, 2024 7:36:40 AM GMT+03:00, Kees Cook <kees@kernel.org> wrote:
>Since this touches arch/x86/, can an x86 maintainer review this? I can
>carry this via the execve tree...

No, we can't until the smoke from the handwaving clears:

>> While not explicitly stated, it can be argued that it
>> should be a SIGBUS, for consistency and for the benefit of the userspace
>> signal handlers.  Even if the process cannot handle the signal, perhaps
>> the parent process can.  This was the case in the scenario that
>> motivated this patch.

I have no clue what that is trying to tell me.

-- 
Sent from a small device: formatting sucks and brevity is inevitable.


^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <SA1PR11MB69926BFE8EFDA7B3C3D84560E7B82@SA1PR11MB6992.namprd11.prod.outlook.com>]

[parent not found: <CAOq732KXwsKdht55E-Z18choiAYn6dMpXc-TD15B7MOUH1fpxQ@mail.gmail.com>]

[parent not found: <20240808145331.GAZrTb60FX_I3p0Ukx@fat_crate.local>]

* Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
       [not found]         ` <20240808145331.GAZrTb60FX_I3p0Ukx@fat_crate.local>
@ 2024-08-09  1:22           ` Andrew Zaborowski
  2024-08-09  8:34             ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Zaborowski @ 2024-08-09  1:22 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-mm, Eric Biederman, x86, Tony Luck

On Thu, 8 Aug 2024 at 16:55, Borislav Petkov <bp@alien8.de> wrote:
> I'm not sure it matters either. You're adding all that code and
> task_struct member just because the kernel sends SIGBUS on a memory
> failure. Oh well.
>
> How is that more beneficial for the overall recovery strategy than
> killing the current task? IOW, what is the real, practical advantage of
> this and why do we want to support it indefinitely?

I don't have a "real world" use case, we hit these two bugs in HW
testing.  Qemu relies on the SIGBUS logic but the execve and rseq
cases cannot be recovered from, the main benefit of sending the
correct signal is perhaps information to the user.

If this cannot be fixed then optimally it should be documented.

As for "all that code", the memory failure handling code is of certain
size and this is a comparatively tiny fix for a tiny issue.

Best regards

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
  2024-08-09  1:22           ` Andrew Zaborowski
@ 2024-08-09  8:34             ` Borislav Petkov
       [not found]               ` <SA1PR11MB69927AE28B46583DCB5C97DEE7BA2@SA1PR11MB6992.namprd11.prod.outlook.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2024-08-09  8:34 UTC (permalink / raw)
  To: Andrew Zaborowski; +Cc: linux-edac, linux-mm, Eric Biederman, x86, Tony Luck

On Fri, Aug 09, 2024 at 03:22:19AM +0200, Andrew Zaborowski wrote:
> I don't have a "real world" use case, we hit these two bugs in HW
> testing.

You inject MCEs or what testing do you mean here?

In what pages? I presume user...

So instead of the process getting killed, you want to return SIGBUS
because, "hey caller, your process encountered an MCE while being
attempted to be executed"?

> Qemu relies on the SIGBUS logic but the execve and rseq
> cases cannot be recovered from, the main benefit of sending the
> correct signal is perhaps information to the user.

You will have that info in the logs - we're usually very loud when we
get an MCE...

> If this cannot be fixed then optimally it should be documented.

I'm not convinced at all that jumping through hoops you're doing, is
worth the effort.

> As for "all that code", the memory failure handling code is of certain
> size and this is a comparatively tiny fix for a tiny issue.

No, I didn't say anything about the memory failure code - it is about
supporting that obscure use case and the additional logic you're adding
to the #MC handler which looks like a real mess already and us having to
support that use case indefinitely.

So why does it matter if a process which is being executed and gets an
MCE beyond the point of no return absolutely needs to return SIGBUS vs
it getting killed and you still get an MCE logged on the machine, in
either case?

I mean, I would understand it when the parent process can do something
meaningful about it but if not, why does it matter at all?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <SA1PR11MB69927AE28B46583DCB5C97DEE7BA2@SA1PR11MB6992.namprd11.prod.outlook.com>]

* Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
       [not found]               ` <SA1PR11MB69927AE28B46583DCB5C97DEE7BA2@SA1PR11MB6992.namprd11.prod.outlook.com>
@ 2024-08-10  1:20                 ` Andrew Zaborowski
  2024-08-10  3:21                   ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Zaborowski @ 2024-08-10  1:20 UTC (permalink / raw)
  To: Borislav Petkov, linux-edac, linux-mm, Eric Biederman, x86, Tony,
	Andrew Zaborowski

Borislav Petkov <bp@alien8.de> wrote:
> So instead of the process getting killed, you want to return SIGBUS
> because, "hey caller, your process encountered an MCE while being
> attempted to be executed"?

The tests could be changed to expect the SIGSEGV but in this case it
seemed that the test was good and the kernel was misbehaving.  One of
the authors of the MCE handling code confirmed that.

>
> > Qemu relies on the SIGBUS logic but the execve and rseq
> > cases cannot be recovered from, the main benefit of sending the
> > correct signal is perhaps information to the user.
>
> You will have that info in the logs - we're usually very loud when we
> get an MCE...

True, though that's hard to link to a specific process crash.  It's
also hard to extract the page address in the process's address space
from that, although I don't think there's a current use case.

>
> > If this cannot be fixed then optimally it should be documented.
>
> I'm not convinced at all that jumping through hoops you're doing, is
> worth the effort.

That could be, again this could be fixed in the documentation instead.

>
> > As for "all that code", the memory failure handling code is of certain
> > size and this is a comparatively tiny fix for a tiny issue.
>
> No, I didn't say anything about the memory failure code - it is about

I was replying to your comment about the size of the change.

> supporting that obscure use case and the additional logic you're adding
> to the #MC handler which looks like a real mess already and us having to
> support that use case indefinitely.

Supporting something generally includes supporting the common and the
obscure cases.  From the user's point of view the kernel has been
committed to supporting these scenarios indefinitely or until the
deprecation of the SIGBUS-on-memory-error logic, and simply has a bug.

>
> So why does it matter if a process which is being executed and gets an
> MCE beyond the point of no return absolutely needs to return SIGBUS vs
> it getting killed and you still get an MCE logged on the machine, in
> either case?

A SIGSEGV strongly implies a problem with the program being run, not a
specific instance of it.  A SIGBUS could be not the program's fault,
like in this case.

In these tests the workload was simply relaunched on a SIGBUS which
sounded fair to me.  A qemu VM could similarly be restarted on an
unrecoverable MCE in a page that doesn't belong to the VM but to qemu
itself.

Best regards

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
  2024-08-10  1:20                 ` Andrew Zaborowski
@ 2024-08-10  3:21                   ` Borislav Petkov
  2024-08-10  3:55                     ` Andrew Zaborowski
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2024-08-10  3:21 UTC (permalink / raw)
  To: Andrew Zaborowski
  Cc: linux-edac, linux-mm, Eric Biederman, x86, Tony, Andrew Zaborowski

On Sat, Aug 10, 2024 at 03:20:10AM +0200, Andrew Zaborowski wrote:
> True, though that's hard to link to a specific process crash.

The process name when the MCE gets reported is *actually* there in the
splat: current->comm.

> I was replying to your comment about the size of the change.

My comment was about the code you're adding:

 arch/x86/kernel/cpu/mce/core.c | 18 +++++++++++++++++-
 fs/exec.c                      | 15 ++++++++++++---
 include/linux/sched.h          |  2 ++
 kernel/rseq.c                  | 25 +++++++++++++++++++++----
 4 files changed, 52 insertions(+), 8 deletions(-)

If it is in drivers, I don't care. But it is in main paths and for
a questionable use case.

> Supporting something generally includes supporting the common and the
> obscure cases.

Bullshit. We certainly won't support some obscure use cases just
because.

> From the user's point of view the kernel has been committed to
> supporting these scenarios indefinitely or until the deprecation of
> the SIGBUS-on-memory-error logic, and simply has a bug.

And lemme repeat my question:

So why does it matter if a process which is being executed and gets an
MCE beyond the point of no return absolutely needs to return SIGBUS vs
it getting killed and you still get an MCE logged on the machine, in
either case?

Bug which is seen by whom or what?

If a process dies, it dies.

> In these tests the workload was simply relaunched on a SIGBUS which
> sounded fair to me.  A qemu VM could similarly be restarted on an
> unrecoverable MCE in a page that doesn't belong to the VM but to qemu
> itself.

If an MCE hits at that particular point once in a blue moon, I don't
care. If it is a special use case where you inject an MCE right then and
there to test recovery actions, then that's perhaps a different story.

Usually, a lot of things can be done. As long as there's a valid use
case to support. But since you hesitate to explain what exactly we're
supporting, I can't help you.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
  2024-08-10  3:21                   ` Borislav Petkov
@ 2024-08-10  3:55                     ` Andrew Zaborowski
  2024-08-10  9:25                       ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Zaborowski @ 2024-08-10  3:55 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-mm, Eric Biederman, x86, Tony

On Sat, 10 Aug 2024 at 05:21, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Aug 10, 2024 at 03:20:10AM +0200, Andrew Zaborowski wrote:
> > True, though that's hard to link to a specific process crash.
>
> The process name when the MCE gets reported is *actually* there in the
> splat: current->comm.

That's the current process.  The list of processes to be signalled is
determined later and not in a simple way.

>
> > Supporting something generally includes supporting the common and the
> > obscure cases.
>
> Bullshit. We certainly won't support some obscure use cases just
> because.

It's simple reliability, if you support something only sometimes no
one can rely on it.  Without a deep analysis of their kernel code
snapshot at least.

>
> > From the user's point of view the kernel has been committed to
> > supporting these scenarios indefinitely or until the deprecation of
> > the SIGBUS-on-memory-error logic, and simply has a bug.
>
> And lemme repeat my question:
>
> So why does it matter if a process which is being executed and gets an
> MCE beyond the point of no return absolutely needs to return SIGBUS vs
> it getting killed and you still get an MCE logged on the machine, in
> either case?
>
> Bug which is seen by whom or what?

In the case I know of, by the parent process, it's basing some
decision on the signal number and the expected behavior from the
kernel even if not unambiguously documented.

Like I said it can be worked around in userspace, my change doesn't
*enable* the use case.

Best regards


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
  2024-08-10  3:55                     ` Andrew Zaborowski
@ 2024-08-10  9:25                       ` Borislav Petkov
  0 siblings, 0 replies; 14+ messages in thread
From: Borislav Petkov @ 2024-08-10  9:25 UTC (permalink / raw)
  To: Andrew Zaborowski; +Cc: linux-edac, linux-mm, Eric Biederman, x86, Tony

So,

to sum up the thread here:

1. Arguably, tasks which have encountered a memory error should get sent
   a SIGBUS.

2. if a valid use case appears, the proper fix should be not to sprinkle
   special-handling code in random syscalls in a whack-a-mole fashion
   but to note somewhere in the task struct that the task has
   encountered a memory error and then *override* the signal it gets
   sent to it with a SIGBUS one.

   I.e., this should be a generic solution, if anything.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-08-10  9:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-23 14:47 [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Andrew Zaborowski
2024-07-23 14:47 ` [RESEND][PATCH 2/3] execve: Ensure SIGBUS delivered on memory failure Andrew Zaborowski
2024-07-23 14:47 ` [RESEND][PATCH 3/3] rseq: " Andrew Zaborowski
2024-08-06  4:37   ` Kees Cook
2024-08-06  7:51     ` Peter Zijlstra
2024-08-06 14:19       ` Mathieu Desnoyers
2024-08-06  4:36 ` [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Kees Cook
2024-08-06  8:35   ` Borislav Petkov
     [not found]     ` <SA1PR11MB69926BFE8EFDA7B3C3D84560E7B82@SA1PR11MB6992.namprd11.prod.outlook.com>
     [not found]       ` <CAOq732KXwsKdht55E-Z18choiAYn6dMpXc-TD15B7MOUH1fpxQ@mail.gmail.com>
     [not found]         ` <20240808145331.GAZrTb60FX_I3p0Ukx@fat_crate.local>
2024-08-09  1:22           ` Andrew Zaborowski
2024-08-09  8:34             ` Borislav Petkov
     [not found]               ` <SA1PR11MB69927AE28B46583DCB5C97DEE7BA2@SA1PR11MB6992.namprd11.prod.outlook.com>
2024-08-10  1:20                 ` Andrew Zaborowski
2024-08-10  3:21                   ` Borislav Petkov
2024-08-10  3:55                     ` Andrew Zaborowski
2024-08-10  9:25                       ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox