* [PATCH v2] binfmt_elf: Dump smaller VMAs first in ELF cores
@ 2024-08-02 22:37 Brian Mak
2024-08-03 3:25 ` Eric W. Biederman
2024-08-04 15:23 ` [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores) Oleg Nesterov
0 siblings, 2 replies; 10+ messages in thread
From: Brian Mak @ 2024-08-02 22:37 UTC (permalink / raw)
To: Eric W. Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-fsdevel, linux-mm, linux-kernel
Cc: Oleg Nesterov
Large cores may be truncated in some scenarios, such as with daemons
with stop timeouts that are not large enough or lack of disk space. This
impacts debuggability with large core dumps since critical information
necessary to form a usable backtrace, such as stacks and shared library
information, are omitted.
Attempting to find all the VMAs necessary to form a proper backtrace and
then prioritizing those VMAs specifically while core dumping is complex.
So instead, we can mitigate the impact of core dump truncation by
dumping smaller VMAs first, which may be more likely to contain memory
necessary to form a usable backtrace.
By sorting VMAs by dump size and dumping in that order, we have a
simple, yet effective heuristic.
Signed-off-by: Brian Mak <makb@juniper.net>
---
v2: Edited commit message to include more reasoning for sorting VMAs
Removed conditional VMA sorting with debugfs knob
Above edits suggested by Eric Biederman <ebiederm@xmission.com>
fs/binfmt_elf.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 19fa49cd9907..7feadbdd9ee6 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -37,6 +37,7 @@
#include <linux/elf-randomize.h>
#include <linux/utsname.h>
#include <linux/coredump.h>
+#include <linux/sort.h>
#include <linux/sched.h>
#include <linux/sched/coredump.h>
#include <linux/sched/task_stack.h>
@@ -1990,6 +1991,20 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
shdr4extnum->sh_info = segs;
}
+static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
+{
+ const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
+ vma_meta_lhs_ptr;
+ const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
+ vma_meta_rhs_ptr;
+
+ if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
+ return -1;
+ if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
+ return 1;
+ return 0;
+}
+
/*
* Actual dumper
*
@@ -2008,6 +2023,7 @@ static int elf_core_dump(struct coredump_params *cprm)
struct elf_shdr *shdr4extnum = NULL;
Elf_Half e_phnum;
elf_addr_t e_shoff;
+ struct core_vma_metadata **sorted_vmas = NULL;
/*
* The number of segs are recored into ELF header as 16bit value.
@@ -2071,9 +2087,20 @@ static int elf_core_dump(struct coredump_params *cprm)
if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
goto end_coredump;
+ /* Allocate memory to sort VMAs and sort. */
+ sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
+
+ if (!sorted_vmas)
+ goto end_coredump;
+
+ for (i = 0; i < cprm->vma_count; i++)
+ sorted_vmas[i] = cprm->vma_meta + i;
+
+ sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
+
/* Write program headers for segments dump */
for (i = 0; i < cprm->vma_count; i++) {
- struct core_vma_metadata *meta = cprm->vma_meta + i;
+ struct core_vma_metadata *meta = sorted_vmas[i];
struct elf_phdr phdr;
phdr.p_type = PT_LOAD;
@@ -2111,7 +2138,7 @@ static int elf_core_dump(struct coredump_params *cprm)
dump_skip_to(cprm, dataoff);
for (i = 0; i < cprm->vma_count; i++) {
- struct core_vma_metadata *meta = cprm->vma_meta + i;
+ struct core_vma_metadata *meta = sorted_vmas[i];
if (!dump_user_range(cprm, meta->start, meta->dump_size))
goto end_coredump;
@@ -2129,6 +2156,7 @@ static int elf_core_dump(struct coredump_params *cprm)
free_note_info(&info);
kfree(shdr4extnum);
kfree(phdr4note);
+ kvfree(sorted_vmas);
return has_dumped;
}
base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d
--
2.25.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2] binfmt_elf: Dump smaller VMAs first in ELF cores
2024-08-02 22:37 [PATCH v2] binfmt_elf: Dump smaller VMAs first in ELF cores Brian Mak
@ 2024-08-03 3:25 ` Eric W. Biederman
2024-08-04 15:23 ` [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores) Oleg Nesterov
1 sibling, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2024-08-03 3:25 UTC (permalink / raw)
To: Brian Mak
Cc: Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
linux-fsdevel, linux-mm, linux-kernel, Oleg Nesterov
Brian Mak <makb@juniper.net> writes:
> Large cores may be truncated in some scenarios, such as with daemons
> with stop timeouts that are not large enough or lack of disk space. This
> impacts debuggability with large core dumps since critical information
> necessary to form a usable backtrace, such as stacks and shared library
> information, are omitted.
>
> Attempting to find all the VMAs necessary to form a proper backtrace and
> then prioritizing those VMAs specifically while core dumping is complex.
> So instead, we can mitigate the impact of core dump truncation by
> dumping smaller VMAs first, which may be more likely to contain memory
> necessary to form a usable backtrace.
I think I would say something like: We attempted to figure out which
VMAs are needed to create a useful backtrace, and it turned out to
be a non-trivial problem. So instead you tried simply sorting
the vma's by size and that worked.
Something to convey it is a tricky problem without an easy solution,
and that it wasn't work solving.
> By sorting VMAs by dump size and dumping in that order, we have a
> simple, yet effective heuristic.
I like the improvement to the description, and I like the improved
simplicity of this.
I will repeat myself quickly and say I would like this even better if
the sorting could be moved to the end of dump_vma_snapshot in
fs/coredump.c, and the original array of vmas could be sorted.
That removes all concerns about taking up more memory, as well
as adding the improvement to ELF FDPIC as well.
Sorting the original array will also change file_files_notes
which will result in the NT_FILE note ordering the mapped files
in the same order as the program headers are written. There
is enough information I don't think it matters, but consistency
is nice.
Eric
> Signed-off-by: Brian Mak <makb@juniper.net>
> ---
>
> v2: Edited commit message to include more reasoning for sorting VMAs
> Removed conditional VMA sorting with debugfs knob
> Above edits suggested by Eric Biederman <ebiederm@xmission.com>
>
> fs/binfmt_elf.c | 32 ++++++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 19fa49cd9907..7feadbdd9ee6 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -37,6 +37,7 @@
> #include <linux/elf-randomize.h>
> #include <linux/utsname.h>
> #include <linux/coredump.h>
> +#include <linux/sort.h>
> #include <linux/sched.h>
> #include <linux/sched/coredump.h>
> #include <linux/sched/task_stack.h>
> @@ -1990,6 +1991,20 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
> shdr4extnum->sh_info = segs;
> }
>
> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
> +{
> + const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
> + vma_meta_lhs_ptr;
> + const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
> + vma_meta_rhs_ptr;
> +
> + if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
> + return -1;
> + if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
> + return 1;
> + return 0;
> +}
> +
> /*
> * Actual dumper
> *
> @@ -2008,6 +2023,7 @@ static int elf_core_dump(struct coredump_params *cprm)
> struct elf_shdr *shdr4extnum = NULL;
> Elf_Half e_phnum;
> elf_addr_t e_shoff;
> + struct core_vma_metadata **sorted_vmas = NULL;
>
> /*
> * The number of segs are recored into ELF header as 16bit value.
> @@ -2071,9 +2087,20 @@ static int elf_core_dump(struct coredump_params *cprm)
> if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
> goto end_coredump;
>
> + /* Allocate memory to sort VMAs and sort. */
> + sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
> +
> + if (!sorted_vmas)
> + goto end_coredump;
> +
> + for (i = 0; i < cprm->vma_count; i++)
> + sorted_vmas[i] = cprm->vma_meta + i;
> +
> + sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
> +
> /* Write program headers for segments dump */
> for (i = 0; i < cprm->vma_count; i++) {
> - struct core_vma_metadata *meta = cprm->vma_meta + i;
> + struct core_vma_metadata *meta = sorted_vmas[i];
> struct elf_phdr phdr;
>
> phdr.p_type = PT_LOAD;
> @@ -2111,7 +2138,7 @@ static int elf_core_dump(struct coredump_params *cprm)
> dump_skip_to(cprm, dataoff);
>
> for (i = 0; i < cprm->vma_count; i++) {
> - struct core_vma_metadata *meta = cprm->vma_meta + i;
> + struct core_vma_metadata *meta = sorted_vmas[i];
>
> if (!dump_user_range(cprm, meta->start, meta->dump_size))
> goto end_coredump;
> @@ -2129,6 +2156,7 @@ static int elf_core_dump(struct coredump_params *cprm)
> free_note_info(&info);
> kfree(shdr4extnum);
> kfree(phdr4note);
> + kvfree(sorted_vmas);
> return has_dumped;
> }
>
>
> base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores)
2024-08-02 22:37 [PATCH v2] binfmt_elf: Dump smaller VMAs first in ELF cores Brian Mak
2024-08-03 3:25 ` Eric W. Biederman
@ 2024-08-04 15:23 ` Oleg Nesterov
2024-08-04 17:47 ` Linus Torvalds
1 sibling, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2024-08-04 15:23 UTC (permalink / raw)
To: Brian Mak, Linus Torvalds
Cc: Eric W. Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-fsdevel, linux-mm, linux-kernel
On 08/02, Brian Mak wrote:
>
> Large cores may be truncated in some scenarios, such as with daemons
> with stop timeouts that are not large enough or lack of disk space. This
> impacts debuggability with large core dumps since critical information
> necessary to form a usable backtrace, such as stacks and shared library
> information, are omitted.
>
> Attempting to find all the VMAs necessary to form a proper backtrace and
> then prioritizing those VMAs specifically while core dumping is complex.
> So instead, we can mitigate the impact of core dump truncation by
> dumping smaller VMAs first, which may be more likely to contain memory
> necessary to form a usable backtrace.
I thought of a another approach... See the simple patch below,
- Incomplete, obviously not for inclusion. I think a new
PTRACE_EVENT_COREDUMP makes sense, this will simplify
the code even more.
- Needs some preparations. In particular, I still think we
should reintroduce SIGNAL_GROUP_COREDUMP regardless of
this feature, but lets not discuss this right now.
This patch adds the new %T specifier to core_pattern, so that
$ echo '|/path/to/dumper %T' /proc/sys/kernel/core_pattern
means that the coredumping thread will run as a traced child of the
"dumper" process, and it will stop in TASK_TRACED before it calls
binfmt->core_dump().
So the dumper process can extract/save the backtrace/registers/whatever
first, then do PTRACE_CONT or kill the tracee if it doesn't need the
"full" coredump.
Of course this won't work if the dumping thread is already ptraced,
but in this case the debugger has all the necessary info.
What do you think?
Oleg.
---
diff --git a/fs/coredump.c b/fs/coredump.c
index 7f12ff6ad1d3..fbe8e5ae7c00 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -337,6 +337,10 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
case 'C':
err = cn_printf(cn, "%d", cprm->cpu);
break;
+ case 'T':
+ // XXX explain that we don't need get_task_struct()
+ cprm->traceme = current;
+ break;
default:
break;
}
@@ -516,9 +520,30 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
/* and disallow core files too */
current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
+ if (cp->traceme) {
+ if (ptrace_attach(cp->traceme, PTRACE_SEIZE, 0,0))
+ cp->traceme = NULL;
+ }
+
return err;
}
+static void umh_pipe_cleanup(struct subprocess_info *info)
+{
+ struct coredump_params *cp = (struct coredump_params *)info->data;
+ // XXX: we can't rely on this check, for example
+ // CONFIG_STATIC_USERMODEHELPER_PATH == ""
+ if (cp->traceme) {
+ // XXX: meaningful exit_code/message, maybe new PTRACE_EVENT_
+ ptrace_notify(SIGTRAP, 0);
+
+ spin_lock_irq(¤t->sighand->siglock);
+ if (!__fatal_signal_pending(current))
+ clear_thread_flag(TIF_SIGPENDING);
+ spin_unlock_irq(¤t->sighand->siglock);
+ }
+}
+
void do_coredump(const kernel_siginfo_t *siginfo)
{
struct core_state core_state;
@@ -637,7 +662,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
retval = -ENOMEM;
sub_info = call_usermodehelper_setup(helper_argv[0],
helper_argv, NULL, GFP_KERNEL,
- umh_pipe_setup, NULL, &cprm);
+ umh_pipe_setup, umh_pipe_cleanup,
+ &cprm);
if (sub_info)
retval = call_usermodehelper_exec(sub_info,
UMH_WAIT_EXEC);
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 0904ba010341..490b6c5e05d8 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -28,6 +28,7 @@ struct coredump_params {
int vma_count;
size_t vma_data_size;
struct core_vma_metadata *vma_meta;
+ struct task_struct *traceme;
};
extern unsigned int core_file_note_size_limit;
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 90507d4afcd6..13aed4c358b6 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -46,6 +46,9 @@ extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
#define PT_EXITKILL (PTRACE_O_EXITKILL << PT_OPT_FLAG_SHIFT)
#define PT_SUSPEND_SECCOMP (PTRACE_O_SUSPEND_SECCOMP << PT_OPT_FLAG_SHIFT)
+extern int ptrace_attach(struct task_struct *task, long request,
+ unsigned long addr, unsigned long flags);
+
extern long arch_ptrace(struct task_struct *child, long request,
unsigned long addr, unsigned long data);
extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index d5f89f9ef29f..47f1e09f8fc9 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -406,9 +406,8 @@ static inline void ptrace_set_stopped(struct task_struct *task, bool seize)
}
}
-static int ptrace_attach(struct task_struct *task, long request,
- unsigned long addr,
- unsigned long flags)
+int ptrace_attach(struct task_struct *task, long request,
+ unsigned long addr, unsigned long flags)
{
bool seize = (request == PTRACE_SEIZE);
int retval;
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores)
2024-08-04 15:23 ` [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores) Oleg Nesterov
@ 2024-08-04 17:47 ` Linus Torvalds
2024-08-04 18:53 ` Oleg Nesterov
2024-08-05 17:56 ` Brian Mak
0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2024-08-04 17:47 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Brian Mak, Eric W. Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-fsdevel, linux-mm,
linux-kernel
On Sun, 4 Aug 2024 at 08:23, Oleg Nesterov <oleg@redhat.com> wrote:
>
> What do you think?
Eww. I really don't like giving the dumper ptrace rights.
I think the real limitations of the "dump to pipe" is that it's just
being very stupid. Which is fine in the sense that core dumps aren't
likely to be a huge priority. But if or when they _are_ a priority,
it's not a great model.
So I prefer the original patch because it's also small, but it's
conceptually much smaller.
That said, even that simplified v2 looks a bit excessive to me.
Does it really help so much to create a new array of core_vma_metadata
pointers - could we not just sort those things in place?
Also, honestly, if the issue is core dump truncation, at some point we
should just support truncating individual mappings rather than the
whole core file anyway. No?
Depending on what the major issue is, we might also tweak the
heuristics for which vmas get written out.
For example, I wouldn't be surprised if there's a fair number of "this
read-only private file mapping gets written out because it has been
written to" due to runtime linking. And I kind of suspect that in many
cases that's not all that interesting.
Anyway, I assume that Brian had some specific problem case that
triggered this all, and I'd like to know a bit more.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores)
2024-08-04 17:47 ` Linus Torvalds
@ 2024-08-04 18:53 ` Oleg Nesterov
2024-08-04 19:18 ` Linus Torvalds
2024-08-05 17:56 ` Brian Mak
1 sibling, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2024-08-04 18:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Brian Mak, Eric W. Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-fsdevel, linux-mm,
linux-kernel
OK, I won't insist, just a couple of notes.
On 08/04, Linus Torvalds wrote:
>
> On Sun, 4 Aug 2024 at 08:23, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > What do you think?
>
> Eww. I really don't like giving the dumper ptrace rights.
Why?
Apart from SIGKILL, the dumper already has the full control.
And note that the dumper can already use ptrace. It can do, say,
ptrace(PTRACE_SEIZE, PTRACE_O_TRACEEXIT), close stdin, and wait
for PTRACE_EVENT_EXIT.
IIRC some people already do this, %T just makes the usage of ptrace
more convenient/powerful in this case.
> So I prefer the original patch because it's also small, but it's
> conceptually much smaller.
Ah, sorry. I didn't mean that %T makes the Brian's patch unnecessary,
I just wanted to discuss this feature "on a related note".
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores)
2024-08-04 18:53 ` Oleg Nesterov
@ 2024-08-04 19:18 ` Linus Torvalds
2024-08-04 20:01 ` Oleg Nesterov
0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2024-08-04 19:18 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Brian Mak, Eric W. Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-fsdevel, linux-mm,
linux-kernel
On Sun, 4 Aug 2024 at 11:53, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Apart from SIGKILL, the dumper already has the full control.
What do you mean? It's a regular usermodehelper. It gets the dump data
as input. That's all the control it has.
> And note that the dumper can already use ptrace.
.. with the normal ptrace() rules, yes.
You realize that some setups literally disable ptrace() system calls,
right? Which your patch now effectively sidesteps.
THAT is why I don't like it. ptrace() is *dangerous*. It is very
typically one of the things that people limit for various reasons.
Just adding some implicit tracing willy-nilly needs to be something
people really worry about.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores)
2024-08-04 19:18 ` Linus Torvalds
@ 2024-08-04 20:01 ` Oleg Nesterov
0 siblings, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2024-08-04 20:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Brian Mak, Eric W. Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-fsdevel, linux-mm,
linux-kernel
On 08/04, Linus Torvalds wrote:
>
> On Sun, 4 Aug 2024 at 11:53, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Apart from SIGKILL, the dumper already has the full control.
>
> What do you mean? It's a regular usermodehelper. It gets the dump data
> as input. That's all the control it has.
I meant, the dumping thread can't exit until the dumper reads the data
from stdin or closes the pipe. Until then the damper can read /proc/pid/mem
and do other things.
> > And note that the dumper can already use ptrace.
>
> .. with the normal ptrace() rules, yes.
>
> You realize that some setups literally disable ptrace() system calls,
> right? Which your patch now effectively sidesteps.
Well. If, say, selinux disables ptrace, then ptrace_attach() in this
patch should also fail.
But if some setups disable sys_ptrace() as a system call... then yes,
I didn't know that.
> THAT is why I don't like it. ptrace() is *dangerous*.
And horrible ;)
> Just adding some implicit tracing willy-nilly needs to be something
> people really worry about.
Ok, as I said I won't insist.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores)
2024-08-04 17:47 ` Linus Torvalds
2024-08-04 18:53 ` Oleg Nesterov
@ 2024-08-05 17:56 ` Brian Mak
2024-08-05 19:10 ` Linus Torvalds
1 sibling, 1 reply; 10+ messages in thread
From: Brian Mak @ 2024-08-05 17:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Eric W. Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-fsdevel, linux-mm,
linux-kernel
On Aug 4, 2024, at 10:47 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 4 Aug 2024 at 08:23, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> What do you think?
>
> Eww. I really don't like giving the dumper ptrace rights.
>
> I think the real limitations of the "dump to pipe" is that it's just
> being very stupid. Which is fine in the sense that core dumps aren't
> likely to be a huge priority. But if or when they _are_ a priority,
> it's not a great model.
>
> So I prefer the original patch because it's also small, but it's
> conceptually much smaller.
>
> That said, even that simplified v2 looks a bit excessive to me.
>
> Does it really help so much to create a new array of core_vma_metadata
> pointers - could we not just sort those things in place?
Hi Linus,
Thanks for taking the time to reply.
Yep, I don't see any immediate reason for why we can't sort this in
place to begin with.
Thanks, Eric, for originally bringing this up. Will send out a v3 with
these edits.
> Also, honestly, if the issue is core dump truncation, at some point we
> should just support truncating individual mappings rather than the
> whole core file anyway. No?
Do you mean support truncating VMAs in addition to sorting or as a
replacement to sorting? If you mean in addition, then I agree, there may
be some VMAs that are known to not contain information critical to
debugging, but may aid, and therefore have less priority.
If you mean as a replacement to sorting, then we'd need to know exactly
which VMAs to keep/discard, which is a non-trivial task, as discussed in
v1 of my patch, and so it doesn't seem like a viable alternative.
> Depending on what the major issue is, we might also tweak the
> heuristics for which vmas get written out.
>
> For example, I wouldn't be surprised if there's a fair number of "this
> read-only private file mapping gets written out because it has been
> written to" due to runtime linking. And I kind of suspect that in many
> cases that's not all that interesting.
>
> Anyway, I assume that Brian had some specific problem case that
> triggered this all, and I'd like to know a bit more.
Yes, there were a couple problem cases that triggered the need for this
patch. I'll repeat what i said in v1 about this:
At Juniper, we have some daemons that can consume a lot of memory, where
upon crash, can result in core dumps of several GBs. While dumping,
we've encountered these two scenarios resulting in a unusable core:
1. Disk space is low at the time of core dump, resulting in a truncated
core once the disk is full.
2. A daemon has a TimeoutStopSec option configured in its systemd unit
file, where upon core dumping, could timeout (triggering a SIGKILL) if
the core dump is too large and is taking too long to dump.
In both scenarios, we see that the core file is already several GB, and
still does not contain the information necessary to form a backtrace,
thus creating the need for this change. In the second scenario, we are
unable to increase the timeout option due to our recovery time objective
requirements.
Best,
Brian Mak
> Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores)
2024-08-05 17:56 ` Brian Mak
@ 2024-08-05 19:10 ` Linus Torvalds
2024-08-05 21:27 ` Brian Mak
0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2024-08-05 19:10 UTC (permalink / raw)
To: Brian Mak
Cc: Oleg Nesterov, Eric W. Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-fsdevel, linux-mm,
linux-kernel
On Mon, 5 Aug 2024 at 10:56, Brian Mak <makb@juniper.net> wrote:
>
> Do you mean support truncating VMAs in addition to sorting or as a
> replacement to sorting? If you mean in addition, then I agree, there may
> be some VMAs that are known to not contain information critical to
> debugging, but may aid, and therefore have less priority.
I'd consider it a completely separate issue, so it would be
independent of the sorting.
We have "ulimit -c" to limit core sizes, but I think it might be
interesting to have a separate "limit individual mapping sizes" logic.
We already have that as a concept: vma_dump_size() could easily limit
the vma dump size, but currently only picks "all or nothing", except
for executable mappings that contain actual ELF headers (then it will
dump the first page only).
And honestly, *particularly* if you have a limit on the core size, I
suspect you'd be better off dumping some of all vma's rather than
dumping all of some vma's.
Now, your sorting approach obviously means that large vma's no longer
stop smaller ones from dumping, so it does take care of that part of
it. But I do wonder if we should just in general not dump crazy big
vmas if the dump size has been limited.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores)
2024-08-05 19:10 ` Linus Torvalds
@ 2024-08-05 21:27 ` Brian Mak
0 siblings, 0 replies; 10+ messages in thread
From: Brian Mak @ 2024-08-05 21:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Eric W. Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-fsdevel, linux-mm,
linux-kernel
On Aug 5, 2024, at 12:10 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 5 Aug 2024 at 10:56, Brian Mak <makb@juniper.net> wrote:
>>
>> Do you mean support truncating VMAs in addition to sorting or as a
>> replacement to sorting? If you mean in addition, then I agree, there may
>> be some VMAs that are known to not contain information critical to
>> debugging, but may aid, and therefore have less priority.
>
> I'd consider it a completely separate issue, so it would be
> independent of the sorting.
>
> We have "ulimit -c" to limit core sizes, but I think it might be
> interesting to have a separate "limit individual mapping sizes" logic.
>
> We already have that as a concept: vma_dump_size() could easily limit
> the vma dump size, but currently only picks "all or nothing", except
> for executable mappings that contain actual ELF headers (then it will
> dump the first page only).
>
> And honestly, *particularly* if you have a limit on the core size, I
> suspect you'd be better off dumping some of all vma's rather than
> dumping all of some vma's.
Oh ok, I understand what you're suggesting now. I like the concept of
limiting the sizes of individual mappings, but I don't really like the
idea of a fixed maximum size like with "ulimit -c". In cases where there
is plenty of free disk space, a user might want larger cores to debug
more effectively. In cases (even on the same machine) where there all of
a sudden is less disk space available, a user would want that cutoff to
be smaller so that they can effectively grab some of all VMAs.
Also, in cases like the systemd timeout scenario where there is a time
limit for dumping, then the amount to dump would be variable depending
on the core pattern script and/or throughput of the medium the core is
being written to. In this scenario, the maximum size cannot be
determined ahead of time.
However, making it so that we don't need a maximum size determined ahead
of time (and can just terminate the core dumping) seems difficult. We
could make it so that VMAs are dumped piece by piece, one VMA at a time,
until it either reaches the end or gets terminated. Not sure what an
effective way to implement this would be while staying within the
confines of the ELF specification though, i.e. how can this be properly
streamed out and still be in ELF format?
> Now, your sorting approach obviously means that large vma's no longer
> stop smaller ones from dumping, so it does take care of that part of
> it. But I do wonder if we should just in general not dump crazy big
> vmas if the dump size has been limited.
Google actually did something like this in an old core dumper library,
where they excluded large VMAs until the core dump is at or below the
dump size limit:
Git: https://github.com/anatol/google-coredumper.git
Reference: src/elfcore.c, L1030
It's not a bad idea to exclude large VMAs in scenarios where there are
limits, but again, not a huge fan of the predetermined dump size limit.
Best,
Brian Mak
> Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-08-05 21:27 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-02 22:37 [PATCH v2] binfmt_elf: Dump smaller VMAs first in ELF cores Brian Mak
2024-08-03 3:25 ` Eric W. Biederman
2024-08-04 15:23 ` [RFC PATCH] piped/ptraced coredump (was: Dump smaller VMAs first in ELF cores) Oleg Nesterov
2024-08-04 17:47 ` Linus Torvalds
2024-08-04 18:53 ` Oleg Nesterov
2024-08-04 19:18 ` Linus Torvalds
2024-08-04 20:01 ` Oleg Nesterov
2024-08-05 17:56 ` Brian Mak
2024-08-05 19:10 ` Linus Torvalds
2024-08-05 21:27 ` Brian Mak
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox