linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: kosaki.motohiro@jp.fujitsu.com,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Paul Menage <menage@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch -mm v2] mm: introduce oom_adj_child
Date: Fri, 31 Jul 2009 15:50:27 +0900 (JST)	[thread overview]
Message-ID: <20090731154823.B6EF.A69D9226@jp.fujitsu.com> (raw)
In-Reply-To: <20090731093305.50bcc58d.kamezawa.hiroyu@jp.fujitsu.com>

> On Thu, 30 Jul 2009 12:05:30 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:
> > 
> > > > If you have suggestions for a better name, I'd happily ack it.
> > > > 
> > > 
> > > Simply, reset_oom_adj_at_new_mm_context or some.
> > > 
> > 
> > I think it's preferred to keep the name relatively short which is an 
> > unfortuante requirement in this case.  I also prefer to start the name 
> > with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
> > alphabetically.
> > 
> But misleading name is bad.
> 
> 
> 
> > > > > 2. More simple plan is like this, IIUC.
> > > > > 
> > > > >   fix oom-killer's select_bad_process() not to be in deadlock.
> > > > > 
> > > > 
> > > > Alternate ideas?
> > > > 
> > > At brief thiking.
> > > 
> > > 1. move oom_adj from mm_struct to signal struct. or somewhere.
> > >    (see copy_signal())
> > >    Then,
> > >     - all threads in a process will have the same oom_adj.
> > >     - vfork()'ed thread will inherit its parent's oom_adj.   
> > >     - vfork()'ed thread can override oom_adj of its own.
> > > 
> > >     In other words, oom_adj is shared when CLONE_PARENT is not set.
> > > 
> > 
> > Hmm, didn't we talk about signal_struct already?  The problem with that 
> > approach is that oom_adj values represent a killable quantity of memory, 
> > so having multiple threads sharing the same mm_struct with one set to 
> > OOM_DISABLE and the other at +15 will still livelock because the oom 
> > killer can't kill either.
> >
> > > 2. rename  mm_struct's oom_adj as shadow_oom_adj.
> > > 
> > >    update this shadow_oom_adj as the highest oom_adj among
> > >    the values all threads share this mm_struct have.
> > >    This update is done when
> > >    - mm_init()
> > >    - oom_adj is written.
> > > 
> > >    User's 
> > >    # echo XXXX > /proc/<x>/oom_adj
> > >    is not necessary to be very very fast.
> > > 
> > >    I don't think a process which calls vfork() is multi-threaded.
> > > 
> > > 3. use shadow_oom_adj in select_bad_process().
> > > 
> > 
> > Ideas 2 & 3 here seem to be a single proposal.  The problem is that it 
> > still leaves /proc/pid/oom_score to be inconsistent with the badness 
> > scoring that the oom killer will eventually use since if it oom kills one 
> > task, it must kill all tasks sharing the same mm_struct to lead to future 
> > memory freeing.
> > 
> yes.
> 
> > Additionally, if you were to set one thread to OOM_DISABLE, storing the 
> > highest oom_adj value in mm_struct isn't going to help because 
> > oom_kill_task() will still require a tasklist scan to ensure no threads 
> > sharing the mm_struct are OOM_DISABLE and the livelock persists.
> > 
> 
> Why don't you think select_bad_process()-> oom_kill_task() implementation is bad ?
> IMHO, it's bad manner to fix an os-implementation problem by adding _new_ user
> interface which is hard to understand.
> 
> 
> > In other words, the issue here is larger than the inheritance of the 
> > oom_adj value amongst children, it addresses a livelock that neither of 
> > your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
> > /proc/pid/oom_score) consistent with how the oom killer behaves.
> 
> This oom_adj_child itself is not related to livelock problem. Don't make
> the problem bigger than it is.
> oom_adj_child itself is just a problem how to handle vfork().


I made my proposal patch today.
this patch have following charactatistics.

o per-process oom_adj (by signal_struct)
o don't live-lock


Please comment.



Patch against 2.6.31-rc4

===========================
Subject: [PATCH] move oom_adj to task->signal

test program
----------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define BUF_SIZE 128

void oom_adj_print(void)
{
	FILE* file;
	char buf[BUF_SIZE];

	file = fopen("/proc/self/oom_adj", "r");
	if (!file) {
		perror("fopen");
		exit(1);
	}

	fscanf(file, "%s\n", buf);
	printf("%s\n", buf);

	fclose(file);
}

void oom_adj_write(int value)
{
	FILE* file;
	size_t ret;


	file = fopen("/proc/self/oom_adj", "w");
	if (!file) {
		perror("fopen");
		exit(1);
	}

	ret = fprintf(file, "%d", value);
	if (!ret) {
		perror("fprintf");
		exit(1);
	}

	fclose(file);
}

int main(void)
{
	int status;

	oom_adj_print();
	oom_adj_write(1);
	oom_adj_print();

	printf("vfork\n");
	if (vfork() == 0) {
		/* child */
		oom_adj_print();
		oom_adj_write(2);
		oom_adj_print();
		_exit(0);
	}
	wait(&status);
	oom_adj_print();

	return 0;
}

test result:
---------------------------------
% ./a.out
0
1
vfork
1
2
1



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 fs/proc/base.c           |    7 ++++---
 include/linux/mm_types.h |    3 ++-
 include/linux/oom.h      |    1 +
 include/linux/sched.h    |    2 ++
 kernel/exit.c            |    2 ++
 kernel/fork.c            |    2 ++
 mm/oom_kill.c            |   14 +++++++++-----
 7 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3ce5ae9..c64499e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1008,7 +1008,7 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
 		return -ESRCH;
 	task_lock(task);
 	if (task->mm)
-		oom_adjust = task->mm->oom_adj;
+		oom_adjust = task->signal->oom_adj;
 	else
 		oom_adjust = OOM_DISABLE;
 	task_unlock(task);
@@ -1046,12 +1046,13 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		put_task_struct(task);
 		return -EINVAL;
 	}
-	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
 		task_unlock(task);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->mm->oom_adj = oom_adjust;
+	task->signal->oom_adj = oom_adjust;
+	task->mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	task_unlock(task);
 	put_task_struct(task);
 	if (end - buffer == 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7acc843..f93f97f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -240,7 +240,8 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
+	s8 oom_adj_cached;	/* mirror from signal_struct->oom_adj.
+				   in vfork case, multiple processes use the same mm. */
 
 	cpumask_t cpu_vm_mask;
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
index a7979ba..a219480 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -3,6 +3,7 @@
 
 /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
 #define OOM_DISABLE (-17)
+#define OOM_CACHE_DEFAULT (15)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3ab08e4..e10b12b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,6 +629,8 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+
+	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/exit.c b/kernel/exit.c
index 869dc22..c741a45 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
 #include <linux/fs_struct.h>
 #include <linux/init_task.h>
 #include <linux/perf_counter.h>
+#include <linux/oom.h>
 #include <trace/events/sched.h>
 
 #include <asm/uaccess.h>
@@ -688,6 +689,7 @@ static void exit_mm(struct task_struct * tsk)
 	enter_lazy_tlb(mm, current);
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
+	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b42695..b7cb474 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -62,6 +62,7 @@
 #include <linux/fs_struct.h>
 #include <linux/magic.h>
 #include <linux/perf_counter.h>
+#include <linux/oom.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -426,6 +427,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
+	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 175a67a..eae2d78 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,7 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
-	int oom_adj;
+	s8 oom_adj;
 
 	task_lock(p);
 	mm = p->mm;
@@ -66,7 +66,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		task_unlock(p);
 		return 0;
 	}
-	oom_adj = mm->oom_adj;
+
+	if (mm->oom_adj_cached < p->signal->oom_adj)
+		mm->oom_adj_cached = p->signal->oom_adj;
+	oom_adj = mm->oom_adj_cached;
 	if (oom_adj == OOM_DISABLE) {
 		task_unlock(p);
 		return 0;
@@ -307,7 +310,8 @@ static void dump_tasks(const struct mem_cgroup *mem)
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
+		       p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -350,7 +354,7 @@ static int oom_kill_task(struct task_struct *p)
 
 	task_lock(p);
 	mm = p->mm;
-	if (!mm || mm->oom_adj == OOM_DISABLE) {
+	if (!mm || p->signal->oom_adj == OOM_DISABLE) {
 		task_unlock(p);
 		return 1;
 	}
@@ -381,7 +385,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
 			current->comm, gfp_mask, order,
-			current->mm ? current->mm->oom_adj : OOM_DISABLE);
+			current->mm ? current->signal->oom_adj : OOM_DISABLE);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();
-- 
1.6.0.GIT




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-07-31  6:50 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-29  4:27 David Rientjes
2009-07-29 23:13 ` Andrew Morton
2009-07-29 23:25   ` Paul Menage
2009-07-30  2:32 ` KOSAKI Motohiro
2009-07-30  7:06   ` David Rientjes
2009-07-31  6:47     ` KOSAKI Motohiro
2009-07-31  9:31       ` David Rientjes
2009-08-03 11:58         ` KOSAKI Motohiro
2009-08-03 12:12           ` KOSAKI Motohiro
2009-07-30  9:00 ` KAMEZAWA Hiroyuki
2009-07-30  9:31   ` David Rientjes
2009-07-30 10:02     ` KAMEZAWA Hiroyuki
2009-07-30 19:05       ` David Rientjes
2009-07-31  0:33         ` KAMEZAWA Hiroyuki
2009-07-31  6:50           ` KOSAKI Motohiro [this message]
2009-07-31 19:38             ` David Rientjes
2009-08-03 12:16               ` KOSAKI Motohiro
2009-07-31  9:36           ` David Rientjes
2009-07-31 10:49             ` KAMEZAWA Hiroyuki
2009-07-31 19:18               ` David Rientjes
2009-08-01  1:10                 ` KAMEZAWA Hiroyuki
2009-08-01 20:26                   ` David Rientjes
2009-08-03  1:42                     ` KAMEZAWA Hiroyuki
2009-08-03  7:59                       ` David Rientjes
2009-08-03  8:02                         ` KAMEZAWA Hiroyuki
2009-08-03  8:08                           ` David Rientjes
2009-08-03  8:45                             ` KAMEZAWA Hiroyuki
2009-08-03  8:55                               ` KAMEZAWA Hiroyuki
2009-08-03 12:19                                 ` KOSAKI Motohiro
2009-08-03 12:32                         ` KOSAKI Motohiro
2009-08-03 12:21                     ` KOSAKI Motohiro
2009-08-03 16:17                     ` Paul Menage

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090731154823.B6EF.A69D9226@jp.fujitsu.com \
    --to=kosaki.motohiro@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=menage@google.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox