[BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness()
@ 2010-08-31  9:29 KOSAKI Motohiro
  2010-08-31  9:30 ` [BUGFIX for 2.6.36][RESEND][PATCH 2/2] Revert "oom: deprecate oom_adj tunable" KOSAKI Motohiro
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: KOSAKI Motohiro @ 2010-08-31  9:29 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Linus Torvalds, KAMEZAWA Hiroyuki,
	David Rientjes
  Cc: kosaki.motohiro

ok, this one got no objection except original patch author.
then, I'll push it to mainline. I'm glad that I who stabilization
developer have finished this work.

If you think this patch is slightly large, please run,
 % git diff a63d83f42^ mm/oom_kill.c
you'll understand this is minimal revert of unnecessary change.

Thanks.

===================================================================
From 938ce3a7aa79ae4a6cbc275259d586086c41eb87 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Fri, 27 Aug 2010 15:24:09 +0900
Subject: [PATCH 1/2] oom: remove totalpage normalization from oom_badness()

Current oom_score_adj is completely broken because It is strongly bound
google usecase and ignore other all.

1) Priority inversion
   As kamezawa-san pointed out, This break cgroup and lxr environment.
   He said,
	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
	> And A uses 200M, B uses 1G of memory under 4G system
	>
	> Under the system.
	> 	A's socre = (200M *1000)/4G + 300 = 350
	> 	B's score = (1G * 1000)/4G = 250.
	>
	> In the cpuset, it has 2G of memory.
	> 	A's score = (200M * 1000)/2G + 300 = 400
	> 	B's socre = (1G * 1000)/2G = 500
	>
	> This priority-inversion don't happen in current system.

2) Ratio base point don't works large machine
   oom_score_adj normalize oom-score to 0-1000 range.
   but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
   1GB. this is no suitable for tuning parameter.
   As I said, proposional value oriented tuning parameter has
   scalability risk.

3) No reason to implement ABI breakage.
   old tuning parameter mean)
	oom-score = oom-base-score x 2^oom_adj
   new tuning parameter mean)
	oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
   but "oom_score_adj / (totalram + totalswap)" can be calculated in
   userland too. beucase both totalram and totalswap has been exporsed by
   /proc. So no reason to introduce funny new equation.

4) totalram based normalization assume flat memory model.
   example, the machine is assymmetric numa. fat node memory and thin
   node memory might have another wight value.
   In other word, totalram based priority is a one of policy. Fixed and
   workload depended policy shouldn't be embedded in kernel. probably.

Then, this patch remove *UGLY* total_pages suck completely. Googler
can calculate it at userland!

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c        |   33 ++---------
 include/linux/oom.h   |   16 +-----
 include/linux/sched.h |    2 +-
 mm/oom_kill.c         |  144 ++++++++++++++++++++-----------------------------
 4 files changed, 68 insertions(+), 127 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a1c43e7..90ba487 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -434,8 +434,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = oom_badness(task, NULL, NULL,
-					totalram_pages + total_swap_pages);
+		points = oom_badness(task, NULL, NULL);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1056,15 +1055,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 			current->comm, task_pid_nr(current),
 			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
-	/*
-	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
-	 * value is always attainable.
-	 */
-	if (task->signal->oom_adj == OOM_ADJUST_MAX)
-		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
-	else
-		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
-								-OOM_DISABLE;
+
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 
@@ -1081,8 +1072,8 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 					size_t count, loff_t *ppos)
 {
 	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
-	char buffer[PROC_NUMBUF];
-	int oom_score_adj = OOM_SCORE_ADJ_MIN;
+	char buffer[21];
+	long oom_score_adj = 0;
 	unsigned long flags;
 	size_t len;
 
@@ -1093,7 +1084,7 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 		unlock_task_sighand(task, &flags);
 	}
 	put_task_struct(task);
-	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+	len = snprintf(buffer, sizeof(buffer), "%ld\n", oom_score_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
@@ -1101,7 +1092,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
 	struct task_struct *task;
-	char buffer[PROC_NUMBUF];
+	char buffer[21];
 	unsigned long flags;
 	long oom_score_adj;
 	int err;
@@ -1115,9 +1106,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
 	if (err)
 		return -EINVAL;
-	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
-			oom_score_adj > OOM_SCORE_ADJ_MAX)
-		return -EINVAL;
 
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
@@ -1134,15 +1122,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 	}
 
 	task->signal->oom_score_adj = oom_score_adj;
-	/*
-	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
-	 * always attainable.
-	 */
-	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		task->signal->oom_adj = OOM_DISABLE;
-	else
-		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
-							OOM_SCORE_ADJ_MAX;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 	return count;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..21006dc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -12,13 +12,6 @@
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
-/*
- * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
- * pid.
- */
-#define OOM_SCORE_ADJ_MIN	(-1000)
-#define OOM_SCORE_ADJ_MAX	1000
-
 #ifdef __KERNEL__
 
 #include <linux/sched.h>
@@ -40,8 +33,9 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-			const nodemask_t *nodemask, unsigned long totalpages);
+/* The badness from the OOM killer */
+extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+				 const nodemask_t *nodemask);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
@@ -62,10 +56,6 @@ static inline void oom_killer_enable(void)
 	oom_killer_disabled = false;
 }
 
-/* The badness from the OOM killer */
-extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long uptime);
-
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
 /* sysctls */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..5e61d60 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -622,7 +622,7 @@ struct signal_struct {
 #endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
-	int oom_score_adj;	/* OOM kill score adjustment */
+	long oom_score_adj;	/* OOM kill score adjustment */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fc81cb2..c1beda0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -143,55 +143,41 @@ static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
- * @totalpages: total present RAM allowed for page allocation
  *
  * The heuristic for determining which task to kill is made to be as simple and
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+			  const nodemask_t *nodemask)
 {
-	int points;
+	unsigned long points;
+	unsigned long points_orig;
+	int oom_adj = p->signal->oom_adj;
+	long oom_score_adj = p->signal->oom_score_adj;
 
-	if (oom_unkillable_task(p, mem, nodemask))
-		return 0;
 
-	p = find_lock_task_mm(p);
-	if (!p)
+	if (oom_unkillable_task(p, mem, nodemask))
 		return 0;
-
-	/*
-	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
-	 * need to be executed for something that cannot be killed.
-	 */
-	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-		task_unlock(p);
+	if (oom_adj == OOM_DISABLE)
 		return 0;
-	}
 
 	/*
 	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
 	 * priority for oom killing.
 	 */
-	if (p->flags & PF_OOM_ORIGIN) {
-		task_unlock(p);
-		return 1000;
-	}
+	if (p->flags & PF_OOM_ORIGIN)
+		return ULONG_MAX;
 
-	/*
-	 * The memory controller may have a limit of 0 bytes, so avoid a divide
-	 * by zero, if necessary.
-	 */
-	if (!totalpages)
-		totalpages = 1;
+	p = find_lock_task_mm(p);
+	if (!p)
+		return 0;
 
 	/*
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss and swap space use.
 	 */
-	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
-			totalpages;
+	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS));
 	task_unlock(p);
 
 	/*
@@ -199,18 +185,28 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 * implementation used by LSMs.
 	 */
 	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
-		points -= 30;
+		points -= points / 32;
 
 	/*
-	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
-	 * either completely disable oom killing or always prefer a certain
-	 * task.
+	 * Adjust the score by oom_adj and oom_score_adj.
 	 */
-	points += p->signal->oom_score_adj;
+	points_orig = points;
+	points += oom_score_adj;
+	if ((oom_score_adj > 0) && (points < points_orig))
+		points = ULONG_MAX;	/* may be overflow */
+	if ((oom_score_adj < 0) && (points > points_orig))
+		points = 0;		/* may be underflow */
+
+	if (oom_adj) {
+		if (oom_adj > 0) {
+			if (!points)
+				points = 1;
+			points <<= oom_adj;
+		} else
+			points >>= -(oom_adj);
+	}
 
-	if (points < 0)
-		return 0;
-	return (points < 1000) ? points : 1000;
+	return points;
 }
 
 /*
@@ -218,17 +214,11 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+			gfp_t gfp_mask, nodemask_t *nodemask)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	bool cpuset_limited = false;
-	int nid;
-
-	/* Default to all available memory */
-	*totalpages = totalram_pages + total_swap_pages;
 
 	if (!zonelist)
 		return CONSTRAINT_NONE;
@@ -245,33 +235,21 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 	 * the page allocator means a mempolicy is in effect.  Cpuset policy
 	 * is enforced in get_page_from_freelist().
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
-		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, *nodemask)
-			*totalpages += node_spanned_pages(nid);
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			cpuset_limited = true;
+			return CONSTRAINT_CPUSET;
 
-	if (cpuset_limited) {
-		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, cpuset_current_mems_allowed)
-			*totalpages += node_spanned_pages(nid);
-		return CONSTRAINT_CPUSET;
-	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+					gfp_t gfp_mask, nodemask_t *nodemask)
 {
-	*totalpages = totalram_pages + total_swap_pages;
 	return CONSTRAINT_NONE;
 }
 #endif
@@ -282,16 +260,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, struct mem_cgroup *mem,
-		const nodemask_t *nodemask)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+					      struct mem_cgroup *mem,
+					      const nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
 	*ppoints = 0;
 
 	for_each_process(p) {
-		unsigned int points;
+		unsigned long points;
 
 		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
@@ -323,10 +301,10 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = 1000;
+			*ppoints = ULONG_MAX;
 		}
 
-		points = oom_badness(p, mem, nodemask, totalpages);
+		points = oom_badness(p, mem, nodemask);
 		if (points > *ppoints) {
 			chosen = p;
 			*ppoints = points;
@@ -371,7 +349,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			continue;
 		}
 
-		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5ld %s\n",
 			task->pid, task_uid(task), task->tgid,
 			task->mm->total_vm, get_mm_rss(task->mm),
 			task_cpu(task), task->signal->oom_adj,
@@ -385,7 +363,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d, oom_score_adj=%d\n",
+		"oom_adj=%d, oom_score_adj=%ld\n",
 		current->comm, gfp_mask, order, current->signal->oom_adj,
 		current->signal->oom_score_adj);
 	cpuset_print_task_mems_allowed(current);
@@ -426,14 +404,13 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 #undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned int points, unsigned long totalpages,
-			    struct mem_cgroup *mem, nodemask_t *nodemask,
-			    const char *message)
+			    unsigned long points, struct mem_cgroup *mem,
+			    nodemask_t *nodemask, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
 	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	unsigned long victim_points = 0;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -449,7 +426,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}
 
 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);
 
@@ -461,13 +438,12 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
+			unsigned long child_points;
 
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */
-			child_points = oom_badness(child, mem, nodemask,
-								totalpages);
+			child_points = oom_badness(child, mem, nodemask);
 			if (child_points > victim_points) {
 				victim = child;
 				victim_points = child_points;
@@ -505,19 +481,17 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
-	unsigned long limit;
-	unsigned int points = 0;
+	unsigned long points = 0;
 	struct task_struct *p;
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
-	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, limit, mem, NULL);
+	p = select_bad_process(&points, mem, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
+	if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -642,9 +616,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	struct task_struct *p;
-	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned long points;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
@@ -668,8 +641,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
-						&totalpages);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	check_panic_on_oom(constraint, gfp_mask, order);
 
 	read_lock(&tasklist_lock);
@@ -681,14 +653,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		 * non-zero, current could not be killed so we must fallback to
 		 * the tasklist scan.
 		 */
-		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+		if (!oom_kill_process(current, gfp_mask, order, 0,
 				NULL, nodemask,
 				"Out of memory (oom_kill_allocating_task)"))
 			goto out;
 	}
 
 retry:
-	p = select_bad_process(&points, totalpages, NULL,
+	p = select_bad_process(&points, NULL,
 			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
 								 NULL);
 	if (PTR_ERR(p) == -1UL)
@@ -701,7 +673,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, NULL,
 				nodemask, "Out of memory"))
 		goto retry;
 	killed = 1;
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [BUGFIX for 2.6.36][RESEND][PATCH 2/2] Revert "oom: deprecate oom_adj tunable"
  2010-08-31  9:29 [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness() KOSAKI Motohiro
@ 2010-08-31  9:30 ` KOSAKI Motohiro
  2010-09-01 22:18 ` [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness() David Rientjes
  2010-09-08  2:44 ` KOSAKI Motohiro
  2 siblings, 0 replies; 7+ messages in thread
From: KOSAKI Motohiro @ 2010-08-31  9:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Linus Torvalds, KAMEZAWA Hiroyuki,
	David Rientjes

oom_adj is not only used for kernel knob, but also used for
application interface.
Then, adding new knob is no good reason to deprecate it.

Also, after former patch, oom_score_adj can't be used for setting
OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing.

This reverts commit 51b1bd2ace1595b72956224deda349efa880b693.
---
 Documentation/feature-removal-schedule.txt |   25 -------------------------
 Documentation/filesystems/proc.txt         |    3 ---
 fs/proc/base.c                             |    8 --------
 include/linux/oom.h                        |    3 ---
 4 files changed, 0 insertions(+), 39 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 842aa9d..aff4d11 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -151,31 +151,6 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
-What:	/proc/<pid>/oom_adj
-When:	August 2012
-Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
-	badness heuristic used to determine which task to kill when the kernel
-	is out of memory.
-
-	The badness heuristic has since been rewritten since the introduction of
-	this tunable such that its meaning is deprecated.  The value was
-	implemented as a bitshift on a score generated by the badness()
-	function that did not have any precise units of measure.  With the
-	rewrite, the score is given as a proportion of available memory to the
-	task allocating pages, so using a bitshift which grows the score
-	exponentially is, thus, impossible to tune with fine granularity.
-
-	A much more powerful interface, /proc/<pid>/oom_score_adj, was
-	introduced with the oom killer rewrite that allows users to increase or
-	decrease the badness() score linearly.  This interface will replace
-	/proc/<pid>/oom_adj.
-
-	A warning will be emitted to the kernel log if an application uses this
-	deprecated interface.  After it is printed once, future warnings will be
-	suppressed until the kernel is rebooted.
-
----------------------------
-
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index a6aca87..cf1295c 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1285,9 +1285,6 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
-NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
-Documentation/feature-removal-schedule.txt.
-
 Caveat: when a parent task is selected, the oom killer will sacrifice any first
 generation children with seperate address spaces instead, if possible.  This
 avoids servers and important system daemons from being killed and loses the
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 90ba487..55a16f2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1046,14 +1046,6 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		return -EACCES;
 	}
 
-	/*
-	 * Warn that /proc/pid/oom_adj is deprecated, see
-	 * Documentation/feature-removal-schedule.txt.
-	 */
-	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
-			"please use /proc/%d/oom_score_adj instead.\n",
-			current->comm, task_pid_nr(current),
-			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
 
 	unlock_task_sighand(task, &flags);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 21006dc..394f2e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -2,9 +2,6 @@
 #define __INCLUDE_LINUX_OOM_H
 
 /*
- * /proc/<pid>/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- *
  * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
  */
 #define OOM_DISABLE (-17)
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness()
  2010-08-31  9:29 [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness() KOSAKI Motohiro
  2010-08-31  9:30 ` [BUGFIX for 2.6.36][RESEND][PATCH 2/2] Revert "oom: deprecate oom_adj tunable" KOSAKI Motohiro
@ 2010-09-01 22:18 ` David Rientjes
  2010-09-08  2:44   ` KOSAKI Motohiro
  2010-09-08  2:44 ` KOSAKI Motohiro
  2 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2010-09-01 22:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Linus Torvalds, KAMEZAWA Hiroyuki

On Tue, 31 Aug 2010, KOSAKI Motohiro wrote:

> ok, this one got no objection except original patch author.

Would you care to respond to my objections?

I replied to these two patches earlier with my nack, here they are:

	http://marc.info/?l=linux-mm&m=128273555323993
	http://marc.info/?l=linux-mm&m=128337879310476

Please carry on a useful debate of the issues rather than continually 
resending patches and labeling them as bugfixes, which they aren't.

> then, I'll push it to mainline. I'm glad that I who stabilization
> developer have finished this work.
> 

You're not the maintainer of this code, patches go through Andrew.

That said, I'm really tired of you trying to make this personal with me; 
I've been very respectful and accomodating during this discussion and I 
hope that you will be the same.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness()
  2010-08-31  9:29 [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness() KOSAKI Motohiro
  2010-08-31  9:30 ` [BUGFIX for 2.6.36][RESEND][PATCH 2/2] Revert "oom: deprecate oom_adj tunable" KOSAKI Motohiro
  2010-09-01 22:18 ` [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness() David Rientjes
@ 2010-09-08  2:44 ` KOSAKI Motohiro
  2 siblings, 0 replies; 7+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, LKML, linux-mm, Linus Torvalds,
	KAMEZAWA Hiroyuki, David Rientjes

> ok, this one got no objection except original patch author.
> then, I'll push it to mainline. I'm glad that I who stabilization
> developer have finished this work.
> 
> If you think this patch is slightly large, please run,
>  % git diff a63d83f42^ mm/oom_kill.c
> you'll understand this is minimal revert of unnecessary change.

Andrew, please don't be lazy this one. I don't hope to slip this anymore.
I was making the patch as you requested. but no responce. I who stabilization
developr can't permit this userland breakage and sucky status. please
join to fix it. Sadly, The delay will be increase, I have to switch 
full revert entirely instead your opinion.

Spell out: I don't hope to continus this crazy discussion. a userland 
breakage bug is a bug, not anything else. I don't hope to talk this 
one anymore even though it's only 5 miniture. I don't think any rare
usecase feature should die. but ZERO USER FEATURE SHOULDN'T BREAK USERLAND.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness()
  2010-09-01 22:18 ` [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness() David Rientjes
@ 2010-09-08  2:44   ` KOSAKI Motohiro
  2010-09-08  3:21     ` David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  2:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Linus Torvalds,
	KAMEZAWA Hiroyuki

> On Tue, 31 Aug 2010, KOSAKI Motohiro wrote:
> 
> > ok, this one got no objection except original patch author.
> 
> Would you care to respond to my objections?
> 
> I replied to these two patches earlier with my nack, here they are:
> 
> 	http://marc.info/?l=linux-mm&m=128273555323993
> 	http://marc.info/?l=linux-mm&m=128337879310476
> 
> Please carry on a useful debate of the issues rather than continually 
> resending patches and labeling them as bugfixes, which they aren't.

You are still talking about only your usecase. Why do we care you? Why?
Why don't you fix the code by yourself? Why? Why do you continue selfish
development? Why? I can't understand.



> > then, I'll push it to mainline. I'm glad that I who stabilization
> > developer have finished this work.
> > 
> 
> You're not the maintainer of this code, patches go through Andrew.
> 
> That said, I'm really tired of you trying to make this personal with me; 
> I've been very respectful and accomodating during this discussion and I 
> hope that you will be the same.

As I said, You only need to don't break userland and fix the code immediately.
You don't have to expect stabilization developer allow userland and code
breakage.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness()
  2010-09-08  2:44   ` KOSAKI Motohiro
@ 2010-09-08  3:21     ` David Rientjes
  2010-09-08  8:24       ` David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2010-09-08  3:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Linus Torvalds, KAMEZAWA Hiroyuki

On Wed, 8 Sep 2010, KOSAKI Motohiro wrote:

> > > ok, this one got no objection except original patch author.
> > 
> > Would you care to respond to my objections?
> > 
> > I replied to these two patches earlier with my nack, here they are:
> > 
> > 	http://marc.info/?l=linux-mm&m=128273555323993
> > 	http://marc.info/?l=linux-mm&m=128337879310476
> > 
> > Please carry on a useful debate of the issues rather than continually 
> > resending patches and labeling them as bugfixes, which they aren't.
> 
> You are still talking about only your usecase. Why do we care you? Why?

It's an example of how the new interface may be used to represent oom 
killing priorities for an aggregate of tasks competing for the same set of 
resources.

> Why don't you fix the code by yourself? Why? Why do you continue selfish
> development? Why? I can't understand.
> 

I can only reiterate what I've said before (and you can be assured I'll 
only keep it technical and professional even though you've always made 
this personal with me): current users of /proc/pid/oom_adj only polarize a 
task to either disable oom killing (-17 or -16), or always prefer a task 
(+15).  Very, very few users tune it to anything in between, and when it's 
done, it's relative to other oom_adj values.

A single example of a /proc/pid/oom_adj usecase has not been presented 
that shows anybody using it as a function of either an application's 
expected memory usage or of the system capacity.  Those two variables are 
important for oom_adj to make any sense since its old definition was 
basically oom_adj = mm->total_vm << oom_adj for positive oom_adj and 
oom_adj = mm->total_vm >> oom_adj for negative oom_adj.  If an 
application, system daemon, or job scheduler does not tune it without 
consideration to the amount of expected RAM usage or system RAM capacity, 
it doesn't make any sense.  You're welcome to present such a user at this 
time.

That said, I felt it was possible to use the current usecase for 
/proc/pid/oom_adj to expand upon its applicability by introducing 
/proc/pid/oom_score_adj with a much higher resolution and ability to stay 
static based on the relative importance of a task compared to others 
sharing the same resources in a dynamic environment (memcg limits 
changing, cpuset mems added, mempolicy nodes changing, etc).

Thus, my introduction of oom_score_adj causes no regression for real-world 
users of /proc/pid/oom_adj and allows users of cgroups and mempolicies a 
much more powerful interface to tune oom killing priority.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness()
  2010-09-08  3:21     ` David Rientjes
@ 2010-09-08  8:24       ` David Rientjes
  0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2010-09-08  8:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Linus Torvalds, KAMEZAWA Hiroyuki

On Tue, 7 Sep 2010, David Rientjes wrote:

> Thus, my introduction of oom_score_adj causes no regression for real-world 
> users of /proc/pid/oom_adj and allows users of cgroups and mempolicies a 
> much more powerful interface to tune oom killing priority.
> 

I want to elaborate on this point just a little more because I feel that 
Andrew is in the unpleasant position of trying to judge whether the 
introduction of this feature actually has a negative side-effect for 
current users of oom_adj, and I want to stress that it doesn't.

The problem being reported here is that current users of oom_adj will now 
experience altered semantics of the value if it doesn't lie at the 
absolute extremes such as +15 or -16.  The prior implementation would, in 
its simpliest form, do this:

	badness_score = task->mm->total_vm << oom_adj

for a positive oom_adj and this:

	badness_score = task->mm->total_vm >> -oom_adj

for a negative oom_adj.  That implementation would require that the task 
setting oom_adj would know the expected RAM usage of the task at the 
time of oom and the system capacity.  Otherwise, it would be impossible to 
judge an approriate value for the bitshift if we didn't know the total VM 
size of task and how the score would rank in comparison to other tasks.

The comparison is important because we use these scores as an indicator of 
the "memory hogging" task to kill and we know that the system is oom so 
the remainder of system RAM that this application is not using may be 
consumed by another that we would therefore want to select instead.

It's my contention that nobody is currently using oom_adj in that way, and 
if they are, they are more deserving of a finer-grained utility that 
doesn't operate exponentially on the VM size, which leaves much to be 
desired.  oom_score_adj, for these users that aren't using cpusets or 
memcg or mempolicies, allows for the same static value that scales 
linearly as opposed to exponentially.

Current users of oom_adj do one of two things:

 - polarize the value such that a task is always preferred (+15) or 
   always protected (-16) [or, completely disabled, -17], or

 - set up very rough approximations of oom killing priority such as 
   one task at +10 and another at +5.

The latter is certainly in the very, very minority and have arbitrary 
values to imply that the former task should be selected over the latter 
iff the latter has exploded in memory use such that it's using much more 
than expected and probably leaking memory.

My contention that we are safe in proceeding with what is currently in 
2.6.36-rc3 is based on the assumption that all users currently use oom_adj 
in one of the two ways enumerated above.  If that assumption isn't 
accepted, then I believe the revert criteria is to show a user, any user, 
of oom_adj that considers both expected memory usage and system capacity 
of the application it is tuning the value for and does so in comparison to 
other tasks on the system.  Unless that can be shown, I do not believe the 
revert criteria has been met in this case.

Furthermore, the characterization of the above as being a "bug" that 
affects endusers is inaccurate.  The vast majority of Linux users do not 
use cpusets, memcg, mempolicies, or memory hotplug.  For those users, the 
proportional scores that are used by oom_score_adj stay static since the 
system capacity remains static.  They will find that oom_score_adj is 
exceptionally more powerful for the users that this is being inaccurately 
described as imposing a regression for: if they are tuning oom_adj based 
on the specific memory usage of their applications and the system 
capacity, they may now do so with a rough linear approximation via oom_adj 
as they always did or use oom_score_adj with a _much_ higher resolution 
(1/1000th of RAM) that scales linearly and not exponentially.

For the users of cpusets or memcg, existing users of oom_adj will see a 
rough approximation of the priority (and always an exact equivalent in the 
high-probability case where they are polarizing the task with +15 or -17) 
while using the old interface.  The badness score _will_ change if the set 
of cpuset mems or the memcg limit changes, which is new behavior.  This is 
the exact equivalent according to the oom killer as if we were using 
memory hotplug on a system and hot-adding memory or hot-removing memory: 
badness scores that have a certain value may no longer have the same kill 
priority because we're using more or less memory.  Considering the 
unconstrained, system-wide oom case: if we hot-added memory and are now 
oom and the memory usage of a task hasn't changed, it may no longer be the 
first task to be killed anymore because another task's usage may now cause 
its badness score to exceed the former.  Likewise, if we hot-removed 
memory and are now oom and the memory usage of a task hasn't changed, it 
may now be selected first because all other task's usage may now be less 
than ours.

That's the exact same behavior as oom_score_adj when constrained to a 
cpuset or memcg.  See either of them as a virtualized system with a set of 
resources and an aggregate of tasks competing for those resources.  
Priorities will change as memory is allowed or restricted, just like we 
always have had with memory hotplug, but we now allow users to define the 
proportion of that working set they are allowed instead of a static value.  
Static oom_adj scores never work appropriately in a dynamic environment 
because we don't know the capacity when we set the value (remember the two 
prerequisites to use oom_adj: expected RAM usage of the application, and 
memory capacity available to it).

For the users of mempolicies, the entire oom killer rewrite has changed 
how those ooms are handled: prior to the rewrite, the oom killer would 
simply kill current.  Now, the tasklist is iterated if the 
oom_kill_allocating_task sysctl is not selected and the badness scores are 
used to select a task.  The prior behavior of oom_adj in these oom 
contexts are therefore unrelated to this discussion.

This explains the power and necessity of oom_score_adj for users who use 
cpusets, memcg, or mempolicies.  Those environments are dynamic and we 
_can't_ expect oom_adj to be written anytime a task changes cgroups or a 
cpuset mem is added, a memcg limit is reduced, or a node is added to a 
mempolicy.  We _can_ expect the admin to know the priority of killing jobs 
relative to others competing for the same set of now fully depleted 
memory if they are using the tunable.

Given the above, it's not possible to meet the revert criteria.  The real 
question to be asking in this case is not whether we need to revert 
oom_score_adj, but rather whether we need dual interfaces to exist: 
oom_adj and oom_score_adj.  I believe that we should only have a single 
interface available in the kernel, and since oom_score_adj is much more 
powerful than oom_adj, acts on a higher resolution, respects the dynamic 
nature of cgroups, provides a rough approximation to users of oom_adj, and 
an exact equivalent of polarizing users of oom_adj, that it should exist 
and oom_adj should be deprecated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-09-08  8:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-31  9:29 [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness() KOSAKI Motohiro
2010-08-31  9:30 ` [BUGFIX for 2.6.36][RESEND][PATCH 2/2] Revert "oom: deprecate oom_adj tunable" KOSAKI Motohiro
2010-09-01 22:18 ` [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpage normalization from oom_badness() David Rientjes
2010-09-08  2:44   ` KOSAKI Motohiro
2010-09-08  3:21     ` David Rientjes
2010-09-08  8:24       ` David Rientjes
2010-09-08  2:44 ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox