[RFC][PATCH 0/8] RSS controller for containers

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/8] RSS controller for containers
@ 2006-11-09 19:35 Balbir Singh
  2006-11-09 19:35 ` [RFC][PATCH 1/8] Fix resource groups parsing, while assigning shares Balbir Singh
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:35 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth,
	Balbir Singh

Here is set of patches that implements a *simple minded* RSS controller
for containers.  It would be nice to split up the memory controller design
and implementation in phases

1. RSS control
2. Page Cache control (with split clean and dirty accounting/control)
3. mlock() control
4. Kernel accounting and control

The beancounter implementation follows a very similar approach. The split
up makes the design of the controller easier. RSS for example, can be tracked
per mm_struct. Page Cache could be tracked per inode, per thread
or per mm_struct (depending on what form is most suitable).

The definition of RSS was debated on lkml, please see

	http://lkml.org/lkml/2006/10/10/130

This patchset is a proof of concept implementation and the accounting can
be easily adapted to meet the definition of RSS as and when it is re-defined
or revisited. The changes required should be small.

The reclamation logic has been borrowed from Dave Hansen's challenged
memory controller and from shrink_all_memory(). The accounting was inspired
from Rohit Seth's container patches.

The good
--------

No additional pointers required in struct page.
There is also a lot of scope for code reuse in tracking the rss of a process
(this reuse is yet to be exploited).

The not so good
---------------
The patches contain a lot of debugging code.

Applying the patches
--------------------
This patchset has been developed on top of 2.6.19-rc2 with the latest
containers patch applied.

To run and test this patch, additional fixes are required.

Please see

	http://lkml.org/lkml/2006/11/6/10
	http://lkml.org/lkml/2006/11/6/245

Series
------
container-res-groups-fix-parsing.patch
container-memctlr-setup.patch
container-memctlr-callbacks.patch
container-memctlr-acct.patch
container-memctlr-task-migration.patch
container-memctlr-shares.patch
container-memctlr-reclaim.patch

Setup
-----
To test the series, here's what you need to do

0. Get the latest containers patches against 2.6.19-rc2
1. Apply all the fixes
2. Apply these patches
3. Build the kernel and mount the container filesystem
	mount -t container container /container

4. Disable cpuset's (to simply assignment of tasks to resource groups)

	cd /container
	echo 0 > cpuset_enabled

5. Add the current task to a new group

	mkdir /container/a
	echo $$ > tasks
	cat memctlr_stats

6. Set limits

	echo "res=memctlr,max_shares=10" > memctlr_shares

7. Spin the system, hang it, revolve it, crash it!!
8. Please provide feedback, both code review and any thing else that
   can be useful for further development

Testing
-------
Kernbench was run on these patches and it did not show any significant
overhead in the tests.

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 1/8] Fix resource groups parsing, while assigning shares
  2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
@ 2006-11-09 19:35 ` Balbir Singh
  2006-11-09 19:35 ` [RFC][PATCH 2/8] RSS controller setup Balbir Singh
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:35 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, Linux Kernel Mailing List, ckrm-tech, haveblue, rohitseth,
	Balbir Singh


echo adds a "\n" to the end of a string. When this string is copied from
user space, we need to remove it, so that match_token() can parse
the user space string correctly

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 kernel/res_group/rgcs.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff -puN kernel/res_group/rgcs.c~container-res-groups-fix-parsing kernel/res_group/rgcs.c
--- linux-2.6.19-rc2/kernel/res_group/rgcs.c~container-res-groups-fix-parsing	2006-11-09 23:08:10.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/rgcs.c	2006-11-09 23:08:10.000000000 +0530
@@ -241,6 +241,12 @@ ssize_t res_group_file_write(struct cont
 	}
 	buf[nbytes] = 0;	/* nul-terminate */
 
+	/*
+	 * Ignore "\n". It might come in from echo(1)
+	 */
+	if (buf[nbytes - 1] == '\n')
+		buf[nbytes - 1] = 0;
+
 	container_manage_lock();
 
 	if (container_is_removed(cont)) {
_

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 2/8] RSS controller setup
  2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
  2006-11-09 19:35 ` [RFC][PATCH 1/8] Fix resource groups parsing, while assigning shares Balbir Singh
@ 2006-11-09 19:35 ` Balbir Singh
  2006-11-09 19:35 ` [RFC][PATCH 3/8] RSS controller add callbacks Balbir Singh
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:35 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth,
	Balbir Singh


Basic setup for a controller written for resource groups. This patch
registers a dummy controller.


Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 include/linux/memctlr.h    |   31 ++++++++++++++
 init/Kconfig               |   11 +++++
 kernel/res_group/Makefile  |    1 
 kernel/res_group/memctlr.c |   94 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 137 insertions(+)

diff -puN /dev/null include/linux/memctlr.h
--- /dev/null	2006-05-31 06:45:07.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/memctlr.h	2006-11-09 23:56:03.000000000 +0530
@@ -0,0 +1,31 @@
+/*
+ * Memory controller - "Account and Control Memory Usage"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Author: Balbir Singh <balbir@in.ibm.com>
+ *
+ */
+
+#ifndef _LINUX_MEMCTRL_H
+#define _LINUX_MEMCTRL_H
+
+#ifdef CONFIG_RES_GROUPS_MEMORY
+#include <linux/res_group_rc.h>
+#endif /* CONFIG_RES_GROUPS_MEMORY */
+
+#endif /* _LINUX_MEMCTRL_H */
diff -puN init/Kconfig~container-memctlr-setup init/Kconfig
--- linux-2.6.19-rc2/init/Kconfig~container-memctlr-setup	2006-11-09 23:09:03.000000000 +0530
+++ linux-2.6.19-rc2-balbir/init/Kconfig	2006-11-09 23:56:47.000000000 +0530
@@ -325,6 +325,17 @@ config RES_GROUPS_NUMTASKS
 
 	  Say N if unsure, Y to use the feature.
 
+config RES_GROUPS_MEMORY
+	bool "Memory Controller for RSS"
+	depends on RES_GROUPS
+	default y
+	help
+	  Provides a Resource Controller for Resource Groups.
+	  It limits the resident pages of the tasks belonging to the resource
+	  group.
+
+	  Say N if unsure, Y to use the feature.
+
 endmenu
 config SYSCTL
 	bool
diff -puN kernel/res_group/Makefile~container-memctlr-setup kernel/res_group/Makefile
--- linux-2.6.19-rc2/kernel/res_group/Makefile~container-memctlr-setup	2006-11-09 23:09:03.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/Makefile	2006-11-09 23:09:03.000000000 +0530
@@ -1,2 +1,3 @@
 obj-y = res_group.o shares.o rgcs.o
 obj-$(CONFIG_RES_GROUPS_NUMTASKS) += numtasks.o
+obj-$(CONFIG_RES_GROUPS_MEMORY) += memctlr.o
diff -puN /dev/null kernel/res_group/memctlr.c
--- /dev/null	2006-05-31 06:45:07.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c	2006-11-09 23:56:03.000000000 +0530
@@ -0,0 +1,94 @@
+/*
+ * Memory controller - "Account and Control Memory Usage"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Author: Balbir Singh <balbir@in.ibm.com>
+ *
+ */
+
+/*
+ * Simple memory controller.
+ * Supports limits, guarantees not supported right now
+ *
+ * Tasks are group'ed virtually by thread groups - Add more details
+ */
+
+#include <linux/module.h>
+#include <linux/res_group_rc.h>
+#include <linux/memctlr.h>
+
+static const char res_ctlr_name[] = "memctlr";
+static struct resource_group *root_rgroup;
+
+struct mem_counter {
+	atomic_long_t	rss;
+};
+
+struct memctlr {
+	struct resource_group *rgroup;		/* My resource group	*/
+	struct res_shares shares;		/* My shares		*/
+
+	struct mem_counter counter;		/* Accounting information */
+	/* Statistics */
+	int successes;
+	int failures;
+};
+
+struct res_controller memctlr_rg;
+
+static struct memctlr *get_memctlr_from_shares(struct res_shares *shares)
+{
+	if (shares)
+		return container_of(shares, struct memctlr, shares);
+	return NULL;
+}
+
+static struct memctlr *get_memctlr(struct resource_group *rgroup)
+{
+	return get_memctlr_from_shares(get_controller_shares(rgroup,
+								&memctlr_rg));
+}
+
+struct res_controller memctlr_rg = {
+	.name = res_ctlr_name,
+	.ctlr_id = NO_RES_ID,
+	.alloc_shares_struct = NULL,
+	.free_shares_struct = NULL,
+	.move_task = NULL,
+	.shares_changed = NULL,
+	.show_stats = NULL,
+};
+
+int __init memctlr_init(void)
+{
+	if (memctlr_rg.ctlr_id != NO_RES_ID)
+		return -EBUSY;	/* already registered */
+	return register_controller(&memctlr_rg);
+}
+
+void __exit memctlr_exit(void)
+{
+	int rc;
+	do {
+		rc = unregister_controller(&memctlr_rg);
+	} while (rc == -EBUSY);
+	BUG_ON(rc != 0);
+}
+
+module_init(memctlr_init);
+module_exit(memctlr_exit);
_

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 3/8] RSS controller add callbacks
  2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
  2006-11-09 19:35 ` [RFC][PATCH 1/8] Fix resource groups parsing, while assigning shares Balbir Singh
  2006-11-09 19:35 ` [RFC][PATCH 2/8] RSS controller setup Balbir Singh
@ 2006-11-09 19:35 ` Balbir Singh
  2006-11-09 19:36 ` [RFC][PATCH 4/8] RSS controller accounting Balbir Singh
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:35 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, Linux Kernel Mailing List, ckrm-tech, haveblue, rohitseth,
	Balbir Singh


Add callbacks to allocate and free instances of the controller as the
hierarchy of resource groups is modified.

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 kernel/res_group/memctlr.c |   58 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 55 insertions(+), 3 deletions(-)

diff -puN kernel/res_group/memctlr.c~container-memctlr-callbacks kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-callbacks	2006-11-09 21:42:35.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c	2006-11-09 21:42:35.000000000 +0530
@@ -34,6 +34,8 @@
 
 static const char res_ctlr_name[] = "memctlr";
 static struct resource_group *root_rgroup;
+static const char version[] = "0.01";
+static struct memctlr *memctlr_root;
 
 struct mem_counter {
 	atomic_long_t	rss;
@@ -64,14 +66,64 @@ static struct memctlr *get_memctlr(struc
 								&memctlr_rg));
 }
 
+static void memctlr_init_new(struct memctlr *res)
+{
+	res->shares.min_shares = SHARE_DONT_CARE;
+	res->shares.max_shares = SHARE_DONT_CARE;
+	res->shares.child_shares_divisor = SHARE_DEFAULT_DIVISOR;
+	res->shares.unused_min_shares = SHARE_DEFAULT_DIVISOR;
+}
+
+static struct res_shares *memctlr_alloc_instance(struct resource_group *rgroup)
+{
+	struct memctlr *res;
+
+	res = kzalloc(sizeof(struct memctlr), GFP_KERNEL);
+	if (!res)
+		return NULL;
+	res->rgroup = rgroup;
+	memctlr_init_new(res);
+	if (is_res_group_root(rgroup)) {
+		root_rgroup = rgroup;
+		memctlr_root = res;
+		printk("Memory Controller version %s\n", version);
+	}
+	return &res->shares;
+}
+
+static void memctlr_free_instance(struct res_shares *shares)
+{
+	struct memctlr *res, *parres;
+
+	res = get_memctlr_from_shares(shares);
+	BUG_ON(!res);
+	/*
+	 * Containers do not allow removal of groups that have tasks
+	 * associated with them. To free a container, it must be empty.
+	 * Handle transfer of charges in the move_task notification
+	 */
+	kfree(res);
+}
+
+static ssize_t memctlr_show_stats(struct res_shares *shares, char *buf,
+					size_t len)
+{
+	int i = 0;
+
+	i += snprintf(buf, len, "Accounting will be added soon\n");
+	buf += i;
+	len -= i;
+	return i;
+}
+
 struct res_controller memctlr_rg = {
 	.name = res_ctlr_name,
 	.ctlr_id = NO_RES_ID,
-	.alloc_shares_struct = NULL,
-	.free_shares_struct = NULL,
+	.alloc_shares_struct = memctlr_alloc_instance,
+	.free_shares_struct = memctlr_free_instance,
 	.move_task = NULL,
 	.shares_changed = NULL,
-	.show_stats = NULL,
+	.show_stats = memctlr_show_stats,
 };
 
 int __init memctlr_init(void)
_

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 4/8] RSS controller accounting
  2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
                   ` (2 preceding siblings ...)
  2006-11-09 19:35 ` [RFC][PATCH 3/8] RSS controller add callbacks Balbir Singh
@ 2006-11-09 19:36 ` Balbir Singh
  2006-11-10  9:06   ` Pavel Emelianov
  2006-11-09 19:36 ` [RFC][PATCH 5/8] RSS controller task migration support Balbir Singh
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:36 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth,
	Balbir Singh


Account RSS usage of a task and the associated container. The definition
of RSS was debated and discussed in the following thread

	http://lkml.org/lkml/2006/10/10/130


The code tracks all resident pages (including shared pages) as RSS. This patch
can easily adapt to the definition of RSS that will be agreed upon. This
implementation provides a proof of concept RSS controller.

The accounting is inspired from Rohit Seth's container patches.

TODO's

1. Merge file_rss and anon_rss tracking with the current rss tracking to
   maximize code reuse
2. Add/remove RSS tracking as the definition of RSS evolves


Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 include/linux/memctlr.h    |   26 +++++++++++
 include/linux/rmap.h       |    2 
 include/linux/sched.h      |   11 +++++
 kernel/fork.c              |    6 ++
 kernel/res_group/memctlr.c |   99 +++++++++++++++++++++++++++++++++++++++++++--
 mm/rmap.c                  |    6 ++
 6 files changed, 145 insertions(+), 5 deletions(-)

diff -puN include/linux/sched.h~container-memctlr-acct include/linux/sched.h
--- linux-2.6.19-rc2/include/linux/sched.h~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/sched.h	2006-11-09 21:46:22.000000000 +0530
@@ -88,6 +88,10 @@ struct sched_param {
 struct exec_domain;
 struct futex_pi_state;
 
+struct memctlr;
+struct container;
+struct mem_counter;
+
 /*
  * List of flags we want to share for kernel threads,
  * if only because they are not used by them anyway.
@@ -355,6 +359,13 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+#ifdef CONFIG_RES_GROUPS_MEMORY
+	struct container	*container;
+	/*
+	 * Try and merge anon and file rss accounting
+	 */
+	struct mem_counter	*counter;
+#endif
 };
 
 struct sighand_struct {
diff -puN kernel/res_group/memctlr.c~container-memctlr-acct kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c	2006-11-09 21:47:06.000000000 +0530
@@ -37,6 +37,8 @@ static struct resource_group *root_rgrou
 static const char version[] = "0.01";
 static struct memctlr *memctlr_root;
 
+#define MEMCTLR_MAGIC	0xdededede
+
 struct mem_counter {
 	atomic_long_t	rss;
 };
@@ -49,6 +51,7 @@ struct memctlr {
 	/* Statistics */
 	int successes;
 	int failures;
+	int magic;
 };
 
 struct res_controller memctlr_rg;
@@ -66,12 +69,91 @@ static struct memctlr *get_memctlr(struc
 								&memctlr_rg));
 }
 
+static void memctlr_init_mem_counter(struct mem_counter *counter)
+{
+	atomic_long_set(&counter->rss, 0);
+}
+
+int mm_init_mem_counter(struct mm_struct *mm)
+{
+	mm->counter = kmalloc(sizeof(struct mem_counter), GFP_KERNEL);
+	if (!mm->counter)
+		return -ENOMEM;
+	memctlr_init_mem_counter(mm->counter);
+	return 0;
+}
+
+void mm_free_mem_counter(struct mm_struct *mm)
+{
+	kfree(mm->counter);
+}
+
+void mm_assign_container(struct mm_struct *mm, struct task_struct *p)
+{
+	rcu_read_lock();
+	mm->container = rcu_dereference(p->container);
+	rcu_read_unlock();
+}
+
+static inline struct memctlr *get_memctlr_from_page(struct page *page)
+{
+	struct resource_group *rgroup;
+	struct memctlr *res;
+
+	/*
+	 * Is the resource groups infrastructure initialized?
+	 */
+	if (!memctlr_root)
+		return NULL;
+
+	rcu_read_lock();
+	rgroup = (struct resource_group *)rcu_dereference(current->container);
+	rcu_read_unlock();
+
+	res = get_memctlr(rgroup);
+	if (!res)
+		return NULL;
+
+	BUG_ON(res->magic != MEMCTLR_MAGIC);
+	return res;
+}
+
+
+void memctlr_inc_rss(struct page *page)
+{
+	struct memctlr *res;
+
+	res = get_memctlr_from_page(page);
+	if (!res)
+		return;
+
+	atomic_long_inc(&current->mm->counter->rss);
+	atomic_long_inc(&res->counter.rss);
+}
+
+void memctlr_dec_rss(struct page *page)
+{
+	struct memctlr *res;
+
+	res = get_memctlr_from_page(page);
+	if (!res)
+		return;
+
+	atomic_long_dec(&res->counter.rss);
+
+	if ((current->flags & PF_EXITING) && !current->mm)
+		return;
+	atomic_long_dec(&current->mm->counter->rss);
+}
+
 static void memctlr_init_new(struct memctlr *res)
 {
 	res->shares.min_shares = SHARE_DONT_CARE;
 	res->shares.max_shares = SHARE_DONT_CARE;
 	res->shares.child_shares_divisor = SHARE_DEFAULT_DIVISOR;
 	res->shares.unused_min_shares = SHARE_DEFAULT_DIVISOR;
+
+	memctlr_init_mem_counter(&res->counter);
 }
 
 static struct res_shares *memctlr_alloc_instance(struct resource_group *rgroup)
@@ -83,6 +165,7 @@ static struct res_shares *memctlr_alloc_
 		return NULL;
 	res->rgroup = rgroup;
 	memctlr_init_new(res);
+	res->magic = MEMCTLR_MAGIC;
 	if (is_res_group_root(rgroup)) {
 		root_rgroup = rgroup;
 		memctlr_root = res;
@@ -93,7 +176,7 @@ static struct res_shares *memctlr_alloc_
 
 static void memctlr_free_instance(struct res_shares *shares)
 {
-	struct memctlr *res, *parres;
+	struct memctlr *res;
 
 	res = get_memctlr_from_shares(shares);
 	BUG_ON(!res);
@@ -108,12 +191,19 @@ static void memctlr_free_instance(struct
 static ssize_t memctlr_show_stats(struct res_shares *shares, char *buf,
 					size_t len)
 {
-	int i = 0;
+	int i = 0, j = 0;
+	struct memctlr *res;
+
+	res = get_memctlr_from_shares(shares);
+	BUG_ON(!res);
 
-	i += snprintf(buf, len, "Accounting will be added soon\n");
+	i = snprintf(buf, len, "RSS Pages %ld\n",
+			atomic_long_read(&res->counter.rss));
 	buf += i;
 	len -= i;
-	return i;
+	j += i;
+
+	return j;
 }
 
 struct res_controller memctlr_rg = {
@@ -142,5 +232,6 @@ void __exit memctlr_exit(void)
 	BUG_ON(rc != 0);
 }
 
+
 module_init(memctlr_init);
 module_exit(memctlr_exit);
diff -puN include/linux/memctlr.h~container-memctlr-acct include/linux/memctlr.h
--- linux-2.6.19-rc2/include/linux/memctlr.h~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/memctlr.h	2006-11-09 21:46:22.000000000 +0530
@@ -26,6 +26,32 @@
 
 #ifdef CONFIG_RES_GROUPS_MEMORY
 #include <linux/res_group_rc.h>
+
+extern int mm_init_mem_counter(struct mm_struct *mm);
+extern void mm_assign_container(struct mm_struct *mm, struct task_struct *p);
+extern void memctlr_inc_rss(struct page *page);
+extern void memctlr_dec_rss(struct page *page);
+extern void mm_free_mem_counter(struct mm_struct *mm);
+
+#else /* CONFIG_RES_GROUPS_MEMORY */
+
+void memctlr_inc_rss(struct page *page)
+{}
+
+void memctlr_dec_rss(struct page *page)
+{}
+
+int mm_init_mem_counter(struct mm_struct *mm)
+{
+	return 0;
+}
+
+void mm_assign_container(struct mm_struct *mm, struct task_struct *p)
+{}
+
+void mm_free_mem_counter(struct mm_struct *mm)
+{}
+
 #endif /* CONFIG_RES_GROUPS_MEMORY */
 
 #endif /* _LINUX_MEMCTRL_H */
diff -puN kernel/fork.c~container-memctlr-acct kernel/fork.c
--- linux-2.6.19-rc2/kernel/fork.c~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/fork.c	2006-11-09 21:46:22.000000000 +0530
@@ -49,6 +49,7 @@
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
 #include <linux/numtasks.h>
+#include <linux/memctlr.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -340,11 +341,14 @@ static struct mm_struct * mm_init(struct
 	mm->ioctx_list = NULL;
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
+	if (mm_init_mem_counter(mm) < 0)
+		goto mem_fail;
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		return mm;
 	}
+mem_fail:
 	free_mm(mm);
 	return NULL;
 }
@@ -372,6 +376,7 @@ struct mm_struct * mm_alloc(void)
 void fastcall __mmdrop(struct mm_struct *mm)
 {
 	BUG_ON(mm == &init_mm);
+	mm_free_mem_counter(mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	free_mm(mm);
@@ -544,6 +549,7 @@ static int copy_mm(unsigned long clone_f
 
 good_mm:
 	tsk->mm = mm;
+	mm_assign_container(mm, tsk);
 	tsk->active_mm = mm;
 	return 0;
 
diff -puN mm/rmap.c~container-memctlr-acct mm/rmap.c
--- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/rmap.c	2006-11-09 21:46:22.000000000 +0530
@@ -537,6 +537,7 @@ void page_add_anon_rmap(struct page *pag
 	if (atomic_inc_and_test(&page->_mapcount))
 		__page_set_anon_rmap(page, vma, address);
 	/* else checking page index and mapping is racy */
+	memctlr_inc_rss(page);
 }
 
 /*
@@ -553,6 +554,7 @@ void page_add_new_anon_rmap(struct page 
 {
 	atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */
 	__page_set_anon_rmap(page, vma, address);
+	memctlr_inc_rss(page);
 }
 
 /**
@@ -565,6 +567,7 @@ void page_add_file_rmap(struct page *pag
 {
 	if (atomic_inc_and_test(&page->_mapcount))
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
+	memctlr_inc_rss(page);
 }
 
 /**
@@ -596,8 +599,9 @@ void page_remove_rmap(struct page *page)
 		if (page_test_and_clear_dirty(page))
 			set_page_dirty(page);
 		__dec_zone_page_state(page,
-				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
+				PageAnon(page) ?  NR_ANON_PAGES : NR_FILE_MAPPED);
 	}
+	memctlr_dec_rss(page, mm);
 }
 
 /*
diff -puN include/linux/rmap.h~container-memctlr-acct include/linux/rmap.h
--- linux-2.6.19-rc2/include/linux/rmap.h~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/rmap.h	2006-11-09 21:46:22.000000000 +0530
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 #include <linux/spinlock.h>
+#include <linux/memctlr.h>
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -84,6 +85,7 @@ void page_remove_rmap(struct page *);
 static inline void page_dup_rmap(struct page *page)
 {
 	atomic_inc(&page->_mapcount);
+	memctlr_inc_rss(page);
 }
 
 /*
_

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 5/8] RSS controller task migration support
  2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
                   ` (3 preceding siblings ...)
  2006-11-09 19:36 ` [RFC][PATCH 4/8] RSS controller accounting Balbir Singh
@ 2006-11-09 19:36 ` Balbir Singh
  2006-11-09 19:36 ` [RFC][PATCH 6/8] RSS controller shares allocation Balbir Singh
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:36 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, Linux Kernel Mailing List, ckrm-tech, haveblue, rohitseth,
	Balbir Singh


Support migration of tasks across groups. Migration uses the accounting
information tracked in the mm_struct to add/delete RSS from the container as
a process migrates from one container to the next.

This patch also adds a /proc/<tid>/memacct interface for debugging purposes.
/proc/<tid>/memacct prints the rss of the task

1. As accounted by the patches
2. By walking the page tables of the process


Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 fs/proc/base.c             |    4 
 include/linux/memctlr.h    |    9 +
 include/linux/rmap.h       |    6 -
 kernel/res_group/memctlr.c |  228 ++++++++++++++++++++++++++++++++++++++++++---
 mm/filemap_xip.c           |    2 
 mm/fremap.c                |    2 
 mm/memory.c                |    6 -
 mm/rmap.c                  |    6 -
 8 files changed, 236 insertions(+), 27 deletions(-)

diff -puN kernel/res_group/memctlr.c~container-memctlr-task-migration kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-task-migration	2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c	2006-11-09 21:56:49.000000000 +0530
@@ -31,10 +31,12 @@
 #include <linux/module.h>
 #include <linux/res_group_rc.h>
 #include <linux/memctlr.h>
+#include <linux/mm.h>
+#include <asm/pgtable.h>
 
 static const char res_ctlr_name[] = "memctlr";
 static struct resource_group *root_rgroup;
-static const char version[] = "0.01";
+static const char version[] = "0.05";
 static struct memctlr *memctlr_root;
 
 #define MEMCTLR_MAGIC	0xdededede
@@ -52,6 +54,7 @@ struct memctlr {
 	int successes;
 	int failures;
 	int magic;
+	spinlock_t lock;
 };
 
 struct res_controller memctlr_rg;
@@ -95,7 +98,7 @@ void mm_assign_container(struct mm_struc
 	rcu_read_unlock();
 }
 
-static inline struct memctlr *get_memctlr_from_page(struct page *page)
+static inline struct memctlr *get_task_memctlr(struct task_struct *p)
 {
 	struct resource_group *rgroup;
 	struct memctlr *res;
@@ -107,7 +110,7 @@ static inline struct memctlr *get_memctl
 		return NULL;
 
 	rcu_read_lock();
-	rgroup = (struct resource_group *)rcu_dereference(current->container);
+	rgroup = (struct resource_group *)rcu_dereference(p->container);
 	rcu_read_unlock();
 
 	res = get_memctlr(rgroup);
@@ -119,31 +122,54 @@ static inline struct memctlr *get_memctl
 }
 
 
-void memctlr_inc_rss(struct page *page)
+void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm)
 {
 	struct memctlr *res;
 
-	res = get_memctlr_from_page(page);
-	if (!res)
+	res = get_task_memctlr(current);
+	if (!res) {
+		printk(KERN_INFO "inc_rss no res set *---*\n");
 		return;
+	}
 
-	atomic_long_inc(&current->mm->counter->rss);
+	spin_lock(&res->lock);
+	atomic_long_inc(&mm->counter->rss);
 	atomic_long_inc(&res->counter.rss);
+	spin_unlock(&res->lock);
 }
 
-void memctlr_dec_rss(struct page *page)
+void memctlr_inc_rss(struct page *page)
 {
 	struct memctlr *res;
+	struct mm_struct *mm = get_task_mm(current);
 
-	res = get_memctlr_from_page(page);
-	if (!res)
+	res = get_task_memctlr(current);
+	if (!res) {
+		printk(KERN_INFO "inc_rss no res set *---*\n");
 		return;
+	}
 
-	atomic_long_dec(&res->counter.rss);
+	spin_lock(&res->lock);
+	atomic_long_inc(&mm->counter->rss);
+	atomic_long_inc(&res->counter.rss);
+	spin_unlock(&res->lock);
+	mmput(mm);
+}
 
-	if ((current->flags & PF_EXITING) && !current->mm)
+void memctlr_dec_rss(struct page *page, struct mm_struct *mm)
+{
+	struct memctlr *res;
+
+	res = get_task_memctlr(current);
+	if (!res) {
+		printk(KERN_INFO "dec_rss no res set *---*\n");
 		return;
-	atomic_long_dec(&current->mm->counter->rss);
+	}
+
+	spin_lock(&res->lock);
+	atomic_long_dec(&res->counter.rss);
+	atomic_long_dec(&mm->counter->rss);
+	spin_unlock(&res->lock);
 }
 
 static void memctlr_init_new(struct memctlr *res)
@@ -154,6 +180,7 @@ static void memctlr_init_new(struct memc
 	res->shares.unused_min_shares = SHARE_DEFAULT_DIVISOR;
 
 	memctlr_init_mem_counter(&res->counter);
+	spin_lock_init(&res->lock);
 }
 
 static struct res_shares *memctlr_alloc_instance(struct resource_group *rgroup)
@@ -188,6 +215,122 @@ static void memctlr_free_instance(struct
 	kfree(res);
 }
 
+static long count_pte_rss(struct vm_area_struct *vma, pmd_t *pmd,
+				unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	long count = 0;
+
+	do {
+		pte = pte_offset_map(pmd, addr);
+		if (!pte_present(*pte))
+			continue;
+		count++;
+		pte_unmap(pte);
+	} while (pte++, addr += PAGE_SIZE, (addr != end));
+
+	return count;
+}
+
+static long count_pmd_rss(struct vm_area_struct *vma, pud_t *pud,
+				unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	long count = 0;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		count += count_pte_rss(vma, pmd, addr, next);
+	} while (pmd++, addr = next, (addr != end));
+
+	return count;
+}
+
+static long count_pud_rss(struct vm_area_struct *vma, pgd_t *pgd,
+				unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+	long count = 0;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		count += count_pmd_rss(vma, pud, addr, next);
+	} while (pud++, addr = next, (addr != end));
+
+	return count;
+}
+
+static long count_pgd_rss(struct vm_area_struct *vma)
+{
+	unsigned long addr, next, end;
+	pgd_t *pgd;
+	long count = 0;
+
+	addr = vma->vm_start;
+	end = vma->vm_end;
+
+	pgd = pgd_offset(vma->vm_mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		count += count_pud_rss(vma, pgd, addr, next);
+	} while (pgd++, addr = next, (addr != end));
+	return count;
+}
+
+static long count_rss(struct task_struct *p)
+{
+	int count = 0;
+	struct mm_struct *mm = get_task_mm(p);
+	struct vm_area_struct *vma = mm->mmap;
+
+	if (!mm)
+		return 0;
+
+	down_read(&mm->mmap_sem);
+	spin_lock(&mm->page_table_lock);
+
+	while (vma) {
+		count += count_pgd_rss(vma);
+		vma = vma->vm_next;
+	}
+
+	spin_unlock(&mm->page_table_lock);
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return count;
+}
+
+int  proc_memacct(struct task_struct *p, char *buf)
+{
+	int i = 0, j = 0;
+	struct mm_struct *mm = get_task_mm(p);
+
+	if (!mm)
+		return sprintf(buf, "no mm associated with the task\n");
+
+	i = sprintf(buf, "rss pages %ld\n",
+			atomic_long_read(&mm->counter->rss));
+	buf += i;
+	j += i;
+
+	i = sprintf(buf, "pg table walk rss pages %ld\n", count_rss(p));
+	buf += i;
+	j += i;
+
+	mmput(mm);
+	return j;
+}
+
 static ssize_t memctlr_show_stats(struct res_shares *shares, char *buf,
 					size_t len)
 {
@@ -206,12 +349,69 @@ static ssize_t memctlr_show_stats(struct
 	return j;
 }
 
+static void double_res_lock(struct memctlr *old, struct memctlr *new)
+{
+	BUG_ON(old == new);
+	if (&old->lock > &new->lock) {
+		spin_lock(&old->lock);
+		spin_lock(&new->lock);
+	} else {
+		spin_lock(&new->lock);
+		spin_lock(&old->lock);
+	}
+}
+
+static void double_res_unlock(struct memctlr *old, struct memctlr *new)
+{
+	BUG_ON(old == new);
+	if (&old->lock > &new->lock) {
+		spin_unlock(&new->lock);
+		spin_unlock(&old->lock);
+	} else {
+		spin_unlock(&old->lock);
+		spin_unlock(&new->lock);
+	}
+}
+
+static void memctlr_move_task(struct task_struct *p, struct res_shares *old,
+				struct res_shares *new)
+{
+	struct memctlr *oldres, *newres;
+	long rss_pages;
+
+	if (old == new)
+		return;
+
+	/*
+	 * If a task has no mm structure associated with it we have
+	 * nothing to do
+	 */
+	if (!old || !new)
+		return;
+
+	if (p->pid != p->tgid)
+		return;
+
+	oldres = get_memctlr_from_shares(old);
+	newres = get_memctlr_from_shares(new);
+
+	double_res_lock(oldres, newres);
+
+	rss_pages = atomic_long_read(&p->mm->counter->rss);
+	atomic_long_sub(rss_pages, &oldres->counter.rss);
+
+	mm_assign_container(p->mm, p);
+	atomic_long_add(rss_pages, &newres->counter.rss);
+
+	double_res_unlock(oldres, newres);
+}
+
 struct res_controller memctlr_rg = {
 	.name = res_ctlr_name,
 	.ctlr_id = NO_RES_ID,
 	.alloc_shares_struct = memctlr_alloc_instance,
 	.free_shares_struct = memctlr_free_instance,
-	.move_task = NULL,
+	.move_task = memctlr_move_task,
 	.shares_changed = NULL,
 	.show_stats = memctlr_show_stats,
 };
diff -puN fs/proc/base.c~container-memctlr-task-migration fs/proc/base.c
--- linux-2.6.19-rc2/fs/proc/base.c~container-memctlr-task-migration	2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/fs/proc/base.c	2006-11-09 21:56:49.000000000 +0530
@@ -72,6 +72,7 @@
 #include <linux/audit.h>
 #include <linux/poll.h>
 #include <linux/nsproxy.h>
+#include <linux/memctlr.h>
 #include "internal.h"
 
 /* NOTE:
@@ -1759,6 +1760,9 @@ static struct pid_entry tgid_base_stuff[
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, numa_maps),
 #endif
+#ifdef CONFIG_RES_GROUPS_MEMORY
+	INF("memacct",	  S_IRUGO, memacct),
+#endif
 	REG("mem",        S_IRUSR|S_IWUSR, mem),
 #ifdef CONFIG_SECCOMP
 	REG("seccomp",    S_IRUSR|S_IWUSR, seccomp),
diff -puN include/linux/memctlr.h~container-memctlr-task-migration include/linux/memctlr.h
--- linux-2.6.19-rc2/include/linux/memctlr.h~container-memctlr-task-migration	2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/memctlr.h	2006-11-09 21:56:49.000000000 +0530
@@ -30,15 +30,20 @@
 extern int mm_init_mem_counter(struct mm_struct *mm);
 extern void mm_assign_container(struct mm_struct *mm, struct task_struct *p);
 extern void memctlr_inc_rss(struct page *page);
-extern void memctlr_dec_rss(struct page *page);
+extern void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm);
+extern void memctlr_dec_rss(struct page *page, struct mm_struct *mm);
 extern void mm_free_mem_counter(struct mm_struct *mm);
+extern int  proc_memacct(struct task_struct *task, char *buffer);
 
 #else /* CONFIG_RES_GROUPS_MEMORY */
 
 void memctlr_inc_rss(struct page *page)
 {}
 
-void memctlr_dec_rss(struct page *page)
+void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm)
+{}
+
+void memctlr_dec_rss(struct page *page, struct mm_struct *mm)
 {}
 
 int mm_init_mem_counter(struct mm_struct *mm)
diff -puN mm/filemap_xip.c~container-memctlr-task-migration mm/filemap_xip.c
--- linux-2.6.19-rc2/mm/filemap_xip.c~container-memctlr-task-migration	2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/filemap_xip.c	2006-11-09 21:56:49.000000000 +0530
@@ -189,7 +189,7 @@ __xip_unmap (struct address_space * mapp
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
+			page_remove_rmap(page, mm);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
diff -puN mm/fremap.c~container-memctlr-task-migration mm/fremap.c
--- linux-2.6.19-rc2/mm/fremap.c~container-memctlr-task-migration	2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/fremap.c	2006-11-09 21:56:49.000000000 +0530
@@ -33,7 +33,7 @@ static int zap_pte(struct mm_struct *mm,
 		if (page) {
 			if (pte_dirty(pte))
 				set_page_dirty(page);
-			page_remove_rmap(page);
+			page_remove_rmap(page, mm);
 			page_cache_release(page);
 		}
 	} else {
diff -puN mm/memory.c~container-memctlr-task-migration mm/memory.c
--- linux-2.6.19-rc2/mm/memory.c~container-memctlr-task-migration	2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/memory.c	2006-11-09 21:56:49.000000000 +0530
@@ -481,7 +481,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-		page_dup_rmap(page);
+		page_dup_rmap(page, dst_mm);
 		rss[!!PageAnon(page)]++;
 	}
 
@@ -681,7 +681,7 @@ static unsigned long zap_pte_range(struc
 					mark_page_accessed(page);
 				file_rss--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap(page, mm);
 			tlb_remove_page(tlb, page);
 			continue;
 		}
@@ -1575,7 +1575,7 @@ gotten:
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
-			page_remove_rmap(old_page);
+			page_remove_rmap(old_page, mm);
 			if (!PageAnon(old_page)) {
 				dec_mm_counter(mm, file_rss);
 				inc_mm_counter(mm, anon_rss);
diff -puN mm/rmap.c~container-memctlr-task-migration mm/rmap.c
--- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-task-migration	2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/rmap.c	2006-11-09 21:56:49.000000000 +0530
@@ -576,7 +576,7 @@ void page_add_file_rmap(struct page *pag
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, struct mm_struct *mm)
 {
 	if (atomic_add_negative(-1, &page->_mapcount)) {
 		if (unlikely(page_mapcount(page) < 0)) {
@@ -689,7 +689,7 @@ static int try_to_unmap_one(struct page 
 		dec_mm_counter(mm, file_rss);
 
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, mm);
 	page_cache_release(page);
 
 out_unmap:
@@ -779,7 +779,7 @@ static void try_to_unmap_cluster(unsigne
 		if (pte_dirty(pteval))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, mm);
 		page_cache_release(page);
 		dec_mm_counter(mm, file_rss);
 		(*mapcount)--;
diff -puN include/linux/rmap.h~container-memctlr-task-migration include/linux/rmap.h
--- linux-2.6.19-rc2/include/linux/rmap.h~container-memctlr-task-migration	2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/rmap.h	2006-11-09 21:56:49.000000000 +0530
@@ -73,7 +73,7 @@ void __anon_vma_link(struct vm_area_stru
 void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
 void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, struct mm_struct *);
 
 /**
  * page_dup_rmap - duplicate pte mapping to a page
@@ -82,10 +82,10 @@ void page_remove_rmap(struct page *);
  * For copy_page_range only: minimal extract from page_add_rmap,
  * avoiding unnecessary tests (already checked) so it's quicker.
  */
-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, struct mm_struct *mm)
 {
 	atomic_inc(&page->_mapcount);
-	memctlr_inc_rss(page);
+	memctlr_inc_rss_mm(page, mm);
 }
 
 /*
_

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 6/8] RSS controller shares allocation
  2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
                   ` (4 preceding siblings ...)
  2006-11-09 19:36 ` [RFC][PATCH 5/8] RSS controller task migration support Balbir Singh
@ 2006-11-09 19:36 ` Balbir Singh
  2006-11-10  9:11   ` Pavel Emelianov
  2006-11-09 19:36 ` [RFC][PATCH 7/8] RSS controller fix resource groups parsing Balbir Singh
  2006-11-09 19:36 ` [RFC][PATCH 8/8] RSS controller support reclamation Balbir Singh
  7 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:36 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth,
	Balbir Singh


Support shares assignment and propagation.

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 kernel/res_group/memctlr.c |   59 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff -puN kernel/res_group/memctlr.c~container-memctlr-shares kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-shares	2006-11-09 22:20:28.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c	2006-11-09 22:20:28.000000000 +0530
@@ -32,6 +32,7 @@
 #include <linux/res_group_rc.h>
 #include <linux/memctlr.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
 #include <asm/pgtable.h>
 
 static const char res_ctlr_name[] = "memctlr";
@@ -55,6 +56,7 @@ struct memctlr {
 	int failures;
 	int magic;
 	spinlock_t lock;
+	long nr_pages;
 };
 
 struct res_controller memctlr_rg;
@@ -180,6 +182,7 @@ static void memctlr_init_new(struct memc
 	res->shares.unused_min_shares = SHARE_DEFAULT_DIVISOR;
 
 	memctlr_init_mem_counter(&res->counter);
+	res->nr_pages = SHARE_DONT_CARE;
 	spin_lock_init(&res->lock);
 }
 
@@ -196,6 +199,7 @@ static struct res_shares *memctlr_alloc_
 	if (is_res_group_root(rgroup)) {
 		root_rgroup = rgroup;
 		memctlr_root = res;
+		res->nr_pages = nr_free_pages();
 		printk("Memory Controller version %s\n", version);
 	}
 	return &res->shares;
@@ -346,6 +350,11 @@ static ssize_t memctlr_show_stats(struct
 	len -= i;
 	j += i;
 
+	i = snprintf(buf, len, "Max Allowed Pages %ld\n", res->nr_pages);
+
+	buf += i;
+	len -= i;
+	j += i;
 	return j;
 }
 
@@ -406,13 +415,61 @@ static void memctlr_move_task(struct tas
 	double_res_unlock(oldres, newres);
 }
 
+static void recalc_and_propagate(struct memctlr *res, struct memctlr *parres)
+{
+	struct resource_group *child = NULL;
+	int child_divisor;
+	u64 numerator;
+	struct memctlr *child_res;
+
+	if (parres) {
+		if (res->shares.max_shares == SHARE_DONT_CARE ||
+			parres->shares.max_shares == SHARE_DONT_CARE)
+			return;
+
+		child_divisor = parres->shares.child_shares_divisor;
+		if (child_divisor == 0)
+			return;
+
+		numerator = (u64)(parres->shares.unused_min_shares *
+				res->shares.max_shares);
+		do_div(numerator, child_divisor);
+		numerator = (u64)(parres->nr_pages * numerator);
+		do_div(numerator, SHARE_DEFAULT_DIVISOR);
+		res->nr_pages = numerator;
+	}
+
+	for_each_child(child, res->rgroup) {
+		child_res = get_memctlr(child);
+		BUG_ON(!child_res);
+		recalc_and_propagate(child_res, res);
+	}
+
+}
+
+static void memctlr_shares_changed(struct res_shares *shares)
+{
+	struct memctlr *res, *parres;
+
+	res = get_memctlr_from_shares(shares);
+	if (!res)
+		return;
+
+	if (is_res_group_root(res->rgroup))
+		parres = NULL;
+	else
+		parres = get_memctlr((struct container *)res->rgroup->parent);
+
+	recalc_and_propagate(res, parres);
+}
+
 struct res_controller memctlr_rg = {
 	.name = res_ctlr_name,
 	.ctlr_id = NO_RES_ID,
 	.alloc_shares_struct = memctlr_alloc_instance,
 	.free_shares_struct = memctlr_free_instance,
 	.move_task = memctlr_move_task,
-	.shares_changed = NULL,
+	.shares_changed = memctlr_shares_changed,
 	.show_stats = memctlr_show_stats,
 };
 
_

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 7/8] RSS controller fix resource groups parsing
  2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
                   ` (5 preceding siblings ...)
  2006-11-09 19:36 ` [RFC][PATCH 6/8] RSS controller shares allocation Balbir Singh
@ 2006-11-09 19:36 ` Balbir Singh
  2006-11-10  9:13   ` Pavel Emelianov
  2006-11-09 19:36 ` [RFC][PATCH 8/8] RSS controller support reclamation Balbir Singh
  7 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:36 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, Linux Kernel Mailing List, ckrm-tech, haveblue, rohitseth,
	Balbir Singh


echo adds a "\n" to the end of a string. When this string is copied from
user space, we need to remove it, so that match_token() can parse
the user space string correctly

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 kernel/res_group/rgcs.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff -puN kernel/res_group/rgcs.c~container-res-groups-fix-parsing kernel/res_group/rgcs.c
--- linux-2.6.19-rc2/kernel/res_group/rgcs.c~container-res-groups-fix-parsing	2006-11-09 23:08:10.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/rgcs.c	2006-11-09 23:08:10.000000000 +0530
@@ -241,6 +241,12 @@ ssize_t res_group_file_write(struct cont
 	}
 	buf[nbytes] = 0;	/* nul-terminate */
 
+	/*
+	 * Ignore "\n". It might come in from echo(1)
+	 */
+	if (buf[nbytes - 1] == '\n')
+		buf[nbytes - 1] = 0;
+
 	container_manage_lock();
 
 	if (container_is_removed(cont)) {
_

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 8/8] RSS controller support reclamation
  2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
                   ` (6 preceding siblings ...)
  2006-11-09 19:36 ` [RFC][PATCH 7/8] RSS controller fix resource groups parsing Balbir Singh
@ 2006-11-09 19:36 ` Balbir Singh
  2006-11-09 19:45   ` Arjan van de Ven
  2006-11-10  8:54   ` Pavel Emelianov
  7 siblings, 2 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-09 19:36 UTC (permalink / raw)
  To: Linux MM
  Cc: dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth,
	Balbir Singh


Reclaim memory as we hit the max_shares limit. The code for reclamation
is inspired from Dave Hansen's challenged memory controller and from the
shrink_all_memory() code

Reclamation can be triggered from two paths

1. While incrementing the RSS, we hit the limit of the container
2. A container is resized, such that it's new limit is below its current
   RSS

In (1) reclamation takes place in the background.

TODO's

1. max_shares currently works like a soft limit. The RSS can grow beyond it's
   limit. One possible fix is to introduce a soft limit (reclaim when the
   container hits the soft limit) and fail when we hit the hard limit

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 include/linux/memctlr.h    |   17 ++++++
 kernel/fork.c              |    1 
 kernel/res_group/memctlr.c |  116 ++++++++++++++++++++++++++++++++++++++-------
 mm/rmap.c                  |   72 +++++++++++++++++++++++++++
 mm/vmscan.c                |   72 +++++++++++++++++++++++++++
 5 files changed, 260 insertions(+), 18 deletions(-)

diff -puN mm/vmscan.c~container-memctlr-reclaim mm/vmscan.c
--- linux-2.6.19-rc2/mm/vmscan.c~container-memctlr-reclaim	2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/vmscan.c	2006-11-09 22:21:11.000000000 +0530
@@ -36,6 +36,8 @@
 #include <linux/rwsem.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
+#include <linux/container.h>
+#include <linux/memctlr.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -65,6 +67,9 @@ struct scan_control {
 	int swappiness;
 
 	int all_unreclaimable;
+
+	int overlimit;
+	void *container;	/* Added as void * to avoid #ifdef's */
 };
 
 /*
@@ -811,6 +816,10 @@ force_reclaim_mapped:
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
+		if (!memctlr_page_reclaim(page, sc->container, sc->overlimit)) {
+			list_add(&page->lru, &l_active);
+			continue;
+		}
 		if (page_mapped(page)) {
 			if (!reclaim_mapped ||
 			    (total_swap_pages == 0 && PageAnon(page)) ||
@@ -1008,6 +1017,8 @@ unsigned long try_to_free_pages(struct z
 		.swap_cluster_max = SWAP_CLUSTER_MAX,
 		.may_swap = 1,
 		.swappiness = vm_swappiness,
+		.overlimit = SC_OVERLIMIT_NONE,
+		.container = NULL,
 	};
 
 	count_vm_event(ALLOCSTALL);
@@ -1104,6 +1115,8 @@ static unsigned long balance_pgdat(pg_da
 		.may_swap = 1,
 		.swap_cluster_max = SWAP_CLUSTER_MAX,
 		.swappiness = vm_swappiness,
+		.overlimit = SC_OVERLIMIT_NONE,
+		.container = NULL,
 	};
 
 loop_again:
@@ -1324,7 +1337,7 @@ void wakeup_kswapd(struct zone *zone, in
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
-#ifdef CONFIG_PM
+#if defined(CONFIG_PM) || defined(CONFIG_RES_GROUPS_MEMORY)
 /*
  * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages' pages
  * from LRU lists system-wide, for given pass and priority, and returns the
@@ -1368,7 +1381,60 @@ static unsigned long shrink_all_zones(un
 
 	return ret;
 }
+#endif
 
+#ifdef CONFIG_RES_GROUPS_MEMORY
+/*
+ * Modelled after shrink_all_memory
+ */
+unsigned long memctlr_shrink_container_memory(unsigned long nr_pages,
+						struct container *container,
+						int overlimit)
+{
+	unsigned long lru_pages;
+	unsigned long ret = 0;
+	int pass;
+	struct zone *zone;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_swap = 0,
+		.swap_cluster_max = nr_pages,
+		.may_writepage = 1,
+		.swappiness = vm_swappiness,
+		.overlimit = overlimit,
+		.container = container,
+	};
+
+	lru_pages = 0;
+	for_each_zone(zone)
+		lru_pages += zone->nr_active + zone->nr_inactive;
+
+	for (pass = 0; pass < 5; pass++) {
+		int prio;
+
+		/* Force reclaiming mapped pages in the passes #3 and #4 */
+		if (pass > 2) {
+			sc.may_swap = 1;
+			sc.swappiness = 100;
+		}
+
+		for (prio = DEF_PRIORITY; prio >= 0; prio--) {
+			unsigned long nr_to_scan = nr_pages - ret;
+
+			sc.nr_scanned = 0;
+			ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
+			if (ret >= nr_pages)
+				break;
+
+			if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
+				blk_congestion_wait(WRITE, HZ / 10);
+		}
+	}
+	return ret;
+}
+#endif
+
+#ifdef CONFIG_PM
 /*
  * Try to free `nr_pages' of memory, system-wide, and return the number of
  * freed pages.
@@ -1390,6 +1456,8 @@ unsigned long shrink_all_memory(unsigned
 		.swap_cluster_max = nr_pages,
 		.may_writepage = 1,
 		.swappiness = vm_swappiness,
+		.overlimit = SC_OVERLIMIT_NONE,
+		.container = NULL,
 	};
 
 	current->reclaim_state = &reclaim_state;
@@ -1585,6 +1653,8 @@ static int __zone_reclaim(struct zone *z
 					SWAP_CLUSTER_MAX),
 		.gfp_mask = gfp_mask,
 		.swappiness = vm_swappiness,
+		.overlimit = SC_OVERLIMIT_NONE,
+		.container = NULL,
 	};
 	unsigned long slab_reclaimable;
 
diff -puN kernel/res_group/memctlr.c~container-memctlr-reclaim kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-reclaim	2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c	2006-11-09 22:21:11.000000000 +0530
@@ -33,6 +33,7 @@
 #include <linux/memctlr.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
+#include <linux/workqueue.h>
 #include <asm/pgtable.h>
 
 static const char res_ctlr_name[] = "memctlr";
@@ -40,7 +41,10 @@ static struct resource_group *root_rgrou
 static const char version[] = "0.05";
 static struct memctlr *memctlr_root;
 
-#define MEMCTLR_MAGIC	0xdededede
+static void memctlr_callback(void *data);
+static atomic_long_t failed_inc_rss;
+static atomic_long_t failed_dec_rss;
+
 
 struct mem_counter {
 	atomic_long_t	rss;
@@ -57,9 +61,12 @@ struct memctlr {
 	int magic;
 	spinlock_t lock;
 	long nr_pages;
+	int reclaim_in_progress;
 };
 
 struct res_controller memctlr_rg;
+static DECLARE_WORK(memctlr_work, memctlr_callback, NULL);
+#define MEMCTLR_MAGIC	0xdededede
 
 static struct memctlr *get_memctlr_from_shares(struct res_shares *shares)
 {
@@ -96,7 +103,7 @@ void mm_free_mem_counter(struct mm_struc
 void mm_assign_container(struct mm_struct *mm, struct task_struct *p)
 {
 	rcu_read_lock();
-	mm->container = rcu_dereference(p->container);
+	rcu_assign_pointer(mm->container, rcu_dereference(p->container));
 	rcu_read_unlock();
 }
 
@@ -123,38 +130,64 @@ static inline struct memctlr *get_task_m
 	return res;
 }
 
-
-void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm)
+static void memctlr_callback(void *data)
 {
-	struct memctlr *res;
+	struct memctlr *res = (struct memctlr *)data;
+	long rss;
+	unsigned long nr_shrink = 0;
 
-	res = get_task_memctlr(current);
-	if (!res) {
-		printk(KERN_INFO "inc_rss no res set *---*\n");
-		return;
-	}
+	BUG_ON(!res);
 
 	spin_lock(&res->lock);
-	atomic_long_inc(&mm->counter->rss);
-	atomic_long_inc(&res->counter.rss);
+	rss = atomic_long_read(&res->counter.rss);
+	if ((rss > res->nr_pages) && (res->nr_pages > 0))
+		nr_shrink = rss - ((res->nr_pages * 4) / 5);
+	spin_unlock(&res->lock);
+
+	if (nr_shrink)
+		memctlr_shrink_container_memory(nr_shrink, res->rgroup,
+						SC_OVERLIMIT_ONE);
+	spin_lock(&res->lock);
+	res->reclaim_in_progress = 0;
 	spin_unlock(&res->lock);
 }
 
-void memctlr_inc_rss(struct page *page)
+void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm)
 {
 	struct memctlr *res;
-	struct mm_struct *mm = get_task_mm(current);
+	long rss;
 
 	res = get_task_memctlr(current);
 	if (!res) {
-		printk(KERN_INFO "inc_rss no res set *---*\n");
+		atomic_long_inc(&failed_inc_rss);
 		return;
 	}
 
 	spin_lock(&res->lock);
 	atomic_long_inc(&mm->counter->rss);
 	atomic_long_inc(&res->counter.rss);
+	rss = atomic_long_read(&res->counter.rss);
+	if ((res->nr_pages < rss) && (res->nr_pages > 0)) {
+		/*
+		 * Reclaim if we exceed our limit
+		 * Schedule a job to do so
+	 	*/
+		if (res->reclaim_in_progress)
+			goto done;
+		res->reclaim_in_progress = 1;
+		spin_unlock(&res->lock);
+		PREPARE_WORK(&memctlr_work, memctlr_callback, res);
+		schedule_work(&memctlr_work);
+		return;
+	}
+done:
 	spin_unlock(&res->lock);
+}
+
+void memctlr_inc_rss(struct page *page)
+{
+	struct mm_struct *mm = get_task_mm(current);
+	memctlr_inc_rss_mm(page, mm);
 	mmput(mm);
 }
 
@@ -162,9 +195,9 @@ void memctlr_dec_rss(struct page *page, 
 {
 	struct memctlr *res;
 
-	res = get_task_memctlr(current);
+	res = get_memctlr(mm->container);
 	if (!res) {
-		printk(KERN_INFO "dec_rss no res set *---*\n");
+		atomic_long_inc(&failed_dec_rss);
 		return;
 	}
 
@@ -183,6 +216,7 @@ static void memctlr_init_new(struct memc
 
 	memctlr_init_mem_counter(&res->counter);
 	res->nr_pages = SHARE_DONT_CARE;
+	res->reclaim_in_progress = 0;
 	spin_lock_init(&res->lock);
 }
 
@@ -200,6 +234,7 @@ static struct res_shares *memctlr_alloc_
 		root_rgroup = rgroup;
 		memctlr_root = res;
 		res->nr_pages = nr_free_pages();
+		res->shares.max_shares = SHARE_DEFAULT_DIVISOR;
 		printk("Memory Controller version %s\n", version);
 	}
 	return &res->shares;
@@ -355,6 +390,20 @@ static ssize_t memctlr_show_stats(struct
 	buf += i;
 	len -= i;
 	j += i;
+
+	i = snprintf(buf, len, "Failed INC RSS Pages %ld\n",
+			atomic_long_read(&failed_inc_rss));
+
+	buf += i;
+	len -= i;
+	j += i;
+
+	i = snprintf(buf, len, "Failed DEC RSS Pages %ld\n",
+			atomic_long_read(&failed_dec_rss));
+
+	buf += i;
+	len -= i;
+	j += i;
 	return j;
 }
 
@@ -421,6 +470,8 @@ static void recalc_and_propagate(struct 
 	int child_divisor;
 	u64 numerator;
 	struct memctlr *child_res;
+	long rss;
+	unsigned long nr_shrink = 0;
 
 	if (parres) {
 		if (res->shares.max_shares == SHARE_DONT_CARE ||
@@ -445,6 +496,35 @@ static void recalc_and_propagate(struct 
 		recalc_and_propagate(child_res, res);
 	}
 
+	/*
+	 * Reclaim if our limit was shrunk
+	 */
+	spin_lock(&res->lock);
+	rss = atomic_long_read(&res->counter.rss);
+	if ((rss > res->nr_pages) && (res->nr_pages > 0))
+		nr_shrink = rss - ((res->nr_pages * 4) / 5);
+	spin_unlock(&res->lock);
+
+	if (nr_shrink)
+		memctlr_shrink_container_memory(nr_shrink, NULL,
+						SC_OVERLIMIT_ALL);
+}
+
+int memctlr_over_limit(struct container *container)
+{
+	struct resource_group *rgroup = container;
+	struct memctlr *res;
+	int ret = 0;
+
+	res = get_memctlr(rgroup);
+	if (!res)
+		return ret;
+
+	spin_lock(&res->lock);
+	if (atomic_long_read(&res->counter.rss) > res->nr_pages)
+		ret = 1;
+	spin_unlock(&res->lock);
+	return ret;
 }
 
 static void memctlr_shares_changed(struct res_shares *shares)
@@ -477,6 +557,8 @@ int __init memctlr_init(void)
 {
 	if (memctlr_rg.ctlr_id != NO_RES_ID)
 		return -EBUSY;	/* already registered */
+	atomic_long_set(&failed_inc_rss, 0);
+	atomic_long_set(&failed_dec_rss, 0);
 	return register_controller(&memctlr_rg);
 }
 
diff -puN include/linux/memctlr.h~container-memctlr-reclaim include/linux/memctlr.h
--- linux-2.6.19-rc2/include/linux/memctlr.h~container-memctlr-reclaim	2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/memctlr.h	2006-11-09 22:21:11.000000000 +0530
@@ -34,6 +34,12 @@ extern void memctlr_inc_rss_mm(struct pa
 extern void memctlr_dec_rss(struct page *page, struct mm_struct *mm);
 extern void mm_free_mem_counter(struct mm_struct *mm);
 extern int  proc_memacct(struct task_struct *task, char *buffer);
+extern unsigned long memctlr_shrink_container_memory(unsigned long nr_pages,
+						struct container *container,
+						int overlimit);
+extern int memctlr_page_reclaim(struct page *page, void *container,
+				int overlimit);
+extern int memctlr_over_limit(struct container *container);
 
 #else /* CONFIG_RES_GROUPS_MEMORY */
 
@@ -54,9 +60,20 @@ int mm_init_mem_counter(struct mm_struct
 void mm_assign_container(struct mm_struct *mm, struct task_struct *p)
 {}
 
+int memctlr_page_reclaim(struct page *page, void *container, int overlimit)
+{
+	return 1;
+}
+
 void mm_free_mem_counter(struct mm_struct *mm)
 {}
 
 #endif /* CONFIG_RES_GROUPS_MEMORY */
 
+enum {
+	SC_OVERLIMIT_NONE,	/* The scan is container independent */
+	SC_OVERLIMIT_ONE,	/* Scan the one container specified */
+	SC_OVERLIMIT_ALL,	/* Scan all containers */
+};
+
 #endif /* _LINUX_MEMCTRL_H */
diff -puN mm/rmap.c~container-memctlr-reclaim mm/rmap.c
--- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-reclaim	2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/rmap.c	2006-11-09 22:21:11.000000000 +0530
@@ -604,6 +604,78 @@ void page_remove_rmap(struct page *page,
 	memctlr_dec_rss(page, mm);
 }
 
+#ifdef CONFIG_RES_GROUPS_MEMORY
+/*
+ * Can we push this code down to try_to_unmap()?
+ */
+int memctlr_page_reclaim(struct page *page, void *container, int overlimit)
+{
+	int ret = 0;
+
+	if (overlimit == SC_OVERLIMIT_NONE)
+		return 1;
+	if (container == NULL && overlimit != SC_OVERLIMIT_ALL)
+		return 1;
+
+	if (!page_mapped(page))
+		return 0;
+
+	if (PageAnon(page)) {
+		struct anon_vma *anon_vma;
+		struct vm_area_struct *vma;
+
+		anon_vma = page_lock_anon_vma(page);
+		if (!anon_vma)
+			return 0;
+
+		list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+			if (memctlr_over_limit(vma->vm_mm->container) &&
+				((container == vma->vm_mm->container) ||
+				  (overlimit == SC_OVERLIMIT_ALL))) {
+				ret = 1;
+				break;
+			}
+		}
+		spin_unlock(&anon_vma->lock);
+	} else {
+		struct address_space *mapping = page_mapping(page);
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+		struct vm_area_struct *vma;
+		struct prio_tree_iter iter;
+
+		if (!mapping)
+			return 0;
+
+		spin_lock(&mapping->i_mmap_lock);
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+								pgoff) {
+			if (memctlr_over_limit(vma->vm_mm->container) &&
+				((container == vma->vm_mm->container) ||
+				  (overlimit == SC_OVERLIMIT_ALL))) {
+				ret = 1;
+				break;
+			}
+		}
+		if (ret)
+			goto done;
+
+		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
+					shared.vm_set.list) {
+			if (memctlr_over_limit(vma->vm_mm->container) &&
+				((container == vma->vm_mm->container) ||
+				  (overlimit == SC_OVERLIMIT_ALL))) {
+				ret = 1;
+				break;
+			}
+		}
+done:
+		spin_unlock(&mapping->i_mmap_lock);
+	}
+
+	return ret;
+}
+#endif
+
 /*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
diff -puN kernel/fork.c~container-memctlr-reclaim kernel/fork.c
--- linux-2.6.19-rc2/kernel/fork.c~container-memctlr-reclaim	2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/fork.c	2006-11-09 22:21:11.000000000 +0530
@@ -364,6 +364,7 @@ struct mm_struct * mm_alloc(void)
 	if (mm) {
 		memset(mm, 0, sizeof(*mm));
 		mm = mm_init(mm);
+		mm_assign_container(mm, current);
 	}
 	return mm;
 }
_

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 8/8] RSS controller support reclamation
  2006-11-09 19:36 ` [RFC][PATCH 8/8] RSS controller support reclamation Balbir Singh
@ 2006-11-09 19:45   ` Arjan van de Ven
  2006-11-10  1:56     ` [ckrm-tech] " Balbir Singh
  2006-11-10  8:54   ` Pavel Emelianov
  1 sibling, 1 reply; 23+ messages in thread
From: Arjan van de Ven @ 2006-11-09 19:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Linux MM, dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth

On Fri, 2006-11-10 at 01:06 +0530, Balbir Singh wrote:
> 
> Reclaim memory as we hit the max_shares limit. The code for reclamation
> is inspired from Dave Hansen's challenged memory controller and from the
> shrink_all_memory() code

Hmm.. I seem to remember that all previous RSS rlimit attempts actually
fell flat on their face because of the reclaim-on-rss-overflow behavior;
in the shared page / cached page (equally important!) case, it means
process A (or container A) suddenly penalizes process B (or container B)
by making B have pagecache misses because A was using a low RSS limit.

Unmapping the page makes sense, sure, and even moving then to inactive
lists or whatever that is called in the vm today, but reclaim... that's
expensive...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ckrm-tech] [RFC][PATCH 8/8] RSS controller support reclamation
  2006-11-09 19:45   ` Arjan van de Ven
@ 2006-11-10  1:56     ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-10  1:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: dev, ckrm-tech, haveblue, Linux Kernel Mailing List, Linux MM, rohitseth

On 11/10/06, Arjan van de Ven <arjan@infradead.org> wrote:
> On Fri, 2006-11-10 at 01:06 +0530, Balbir Singh wrote:
> >
> > Reclaim memory as we hit the max_shares limit. The code for reclamation
> > is inspired from Dave Hansen's challenged memory controller and from the
> > shrink_all_memory() code
>
>
> Hmm.. I seem to remember that all previous RSS rlimit attempts actually
> fell flat on their face because of the reclaim-on-rss-overflow behavior;
> in the shared page / cached page (equally important!) case, it means
> process A (or container A) suddenly penalizes process B (or container B)
> by making B have pagecache misses because A was using a low RSS limit.
>
> Unmapping the page makes sense, sure, and even moving then to inactive
> lists or whatever that is called in the vm today, but reclaim... that's
> expensive...
>

I see your point, one of things we could do is that we could track
shared and cached pages separately and not be so severe on them.

I'll play around with this idea and see what I come up with.

Thanks for the feedback,
Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 8/8] RSS controller support reclamation
  2006-11-09 19:36 ` [RFC][PATCH 8/8] RSS controller support reclamation Balbir Singh
  2006-11-09 19:45   ` Arjan van de Ven
@ 2006-11-10  8:54   ` Pavel Emelianov
  2006-11-10  9:16     ` Balbir Singh
  1 sibling, 1 reply; 23+ messages in thread
From: Pavel Emelianov @ 2006-11-10  8:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Linux MM, dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth

Balbir Singh wrote:
> Reclaim memory as we hit the max_shares limit. The code for reclamation
> is inspired from Dave Hansen's challenged memory controller and from the
> shrink_all_memory() code
> 
> Reclamation can be triggered from two paths
> 
> 1. While incrementing the RSS, we hit the limit of the container
> 2. A container is resized, such that it's new limit is below its current
>    RSS
> 
> In (1) reclamation takes place in the background.

Hmm... This is not a hard limit in this case, right? And in case
of overloaded system from the moment reclamation thread is woken
up till the moment it starts shrinking zones container may touch
too many pages...

That's not good.

> TODO's
> 
> 1. max_shares currently works like a soft limit. The RSS can grow beyond it's
>    limit. One possible fix is to introduce a soft limit (reclaim when the
>    container hits the soft limit) and fail when we hit the hard limit

Such soft limit doesn't help also. It just makes effects on
low-loaded system smoother.

And what about a hard limit - how would you fail in page fault in
case of limit hit? SIGKILL/SEGV is not an option - in this case we
should run synchronous reclamation. This is done in beancounter
patches v6 we've sent recently.

> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
> ---
> 
> --- linux-2.6.19-rc2/mm/vmscan.c~container-memctlr-reclaim	2006-11-09 22:21:11.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/mm/vmscan.c	2006-11-09 22:21:11.000000000 +0530
> @@ -36,6 +36,8 @@
>  #include <linux/rwsem.h>
>  #include <linux/delay.h>
>  #include <linux/kthread.h>
> +#include <linux/container.h>
> +#include <linux/memctlr.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -65,6 +67,9 @@ struct scan_control {
>  	int swappiness;
>  
>  	int all_unreclaimable;
> +
> +	int overlimit;
> +	void *container;	/* Added as void * to avoid #ifdef's */
>  };
>  
>  /*
> @@ -811,6 +816,10 @@ force_reclaim_mapped:
>  		cond_resched();
>  		page = lru_to_page(&l_hold);
>  		list_del(&page->lru);
> +		if (!memctlr_page_reclaim(page, sc->container, sc->overlimit)) {
> +			list_add(&page->lru, &l_active);
> +			continue;
> +		}
>  		if (page_mapped(page)) {
>  			if (!reclaim_mapped ||
>  			    (total_swap_pages == 0 && PageAnon(page)) ||

[snip] See comment below.

>  
> +#ifdef CONFIG_RES_GROUPS_MEMORY
> +/*
> + * Modelled after shrink_all_memory
> + */
> +unsigned long memctlr_shrink_container_memory(unsigned long nr_pages,
> +						struct container *container,
> +						int overlimit)
> +{
> +	unsigned long lru_pages;
> +	unsigned long ret = 0;
> +	int pass;
> +	struct zone *zone;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_swap = 0,
> +		.swap_cluster_max = nr_pages,
> +		.may_writepage = 1,
> +		.swappiness = vm_swappiness,
> +		.overlimit = overlimit,
> +		.container = container,
> +	};
> +

[snip]

> +		for (prio = DEF_PRIORITY; prio >= 0; prio--) {
> +			unsigned long nr_to_scan = nr_pages - ret;
> +
> +			sc.nr_scanned = 0;
> +			ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
> +			if (ret >= nr_pages)
> +				break;
> +
> +			if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
> +				blk_congestion_wait(WRITE, HZ / 10);
> +		}
> +	}
> +	return ret;
> +}
> +#endif

Please correct me if I'm wrong, but does this reclamation work like
"run over all the zones' lists searching for page whose controller
is sc->container" ?

[snip]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 4/8] RSS controller accounting
  2006-11-09 19:36 ` [RFC][PATCH 4/8] RSS controller accounting Balbir Singh
@ 2006-11-10  9:06   ` Pavel Emelianov
  2006-11-10  9:29     ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelianov @ 2006-11-10  9:06 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Linux MM, dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth

Balbir Singh wrote:
> Account RSS usage of a task and the associated container. The definition
> of RSS was debated and discussed in the following thread
> 
> 	http://lkml.org/lkml/2006/10/10/130
> 
> 
> The code tracks all resident pages (including shared pages) as RSS. This patch
> can easily adapt to the definition of RSS that will be agreed upon. This
> implementation provides a proof of concept RSS controller.
> 
> The accounting is inspired from Rohit Seth's container patches.
> 
> TODO's
> 
> 1. Merge file_rss and anon_rss tracking with the current rss tracking to
>    maximize code reuse
> 2. Add/remove RSS tracking as the definition of RSS evolves
> 
> 
> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
> ---
> 

[snip]

> --- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c	2006-11-09 21:47:06.000000000 +0530
> @@ -37,6 +37,8 @@ static struct resource_group *root_rgrou
>  static const char version[] = "0.01";
>  static struct memctlr *memctlr_root;
>  
> +#define MEMCTLR_MAGIC	0xdededede
> +
>  struct mem_counter {
>  	atomic_long_t	rss;
>  };
> @@ -49,6 +51,7 @@ struct memctlr {
>  	/* Statistics */
>  	int successes;
>  	int failures;
> +	int magic;

What is this magic for? Is it just for debugging?

[snip]

> +static inline struct memctlr *get_memctlr_from_page(struct page *page)
> +{
> +	struct resource_group *rgroup;
> +	struct memctlr *res;
> +
> +	/*
> +	 * Is the resource groups infrastructure initialized?
> +	 */
> +	if (!memctlr_root)
> +		return NULL;
> +
> +	rcu_read_lock();
> +	rgroup = (struct resource_group *)rcu_dereference(current->container);
> +	rcu_read_unlock();
> +
> +	res = get_memctlr(rgroup);
> +	if (!res)
> +		return NULL;
> +
> +	BUG_ON(res->magic != MEMCTLR_MAGIC);
> +	return res;
> +}

I don't see how page passed to this function is involved into
'struct memctlr *res' determining. Could you comment this?

[snip]

> --- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/mm/rmap.c	2006-11-09 21:46:22.000000000 +0530
> @@ -537,6 +537,7 @@ void page_add_anon_rmap(struct page *pag
>  	if (atomic_inc_and_test(&page->_mapcount))
>  		__page_set_anon_rmap(page, vma, address);
>  	/* else checking page index and mapping is racy */
> +	memctlr_inc_rss(page);
>  }
>  
>  /*
> @@ -553,6 +554,7 @@ void page_add_new_anon_rmap(struct page 
>  {
>  	atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */
>  	__page_set_anon_rmap(page, vma, address);
> +	memctlr_inc_rss(page);
>  }
>  
>  /**
> @@ -565,6 +567,7 @@ void page_add_file_rmap(struct page *pag
>  {
>  	if (atomic_inc_and_test(&page->_mapcount))
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> +	memctlr_inc_rss(page);

Consider a task maps one file page 100 times in different places
and touches 'all of them'. In this case I see that you'll get
100 in rss counter while real rss will be just 1.

>  }
>  
>  /**
> @@ -596,8 +599,9 @@ void page_remove_rmap(struct page *page)
>  		if (page_test_and_clear_dirty(page))
>  			set_page_dirty(page);
>  		__dec_zone_page_state(page,
> -				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> +				PageAnon(page) ?  NR_ANON_PAGES : NR_FILE_MAPPED);

What is this extra space after a question-mark for?

>  	}
> +	memctlr_dec_rss(page, mm);
>  }
>  
>  /*
> diff -puN include/linux/rmap.h~container-memctlr-acct include/linux/rmap.h
> --- linux-2.6.19-rc2/include/linux/rmap.h~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/include/linux/rmap.h	2006-11-09 21:46:22.000000000 +0530
> @@ -8,6 +8,7 @@
>  #include <linux/slab.h>
>  #include <linux/mm.h>
>  #include <linux/spinlock.h>
> +#include <linux/memctlr.h>
>  
>  /*
>   * The anon_vma heads a list of private "related" vmas, to scan if
> @@ -84,6 +85,7 @@ void page_remove_rmap(struct page *);
>  static inline void page_dup_rmap(struct page *page)
>  {
>  	atomic_inc(&page->_mapcount);
> +	memctlr_inc_rss(page);
>  }

I'm not sure this is correct. page_dup_rmap() happens in the context
of forking process and thus you'll increment rss counter on current.
But this must be incremented at new task's counter, mustn't it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 6/8] RSS controller shares allocation
  2006-11-09 19:36 ` [RFC][PATCH 6/8] RSS controller shares allocation Balbir Singh
@ 2006-11-10  9:11   ` Pavel Emelianov
  2006-11-10 10:27     ` [ckrm-tech] " Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelianov @ 2006-11-10  9:11 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Linux MM, dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth

Balbir Singh wrote:
> Support shares assignment and propagation.
> 
> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
> ---
> 
>  kernel/res_group/memctlr.c |   59 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 58 insertions(+), 1 deletion(-)

[snip]

> +static void recalc_and_propagate(struct memctlr *res, struct memctlr *parres)
> +{
> +	struct resource_group *child = NULL;
> +	int child_divisor;
> +	u64 numerator;
> +	struct memctlr *child_res;
> +
> +	if (parres) {
> +		if (res->shares.max_shares == SHARE_DONT_CARE ||
> +			parres->shares.max_shares == SHARE_DONT_CARE)
> +			return;
> +
> +		child_divisor = parres->shares.child_shares_divisor;
> +		if (child_divisor == 0)
> +			return;
> +
> +		numerator = (u64)(parres->shares.unused_min_shares *
> +				res->shares.max_shares);
> +		do_div(numerator, child_divisor);
> +		numerator = (u64)(parres->nr_pages * numerator);
> +		do_div(numerator, SHARE_DEFAULT_DIVISOR);
> +		res->nr_pages = numerator;
> +	}
> +
> +	for_each_child(child, res->rgroup) {
> +		child_res = get_memctlr(child);
> +		BUG_ON(!child_res);
> +		recalc_and_propagate(child_res, res);

Recursion? Won't it eat all the stack in case of a deep tree?

> +	}
> +
> +}
> +
> +static void memctlr_shares_changed(struct res_shares *shares)
> +{
> +	struct memctlr *res, *parres;
> +
> +	res = get_memctlr_from_shares(shares);
> +	if (!res)
> +		return;
> +
> +	if (is_res_group_root(res->rgroup))
> +		parres = NULL;
> +	else
> +		parres = get_memctlr((struct container *)res->rgroup->parent);
> +
> +	recalc_and_propagate(res, parres);
> +}
> +
>  struct res_controller memctlr_rg = {
>  	.name = res_ctlr_name,
>  	.ctlr_id = NO_RES_ID,
>  	.alloc_shares_struct = memctlr_alloc_instance,
>  	.free_shares_struct = memctlr_free_instance,
>  	.move_task = memctlr_move_task,
> -	.shares_changed = NULL,
> +	.shares_changed = memctlr_shares_changed,

I didn't find where in this patches this callback is called.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 7/8] RSS controller fix resource groups parsing
  2006-11-09 19:36 ` [RFC][PATCH 7/8] RSS controller fix resource groups parsing Balbir Singh
@ 2006-11-10  9:13   ` Pavel Emelianov
  2006-11-10  9:32     ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelianov @ 2006-11-10  9:13 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Linux MM, dev, Linux Kernel Mailing List, ckrm-tech, haveblue, rohitseth

Balbir Singh wrote:
> echo adds a "\n" to the end of a string. When this string is copied from
> user space, we need to remove it, so that match_token() can parse
> the user space string correctly
> 
> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
> ---
> 
>  kernel/res_group/rgcs.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff -puN kernel/res_group/rgcs.c~container-res-groups-fix-parsing kernel/res_group/rgcs.c
> --- linux-2.6.19-rc2/kernel/res_group/rgcs.c~container-res-groups-fix-parsing	2006-11-09 23:08:10.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/kernel/res_group/rgcs.c	2006-11-09 23:08:10.000000000 +0530
> @@ -241,6 +241,12 @@ ssize_t res_group_file_write(struct cont
>  	}
>  	buf[nbytes] = 0;	/* nul-terminate */
>  
> +	/*
> +	 * Ignore "\n". It might come in from echo(1)

Why not inform user he should call echo -n?

> +	 */
> +	if (buf[nbytes - 1] == '\n')
> +		buf[nbytes - 1] = 0;
> +
>  	container_manage_lock();
>  
>  	if (container_is_removed(cont)) {
> _
> 

That's the same patch as in [PATCH 1/8] mail. Did you attached
a wrong one?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 8/8] RSS controller support reclamation
  2006-11-10  8:54   ` Pavel Emelianov
@ 2006-11-10  9:16     ` Balbir Singh
  2006-11-10  9:29       ` Pavel Emelianov
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2006-11-10  9:16 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Linux MM, dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Reclaim memory as we hit the max_shares limit. The code for reclamation
>> is inspired from Dave Hansen's challenged memory controller and from the
>> shrink_all_memory() code
>>
>> Reclamation can be triggered from two paths
>>
>> 1. While incrementing the RSS, we hit the limit of the container
>> 2. A container is resized, such that it's new limit is below its current
>>    RSS
>>
>> In (1) reclamation takes place in the background.
> 
> Hmm... This is not a hard limit in this case, right? And in case
> of overloaded system from the moment reclamation thread is woken
> up till the moment it starts shrinking zones container may touch
> too many pages...
> 
> That's not good.

Yes, please see my comments in the TODO's. Hard limits should be easy
to implement, it's a question of calling the correct routine based
on policy.

> 
>> TODO's
>>
>> 1. max_shares currently works like a soft limit. The RSS can grow beyond it's
>>    limit. One possible fix is to introduce a soft limit (reclaim when the
>>    container hits the soft limit) and fail when we hit the hard limit
> 
> Such soft limit doesn't help also. It just makes effects on
> low-loaded system smoother.
> 
> And what about a hard limit - how would you fail in page fault in
> case of limit hit? SIGKILL/SEGV is not an option - in this case we
> should run synchronous reclamation. This is done in beancounter
> patches v6 we've sent recently.
> 

I thought about running synchronous reclamation, but then did not follow
that approach, I was not sure if calling the reclaim routines from the
page fault context is a good thing to do. It's worth trying out, since
it would provide better control over rss.


>> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
>> ---
>>
>> --- linux-2.6.19-rc2/mm/vmscan.c~container-memctlr-reclaim	2006-11-09 22:21:11.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/mm/vmscan.c	2006-11-09 22:21:11.000000000 +0530
>> @@ -36,6 +36,8 @@
>>  #include <linux/rwsem.h>
>>  #include <linux/delay.h>
>>  #include <linux/kthread.h>
>> +#include <linux/container.h>
>> +#include <linux/memctlr.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/div64.h>
>> @@ -65,6 +67,9 @@ struct scan_control {
>>  	int swappiness;
>>  
>>  	int all_unreclaimable;
>> +
>> +	int overlimit;
>> +	void *container;	/* Added as void * to avoid #ifdef's */
>>  };
>>  
>>  /*
>> @@ -811,6 +816,10 @@ force_reclaim_mapped:
>>  		cond_resched();
>>  		page = lru_to_page(&l_hold);
>>  		list_del(&page->lru);
>> +		if (!memctlr_page_reclaim(page, sc->container, sc->overlimit)) {
>> +			list_add(&page->lru, &l_active);
>> +			continue;
>> +		}
>>  		if (page_mapped(page)) {
>>  			if (!reclaim_mapped ||
>>  			    (total_swap_pages == 0 && PageAnon(page)) ||
> 
> [snip] See comment below.
> 
>>  
>> +#ifdef CONFIG_RES_GROUPS_MEMORY
>> +/*
>> + * Modelled after shrink_all_memory
>> + */
>> +unsigned long memctlr_shrink_container_memory(unsigned long nr_pages,
>> +						struct container *container,
>> +						int overlimit)
>> +{
>> +	unsigned long lru_pages;
>> +	unsigned long ret = 0;
>> +	int pass;
>> +	struct zone *zone;
>> +	struct scan_control sc = {
>> +		.gfp_mask = GFP_KERNEL,
>> +		.may_swap = 0,
>> +		.swap_cluster_max = nr_pages,
>> +		.may_writepage = 1,
>> +		.swappiness = vm_swappiness,
>> +		.overlimit = overlimit,
>> +		.container = container,
>> +	};
>> +
> 
> [snip]
> 
>> +		for (prio = DEF_PRIORITY; prio >= 0; prio--) {
>> +			unsigned long nr_to_scan = nr_pages - ret;
>> +
>> +			sc.nr_scanned = 0;
>> +			ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
>> +			if (ret >= nr_pages)
>> +				break;
>> +
>> +			if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
>> +				blk_congestion_wait(WRITE, HZ / 10);
>> +		}
>> +	}
>> +	return ret;
>> +}
>> +#endif
> 
> Please correct me if I'm wrong, but does this reclamation work like
> "run over all the zones' lists searching for page whose controller
> is sc->container" ?
> 

Yeah, that's correct. The code can also reclaim memory from all over-the-limit
containers (by passing SC_OVERLIMIT_ALL). The idea behind using such a scheme
is to ensure that the global LRU list is not broken.


-- 
	Thanks for the feedback,
	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 4/8] RSS controller accounting
  2006-11-10  9:06   ` Pavel Emelianov
@ 2006-11-10  9:29     ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-10  9:29 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Linux MM, dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Account RSS usage of a task and the associated container. The definition
>> of RSS was debated and discussed in the following thread
>>
>> 	http://lkml.org/lkml/2006/10/10/130
>>
>>
>> The code tracks all resident pages (including shared pages) as RSS. This patch
>> can easily adapt to the definition of RSS that will be agreed upon. This
>> implementation provides a proof of concept RSS controller.
>>
>> The accounting is inspired from Rohit Seth's container patches.
>>
>> TODO's
>>
>> 1. Merge file_rss and anon_rss tracking with the current rss tracking to
>>    maximize code reuse
>> 2. Add/remove RSS tracking as the definition of RSS evolves
>>
>>
>> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
>> ---
>>
> 
> [snip]
> 
>> --- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c	2006-11-09 21:47:06.000000000 +0530
>> @@ -37,6 +37,8 @@ static struct resource_group *root_rgrou
>>  static const char version[] = "0.01";
>>  static struct memctlr *memctlr_root;
>>  
>> +#define MEMCTLR_MAGIC	0xdededede
>> +
>>  struct mem_counter {
>>  	atomic_long_t	rss;
>>  };
>> @@ -49,6 +51,7 @@ struct memctlr {
>>  	/* Statistics */
>>  	int successes;
>>  	int failures;
>> +	int magic;
> 
> What is this magic for? Is it just for debugging?
> 

Yes

> [snip]
> 
>> +static inline struct memctlr *get_memctlr_from_page(struct page *page)
>> +{
>> +	struct resource_group *rgroup;
>> +	struct memctlr *res;
>> +
>> +	/*
>> +	 * Is the resource groups infrastructure initialized?
>> +	 */
>> +	if (!memctlr_root)
>> +		return NULL;
>> +
>> +	rcu_read_lock();
>> +	rgroup = (struct resource_group *)rcu_dereference(current->container);
>> +	rcu_read_unlock();
>> +
>> +	res = get_memctlr(rgroup);
>> +	if (!res)
>> +		return NULL;
>> +
>> +	BUG_ON(res->magic != MEMCTLR_MAGIC);
>> +	return res;
>> +}
> 
> I don't see how page passed to this function is involved into
> 'struct memctlr *res' determining. Could you comment this?
> 

Yeah, from page is a misnomer. We just use the current task task.
I'll fix the naming convention


> [snip]
> 
>> --- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/mm/rmap.c	2006-11-09 21:46:22.000000000 +0530
>> @@ -537,6 +537,7 @@ void page_add_anon_rmap(struct page *pag
>>  	if (atomic_inc_and_test(&page->_mapcount))
>>  		__page_set_anon_rmap(page, vma, address);
>>  	/* else checking page index and mapping is racy */
>> +	memctlr_inc_rss(page);
>>  }
>>  
>>  /*
>> @@ -553,6 +554,7 @@ void page_add_new_anon_rmap(struct page 
>>  {
>>  	atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */
>>  	__page_set_anon_rmap(page, vma, address);
>> +	memctlr_inc_rss(page);
>>  }
>>  
>>  /**
>> @@ -565,6 +567,7 @@ void page_add_file_rmap(struct page *pag
>>  {
>>  	if (atomic_inc_and_test(&page->_mapcount))
>>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
>> +	memctlr_inc_rss(page);
> 
> Consider a task maps one file page 100 times in different places
> and touches 'all of them'. In this case I see that you'll get
> 100 in rss counter while real rss will be just 1.
> 

Hmmm... something for me to think about. Depending on how we define
RSS, the code for accounting should be easy to add & modify depending
on how we define RSS. But you bring up a very good point.

>>  }
>>  
>>  /**
>> @@ -596,8 +599,9 @@ void page_remove_rmap(struct page *page)
>>  		if (page_test_and_clear_dirty(page))
>>  			set_page_dirty(page);
>>  		__dec_zone_page_state(page,
>> -				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
>> +				PageAnon(page) ?  NR_ANON_PAGES : NR_FILE_MAPPED);
> 
> What is this extra space after a question-mark for?

This is again something I changed and looks my undo was not very good.
Please ignore it, I'll remove it from the diff.

> 
>>  	}
>> +	memctlr_dec_rss(page, mm);
>>  }
>>  
>>  /*
>> diff -puN include/linux/rmap.h~container-memctlr-acct include/linux/rmap.h
>> --- linux-2.6.19-rc2/include/linux/rmap.h~container-memctlr-acct	2006-11-09 21:46:22.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/include/linux/rmap.h	2006-11-09 21:46:22.000000000 +0530
>> @@ -8,6 +8,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/mm.h>
>>  #include <linux/spinlock.h>
>> +#include <linux/memctlr.h>
>>  
>>  /*
>>   * The anon_vma heads a list of private "related" vmas, to scan if
>> @@ -84,6 +85,7 @@ void page_remove_rmap(struct page *);
>>  static inline void page_dup_rmap(struct page *page)
>>  {
>>  	atomic_inc(&page->_mapcount);
>> +	memctlr_inc_rss(page);
>>  }
> 
> I'm not sure this is correct. page_dup_rmap() happens in the context
> of forking process and thus you'll increment rss counter on current.
> But this must be incremented at new task's counter, mustn't it?

This is fixed in the next patch container-memctlr-task-migration.
Thanks for spotting it.

-- 
	Thanks,
	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 8/8] RSS controller support reclamation
  2006-11-10  9:16     ` Balbir Singh
@ 2006-11-10  9:29       ` Pavel Emelianov
  2006-11-10 12:42         ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelianov @ 2006-11-10  9:29 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, Linux MM, dev, ckrm-tech,
	Linux Kernel Mailing List, haveblue, rohitseth

Balbir Singh wrote:

[snip]

>> And what about a hard limit - how would you fail in page fault in
>> case of limit hit? SIGKILL/SEGV is not an option - in this case we
>> should run synchronous reclamation. This is done in beancounter
>> patches v6 we've sent recently.
>>
> 
> I thought about running synchronous reclamation, but then did not follow
> that approach, I was not sure if calling the reclaim routines from the
> page fault context is a good thing to do. It's worth trying out, since

Each page fault potentially calls reclamation by allocating
required page with __GFP_IO | __GFP_FS bits set. Synchronous
reclamation in page fault is really normal.

[snip]

>> Please correct me if I'm wrong, but does this reclamation work like
>> "run over all the zones' lists searching for page whose controller
>> is sc->container" ?
>>
> 
> Yeah, that's correct. The code can also reclaim memory from all over-the-limit

OK. What if I have a container with 100 pages limit in a 4Gb
(~ million of pages) machine and this group starts reclaiming
its pages. In case this group uses its pages heavily they will
be at the beginning of an LRU list and reclamation code would
have to scan through all (million) pages before it finds proper
ones. This is not optimal!

> containers (by passing SC_OVERLIMIT_ALL). The idea behind using such a scheme
> is to ensure that the global LRU list is not broken.

isolate_lru_pages() helps in this. As far as I remember this
was introduced to reduce lru lock contention and keep lru
lists integrity.

In beancounters patches this is used to shrink BC's pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 7/8] RSS controller fix resource groups parsing
  2006-11-10  9:13   ` Pavel Emelianov
@ 2006-11-10  9:32     ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-10  9:32 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Linux MM, dev, Linux Kernel Mailing List, ckrm-tech, haveblue, rohitseth

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> echo adds a "\n" to the end of a string. When this string is copied from
>> user space, we need to remove it, so that match_token() can parse
>> the user space string correctly
>>
>> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
>> ---
>>
>>  kernel/res_group/rgcs.c |    6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff -puN kernel/res_group/rgcs.c~container-res-groups-fix-parsing kernel/res_group/rgcs.c
>> --- linux-2.6.19-rc2/kernel/res_group/rgcs.c~container-res-groups-fix-parsing	2006-11-09 23:08:10.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/kernel/res_group/rgcs.c	2006-11-09 23:08:10.000000000 +0530
>> @@ -241,6 +241,12 @@ ssize_t res_group_file_write(struct cont
>>  	}
>>  	buf[nbytes] = 0;	/* nul-terminate */
>>  
>> +	/*
>> +	 * Ignore "\n". It might come in from echo(1)
> 
> Why not inform user he should call echo -n?

Yes, but what if the user does not use it? We can't afford to do the
wrong thing. But it's a good point, I'll document and recommend that
the users use echo -n.


> 
>> +	 */
>> +	if (buf[nbytes - 1] == '\n')
>> +		buf[nbytes - 1] = 0;
>> +
>>  	container_manage_lock();
>>  
>>  	if (container_is_removed(cont)) {
>> _
>>
> 
> That's the same patch as in [PATCH 1/8] mail. Did you attached
> a wrong one?

Yeah... I moved this patch from #7 to #1 and did not remove it.
Sorry!

-- 
	Thanks,
	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ckrm-tech] [RFC][PATCH 6/8] RSS controller shares allocation
  2006-11-10  9:11   ` Pavel Emelianov
@ 2006-11-10 10:27     ` Balbir Singh
  2006-11-10 10:32       ` Pavel Emelianov
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2006-11-10 10:27 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: dev, ckrm-tech, haveblue, Linux Kernel Mailing List, Linux MM, rohitseth

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Support shares assignment and propagation.
>>
>> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
>> ---
>>
>>  kernel/res_group/memctlr.c |   59 ++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 58 insertions(+), 1 deletion(-)
> 
> [snip]
> 
>> +static void recalc_and_propagate(struct memctlr *res, struct memctlr *parres)
>> +{
>> +	struct resource_group *child = NULL;
>> +	int child_divisor;
>> +	u64 numerator;
>> +	struct memctlr *child_res;
>> +
>> +	if (parres) {
>> +		if (res->shares.max_shares == SHARE_DONT_CARE ||
>> +			parres->shares.max_shares == SHARE_DONT_CARE)
>> +			return;
>> +
>> +		child_divisor = parres->shares.child_shares_divisor;
>> +		if (child_divisor == 0)
>> +			return;
>> +
>> +		numerator = (u64)(parres->shares.unused_min_shares *
>> +				res->shares.max_shares);
>> +		do_div(numerator, child_divisor);
>> +		numerator = (u64)(parres->nr_pages * numerator);
>> +		do_div(numerator, SHARE_DEFAULT_DIVISOR);
>> +		res->nr_pages = numerator;
>> +	}
>> +
>> +	for_each_child(child, res->rgroup) {
>> +		child_res = get_memctlr(child);
>> +		BUG_ON(!child_res);
>> +		recalc_and_propagate(child_res, res);
> 
> Recursion? Won't it eat all the stack in case of a deep tree?

The depth of the hierarchy can be controlled. Recursion is needed
to do a DFS walk

> 
>> +	}
>> +
>> +}
>> +
>> +static void memctlr_shares_changed(struct res_shares *shares)
>> +{
>> +	struct memctlr *res, *parres;
>> +
>> +	res = get_memctlr_from_shares(shares);
>> +	if (!res)
>> +		return;
>> +
>> +	if (is_res_group_root(res->rgroup))
>> +		parres = NULL;
>> +	else
>> +		parres = get_memctlr((struct container *)res->rgroup->parent);
>> +
>> +	recalc_and_propagate(res, parres);
>> +}
>> +
>>  struct res_controller memctlr_rg = {
>>  	.name = res_ctlr_name,
>>  	.ctlr_id = NO_RES_ID,
>>  	.alloc_shares_struct = memctlr_alloc_instance,
>>  	.free_shares_struct = memctlr_free_instance,
>>  	.move_task = memctlr_move_task,
>> -	.shares_changed = NULL,
>> +	.shares_changed = memctlr_shares_changed,
> 
> I didn't find where in this patches this callback is called.

It's a part of the resource groups infrastructure. It's been ported
on top of Paul Menage's containers patches. The code can be easily
adapted to work directly with containers instead of resource groups
if required.



-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ckrm-tech] [RFC][PATCH 6/8] RSS controller shares allocation
  2006-11-10 10:27     ` [ckrm-tech] " Balbir Singh
@ 2006-11-10 10:32       ` Pavel Emelianov
  2006-11-10 12:55         ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelianov @ 2006-11-10 10:32 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, dev, ckrm-tech, haveblue,
	Linux Kernel Mailing List, Linux MM, rohitseth

[snip]

>>> +	for_each_child(child, res->rgroup) {
>>> +		child_res = get_memctlr(child);
>>> +		BUG_ON(!child_res);
>>> +		recalc_and_propagate(child_res, res);
>> Recursion? Won't it eat all the stack in case of a deep tree?
> 
> The depth of the hierarchy can be controlled. Recursion is needed
> to do a DFS walk

That's another point against recursion - bad root can
crash the kernel... If we are about to give container's
users ability to make their own subtrees then we *must*
avoid recursion. There's an algorithm that allows one
to walk the tree like this w/o recursion.

[snip]

>> I didn't find where in this patches this callback is called.
> 
> It's a part of the resource groups infrastructure. It's been ported
> on top of Paul Menage's containers patches. The code can be easily
> adapted to work directly with containers instead of resource groups
> if required.


Could you please give me a link to the patch where this
is called?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 8/8] RSS controller support reclamation
  2006-11-10  9:29       ` Pavel Emelianov
@ 2006-11-10 12:42         ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-10 12:42 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Linux MM, dev, ckrm-tech, Linux Kernel Mailing List, haveblue, rohitseth

Pavel Emelianov wrote:
> Balbir Singh wrote:
> 
> [snip]
> 
>>> And what about a hard limit - how would you fail in page fault in
>>> case of limit hit? SIGKILL/SEGV is not an option - in this case we
>>> should run synchronous reclamation. This is done in beancounter
>>> patches v6 we've sent recently.
>>>
>> I thought about running synchronous reclamation, but then did not follow
>> that approach, I was not sure if calling the reclaim routines from the
>> page fault context is a good thing to do. It's worth trying out, since
> 
> Each page fault potentially calls reclamation by allocating
> required page with __GFP_IO | __GFP_FS bits set. Synchronous
> reclamation in page fault is really normal.

True. I don't know what I was thinking, thanks for making me think
straight.

> 
> [snip]
> 
>>> Please correct me if I'm wrong, but does this reclamation work like
>>> "run over all the zones' lists searching for page whose controller
>>> is sc->container" ?
>>>
>> Yeah, that's correct. The code can also reclaim memory from all over-the-limit
> 
> OK. What if I have a container with 100 pages limit in a 4Gb
> (~ million of pages) machine and this group starts reclaiming
> its pages. In case this group uses its pages heavily they will
> be at the beginning of an LRU list and reclamation code would
> have to scan through all (million) pages before it finds proper
> ones. This is not optimal!
> 

Yes, thats possible. The trade off is between

The cost associated with traversing that list while reclaiming
and the complexity associated with task migration. If we keep
a per-container list of pages, during task migration, you'll have
to migrate pages (of the task) from the list to the new container.

>> containers (by passing SC_OVERLIMIT_ALL). The idea behind using such a scheme
>> is to ensure that the global LRU list is not broken.
> 
> isolate_lru_pages() helps in this. As far as I remember this
> was introduced to reduce lru lock contention and keep lru
> lists integrity.
> 
> In beancounters patches this is used to shrink BC's pages.

I'll look at isolate_lru_pages() to see if the reclaim can be optimized.

Thanks for your feedback,


-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ckrm-tech] [RFC][PATCH 6/8] RSS controller shares allocation
  2006-11-10 10:32       ` Pavel Emelianov
@ 2006-11-10 12:55         ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-11-10 12:55 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: dev, ckrm-tech, haveblue, Linux Kernel Mailing List, Linux MM, rohitseth

Pavel Emelianov wrote:
> [snip]
> 
>>>> +	for_each_child(child, res->rgroup) {
>>>> +		child_res = get_memctlr(child);
>>>> +		BUG_ON(!child_res);
>>>> +		recalc_and_propagate(child_res, res);
>>> Recursion? Won't it eat all the stack in case of a deep tree?
>> The depth of the hierarchy can be controlled. Recursion is needed
>> to do a DFS walk
> 
> That's another point against recursion - bad root can
> crash the kernel... If we are about to give container's
> users ability to make their own subtrees then we *must*
> avoid recursion. There's an algorithm that allows one
> to walk the tree like this w/o recursion.

Bad pointers are always bad, whether they are the root or
any other pointer. Tree traversal is a generic infrastructure issue
for any infrastructure that supports a hierarchy.

Are you talking about threaded trees? Yes, they can be traversed
without recursion. I need to recheck my DS reference to double
check.

> 
> [snip]
> 
>>> I didn't find where in this patches this callback is called.
>> It's a part of the resource groups infrastructure. It's been ported
>> on top of Paul Menage's containers patches. The code can be easily
>> adapted to work directly with containers instead of resource groups
>> if required.
> 
> 
> Could you please give me a link to the patch where this
> is called?

Please see

http://www.mail-archive.com/ckrm-tech@lists.sourceforge.net/msg03333.html

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-11-11  0:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-09 19:35 [RFC][PATCH 0/8] RSS controller for containers Balbir Singh
2006-11-09 19:35 ` [RFC][PATCH 1/8] Fix resource groups parsing, while assigning shares Balbir Singh
2006-11-09 19:35 ` [RFC][PATCH 2/8] RSS controller setup Balbir Singh
2006-11-09 19:35 ` [RFC][PATCH 3/8] RSS controller add callbacks Balbir Singh
2006-11-09 19:36 ` [RFC][PATCH 4/8] RSS controller accounting Balbir Singh
2006-11-10  9:06   ` Pavel Emelianov
2006-11-10  9:29     ` Balbir Singh
2006-11-09 19:36 ` [RFC][PATCH 5/8] RSS controller task migration support Balbir Singh
2006-11-09 19:36 ` [RFC][PATCH 6/8] RSS controller shares allocation Balbir Singh
2006-11-10  9:11   ` Pavel Emelianov
2006-11-10 10:27     ` [ckrm-tech] " Balbir Singh
2006-11-10 10:32       ` Pavel Emelianov
2006-11-10 12:55         ` Balbir Singh
2006-11-09 19:36 ` [RFC][PATCH 7/8] RSS controller fix resource groups parsing Balbir Singh
2006-11-10  9:13   ` Pavel Emelianov
2006-11-10  9:32     ` Balbir Singh
2006-11-09 19:36 ` [RFC][PATCH 8/8] RSS controller support reclamation Balbir Singh
2006-11-09 19:45   ` Arjan van de Ven
2006-11-10  1:56     ` [ckrm-tech] " Balbir Singh
2006-11-10  8:54   ` Pavel Emelianov
2006-11-10  9:16     ` Balbir Singh
2006-11-10  9:29       ` Pavel Emelianov
2006-11-10 12:42         ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox