[NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
@ 2005-07-15  1:39 Christoph Lameter
  2005-07-15  3:50 ` Paul Jackson
  2005-07-15  4:52 ` Chen, Kenneth W
  0 siblings, 2 replies; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15  1:39 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-ia64, pj

This patch adds a new proc entry for each process called "numa_policy".

If read this file will output a text string describing the memory policy for the process.
A new policy may be written to "numa_policy" in order to change the memory
policy for the process. The following strings may be written to
/proc/<pid>/numa_policy:

default			-> Reset allocation policy to default
prefer=<node>		-> Prefer allocation on specified node
interleave={nodelist}	-> Interleaved allocation on the given nodes
bind={zonelist}		-> Restrict allocation to the specified zones.

Zones are specified by either only providing the node number or using the
notation zone/name. I.e. 3/normal 1/high 0/dma etc.

Additionally the patch also adds write capability to the "numa_maps". One can write
a VMA address followed by the policy to that file to change the mempolicy of an
individual virtual memory area. i.e.

echo "2aaaaaaab000 bind={0/Normal}" >numa_maps

This is compatible with the output format of numa_maps.

These functions are a core requirement for the ability to manage the memory allocation
of processes dynamically. This may be done by the administrator manually as described
here or one may write a batch process manager that manages the memory on a numa system.

The patch requires my numa_maps patch from Andrew Morton's tree.

Here is an example. We want to reorganize how process  12024 is allocating memory.
We would like to allocate most pages on node 1.  However, we would like the
heap pages to be allocated interleaved on nodes 2 and 3 to allow better throughput.

cd /proc/12024/

echo "prefer=1" >numa_policy

margin:/proc/12024 # cat numa_maps
00000000 prefer=1 MaxRef=0 Pages=0 Mapped=0
2000000000000000 prefer=1 MaxRef=42 Pages=11 Mapped=11 N0=3 N1=2 N2=2 N3=4
2000000000038000 prefer=1 MaxRef=1 Pages=2 Mapped=2 Anon=2 N1=2
2000000000040000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
2000000000058000 prefer=1 MaxRef=42 Pages=59 Mapped=59 N0=14 N1=16 N2=15 N3=14
2000000000260000 prefer=1 MaxRef=0 Pages=0 Mapped=0
2000000000268000 prefer=1 MaxRef=1 Pages=2 Mapped=2 Anon=2 N1=2
2000000000274000 prefer=1 MaxRef=1 Pages=3 Mapped=3 Anon=3 N1=3
2000000000280000 prefer=1 MaxRef=8 Pages=3 Mapped=3 N0=3
2000000000300000 prefer=1 MaxRef=8 Pages=2 Mapped=2 N0=2
2000000000318000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
4000000000000000 prefer=1 MaxRef=6 Pages=2 Mapped=2 N1=2
6000000000004000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
6000000000008000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
60000fff7fffc000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
60000ffffff3c000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
margin:/proc/12024 # cat maps
00000000-00004000 r--p 00000000 00:00 0
2000000000000000-200000000002c000 r-xp 00000000 08:04 516                /lib/ld-2.3.3.so
2000000000038000-2000000000040000 rw-p 00028000 08:04 516                /lib/ld-2.3.3.so
2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0
2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842           /lib/tls/libc.so.6.1
2000000000260000-2000000000268000 ---p 00208000 08:04 54707842           /lib/tls/libc.so.6.1
2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842           /lib/tls/libc.so.6.1
2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0
2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923            /usr/lib/locale/en_US.utf8/LC_CTYPE
2000000000300000-2000000000308000 r--s 00000000 08:04 60071467           /usr/lib/gconv/gconv-modules.cache
2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0
4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399           /sbin/mingetty
6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399           /sbin/mingetty
6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0          [heap]
60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0
60000ffffff3c000-60000ffffff90000 rw-p 60000ffffff3c000 00:00 0          [stack]
a000000000000000-a000000000020000 ---p 00000000 00:00 0                  [vdso]

echo "2xxxx interleave={2,3}" >numa_maps

margin:/proc/12024 # cat numa_maps
00000000 prefer=1 MaxRef=0 Pages=0 Mapped=0
2000000000000000 prefer=1 MaxRef=42 Pages=11 Mapped=11 N0=3 N1=2 N2=2 N3=4
2000000000038000 prefer=1 MaxRef=1 Pages=2 Mapped=2 Anon=2 N1=2
2000000000040000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
2000000000058000 prefer=1 MaxRef=42 Pages=59 Mapped=59 N0=14 N1=16 N2=15 N3=14
2000000000260000 prefer=1 MaxRef=0 Pages=0 Mapped=0
2000000000268000 prefer=1 MaxRef=1 Pages=2 Mapped=2 Anon=2 N1=2
2000000000274000 prefer=1 MaxRef=1 Pages=3 Mapped=3 Anon=3 N1=3
2000000000280000 prefer=1 MaxRef=8 Pages=3 Mapped=3 N0=3
2000000000300000 prefer=1 MaxRef=8 Pages=2 Mapped=2 N0=2
2000000000318000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
4000000000000000 prefer=1 MaxRef=6 Pages=2 Mapped=2 N1=2
6000000000004000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
6000000000008000 interleave={2,3} MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
60000fff7fffc000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1
60000ffffff3c000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.13-rc3/fs/proc/base.c
===================================================================
--- linux-2.6.13-rc3.orig/fs/proc/base.c	2005-07-15 00:40:17.000000000 +0000
+++ linux-2.6.13-rc3/fs/proc/base.c	2005-07-15 01:00:26.000000000 +0000
@@ -65,7 +65,10 @@ enum pid_directory_inos {
 	PROC_TGID_STAT,
 	PROC_TGID_STATM,
 	PROC_TGID_MAPS,
+#ifdef CONFIG_NUMA
 	PROC_TGID_NUMA_MAPS,
+	PROC_TGID_NUMA_POLICY,
+#endif
 	PROC_TGID_MOUNTS,
 	PROC_TGID_WCHAN,
 #ifdef CONFIG_SCHEDSTATS
@@ -103,7 +106,10 @@ enum pid_directory_inos {
 	PROC_TID_STAT,
 	PROC_TID_STATM,
 	PROC_TID_MAPS,
+#ifdef CONFIG_NUMA
 	PROC_TID_NUMA_MAPS,
+	PROC_TID_NUMA_POLICY,
+#endif
 	PROC_TID_MOUNTS,
 	PROC_TID_WCHAN,
 #ifdef CONFIG_SCHEDSTATS
@@ -148,6 +154,7 @@ static struct pid_entry tgid_base_stuff[
 	E(PROC_TGID_MAPS,      "maps",    S_IFREG|S_IRUGO),
 #ifdef CONFIG_NUMA
 	E(PROC_TGID_NUMA_MAPS, "numa_maps", S_IFREG|S_IRUGO),
+	E(PROC_TGID_NUMA_POLICY, "numa_policy", S_IFREG|S_IRUGO|S_IWUSR),
 #endif
 	E(PROC_TGID_MEM,       "mem",     S_IFREG|S_IRUSR|S_IWUSR),
 #ifdef CONFIG_SECCOMP
@@ -187,6 +194,7 @@ static struct pid_entry tid_base_stuff[]
 	E(PROC_TID_MAPS,       "maps",    S_IFREG|S_IRUGO),
 #ifdef CONFIG_NUMA
 	E(PROC_TID_NUMA_MAPS,  "numa_maps",    S_IFREG|S_IRUGO),
+	E(PROC_TID_NUMA_POLICY, "numa_policy", S_IFREG|S_IRUGO|S_IWUSR),
 #endif
 	E(PROC_TID_MEM,        "mem",     S_IFREG|S_IRUSR|S_IWUSR),
 #ifdef CONFIG_SECCOMP
@@ -524,24 +532,8 @@ static struct file_operations proc_maps_
 };
 
 #ifdef CONFIG_NUMA
-extern struct seq_operations proc_pid_numa_maps_op;
-static int numa_maps_open(struct inode *inode, struct file *file)
-{
-	struct task_struct *task = proc_task(inode);
-	int ret = seq_open(file, &proc_pid_numa_maps_op);
-	if (!ret) {
-		struct seq_file *m = file->private_data;
-		m->private = task;
-	}
-	return ret;
-}
-
-static struct file_operations proc_numa_maps_operations = {
-	.open		= numa_maps_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
-};
+extern struct file_operations proc_numa_maps_operations;
+extern struct file_operations proc_numa_policy_operations;
 #endif
 
 extern struct seq_operations mounts_op;
@@ -1558,6 +1550,10 @@ static struct dentry *proc_pident_lookup
 		case PROC_TGID_NUMA_MAPS:
 			inode->i_fop = &proc_numa_maps_operations;
 			break;
+		case PROC_TID_NUMA_POLICY:
+		case PROC_TGID_NUMA_POLICY:
+			inode->i_fop = &proc_numa_policy_operations;
+			break;
 #endif
 		case PROC_TID_MEM:
 		case PROC_TGID_MEM:
Index: linux-2.6.13-rc3/mm/mempolicy.c
===================================================================
--- linux-2.6.13-rc3.orig/mm/mempolicy.c	2005-07-15 00:40:17.000000000 +0000
+++ linux-2.6.13-rc3/mm/mempolicy.c	2005-07-15 01:01:48.000000000 +0000
@@ -1170,3 +1170,214 @@ void numa_default_policy(void)
 {
 	sys_set_mempolicy(MPOL_DEFAULT, NULL, 0);
 }
+
+/*
+ * Convert a mempolicy into a string.
+ * Returns the number of characters in buffer (if positive)
+ * or an error (negative)
+ */
+int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
+{
+	char *p = buffer;
+	char *e = buffer + maxlen;
+	int first = 1;
+	int node;
+	struct zone **z;
+
+	if (!pol || pol->policy == MPOL_DEFAULT) {
+		strcpy(buffer,"default");
+		return 7;
+	}
+
+	if (pol->policy == MPOL_PREFERRED) {
+		if (e < p + 8 /* fixed string size */ + 4 /* max len of node number */)
+			return -ENOSPC;
+
+		sprintf(p, "prefer=%d", pol->v.preferred_node);
+		return strlen(buffer);
+
+	} else if (pol->policy == MPOL_BIND) {
+
+		if (e < p + 9 + 4)
+			return -ENOSPC;
+
+		p+= sprintf(p, "bind={");
+
+		for (z = pol->v.zonelist->zones; *z ; *z++) {
+			if (!first)
+				*p++ = ',';
+			else
+				first = 0;
+			if (e < p + 2 + 4 + strlen((*z)->name))
+				return -ENOSPC;
+			p += sprintf(p, "%d/%s", (*z)->zone_pgdat->node_id, (*z)->name);
+		}
+
+		*p++ = '}';
+		*p++ = 0;
+		return p-buffer;
+
+	} else if (pol->policy == MPOL_INTERLEAVE) {
+
+		if (e < p + 14 + 4)
+			return -ENOSPC;
+
+		p += sprintf(p, "interleave={");
+
+		for_each_node(node)
+			if (test_bit(node, pol->v.nodes)) {
+				if (!first)
+					*p++ = ',';
+				else
+					first = 0;
+				if (e < p + 2 /* min bytes that follow */ + 4 /* node number */)
+					return -ENOSPC;
+				p += sprintf(p, "%d", node);
+			}
+
+		*p++ = '}';
+		*p++ = 0;
+		return p-buffer;
+	}
+	BUG();
+	return -EFAULT;
+}
+
+/*
+ * Convert a representation of a memory policy from text
+ * form to binary.
+ *
+ * Returns either a memory policy or NULL for error.
+ */
+struct mempolicy *str_to_mpol(char *buffer, char **end)
+{
+	char *p;
+	struct mempolicy *pol;
+	int node;
+	size_t size;
+
+	if (strnicmp(buffer, "default", 7) == 0) {
+
+		*end = buffer + 7;
+		return &default_policy;
+
+	}
+
+	pol = __mpol_copy(&default_policy);
+	if (IS_ERR(pol))
+		return NULL;
+
+	if (strnicmp(buffer, "prefer=", 7) == 0) {
+
+		node = simple_strtoul(buffer + 7, &p, 10);
+		if (node >= MAX_NUMNODES || !node_online(node))
+			goto out;
+
+		pol->policy = MPOL_PREFERRED;
+		pol->v.preferred_node = node;
+
+	} else if (strnicmp(buffer, "interleave={", 12) == 0) {
+
+		pol->policy = MPOL_INTERLEAVE;
+		p = buffer + 12;
+		bitmap_zero(pol->v.nodes, MAX_NUMNODES);
+
+		do {
+			node = simple_strtoul(p, &p, 10);
+
+			/* Check here for cpuset restrictions on nodes */
+			if (node >= MAX_NUMNODES || !node_online(node))
+				goto out;
+			set_bit(node, pol->v.nodes);
+
+		} while (*p++ == ',');
+
+		if (p[-1] != '}' || bitmap_empty(pol->v.nodes, MAX_NUMNODES))
+			goto out;
+
+	} else if (strnicmp(buffer, "bind={", 6) == 0) {
+
+		struct zonelist *zonelist = kmalloc(sizeof(struct zonelist), GFP_KERNEL);
+		struct zone **z = zonelist->zones;
+		struct zonelist *new;
+
+		pol->policy = MPOL_BIND;
+		p = buffer + 6;
+
+		do {
+			pg_data_t *pgdat;
+			struct zone *zone = NULL;
+
+			node = simple_strtoul(p, &p, 10);
+
+			/* Try to find the pgdat for the specified node */
+			for_each_pgdat(pgdat) {
+				if (pgdat->node_id == node) {
+					zone = pgdat->node_zones;
+					break;
+				}
+			}
+			if (!zone || node >= MAX_NUMNODES || !node_online(node))
+				goto bind_out;
+
+			/*
+			 * If there is no zone specified then take the first
+			 * zone. Otherwise we need to look for a matching name
+			 */
+			if (*p == '/') {
+				char *start = ++p;
+				struct zone *q;
+				struct zone *found = NULL;
+
+				/* Find end of the zone name */
+				while (*p && *p != ',' && *p != '}')
+					p++;
+
+				if (start == p)
+					goto bind_out;
+				/*
+				 * Go through the zones in this node and check
+				 * if any have the name we are looking for
+				 */
+				for(q = zone; q < zone + MAX_NR_ZONES; q++) {
+					if (strnicmp(q->name, start, p-start) == 0) {
+						found = q;
+						break;
+					}
+				}
+				zone = found;
+			}
+
+			if (!zone || z > zonelist->zones + MAX_NUMNODES * MAX_NR_ZONES)
+				goto bind_out;
+			*z++ = zone;
+
+		} while (*p++ == ',');
+
+		if (p[-1] != '}') {
+bind_out:
+			kfree(zonelist);
+			goto out;
+		}
+
+		/* Allocate only the necessary elements */
+		*z++ = NULL;
+		size = (z - zonelist->zones) * sizeof(struct zonelist *);
+		new = kmalloc(size, GFP_KERNEL);
+		if (!new)
+			goto out;
+		memcpy(new, zonelist, size);
+		kfree(zonelist);
+
+		pol->v.zonelist = new;
+
+	} else {
+out:
+		__mpol_free(pol);
+		return NULL;
+	}
+
+	*end = p;
+	return pol;
+}
+
Index: linux-2.6.13-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.13-rc3.orig/fs/proc/task_mmu.c	2005-07-15 00:40:17.000000000 +0000
+++ linux-2.6.13-rc3/fs/proc/task_mmu.c	2005-07-15 01:00:26.000000000 +0000
@@ -286,15 +286,15 @@ static struct numa_maps *get_numa_maps(c
 	return md;
 }
 
+#define MAX_MEMPOL_STRING_SIZE 50
+
 static int show_numa_map(struct seq_file *m, void *v)
 {
 	struct task_struct *task = m->private;
 	struct vm_area_struct *vma = v;
-	struct mempolicy *pol;
 	struct numa_maps *md;
-	struct zone **z;
 	int n;
-	int first;
+	char buffer[MAX_MEMPOL_STRING_SIZE];
 
 	if (!vma->vm_mm)
 		return 0;
@@ -303,46 +303,11 @@ static int show_numa_map(struct seq_file
 	if (!md)
 		return 0;
 
-	seq_printf(m, "%08lx", vma->vm_start);
-	pol = get_vma_policy(task, vma, vma->vm_start);
-	/* Print policy */
-	switch (pol->policy) {
-	case MPOL_PREFERRED:
-		seq_printf(m, " prefer=%d", pol->v.preferred_node);
-		break;
-	case MPOL_BIND:
-		seq_printf(m, " bind={");
-		first = 1;
-		for (z = pol->v.zonelist->zones; *z; z++) {
-
-			if (!first)
-				seq_putc(m, ',');
-			else
-				first = 0;
-			seq_printf(m, "%d/%s", (*z)->zone_pgdat->node_id,
-					(*z)->name);
-		}
-		seq_putc(m, '}');
-		break;
-	case MPOL_INTERLEAVE:
-		seq_printf(m, " interleave={");
-		first = 1;
-		for_each_node(n) {
-			if (test_bit(n, pol->v.nodes)) {
-				if (!first)
-					seq_putc(m,',');
-				else
-					first = 0;
-				seq_printf(m, "%d",n);
-			}
-		}
-		seq_putc(m, '}');
-		break;
-	default:
-		seq_printf(m," default");
-		break;
-	}
-	seq_printf(m, " MaxRef=%lu Pages=%lu Mapped=%lu",
+	if (mpol_to_str(buffer, sizeof(buffer), get_vma_policy(task, vma, vma->vm_start)) <0)
+		return 0;
+
+	seq_printf(m, "%08lx %s MaxRef=%lu Pages=%lu Mapped=%lu",
+			vma->vm_start, buffer,
 			md->mapcount_max, md->pages, md->mapped);
 	if (md->anon)
 		seq_printf(m," Anon=%lu",md->anon);
@@ -364,4 +329,134 @@ struct seq_operations proc_pid_numa_maps
 	.stop	= m_stop,
 	.show	= show_numa_map
 };
+
+/*
+ * Retrieval and setting of the memory policy for a task
+ */
+static ssize_t numa_policy_read(struct file *file, char __user *buf,
+                                size_t count, loff_t *ppos)
+{
+	struct task_struct *task = proc_task(file->f_dentry->d_inode);
+	char buffer[MAX_MEMPOL_STRING_SIZE];	/* Should this really be on the stack ?? */
+	size_t len;
+	loff_t __ppos = *ppos;
+
+	len = mpol_to_str(buffer, MAX_MEMPOL_STRING_SIZE, task->mempolicy);
+	if (__ppos >= len)
+		return 0;
+	if (count > len-__ppos)
+		count = len-__ppos;
+	if (copy_to_user(buf, buffer + __ppos, count))
+		return -EFAULT;
+	*ppos = __ppos + count;
+	return count;
+}
+
+static ssize_t numa_policy_write(struct file *file, const char __user *buf,
+                                size_t count, loff_t *ppos)
+{
+	struct task_struct *task = proc_task(file->f_dentry->d_inode);
+	char buffer[MAX_MEMPOL_STRING_SIZE], *end;
+	struct mempolicy *pol, *old_policy;
+
+	if (!capable(CAP_SYS_RESOURCE))
+		return -EPERM;
+	memset(buffer, 0, MAX_MEMPOL_STRING_SIZE);
+	if (count > MAX_MEMPOL_STRING_SIZE || !task->mm)
+		return -EINVAL;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	pol = str_to_mpol(buffer, &end);
+	if (!pol)
+		return -EINVAL;
+	if (*end == '\n')
+		end++;
+
+	old_policy = task->mempolicy;
+
+
+	if (!mpol_equal(pol, old_policy)) {
+		if (pol->policy == MPOL_DEFAULT)
+			pol = NULL;
+
+		task->mempolicy = pol;
+		mpol_free(old_policy);
+	} else
+		mpol_free(pol);
+
+	return end - buffer;
+}
+
+
+struct file_operations proc_numa_policy_operations = {
+	.read = numa_policy_read,
+	.write = numa_policy_write
+};
+
+static ssize_t numa_vma_policy_write(struct file *file, const char __user *buf,
+                                size_t count, loff_t *ppos)
+{
+	struct task_struct *task = proc_task(file->f_dentry->d_inode);
+	struct vm_area_struct *vma;
+	unsigned long addr;
+	char buffer[MAX_MEMPOL_STRING_SIZE];
+	char *p, *end;
+	struct mempolicy *pol, *old_policy;
+
+	if (!capable(CAP_SYS_RESOURCE))
+		return -EPERM;
+	memset(buffer, 0, MAX_MEMPOL_STRING_SIZE);
+	if (count > MAX_MEMPOL_STRING_SIZE || !task->mm)
+		return -EINVAL;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	/* Extract VMA address and find the vma */
+	addr = simple_strtoul(buffer, &p, 16);
+	if (*p++ != ' ')
+		return -EINVAL;
+	vma = find_vma(task->mm, addr);
+	if (!vma || vma->vm_end < addr)
+		return -EINVAL;
+
+	pol = str_to_mpol(p, &end);
+	if (!pol)
+		return -EINVAL;
+	if (*end == '\n')
+		end++;
+
+	old_policy = vma->vm_policy;
+
+	if (!mpol_equal(pol, old_policy)) {
+		if (pol->policy == MPOL_DEFAULT)
+			pol = NULL;
+
+		vma->vm_policy = pol;
+		mpol_free(old_policy);
+	} else
+		mpol_free(pol);
+
+	return end - buffer;
+}
+
+static int numa_maps_open(struct inode *inode, struct file *file)
+{
+	struct task_struct *task = proc_task(inode);
+	int ret = seq_open(file, &proc_pid_numa_maps_op);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = task;
+	}
+	return ret;
+}
+
+struct file_operations proc_numa_maps_operations = {
+	.open		= numa_maps_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+	.write		= numa_vma_policy_write
+};
+
 #endif
Index: linux-2.6.13-rc3/include/linux/mempolicy.h
===================================================================
--- linux-2.6.13-rc3.orig/include/linux/mempolicy.h	2005-07-15 00:40:17.000000000 +0000
+++ linux-2.6.13-rc3/include/linux/mempolicy.h	2005-07-15 01:00:26.000000000 +0000
@@ -156,6 +156,10 @@ struct mempolicy *get_vma_policy(struct 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
 
+/* Conversion functions for /proc interface */
+int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol);
+struct mempolicy *str_to_mpol(char *buffer, char **end);
+
 #else
 
 struct mempolicy {};
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15  1:39 [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy Christoph Lameter
@ 2005-07-15  3:50 ` Paul Jackson
  2005-07-15  4:52 ` Chen, Kenneth W
  1 sibling, 0 replies; 35+ messages in thread
From: Paul Jackson @ 2005-07-15  3:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-ia64

This patch puzzles me.  Some of my questions are probably answered
by the code, but I tend to read the commentary and comments first,
to "get my bearings."  I failed to get said bearings ... as you
shall soon realize.

How does this patch relate to Andi's mbind/mempolicy support?

How does it relate to cpusets?

What are the essential feature(s) not provided by the above which
this patch adds?

What are some situations/scenarios in which this facility or additional
features would be useful, and how would it be used therein?

Why yet another parser/displayer for lists of numbers, rather than
use lib/bitmap.c: bitmap_scnlistprintf, bitmap_parselist?

This patch seems to be closely related to the mempolicy work (does
it just provide another way to manipulate and display such)?  But it
uses a file system interface, rather than a system call interface.
Until now, mempolicy has used system calls, and cpusets file system
style API's.  This patch seems to remove this nice, simple, albeit
inessential, distinction.

The comment:
	/* Check here for cpuset restrictions on nodes */
doesn't seem to be followed by any code involving cpusets.  I guess
this is an "XXX" (open issue) comment, and not a comment on the code
that seems to follow it.

The key question I have is thus: this seems to be more additional
detail in an API and implementing code than I understand the
requirement for.  To repeat one of the questions above - what are
the essential feature(s) this patch adds?

What are the options that one could consider for the API style,
and how do you end up recommending this particular choice?

Could you speak to the motivation for setting this policy per-task,
rather than per-cpuset?  I suspect that there is good motivation for
this choice, but I'd like to see it spelled out.

I'd like to think that someway could be found to accomplish this
patch with quite a bit less "fussy parsing" code.  Such code is a
pretty much guaranteed pain in the backside, both to code to from
user space and to maintain the kernel code.

I'm a little surprised one can just force the mempolicy of another
task's vma without any interlocking/synchronization that I noticed:

+static ssize_t numa_vma_policy_write(struct file *file, const char __user *buf,
+                                size_t count, loff_t *ppos)
+{
...
+	old_policy = vma->vm_policy;
+
+	if (!mpol_equal(pol, old_policy)) {
+		if (pol->policy == MPOL_DEFAULT)
+			pol = NULL;
+
+		vma->vm_policy = pol;
+		mpol_free(old_policy);
+	} else
+		mpol_free(pol);

But I'm no expert in this code, so perhaps the above is safe.

How many ways do we end up with to query a tasks mempolicy?
Superficially, it seems to include get_mempolicy, last weeks
numa_maps patch and this patch's support for reading the
new /proc/<pid>/numa_policy files.  Are all three mechanisms
needed, and do they each provide something valuable and unique?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15  1:39 [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy Christoph Lameter
  2005-07-15  3:50 ` Paul Jackson
@ 2005-07-15  4:52 ` Chen, Kenneth W
  2005-07-15  5:07   ` Christoph Lameter
  1 sibling, 1 reply; 35+ messages in thread
From: Chen, Kenneth W @ 2005-07-15  4:52 UTC (permalink / raw)
  To: 'Christoph Lameter', linux-mm; +Cc: linux-ia64, pj

Christoph Lameter wrote on Thursday, July 14, 2005 6:40 PM
> This patch adds a new proc entry for each process called "numa_policy".
> 
> If read this file will output a text string describing the memory policy
> for the process.
> A new policy may be written to "numa_policy" in order to change the memory
> policy for the process. The following strings may be written to
> /proc/<pid>/numa_policy:
> 
> Additionally the patch also adds write capability to the "numa_maps". One
> can write a VMA address followed by the policy to that file to change the
> mempolicy of an individual virtual memory area. i.e.

This looks a lot like a back door access to libnuma and numactl capability.
Are you sure libnuma and numactl won't suite your needs?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15  4:52 ` Chen, Kenneth W
@ 2005-07-15  5:07   ` Christoph Lameter
  2005-07-15  5:55     ` Chen, Kenneth W
  2005-07-15  6:05     ` Paul Jackson
  0 siblings, 2 replies; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15  5:07 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-mm, linux-ia64, pj

On Thu, 14 Jul 2005, Chen, Kenneth W wrote:

> > Additionally the patch also adds write capability to the "numa_maps". One
> > can write a VMA address followed by the policy to that file to change the
> > mempolicy of an individual virtual memory area. i.e.
> 
> This looks a lot like a back door access to libnuma and numactl capability.
> Are you sure libnuma and numactl won't suite your needs?

The functionality offered here is different. numactl's main concern is 
starting processes. libnuma is mostly concerned with a process 
controlling its own memory allocation.

This is an implementation that deals with monitoring and managing running 
processes. For an effective batch scheduler we need outside control 
over memory policy. It needs to be easy to see what is going on in the 
system (numa_maps) and easy to manipulate (numa_policy).

These two control files allow the monitor and control of the memory policy 
of an existing process down to the vma level.

I plan to add another patch soon that will then also tie page migration 
into this. Basically this will be implemented by allowing to do

echo "<vma-address> N<sourcenode>(<nr-pages) <targetnode>" 
>/proc/<pid>/numa_maps

(echoing the output format of numa maps)

Doing page migration at the vma level avoids the necessity to analyze the 
vma's of a process in kernel space and simplifies the implementation of 
page migration significantly. A batch scheduler or a system 
administrator can control individual vma's. They can make their own 
decisions if a shared library should be migrated or not etc.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15  5:07   ` Christoph Lameter
@ 2005-07-15  5:55     ` Chen, Kenneth W
  2005-07-15  6:05     ` Paul Jackson
  1 sibling, 0 replies; 35+ messages in thread
From: Chen, Kenneth W @ 2005-07-15  5:55 UTC (permalink / raw)
  To: 'Christoph Lameter'; +Cc: linux-mm, linux-ia64, pj

Christoph Lameter wrote on Thursday, July 14, 2005 10:08 PM
> On Thu, 14 Jul 2005, Chen, Kenneth W wrote:
> > > Additionally the patch also adds write capability to the "numa_maps". One
> > > can write a VMA address followed by the policy to that file to change the
> > > mempolicy of an individual virtual memory area. i.e.
> > 
> > This looks a lot like a back door access to libnuma and numactl capability.
> > Are you sure libnuma and numactl won't suite your needs?
> 
> The functionality offered here is different. numactl's main concern is 
> starting processes. libnuma is mostly concerned with a process 
> controlling its own memory allocation.
> 
> This is an implementation that deals with monitoring and managing running 
> processes. For an effective batch scheduler we need outside control 
> over memory policy.

I want to warn you that controlling via external means to the app with numa
policy is extremely unreliable and difficult.  Since in-kernel numa policy
is enforced for the new allocation.  When pages inside the vma have already
been touched before you echo the policy into the proc file, it has no effect.

That means one need some synchronization point between sys admin echo a
desired policy into the /proc file to the time app touches the memory.  It
sound like you have another patch in the pipeline to address that.  But
there is always some usage model this will break down (me thinking interleave
mode...).


> It needs to be easy to see what is going on in the system (numa_maps)

Yeah, I like the numa_maps a lot :-)


- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15  5:07   ` Christoph Lameter
  2005-07-15  5:55     ` Chen, Kenneth W
@ 2005-07-15  6:05     ` Paul Jackson
  2005-07-15 11:46       ` Andi Kleen
  2005-07-15 16:06       ` Christoph Lameter
  1 sibling, 2 replies; 35+ messages in thread
From: Paul Jackson @ 2005-07-15  6:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: kenneth.w.chen, linux-mm, linux-ia64, Andi Kleen

Christoph wrote:
> This is an implementation that deals with monitoring and managing running 
> processes.

So is this patch roughly equivalent to adding a pid to the
mbind/set_mempolicy/get_mempolicy system calls?

Not that I am advocating for or against adding doing that.  But this
seems like alot of code, with new and exciting API details, just to
add a pid argument, if such it be.

Andi - could you remind us all why you chose not to have a pid argument
in these calls?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15  6:05     ` Paul Jackson
@ 2005-07-15 11:46       ` Andi Kleen
  2005-07-15 16:06       ` Christoph Lameter
  1 sibling, 0 replies; 35+ messages in thread
From: Andi Kleen @ 2005-07-15 11:46 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Christoph Lameter, kenneth.w.chen, linux-mm, linux-ia64

On Thu, 14 Jul 2005 23:05:01 -0700
Paul Jackson <pj@sgi.com> wrote:

> Christoph wrote:
> > This is an implementation that deals with monitoring and managing running 
> > processes.
> 
> So is this patch roughly equivalent to adding a pid to the
> mbind/set_mempolicy/get_mempolicy system calls?
> 
> Not that I am advocating for or against adding doing that.  But this
> seems like alot of code, with new and exciting API details, just to
> add a pid argument, if such it be.
> 
> Andi - could you remind us all why you chose not to have a pid argument
> in these calls?

Because of locking issues and I don't think external processes
should mess with virtual addresses of other processes. There is
just no way to do the later cleanly and race free.

I haven't seen the patch, but from the description it sounds wrong.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15  6:05     ` Paul Jackson
  2005-07-15 11:46       ` Andi Kleen
@ 2005-07-15 16:06       ` Christoph Lameter
  2005-07-15 21:04         ` Paul Jackson
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15 16:06 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, linux-mm, linux-ia64, Andi Kleen

On Thu, 14 Jul 2005, Paul Jackson wrote:

> Christoph wrote:
> > This is an implementation that deals with monitoring and managing running 
> > processes.
> 
> So is this patch roughly equivalent to adding a pid to the
> mbind/set_mempolicy/get_mempolicy system calls?

Yes. Almost.
 
> Not that I am advocating for or against adding doing that.  But this
> seems like alot of code, with new and exciting API details, just to
> add a pid argument, if such it be.

I think the syscall interface is plainly wrong for monitoring and managing 
a process. The /proc interface is designed to monitor processes and it 
allows the modification of process characteristics. This is the natural 
way to implement viewing of numa allocation maps, the runtime changes
to allocation strategies and finally something that migrates pages of a 
vma between nodes.

A syscall interface implies that you have to write user space programs 
with associated libraries to display and manipulate values. As 
demonstrated this is really not necessary. Implementation via /proc
is fairly simple.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 16:06       ` Christoph Lameter
@ 2005-07-15 21:04         ` Paul Jackson
  2005-07-15 21:12           ` Andi Kleen
  0 siblings, 1 reply; 35+ messages in thread
From: Paul Jackson @ 2005-07-15 21:04 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: kenneth.w.chen, linux-mm, linux-ia64, ak

Christoph wrote:
> I think the syscall interface is plainly wrong for monitoring and managing 
> a process.

Well ... actually I'd have to agree with that.  I chose a filesys
interface for cpusets for similar reasons.

However in this case, the added functionality seems so close to
mbind/mempolicy that one has to at least give consideration to
remaining consistent with that style of interface.

These questions of interface style (filesys or syscall) probably don't
matter, however. at least not yet.  First we need to make sense of
the larger issues that Ken and Andi raise, of whether this is a good
thing to do.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 21:04         ` Paul Jackson
@ 2005-07-15 21:12           ` Andi Kleen
  2005-07-15 21:20             ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2005-07-15 21:12 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Christoph Lameter, kenneth.w.chen, linux-mm, linux-ia64, ak

> These questions of interface style (filesys or syscall) probably don't
> matter, however. at least not yet.  First we need to make sense of
> the larger issues that Ken and Andi raise, of whether this is a good
> thing to do.

In my opinion detailed reporting of node affinity to external
processes of specific memory areas is a mistake. It's too finegrained and 
not useful outside the process itself (external users don't or shouldn't
know anything about process virtual addresses). The information
is too volatile and can change every time without nice 
ways to lock (no SIGSTOP is not a acceptable way) 

Some people might find it useful for debugging NUMA kernel code,
but that doesn't mean it has to go into the kernel.

For statistics purposes probably just some counters are enough.
Either generated on demand or counted. On demand would be 
probably slow and counted would bloat mm_struct because
it would need some max_numnodes sized arrays.   Not sure what
is the better tradeoff.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 21:12           ` Andi Kleen
@ 2005-07-15 21:20             ` Christoph Lameter
  2005-07-15 21:47               ` Andi Kleen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15 21:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Fri, 15 Jul 2005, Andi Kleen wrote:

> > These questions of interface style (filesys or syscall) probably don't
> > matter, however. at least not yet.  First we need to make sense of
> > the larger issues that Ken and Andi raise, of whether this is a good
> > thing to do.
> 
> In my opinion detailed reporting of node affinity to external
> processes of specific memory areas is a mistake. It's too finegrained and 
> not useful outside the process itself (external users don't or shouldn't
> know anything about process virtual addresses). The information
> is too volatile and can change every time without nice 
> ways to lock (no SIGSTOP is not a acceptable way) 

It is very useful to a batch scheduler that can dynamically move memory 
between nodes. It needs to know exactly where the pages are including the 
vma information. It is also of utmost importance to a sysadmin that wants 
to control the memory placement of an important application to have 
information about the process and be able to influence future allocations 
as well as to move existing pages.

The volatility has to be taken into account by the batch scheduler or by 
the sysadmin manipulating the program. Typically both know much more about 
the expected and future behavior of the application than the kernel.

And yes SIGSTOP is acceptable if the application behavior on STOP -> 
Continue is know by the administrator or the batch scheduler. I do not 
think that this is required though.

Image an important batch data run that has been running for 2 days and 
will run 3 more days. Now some nodes are running out of memory and the 
performance suffers. The batch scheduler / or sysadmin will be able to 
inspect the situation and improve the performance by changing memory 
policies and/or moving pages. The batch scheduler / admin knows about 
which processes are important and may stop other processes in order for 
the critical process to finish in time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 21:20             ` Christoph Lameter
@ 2005-07-15 21:47               ` Andi Kleen
  2005-07-15 21:55                 ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2005-07-15 21:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

> It is very useful to a batch scheduler that can dynamically move memory 
> between nodes. It needs to know exactly where the pages are including the 
> vma information. 

You mean for relative placement in node groups? 
Ray's code was supposed to handle that in the kernel.
You pass an mapping array to the syscall and it does the rest.

We had a big discussion about that some months ago; I suggest
you review it.

So for what does that batch monstrosity need to know 
about the VMAs? 

> It is also of utmost importance to a sysadmin that wants 
> to control the memory placement of an important application to have 
> information about the process and be able to influence future allocations 
> as well as to move existing pages.

I don't believe any admin will mess with virtual addresses.

I added the capability to numactl for shared memory
areas because I first thought it would be useful, but as far
as I know nobody was interested in it. (will probably remove
it again) 

But for "uncooperative" programs working on bigger objects
like threads/files/shm areas/processes makes much more sense. And gives
much cleaner interfaces too.

Now I can see some people being interested in more fine grained
policy, but the only sane way to do that is to change the source
code and use libnuma.

Basically to mess with finegrained virtual addresses you need code access,
and when you have that you can as well do it well and add 
libnuma and recompile.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 21:47               ` Andi Kleen
@ 2005-07-15 21:55                 ` Christoph Lameter
  2005-07-15 22:07                   ` Andi Kleen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15 21:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Fri, 15 Jul 2005, Andi Kleen wrote:

> So for what does that batch monstrosity need to know 
> about the VMAs? 

It needs to know where the memory of a process is. Thus 
/proc/<pid>/numa_maps.

> I don't believe any admin will mess with virtual addresses.

No but they will mess with vma's which are only identifiable by the 
starting virtual address.
 
> But for "uncooperative" programs working on bigger objects
> like threads/files/shm areas/processes makes much more sense. And gives
> much cleaner interfaces too.

Look at the existing patches and you see a huge complexity and heuristics 
because the kernel guesses which vma's to migrate. If the vma are 
exposed to the batch scheduler / admin then things become much easier to 
implement and the batch scheduler / admin has finer grained control.

> Now I can see some people being interested in more fine grained
> policy, but the only sane way to do that is to change the source
> code and use libnuma.

Can libnuma change the memory policy and move pages of existing processes?
 
> Basically to mess with finegrained virtual addresses you need code access,
> and when you have that you can as well do it well and add 
> libnuma and recompile.

libnuma is pretty heavy and AFAIK does not have the functionality that is 
required here.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 21:55                 ` Christoph Lameter
@ 2005-07-15 22:07                   ` Andi Kleen
  2005-07-15 22:30                     ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2005-07-15 22:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Fri, Jul 15, 2005 at 02:55:45PM -0700, Christoph Lameter wrote:
> On Fri, 15 Jul 2005, Andi Kleen wrote:
> 
> > So for what does that batch monstrosity need to know 
> > about the VMAs? 
> 
> It needs to know where the memory of a process is. Thus 

For that the counters I proposed are enough.


> /proc/<pid>/numa_maps.

All it should do is to start processes on specific nodes
(already should work) 
and perhaps later migrate processes from some set of specific
nodes to another set of specific nodes (using Ray's page
migration call) 

I don't see where a knowledge of specific VMAs is needed anywhere
in this.



> 
> > I don't believe any admin will mess with virtual addresses.
> 
> No but they will mess with vma's which are only identifiable by the 
> starting virtual address.

What for? 


>  
> > But for "uncooperative" programs working on bigger objects
> > like threads/files/shm areas/processes makes much more sense. And gives
> > much cleaner interfaces too.
> 
> Look at the existing patches and you see a huge complexity and heuristics 
> because the kernel guesses which vma's to migrate. If the vma are 

They kernel doesn't guess, it knows exactly.

> exposed to the batch scheduler / admin then things become much easier to 
> implement and the batch scheduler / admin has finer grained control.

So you want to tear up the interface Ray came up with and we discussed
and agreed on and replace it with something completely different and something
that uses this ugly /proc file? I don't think that's a good idea.



> 
> > Now I can see some people being interested in more fine grained
> > policy, but the only sane way to do that is to change the source
> > code and use libnuma.
> 
> Can libnuma change the memory policy and move pages of existing processes?

If someone hooks it into mbind() sure. But most likely 
such changes would be handled by migrate_pages()



>  
> > Basically to mess with finegrained virtual addresses you need code access,
> > and when you have that you can as well do it well and add 
> > libnuma and recompile.
> 
> libnuma is pretty heavy and AFAIK does not have the functionality that is 

Heavy??? You're not serious, right? 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 22:07                   ` Andi Kleen
@ 2005-07-15 22:30                     ` Christoph Lameter
  2005-07-15 22:37                       ` Andi Kleen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15 22:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Andi Kleen wrote:

> > > I don't believe any admin will mess with virtual addresses.
> > 
> > No but they will mess with vma's which are only identifiable by the 
> > starting virtual address.
> 
> What for? 

For page migration.

> > Look at the existing patches and you see a huge complexity and heuristics 
> > because the kernel guesses which vma's to migrate. If the vma are 
> They kernel doesn't guess, it knows exactly.

Maybe you need to reread the discussion on page migration that ended up 
with filesystem modifications?

> > > Now I can see some people being interested in more fine grained
> > > policy, but the only sane way to do that is to change the source
> > > code and use libnuma.
> > 
> > Can libnuma change the memory policy and move pages of existing processes?
> 
> If someone hooks it into mbind() sure. But most likely 
> such changes would be handled by migrate_pages()

I cannot imagine that migrate pages make it into the kernel in its 
current form. It combines multiple functionalities that need to be 
separate (it does update the memory policy, clears the page cache, deals 
with memory policy translations and then does heuristics to guess which 
vma's to transfer) and then provides a complex function moving of pages 
between groups of nodes.

Therefore:

1. Updating the memory policy is something that can be useful in other 
   settings as well so it need to be separate. The patch we are discussing
   does exactly that. The batch scheduler or the sysadmin can invoke this
   functionality before migrating pages if necessary.

2. Clearing the page cache is some work pursued by someone else. The batch
   scheduler or the sysadmin can invoke this function if necessary before
   migrating pages.

3. Memory policy translations better be done in user space. The batch
   scheduler /sysadmin knows which node has what pages so it can easily 
   develop page movement scheme that is optimal for the process.

4. Moving pages should be a simple function like

   migrate_pages(vma, from-node, nr-pages, to-node)

   The batch scheduler / sysadmin can invoke this function multiple times
   to move groups of nodes or only move parts of memory from a node (which
   was not really supported by Ray's migrate pages instead another 
   heuristics guessed how much to move and there was no option of 
   partial moves).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 22:30                     ` Christoph Lameter
@ 2005-07-15 22:37                       ` Andi Kleen
  2005-07-15 22:49                         ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2005-07-15 22:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Fri, Jul 15, 2005 at 03:30:40PM -0700, Christoph Lameter wrote:
> I cannot imagine that migrate pages make it into the kernel in its 
> current form. It combines multiple functionalities that need to be 
> separate (it does update the memory policy, clears the page cache, deals 
> with memory policy translations and then does heuristics to guess which 
> vma's to transfer) and then provides a complex function moving of pages 
> between groups of nodes.
> 
> Therefore:
> 
> 1. Updating the memory policy is something that can be useful in other 
>    settings as well so it need to be separate. The patch we are discussing

Not for external processes except in the narrow special case
of migrating everything. External processes shouldn' t
know about virtual addresses of other people.


> 3. Memory policy translations better be done in user space. The batch
>    scheduler /sysadmin knows which node has what pages so it can easily 
>    develop page movement scheme that is optimal for the process.

I don't think the existing policies are complex enough to make
this useful. The mapping for page migration for all of 
them is quite straight forward.


-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 22:37                       ` Andi Kleen
@ 2005-07-15 22:49                         ` Christoph Lameter
  2005-07-15 22:56                           ` Andi Kleen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15 22:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Andi Kleen wrote:

> > 1. Updating the memory policy is something that can be useful in other 
> >    settings as well so it need to be separate. The patch we are discussing
> 
> Not for external processes except in the narrow special case
> of migrating everything. External processes shouldn' t
> know about virtual addresses of other people.

Updating the memory policy is also useful if memory on one node gets 
short and you want to redirct allocations to a node that has memory free. 

A batch scheduler may anticipate memory shortages and redirect memory 
allocations in order to avoid page migration.

> > 3. Memory policy translations better be done in user space. The batch
> >    scheduler /sysadmin knows which node has what pages so it can easily 
> >    develop page movement scheme that is optimal for the process.
> 
> I don't think the existing policies are complex enough to make
> this useful. The mapping for page migration for all of 
> them is quite straight forward.

I'd rather have that logic in userspace rather than fix up page_migrate 
again and again and again. Automatic recalculation of memory policies is 
likely an unexpected side effect of the existing page migration code. 

Policies should only change with explicit instructions from user space and 
not as a side effect of page migration.

And curiously with the old page migration code: The only way to change the 
a memory policy is by page migration and this is automatically behind your 
back.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 22:49                         ` Christoph Lameter
@ 2005-07-15 22:56                           ` Andi Kleen
  2005-07-15 23:11                             ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2005-07-15 22:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Fri, Jul 15, 2005 at 03:49:33PM -0700, Christoph Lameter wrote:
> On Sat, 16 Jul 2005, Andi Kleen wrote:
> 
> > > 1. Updating the memory policy is something that can be useful in other 
> > >    settings as well so it need to be separate. The patch we are discussing
> > 
> > Not for external processes except in the narrow special case
> > of migrating everything. External processes shouldn' t
> > know about virtual addresses of other people.
> 
> Updating the memory policy is also useful if memory on one node gets 
> short and you want to redirct allocations to a node that has memory free. 

If you use MEMBIND just specify all the nodes upfront and it'll
do the normal fallback in them. 

If you use PREFERED it'll do that automatically anyways.

> 
> A batch scheduler may anticipate memory shortages and redirect memory 
> allocations in order to avoid page migration.

I think that jobs more belongs to the kernel. After all we don't
want to move half of our VM into your proprietary scheduler.


> I'd rather have that logic in userspace rather than fix up page_migrate 
> again and again and again. Automatic recalculation of memory policies is 
> likely an unexpected side effect of the existing page migration code. 

Only if you migrate again and again.

> 
> Policies should only change with explicit instructions from user space and 
> not as a side effect of page migration.

Well, page migration would be a "explicit instruction from user space" 

> 
> And curiously with the old page migration code: The only way to change the 
> a memory policy is by page migration and this is automatically behind your 
> back.

mbind can change policy at any time. Just only for the local
process, as that is the the only one who has enough information
to really do this.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 22:56                           ` Andi Kleen
@ 2005-07-15 23:11                             ` Christoph Lameter
  2005-07-15 23:44                               ` Andi Kleen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15 23:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Andi Kleen wrote:

> > Updating the memory policy is also useful if memory on one node gets 
> > short and you want to redirct allocations to a node that has memory free. 
> 
> If you use MEMBIND just specify all the nodes upfront and it'll
> do the normal fallback in them. 
> 
> If you use PREFERED it'll do that automatically anyways.

No it wont. If you know that you are going to start a process that must 
run on node 3 and know its going to use 2G but there is only 1G free 
then you may want to modify the policy of an existing huge process on 
node 3that is still allocating to go to node 2 that just happens to have 
free space.

> > A batch scheduler may anticipate memory shortages and redirect memory 
> > allocations in order to avoid page migration.
> I think that jobs more belongs to the kernel. After all we don't
> want to move half of our VM into your proprietary scheduler.

Care to tell me which proprietary scheduler you are talking about? I was 
not aware the existance of such a thing. I am particularly surprised that 
this proprietary scheduler exists before we have a working interface.

And you are now going to implement automatic page migration into the 
existing scheduler?

> > I'd rather have that logic in userspace rather than fix up page_migrate 
> > again and again and again. Automatic recalculation of memory policies is 
> > likely an unexpected side effect of the existing page migration code. 
> 
> Only if you migrate again and again.

If you encounter different situation then you may need different address 
translation. F.e. lets say you want to move a process from node 3 and 4 to 
node 5. That wont work with the existing patches. Or you want a process 
running on node 1 to be split to nodes 2 and 3. You want 1G to be moved to 
node 2 and the rest to node 3. Cannot be done with the old page migration.

> > Policies should only change with explicit instructions from user space and 
> > not as a side effect of page migration.
> 
> Well, page migration would be a "explicit instruction from user space" 

Existing page migration does not specify a memory policy it just 
translates it. And its inflexible and unable to handle some common 
situations described above.

> > And curiously with the old page migration code: The only way to change the 
> > a memory policy is by page migration and this is automatically behind your 
> > back.
> 
> mbind can change policy at any time. Just only for the local
> process, as that is the the only one who has enough information
> to really do this.

Which makes mbind useless for the sysadmin and/or batch scheduler in the 
scenarios we are discussing. That is the key reason why we need this patch.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 23:11                             ` Christoph Lameter
@ 2005-07-15 23:44                               ` Andi Kleen
  2005-07-15 23:56                                 ` Christoph Lameter
                                                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Andi Kleen @ 2005-07-15 23:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Fri, Jul 15, 2005 at 04:11:00PM -0700, Christoph Lameter wrote:
> On Sat, 16 Jul 2005, Andi Kleen wrote:
> 
> > > Updating the memory policy is also useful if memory on one node gets 
> > > short and you want to redirct allocations to a node that has memory free. 
> > 
> > If you use MEMBIND just specify all the nodes upfront and it'll
> > do the normal fallback in them. 
> > 
> > If you use PREFERED it'll do that automatically anyways.
> 
> No it wont. If you know that you are going to start a process that must 
> run on node 3 and know its going to use 2G but there is only 1G free 
> then you may want to modify the policy of an existing huge process on 
> node 3that is still allocating to go to node 2 that just happens to have 
> free space.

I think you should leave that to the kernel.

> > > A batch scheduler may anticipate memory shortages and redirect memory 
> > > allocations in order to avoid page migration.
> > I think that jobs more belongs to the kernel. After all we don't
> > want to move half of our VM into your proprietary scheduler.
> 
> Care to tell me which proprietary scheduler you are talking about? I was 

That SGI batch scheduler with its incredibly long specification
list you guys seem to want to mess up all interfaces
for. If I can download source to it please supply an URL. 

> And you are now going to implement automatic page migration into the 
> existing scheduler?

Hmm? You mean the kernel CPU scheduler? Nobody is planning to add
page migration to that.
> 
> > > I'd rather have that logic in userspace rather than fix up page_migrate 
> > > again and again and again. Automatic recalculation of memory policies is 
> > > likely an unexpected side effect of the existing page migration code. 
> > 
> > Only if you migrate again and again.
> 
> If you encounter different situation then you may need different address 
> translation. F.e. lets say you want to move a process from node 3 and 4 to 
> node 5. That wont work with the existing patches. Or you want a process 
> running on node 1 to be split to nodes 2 and 3. You want 1G to be moved to 
> node 2 and the rest to node 3. Cannot be done with the old page migration.

Ok, let's review it slowly. Why would you want to move 1GB
of a existing process and another GB to different nodes?  

There are two goals: either best memory latency (local memory) or best 
memory bandwidth (interleaved memory).   

Considering you want to optimize for latency: 
- It doesn't make sense here because your external agent doesn't know 
which thread is using the first GB and which thread is using the last 2GBs.
Most likely they use malloc and everything is pretty much mixed up.
That is information only the code knows or the kernel indirectly from its 
first touch policy. But you need it otherwise you violate local 
memory policy for one thread or another. 

In short blocks of memory are useless here because they have no 
relationship to what the code actually does. 

If you want to optimize for bandwidth: 

- Similar problem applies.  First GB and last GB of memory has no
relationship to how the memory is interleaved.

So it doesn't make much sense to work on smaller pieces 
than processes here. Files are corner cases, but they can
be already handled with some existing patches to mbind.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 23:44                               ` Andi Kleen
@ 2005-07-15 23:56                                 ` Christoph Lameter
  2005-07-16  2:01                                   ` Andi Kleen
  2005-07-16  0:00                                 ` David Singleton
  2005-07-16  0:16                                 ` Steve Neuner
  2 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-15 23:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Andi Kleen wrote:

> > If you encounter different situation then you may need different address 
> > translation. F.e. lets say you want to move a process from node 3 and 4 to 
> > node 5. That wont work with the existing patches. Or you want a process 
> > running on node 1 to be split to nodes 2 and 3. You want 1G to be moved to 
> > node 2 and the rest to node 3. Cannot be done with the old page migration.
> 
> Ok, let's review it slowly. Why would you want to move 1GB
> of a existing process and another GB to different nodes?  

Many reasons: One is to optimize access: Interleave. Or there just happens
to be space on these nodes and one needs the space on this node for 
something else.

> Considering you want to optimize for latency: 
> - It doesn't make sense here because your external agent doesn't know 
> which thread is using the first GB and which thread is using the last 2GBs.
> Most likely they use malloc and everything is pretty much mixed up.
> That is information only the code knows or the kernel indirectly from its 
> first touch policy. But you need it otherwise you violate local 
> memory policy for one thread or another. 
> 
> In short blocks of memory are useless here because they have no 
> relationship to what the code actually does. 
> 
> If you want to optimize for bandwidth: 
> 
> - Similar problem applies.  First GB and last GB of memory has no
> relationship to how the memory is interleaved.
> 
> So it doesn't make much sense to work on smaller pieces 
> than processes here. Files are corner cases, but they can
> be already handled with some existing patches to mbind.

You are prescribing now how things have to be done. This is not manual 
page migration anymore. Manual page migration would allow control over 
memory locations of a process.

Lets say I want neither of the above. I just need to run a process on a 
certain node because there is disk storage attached to that node and the 
other processes need to get out of the way for the next 30 minutes.

One always needs control over what is migrated. Ideally one would be able 
to specify that only the vma containing the huge amount of sparsely 
accessed data is to be migrated if memory becomes tight but the process 
continues to run on the same node. The stack and text segments and 
libraries should stay on the node.

On the other hand if the process is migrated to another one node by 
the scheduler then one may want to migrate the text segment and the 
stack but leave the 6G vma containing data vma where it originally was.

It all boils down to the following:

Are you willing to allow us to control memory placement? Or will it be 
automatically? If automatically then maybe you need to get rid of libnuma 
and numactl and put it all in the scheduler. Otherwise please full control 
and not some half-way measures.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 23:44                               ` Andi Kleen
  2005-07-15 23:56                                 ` Christoph Lameter
@ 2005-07-16  0:00                                 ` David Singleton
  2005-07-16  0:16                                 ` Steve Neuner
  2 siblings, 0 replies; 35+ messages in thread
From: David Singleton @ 2005-07-16  0:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

Andi Kleen wrote:
>>No it wont. If you know that you are going to start a process that must 
>>run on node 3 and know its going to use 2G but there is only 1G free 
>>then you may want to modify the policy of an existing huge process on 
>>node 3that is still allocating to go to node 2 that just happens to have 
>>free space.
> 
> I think you should leave that to the kernel.
> 

But the kernel doesn't know about these future requirements.
A batch scheduler does.

> 
>>>>A batch scheduler may anticipate memory shortages and redirect memory 
>>>>allocations in order to avoid page migration.
>>>
>>>I think that jobs more belongs to the kernel. After all we don't
>>>want to move half of our VM into your proprietary scheduler.
>>
>>Care to tell me which proprietary scheduler you are talking about? I was 
> 
> 
> That SGI batch scheduler with its incredibly long specification
> list you guys seem to want to mess up all interfaces
> for. If I can download source to it please supply an URL. 

I think SGI are just trying to facilitate users (like us) with our
own schedulers.

> 
> 
>>And you are now going to implement automatic page migration into the 
>>existing scheduler?
> 
> 
> Hmm? You mean the kernel CPU scheduler? Nobody is planning to add
> page migration to that.

Exactly.  Some of us think we can do a half decent job of manually
controlling page migration.  What is the harm in letting us "shoot
ourselves in the foot" trying?

-- 
--------------------------------------------------------------------------
                                     ANU Supercomputer Facility
    David.Singleton@anu.edu.au       and APAC National Facility
    Phone: +61 2 6125 4389           Leonard Huxley Bldg (No. 56)
    Fax:   +61 2 6125 8199           Australian National University
                                     Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 23:44                               ` Andi Kleen
  2005-07-15 23:56                                 ` Christoph Lameter
  2005-07-16  0:00                                 ` David Singleton
@ 2005-07-16  0:16                                 ` Steve Neuner
  2 siblings, 0 replies; 35+ messages in thread
From: Steve Neuner @ 2005-07-16  0:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

> That SGI batch scheduler with its incredibly long specification
> list you guys seem to want to mess up all interfaces
> for. If I can download source to it please supply an URL.

Hi,

SGI does not have or ship a batch scheduler product.  However, 
many Linux and other OS customers want and use both open source 
and 3rd-party products that provide this capability.  For example, 
check out:
   http://www.platform.com/products/HPC/
   http://www.osc.edu/hpc/software/apps/pbs.shtml
   http://www.altair.com/software/pbs_abo.htm
   http://www.clusterresources.com/products/maui/

Hope that helps.

--steve
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-15 23:56                                 ` Christoph Lameter
@ 2005-07-16  2:01                                   ` Andi Kleen
  2005-07-16 15:14                                     ` Christoph Lameter
  2005-07-16 23:30                                     ` Paul Jackson
  0 siblings, 2 replies; 35+ messages in thread
From: Andi Kleen @ 2005-07-16  2:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

> One always needs control over what is migrated. Ideally one would be able 
> to specify that only the vma containing the huge amount of sparsely 
> accessed data is to be migrated if memory becomes tight but the process 
> continues to run on the same node. The stack and text segments and 
> libraries should stay on the node.

There is no way to do sane locking from user space 
for such external manipulation of arbitary mappings.  You need
to do it in the kernel.

BTW all your talking about VMAs is useless here anyways because
NUMA policies don't necessarily match VMAs and neither does
allocated memory. 

> Are you willing to allow us to control memory placement? Or will it be 

> automatically? If automatically then maybe you need to get rid of libnuma 
> and numactl and put it all in the scheduler. Otherwise please full control 
> and not some half-way measures.

Without my NUMA policy code you wouldn't have any usable NUMA policy today,
But my goal is definitely to keep the kernel interfaces for this
clean. And what you're proposing is *not* clean. 

I think the per VMA approach is fundamentally wrong because
virtual addresses are nothing an external user can safely
access.  Doing it on higher level objects allows better interfaces
and better locking, and as far as I can see process/shm segment/file
are the only useful objects for this. 

It should basically work like swapping without the need to SIGSTOP
the target.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-16  2:01                                   ` Andi Kleen
@ 2005-07-16 15:14                                     ` Christoph Lameter
  2005-07-16 22:39                                       ` Paul Jackson
  2005-07-16 23:30                                     ` Paul Jackson
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-16 15:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Jackson, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Andi Kleen wrote:

> There is no way to do sane locking from user space 
> for such external manipulation of arbitary mappings.  You need
> to do it in the kernel.

These operations do not have to be reliable but best effort. Locking is up 
to the user and the user can check by inspecting proc files if it worked.

> BTW all your talking about VMAs is useless here anyways because
> NUMA policies don't necessarily match VMAs and neither does
> allocated memory. 

Numa policies are per vma. See the definition of vma_area_struct.

> Without my NUMA policy code you wouldn't have any usable NUMA policy today,
> But my goal is definitely to keep the kernel interfaces for this
> clean. And what you're proposing is *not* clean. 

Then come up with an alternative that is cleaner. 

> I think the per VMA approach is fundamentally wrong because
> virtual addresses are nothing an external user can safely
> access.  Doing it on higher level objects allows better interfaces
> and better locking, and as far as I can see process/shm segment/file
> are the only useful objects for this. 

Then you need to remove the association between the VMA and memory 
policies. Otherwise statements like this do not make sense. 
/proc/<pid>/maps already exposes the virtual addresses to user space. The 
address is onlys used to identify the VMA there is no use of "virtual 
addresses" per se.

Plus the libnuma interfaces also rely on addresses.

We can number the vma's if that makes you feel better and refer to the 
number of the vma.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-16 15:14                                     ` Christoph Lameter
@ 2005-07-16 22:39                                       ` Paul Jackson
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Jackson @ 2005-07-16 22:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: ak, kenneth.w.chen, linux-mm, linux-ia64

Christoph wrote:
> We can number the vma's if that makes you feel better and refer to the 
> number of the vma.

I really doubt you want to go down that path.

VMA's are a kernel internel detail.  They can come and go, be merged
and split, in the dark of the night, transparent to user space.

One might argue that virtual address ranges (rather than VMAs) are
appropriate to be manipulated from outside the task, on the other
side of the position that Andi takes.  Not that I am arguing such --
Andi is making stronger points against such than I am able to refute.

But VMA's are not really visible, except via diagnostic displays
(and Andi makes a good case against even those) to user space; not
even the VMA's within one's own task.

No, I don't think you want to consider numbering VMA's for the purposes
of manipulating them.

===

My intuition is that we are seeing a clash of computing models here.

Main memory is becoming hundreds, even thousands, of times slower
than internal CPU operations.  And main memory is, under the rubric
"NUMA", becoming no longer a monolithic resource on larger systems.
Within a few years, high end workstations will join the ranks of
NUMA systems, just as they have already joined the ranks of SMP
systems.

What was once the private business of each individual task, it's
address space and the placement of its memory, is now becoming the
proper business of system wide administration.  This is because memory
placement can have substantial affects on system and job performance.

We see a variation of this clash on issues of how the kernel should
consume and place its internal memory for caches, buffers and such.
What used to be the private business of the kernel, guided by the
rule that it is best to consume almost all available memory to
cache something, is becoming a system problem, as it can be counter
productive on NUMA systems.

Folks like SGI, on the high end of big honkin NUMA iron, are seeing
it first.  As system architectures become more complex, and scaled
down NUMA architectures become in more widespread use, others will
see it as well, though with no doubt different tradeoffs and ordering
of requirements than SGI and its competitors notice currently.

However, I would presume that Andi is entirely correct, and that
the architecture of the kernel does not allow one task safely to
manipulate another's address space or memory placement there under,
at least not in a way that us mere mortals can understand.

A rock and a hard place.

Just brainstorming ... if one could load a kernel module that could
be called in the context of a target task, either when it is returning
from kernel space or was entering or leaving a timer/resched interrupt
taken while in user space, where that module could, if it was so
coded, munge the tasks memory placement, then would this provide a
basis for solutions to some of these problems?

I am presuming that the hooks for such a module, given it was under
GPL license, would be modest, and of minimum burden to the great
majority of systems that had no such need.

In short, when memory management and placement has such a dominant
impact on overall system performance, it cannot remain solely the
private business of each application.  We need to look for a safe and
simple means to enable external (from outside the task) management
of a tasks memory, without requiring massive surgery on a large body
of critical code that is (quite properly) not designed to handle such.

And we have to co-exist with the folks pushing Linux in the other
direction, embedded in wrist watches or whatever.  Those folks will
properly refuse to waste any non-trivial number brain or CPU cycles
on NUMA requirements.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-16  2:01                                   ` Andi Kleen
  2005-07-16 15:14                                     ` Christoph Lameter
@ 2005-07-16 23:30                                     ` Paul Jackson
  2005-07-17  1:55                                       ` Christoph Lameter
  2005-07-17  3:21                                       ` Christoph Lameter
  1 sibling, 2 replies; 35+ messages in thread
From: Paul Jackson @ 2005-07-16 23:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: clameter, kenneth.w.chen, linux-mm, linux-ia64

Andi wrote:
> I think the per VMA approach is fundamentally wrong because
> virtual addresses are nothing an external user can safely
> access.

Earlier, he also wrote:
> In short blocks of memory are useless here because they have no 
> relationship to what the code actually does. 

There are two questions here - should we and can we.

One the one hand, I hear Andi saying we should not want to alter the
placement of pages allocated to an external task at such a fine level
of granularity.

On the other hand, I hear him saying we can't do it, because the
locking cannot be safely handled.

There is also one confusion that I sometimes succumb to, reading these
replies - between memory policies to control future allocations and
memory policies to relocate already allocated memory.

I think between the numa calls (mbind, set_mempolicy) and cpusets,
we have a decent array of mechanisms to control future allocations.
The full set of features required may not be complete, but the
framework seems to be in place, and the majority of what features we
will need are supported now.

We are lacking in sufficient means to relocate already allocated
user memory.

I'd disagree with Andi that we should not support rearranging memory
at a fine granularity.  For most systems and most applications, Andi
is no doubt right.  But for some systems and some applications, such
as big long running tightly parallel applications on NUMA systems,
placement is often well understood and closely managed at a fine
granularity, because algorithm and memory placement closely interact,
and can have a huge impact on performance.

I willingly bow to Andi's expertise when he says we can't do it now
because memory structures and placement cannot be safely modified
from outside a task.

But I don't agree that we shouldn't look for a way to do it.

We need a way to safely rearrange the placement of already allocated
user memory pages, at a fine granularity (per physical page), without
significant impact to the main body of kernel memory management code.

I think that must mean code operating within the context of the target
task.  I suspect that means at least a portion of this code must be
operating within kernel space.  It should enable external, system
administrator imposed, per-page relocation of already allocated memory.

In some cases, the details of the code that decide what page should
go where will be very specific to a situation, and belong in user
space, or at most, a loadable kernel module, certainly not in main
line kernel code.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-16 23:30                                     ` Paul Jackson
@ 2005-07-17  1:55                                       ` Christoph Lameter
  2005-07-17  3:50                                         ` Paul Jackson
  2005-07-17  3:21                                       ` Christoph Lameter
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-17  1:55 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andi Kleen, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Paul Jackson wrote:

> On the other hand, I hear him saying we can't do it, because the
> locking cannot be safely handled.

That should have been brought up earlier because the page migration 
patches by Ray always modified the policy and Andi agreed to that.

We can certainly find a way to provide proper locking for policy changes 
if there are concerns. The most trivial would to require atomic 
modifications via cmpxchg.

However, there is a fundamental issue with the application and the one who 
manages the process from the outside making changed to the policy.

Currently only the application can make these changes which avoids locking
issues but also restricts the usefulness of these policies since they then
cannot be used from the outside to manage the memory allocation behavior
of a process. 

If both are making changes then the outside controller may find that 
memory allocation policy suddenly changes and an application already using
libnuma may experience unexpected changes in memory policy. However, 
libnuma/numactl is used when memory areas are setup to define how the 
system should treat these memory areas. The outside management always
works with the settings already established by the application and 
modifies those. So in practice there will be little change of 
interference.

> There is also one confusion that I sometimes succumb to, reading these
> replies - between memory policies to control future allocations and
> memory policies to relocate already allocated memory.
> 
> I think between the numa calls (mbind, set_mempolicy) and cpusets,
> we have a decent array of mechanisms to control future allocations.
> The full set of features required may not be complete, but the
> framework seems to be in place, and the majority of what features we
> will need are supported now.

Correct. We could implement the changing of policies via an extension of 
the existing libnuma. That could be easily done as far as I can tell. If 
that is done then the patch that I proposed is no longer necessary. But 
then libnuma needs to also be extended to

1. Allow the discovery of the memory policies of each vma for each process 
in a system. Otherwise intelligent decisions about page migration cannot 
be made and we end up with the kernel guessing which vma's to migrate and 
we cannot control migration of the text segments separately from the data 
segment etc.

2. Add a function call to migrate pages in a particular vma to another 
node. I.e.

sys_page_migrate(pid, address, from-node, to_node, nr-pages)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-16 23:30                                     ` Paul Jackson
  2005-07-17  1:55                                       ` Christoph Lameter
@ 2005-07-17  3:21                                       ` Christoph Lameter
  2005-07-17  4:51                                         ` Paul Jackson
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-17  3:21 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andi Kleen, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Paul Jackson wrote:

> On the other hand, I hear him saying we can't do it, because the
> locking cannot be safely handled.

Here is one approach to locking using xchg. This is restricted only to the 
policy fields on task_struct and vm_area_struct. One could also 
synchronize by taking the alloc_lock in task_struct. I did not use xchg
during the population of vm_area_struct and task_struct and also not 
during the destruction of these structures.

There may be additional races that need to be dealt with depending on 
when the task struct and vm_area_struct become visible through the /proc 
filesystem. However, these races are then general races affecting the use 
of other fields in the 
/proc filesystem.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.13-rc3/mm/mempolicy.c
===================================================================
--- linux-2.6.13-rc3.orig/mm/mempolicy.c	2005-07-16 20:07:04.000000000 -0700
+++ linux-2.6.13-rc3/mm/mempolicy.c	2005-07-16 20:07:06.000000000 -0700
@@ -349,7 +349,7 @@ check_range(struct mm_struct *mm, unsign
 static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
 {
 	int err = 0;
-	struct mempolicy *old = vma->vm_policy;
+	struct mempolicy *old;
 
 	PDprintk("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
 		 vma->vm_start, vma->vm_end, vma->vm_pgoff,
@@ -360,7 +360,7 @@ static int policy_vma(struct vm_area_str
 		err = vma->vm_ops->set_policy(vma, new);
 	if (!err) {
 		mpol_get(new);
-		vma->vm_policy = new;
+		old = xchg(&vma->vm_policy, new);
 		mpol_free(old);
 	}
 	return err;
@@ -451,8 +451,7 @@ asmlinkage long sys_set_mempolicy(int mo
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
-	mpol_free(current->mempolicy);
-	current->mempolicy = new;
+	mpol_free(xchg(&current->mempolicy, new));
 	if (new && new->policy == MPOL_INTERLEAVE)
 		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
 	return 0;
Index: linux-2.6.13-rc3/kernel/exit.c
===================================================================
--- linux-2.6.13-rc3.orig/kernel/exit.c	2005-07-12 21:46:46.000000000 -0700
+++ linux-2.6.13-rc3/kernel/exit.c	2005-07-16 20:07:06.000000000 -0700
@@ -851,8 +851,7 @@ fastcall NORET_TYPE void do_exit(long co
 	tsk->exit_code = code;
 	exit_notify(tsk);
 #ifdef CONFIG_NUMA
-	mpol_free(tsk->mempolicy);
-	tsk->mempolicy = NULL;
+	mpol_free(xchg(&tsk->mempolicy, NULL));
 #endif
 
 	BUG_ON(!(current->flags & PF_DEAD));
Index: linux-2.6.13-rc3/include/linux/mm.h
===================================================================
--- linux-2.6.13-rc3.orig/include/linux/mm.h	2005-07-12 21:46:46.000000000 -0700
+++ linux-2.6.13-rc3/include/linux/mm.h	2005-07-16 20:07:06.000000000 -0700
@@ -107,7 +107,9 @@ struct vm_area_struct {
 	atomic_t vm_usage;		/* refcount (VMAs shared if !MMU) */
 #endif
 #ifdef CONFIG_NUMA
-	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+	struct mempolicy *vm_policy;	/* NUMA policy for the VMA, may be updated only
+					 * with xchg or cmpxchg
+					 */
 #endif
 };
 
Index: linux-2.6.13-rc3/include/linux/sched.h
===================================================================
--- linux-2.6.13-rc3.orig/include/linux/sched.h	2005-07-16 19:54:14.000000000 -0700
+++ linux-2.6.13-rc3/include/linux/sched.h	2005-07-16 20:07:06.000000000 -0700
@@ -761,7 +761,10 @@ struct task_struct {
 	clock_t acct_stimexpd;	/* clock_t-converted stime since last update */
 #endif
 #ifdef CONFIG_NUMA
-  	struct mempolicy *mempolicy;
+  	struct mempolicy *mempolicy;	/* Only update via xchg or cmpxchg because mempolicy
+					 * may be changed from outside of the process
+					 * context
+					 */
 	short il_next;
 #endif
 #ifdef CONFIG_CPUSETS
Index: linux-2.6.13-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.13-rc3.orig/fs/proc/task_mmu.c	2005-07-16 20:07:04.000000000 -0700
+++ linux-2.6.13-rc3/fs/proc/task_mmu.c	2005-07-16 20:08:49.000000000 -0700
@@ -357,7 +357,7 @@ static ssize_t numa_policy_write(struct 
 {
 	struct task_struct *task = proc_task(file->f_dentry->d_inode);
 	char buffer[MAX_MEMPOL_STRING_SIZE], *end;
-	struct mempolicy *pol, *old_policy;
+	struct mempolicy *pol;
 
 	if (!capable(CAP_SYS_RESOURCE))
 		return -EPERM;
@@ -373,17 +373,10 @@ static ssize_t numa_policy_write(struct 
 	if (*end == '\n')
 		end++;
 
-	old_policy = task->mempolicy;
+	if (pol->policy == MPOL_DEFAULT)
+		pol = NULL;
 
-
-	if (!mpol_equal(pol, old_policy)) {
-		if (pol->policy == MPOL_DEFAULT)
-			pol = NULL;
-
-		task->mempolicy = pol;
-		mpol_free(old_policy);
-	} else
-		mpol_free(pol);
+	mpol_free(xchg(&task->mempolicy, pol));
 
 	return end - buffer;
 }
@@ -402,7 +395,7 @@ static ssize_t numa_vma_policy_write(str
 	unsigned long addr;
 	char buffer[MAX_MEMPOL_STRING_SIZE];
 	char *p, *end;
-	struct mempolicy *pol, *old_policy;
+	struct mempolicy *pol;
 
 	if (!capable(CAP_SYS_RESOURCE))
 		return -EPERM;
@@ -426,16 +419,10 @@ static ssize_t numa_vma_policy_write(str
 	if (*end == '\n')
 		end++;
 
-	old_policy = vma->vm_policy;
+	if (pol->policy == MPOL_DEFAULT)
+		pol = NULL;
 
-	if (!mpol_equal(pol, old_policy)) {
-		if (pol->policy == MPOL_DEFAULT)
-			pol = NULL;
-
-		vma->vm_policy = pol;
-		mpol_free(old_policy);
-	} else
-		mpol_free(pol);
+	mpol_free(xchg(&vma->vm_policy, pol));
 
 	return end - buffer;
 }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-17  1:55                                       ` Christoph Lameter
@ 2005-07-17  3:50                                         ` Paul Jackson
  2005-07-17  5:56                                           ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Paul Jackson @ 2005-07-17  3:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: ak, kenneth.w.chen, linux-mm, linux-ia64

Christoph wrote:
> Correct. We could implement the changing of policies via an extension
> of the existing libnuma. That could be easily done as far as I can
> tell. If that is done then the patch that I proposed is no longer
> necessary. But then libnuma needs to also be extended to
> 
> 1. Allow the discovery of the memory policies of each vma for each
> process

I'm missing something here.  Are you saying that just a change to
libnuma would suffice to accomplish what you sought with this patch?

If that's the case, we don't need a kernel patch, right?

And despite Andi's urging us to only access these facilities via
libnuma, there is no law to that affect that I know of.  At the least,
you could present user level only code that accomplished the object
of this patch set, with no kernel change.

I don't think that is possible, short of gross hackery on /dev/mem.
I think some sort of kernel change is required to enable one task to
change the numa policy of another task.

What the heck, over ??

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-17  3:21                                       ` Christoph Lameter
@ 2005-07-17  4:51                                         ` Paul Jackson
  2005-07-17  6:00                                           ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Paul Jackson @ 2005-07-17  4:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: ak, kenneth.w.chen, linux-mm, linux-ia64

Christoph wrote:
> Here is one approach to locking using xchg.

What I see here doesn't change the behaviour of the
kernel any - just adds some locked exchanges, right?

I thought the hard part was having some other task
change the current tasks mempolicy.  For example,
how does one task sync another tasks mempolicy up
with its cpuset, or synchronously get the policies
zonelist or preferred node set correctly?

I guess that this approach is intended to show how
to make it easy to add that hard part, right?

... whatever ... guess I'm still missing something ...

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-17  3:50                                         ` Paul Jackson
@ 2005-07-17  5:56                                           ` Christoph Lameter
  2005-07-17  7:22                                             ` Paul Jackson
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-17  5:56 UTC (permalink / raw)
  To: Paul Jackson; +Cc: ak, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Paul Jackson wrote:

> I'm missing something here.  Are you saying that just a change to
> libnuma would suffice to accomplish what you sought with this patch?

Its a quite significant change but yes of course you can do that if you 
really favor libnuma and IMHO want to make it difficult to maintain and to 
use.

> If that's the case, we don't need a kernel patch, right?

Sure.

> And despite Andi's urging us to only access these facilities via
> libnuma, there is no law to that affect that I know of.  At the least,
> you could present user level only code that accomplished the object
> of this patch set, with no kernel change.

Sure you can write a series of tools that accomplish the same.
 
> I don't think that is possible, short of gross hackery on /dev/mem.
> I think some sort of kernel change is required to enable one task to
> change the numa policy of another task.

Yes doing what I said to libnuma would require a significant rework of the 
APIs and the kernel libnuma stuff. Its easier to implement the whole thing 
using /proc, then no libraries would need to be modified, no tools need to 
be written. Just accept the patch that I posted, fix up whatever has to be 
fixed and we are done.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-17  4:51                                         ` Paul Jackson
@ 2005-07-17  6:00                                           ` Christoph Lameter
  2005-07-17  8:17                                             ` Paul Jackson
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2005-07-17  6:00 UTC (permalink / raw)
  To: Paul Jackson; +Cc: ak, kenneth.w.chen, linux-mm, linux-ia64

On Sat, 16 Jul 2005, Paul Jackson wrote:

> Christoph wrote:
> > Here is one approach to locking using xchg.
> 
> What I see here doesn't change the behaviour of the
> kernel any - just adds some locked exchanges, right?

Correct.
 
> I thought the hard part was having some other task
> change the current tasks mempolicy.  For example,
> how does one task sync another tasks mempolicy up
> with its cpuset, or synchronously get the policies
> zonelist or preferred node set correctly?

Could you give me some more detail on how this should integrate with 
cpusets? I am not aware of any thing that I would call "hard".

What do you mean by synchronously? The proc changes do best effort 
modifications. There is no transactional behavior that allows the changes 
of multiple items at once, nor is there any guarantee that the vma you are 
changing is still there after you have read /proc/<pid>/numa_maps. Why 
would such synchronicity be necessary?

> I guess that this approach is intended to show how
> to make it easy to add that hard part, right?

This is intended to provide race free update of the memory policy.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-17  5:56                                           ` Christoph Lameter
@ 2005-07-17  7:22                                             ` Paul Jackson
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Jackson @ 2005-07-17  7:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: ak, kenneth.w.chen, linux-mm, linux-ia64

Christoph, responding to pj:
> > I'm missing something here.  Are you saying that just a change to
> > libnuma would suffice to accomplish what you sought with this patch?
> 
> Its a quite significant change but yes of course you can do that ...

I am totally stumped.  I have no idea how what you have in mind.

The mbind, set_mempolicy and get_mempolicy system calls plainly and
simply apply only to the current task, and it would take changes in
kernel code and the system call API to change that fact in any
sensible way.

You've dropped one hint: its a quite significant change.

If you have the patience, could you drop a couple more hints on how
to do this (make this change by just changing libnuma)?  Perhaps with
a little more technical meat on their bones?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy
  2005-07-17  6:00                                           ` Christoph Lameter
@ 2005-07-17  8:17                                             ` Paul Jackson
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Jackson @ 2005-07-17  8:17 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: ak, kenneth.w.chen, linux-mm, linux-ia64

Christoph wrote:
> Could you give me some more detail on how this should integrate with 
> cpusets? I am not aware of any thing that I would call "hard".

I can't speak to how "hard" it is, but what I have in mind is the
following lines from the mm/mempolicy.c get_nodes() routine:

        /* Update current mems_allowed */
        cpuset_update_current_mems_allowed();
        /* Ignore nodes not set in current->mems_allowed */
        cpuset_restrict_to_mems_allowed(nodes);

These lines insure that the current tasks mems_allowed is uptodate
with any constraints imposed by the tasks cpuset, and then they
restrict the nodes to that mems_allowed.

Offhand, I do not know a safe way to update a tasks mems_allowed
from its cpuset, except within the tasks context.  This is why
'mems_generation' and cpuset_update_current_mems_allowed() exist.

If you can find a way, more power to you.  I could simiply the
cpuset mems_generation apparatus if I had such a way.

The above get_nodes() routines is called by mbind() and set_mempolicy(),
when passing in a list of memory nodes as part of a memory policy.

> What do you mean by synchronously? 

Probably what Andi is referring to when he worries about locking.
If so, he certainly understands this better than I.

But for example, I notice that the check_range() routine is called
for mbind() requests.  The check_range() code does a bunch of poking
around in the current tasks vma structs.  How do you propose to allow
a separate task to do this safely?

Also, there are several derefences of the pointer 'current'. and to
further mm and vma state referenced via current, to pick up various
attributes of the current task and its memory.  Each one of these
has to be examined, I presume, in order to determine what accesses
can safely be done from an external task, and still obtain consistent
results.

> There is no transactional behavior that allows the changes of multiple
> items at once, nor is there any guarantee that the vma you are changing
> is still there after you have read /proc/<pid>/numa_maps. Why would
> such synchronicity be necessary?

I agree that such is not possible, present nor necessary.

I am worried about what happens within a single mbind or set_mempolicy
call attempted on an external task, not what happens between one such
call and the next.

Clearly the mm/mempolicy code for mbind and set_mempolicy was written
with the assumption that it applied to the current task, its mm
and vmas, and hence the current task was stuck inside this code.

A variety of task and memory state is read and written, without
need for much locking, because we are single threaded in the only
task that is allowed to modify this state.  The author of this code
repeatedly expresses concerns that external modification will fail
due to locking issues.

To me, that means it will take, at best, a careful and detailed
analysis to have any hope of safe external modification of this state,
if it is possible at all.

This is why I suspect we need a way to plug in code that executes in
the context of a task, to apply externally determined changes to the
tasks memory layout.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2005-07-17  8:17 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-07-15  1:39 [NUMA] Display and modify the memory policy of a process through /proc/<pid>/numa_policy Christoph Lameter
2005-07-15  3:50 ` Paul Jackson
2005-07-15  4:52 ` Chen, Kenneth W
2005-07-15  5:07   ` Christoph Lameter
2005-07-15  5:55     ` Chen, Kenneth W
2005-07-15  6:05     ` Paul Jackson
2005-07-15 11:46       ` Andi Kleen
2005-07-15 16:06       ` Christoph Lameter
2005-07-15 21:04         ` Paul Jackson
2005-07-15 21:12           ` Andi Kleen
2005-07-15 21:20             ` Christoph Lameter
2005-07-15 21:47               ` Andi Kleen
2005-07-15 21:55                 ` Christoph Lameter
2005-07-15 22:07                   ` Andi Kleen
2005-07-15 22:30                     ` Christoph Lameter
2005-07-15 22:37                       ` Andi Kleen
2005-07-15 22:49                         ` Christoph Lameter
2005-07-15 22:56                           ` Andi Kleen
2005-07-15 23:11                             ` Christoph Lameter
2005-07-15 23:44                               ` Andi Kleen
2005-07-15 23:56                                 ` Christoph Lameter
2005-07-16  2:01                                   ` Andi Kleen
2005-07-16 15:14                                     ` Christoph Lameter
2005-07-16 22:39                                       ` Paul Jackson
2005-07-16 23:30                                     ` Paul Jackson
2005-07-17  1:55                                       ` Christoph Lameter
2005-07-17  3:50                                         ` Paul Jackson
2005-07-17  5:56                                           ` Christoph Lameter
2005-07-17  7:22                                             ` Paul Jackson
2005-07-17  3:21                                       ` Christoph Lameter
2005-07-17  4:51                                         ` Paul Jackson
2005-07-17  6:00                                           ` Christoph Lameter
2005-07-17  8:17                                             ` Paul Jackson
2005-07-16  0:00                                 ` David Singleton
2005-07-16  0:16                                 ` Steve Neuner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox