Hi Michal, I've done the performance testing, please check it out.

>> Yes this is all understood but the level of the overhead is not really
>> clear. So the question is whether this will induce a visible overhead.
>> Because from the maintainability point of view it is much less costly to
>> have a clear life time model. Right now we have a mix of reference
>> counting and per-task requirements which is rather subtle and easy to
>> get wrong. In an ideal world we would have get_vma_policy always
>> returning a reference counted policy or NULL. If we really need to
>> optimize for cache line bouncing we can go with per cpu reference
>> counters (something that was not available at the time the mempolicy
>> code has been introduced).
>>
>> So I am not saying that the task_work based solution is not possible I
>> just think that this looks like a good opportunity to get from the
>> existing subtle model.

Test tools:
numactl -m 0-3 ./run-mmtests.sh -n -c configs/config-workload-
aim9-pagealloc  test_name

Modification:
Get_vma_policy(), get_task_policy() always returning a reference
counted policy, except for the static policy(default_policy and
preferred_node_policy[nid]).

All vma manipulation is protected by a down_read, so mpol_get()
can be called directly to take a refcount on the mpol. but there
is no lock in task->mempolicy context.
so task->mempolicy should be protected by task_lock.

struct mempolicy *get_task_policy(struct task_struct *p)
{
	struct mempolicy *pol;
	int node;

	if (p->mempolicy) {
		task_lock(p);
		pol = p->mempolicy;
		mpol_get(pol);
		task_unlock(p);
		if (pol)
			return pol;
	}
	.....
}

Test Case1:
Describe:
	Test directly, no other user processes.
Result:
	This will degrade performance about 1% to 3%.
For more information, please see the attachment:mpol.txt

aim9

Hmean     page_test   484561.68 (   0.00%)   471039.34 *  -2.79%*
Hmean     brk_test   1400702.48 (   0.00%)  1388949.10 *  -0.84%*
Hmean     exec_test     2339.45 (   0.00%)     2278.41 *  -2.61%*
Hmean     fork_test     6500.02 (   0.00%)     6500.17 *   0.00%*


Test Case2:
Describe:
	Added a user process, top.
Result:
	This will degrade performance about 2.1%.
For more information, please see the attachment:mpol_top.txt

Hmean     page_test   477916.47 (   0.00%)   467829.01 *  -2.11%*
Hmean     brk_test   1351439.76 (   0.00%)  1373663.90 *   1.64%*
Hmean     exec_test     2312.24 (   0.00%)     2296.06 *  -0.70%*
Hmean     fork_test     6483.46 (   0.00%)     6472.06 *  -0.18%*


Test Case3:
	
Describe:
	Add a daemon to read /proc/$test_pid/status, which will acquire 
task_lock. while :;do cat /proc/$(pidof singleuser)/status;done

Result:
	the baseline is degrade from 484561(case1) to 438591(about 10%)
when the daemon was add, but the performance degradation in case3 is
about 3.2%. For more information, please see the
attachment:mpol_status.txt

Hmean     page_test   438591.97 (   0.00%)   424251.22 *  -3.27%*
Hmean     brk_test   1268906.57 (   0.00%)  1278100.12 *   0.72%*
Hmean     exec_test     2301.19 (   0.00%)     2192.71 *  -4.71%*
Hmean     fork_test     6453.24 (   0.00%)     6090.48 *  -5.62%*


Thanks,
Zhongkun.