* [PATCH 0/13] memory cgroup updates v4
@ 2008-09-22 10:51 KAMEZAWA Hiroyuki
2008-09-22 10:55 ` [PATCH 1/13] memcg: avoid accounting special mapping KAMEZAWA Hiroyuki
` (13 more replies)
0 siblings, 14 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 10:51 UTC (permalink / raw)
To: linux-mm; +Cc: balbir, nishimura, xemul, LKML
This is a series of patches for memory resource controller.
Based on mmotm Sep18 ver. This passed some tests and seems works well.
This consists of followings
- fixes.
* fixes do_swap_page() handling.
- new feature
* "root" cgroup is treated as nolimit.
* implements account_move() and move account at force_empty rather than
forgeting all.
* atomic page_cgroup->flags.
* page_cgroup lookup system. (and page_cgroup.h is added.)
- optimize.
* per cpu status update.
- remove page_cgroup pointer from struct page.
- lazy lru add/remove
peformance is here. (on 8cpu Xeon/64bit) not so bad.
2.6.26-rc6-mm1(2008/9/18 version)
==
Execl Throughput 2311.6 lps (29.9 secs, 3 samples)
C Compiler Throughput 1331.9 lpm (60.4 secs, 3 samples)
Shell Scripts (1 concurrent) 7500.7 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 3031.0 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 1729.7 lpm (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 99310.2 lpm (30.0 secs, 3 samples)
afte all patches.
==
Execl Throughput 2308.7 lps (29.9 secs, 3 samples)
C Compiler Throughput 1343.4 lpm (60.3 secs, 3 samples)
Shell Scripts (1 concurrent) 7451.7 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 3024.0 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 1752.0 lpm (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 99255.3 lpm (30.0 secs, 3 samples)
after all patches + add padding to make "struct page" to be 64 bytes ;)
==
Execl Throughput 2332.2 lps (29.9 secs, 3 samples)
C Compiler Throughput 1345.3 lpm (60.4 secs, 3 samples)
Shell Scripts (1 concurrent) 7564.3 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 3075.0 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 1755.0 lpm (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 99979.3 lpm (30.0 secs, 3 samples)
This patch set saves 8bytes per page struct.
(when CONFIG_MEM_RES_CTLR=y but memcg is disabled at boot)
On this 48Gbytes machine, 48 * 1024 * 1024 * 1024 / 4096 * 8 = 96MB.
Maybe good for distros and users who don't want memcg.
Extra memory usage by this hash routine is 16k for root hash table on this
machine and 20bytes per 128MB. (and some per-cpu area.)
IMHO, because "lookup_page_cgroup" interface is added, updates/optimzation
for internal lookup method can be done later. (we can do some.)
So, I don't want to discuss special-lookup-support-for-some-memory-model
or which-is-quicker-lookup-method etc...too much here.
Brief description.
1/13 .... special mapping fix. (NEW)
=> avoid accounting pages not on LRU...which we cannot reclaim.
2/13 .... account swap-cache under lock.
=> move accounting of swap-cache under lock for avoiding unnecessary race.
3/13 .... make root cgroup to be unlimited.
=> fix root cgroup's memory limit to unlimited.
4/13 .... atomic-flags for page_cgroup
=> make page_cgroup->flags to be atomic.
5/13 .... implement move_account function.
=> add a function for moving page_cgroup's account to other cgroup.
6/13 ... force_empty to migrate account
=> move all account to root cgroup rather than forget all.
7/13 ...make mapping NULL (clean up)
=> ensure page->mapping to be NULL before calling mem_cgroup_uncharge_cache().
8/13 ...optimize cpustat
=> optimize access to per-cpu statistics for memcg.
9/13 ...lookup page_cgroup (CHANGED)
=> preallocate all page_cgroup at boot and remove page->page_cgroup pointer.
10/13...page_cgroup lookaside buffer
=> helps looking up page_cgroup from page.
11/13...lazy lru freeing page_cgroup (NEW)
=> do removal from LRU in bached manner like pagevec.
12/13...lazy lru add page_cgroup (NEW)
=> do addition to LRU in bached manner like pagevec.
13/13...swap accountig fix. (NEW)
=> fix race in swap accounting (can be happen)
and this intrduce new protocal as precharge/commit/cancel.
Some patches are big but not complicated I think.
patch 13/13 includes a brandnew concept.
So please review and tell me your opinion.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 1/13] memcg: avoid accounting special mapping
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
@ 2008-09-22 10:55 ` KAMEZAWA Hiroyuki
2008-09-22 10:57 ` [PATCH 2/13] memcg: account fault-in swap under lock KAMEZAWA Hiroyuki
` (12 subsequent siblings)
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 10:55 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
There are not-on-LRU pages which can be mapped and they are not worth to
be accounted. (becasue we can't shrink them and need dirty codes to handle
specical case) We don't want to account out-of-vm's-control pages.
When special_mapping_fault() is called, page->mapping is tend to be NULL
and it's charged as Anonymous page. So avoid account it in __do_fault().
We can know that by checking anon var.
insert_page() also handles some special pages from drivers.
Changelog:
- new patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memory.c | 18 ++++++------------
mm/rmap.c | 4 ++--
2 files changed, 8 insertions(+), 14 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memory.c
+++ mmotm-2.6.27-rc6+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
pte_t *pte;
spinlock_t *ptl;
- retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
- if (retval)
- goto out;
-
retval = -EINVAL;
if (PageAnon(page))
- goto out_uncharge;
+ goto out;
retval = -ENOMEM;
flush_dcache_page(page);
pte = get_locked_pte(mm, addr, &ptl);
if (!pte)
- goto out_uncharge;
+ goto out;
retval = -EBUSY;
if (!pte_none(*pte))
goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
return retval;
out_unlock:
pte_unmap_unlock(pte, ptl);
-out_uncharge:
- mem_cgroup_uncharge_page(page);
out:
return retval;
}
@@ -2542,7 +2536,7 @@ static int __do_fault(struct mm_struct *
}
- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (anon && mem_cgroup_charge(page, mm, GFP_KERNEL)) {
ret = VM_FAULT_OOM;
goto out;
}
@@ -2584,10 +2578,10 @@ static int __do_fault(struct mm_struct *
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, address, entry);
} else {
- mem_cgroup_uncharge_page(page);
- if (anon)
+ if (anon) {
+ mem_cgroup_uncharge_page(page);
page_cache_release(page);
- else
+ } else
anon = 1; /* no anon but release faulted_page */
}
Index: mmotm-2.6.27-rc6+/mm/rmap.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/rmap.c
+++ mmotm-2.6.27-rc6+/mm/rmap.c
@@ -725,8 +725,8 @@ void page_remove_rmap(struct page *page,
page_clear_dirty(page);
set_page_dirty(page);
}
-
- mem_cgroup_uncharge_page(page);
+ if (PageAnon(page))
+ mem_cgroup_uncharge_page(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 2/13] memcg: account fault-in swap under lock.
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
2008-09-22 10:55 ` [PATCH 1/13] memcg: avoid accounting special mapping KAMEZAWA Hiroyuki
@ 2008-09-22 10:57 ` KAMEZAWA Hiroyuki
2008-09-22 10:58 ` [PATCH 3/13] memcg: nolimit root cgroup KAMEZAWA Hiroyuki
` (11 subsequent siblings)
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 10:57 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
While page-cache's charge/uncharge is done under page_lock(), swap-cache
isn't. (anonymous page is charged when it's newly allocated.)
This patch moves do_swap_page()'s charge() call under lock.
This is good for avoiding unnecessary slow-path in charge().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memory.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memory.c
+++ mmotm-2.6.27-rc6+/mm/memory.c
@@ -2320,15 +2320,14 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}
+ lock_page(page);
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+
if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
ret = VM_FAULT_OOM;
goto out;
}
-
mark_page_accessed(page);
- lock_page(page);
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
/*
* Back out if somebody else already faulted in this pte.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 3/13] memcg: nolimit root cgroup
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
2008-09-22 10:55 ` [PATCH 1/13] memcg: avoid accounting special mapping KAMEZAWA Hiroyuki
2008-09-22 10:57 ` [PATCH 2/13] memcg: account fault-in swap under lock KAMEZAWA Hiroyuki
@ 2008-09-22 10:58 ` KAMEZAWA Hiroyuki
2008-09-22 11:00 ` [PATCH 4/13] memcg: force_empty moving account KAMEZAWA Hiroyuki
` (10 subsequent siblings)
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 10:58 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Make root cgroup of memory resource controller to have no limit.
By this, users cannot set limit to root group. This is for making root cgroup
as a kind of trash-can.
For accounting pages which has no owner, which are created by force_empty,
we need some cgroup with no_limit. A patch for rewriting force_empty will
will follow this one.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -136,6 +136,9 @@ struct mem_cgroup {
};
static struct mem_cgroup init_mem_cgroup;
+#define is_root_cgroup(cgrp) ((cgrp) == &init_mem_cgroup)
+
+
/*
* We use the lower bit of the page->page_cgroup pointer as a bit spin
* lock. We need to ensure that page->page_cgroup is at least two
@@ -944,6 +947,10 @@ static int mem_cgroup_write(struct cgrou
switch (cft->private) {
case RES_LIMIT:
+ if (is_root_cgroup(memcg)) {
+ ret = -EINVAL;
+ break;
+ }
/* This function does all necessary parse...reuse it */
ret = res_counter_memparse_write_strategy(buffer, &val);
if (!ret)
Index: mmotm-2.6.27-rc6+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.27-rc6+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.27-rc6+/Documentation/controllers/memory.txt
@@ -121,6 +121,9 @@ The corresponding routines that remove a
a page from Page Cache is used to decrement the accounting counters of the
cgroup.
+The root cgroup is not allowed to be set limit but usage is accounted.
+For controlling usage of memory, you need to create a cgroup.
+
2.3 Shared Page Accounting
Shared pages are accounted on the basis of the first touch approach. The
@@ -172,6 +175,7 @@ We can alter the memory limit:
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or gigabytes.
+Note: root cgroup is not able to be set limit.
# cat /cgroups/0/memory.limit_in_bytes
4194304
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 4/13] memcg: force_empty moving account
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2008-09-22 10:58 ` [PATCH 3/13] memcg: nolimit root cgroup KAMEZAWA Hiroyuki
@ 2008-09-22 11:00 ` KAMEZAWA Hiroyuki
2008-09-22 14:23 ` Peter Zijlstra
2008-09-22 14:50 ` kamezawa.hiroyu
2008-09-22 11:02 ` [PATCH 5/13] memcg: cleanup to make mapping null before unchage KAMEZAWA Hiroyuki
` (9 subsequent siblings)
13 siblings, 2 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:00 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Current force_empty of memory resource controller just removes page_cgroup.
This maans the page is never accounted at all and create an in-use page which
has no page_cgroup.
This patch tries to move account to "root" cgroup. By this patch, force_empty
doesn't leak an account but move account to "root" cgroup. Maybe someone can
think of other enhancements as moving account to its parent.
(But moving to the parent means we have to handle "limit" of pages.
Need more complicated work to do that.")
For now, just moves account to root cgroup.
Note: all lock other than old mem_cgroup's lru_lock
in this path is try_lock().
Changelog (v3) -> (v4)
- no changes
Changelog (v2) -> (v3)
- splitted out mem_cgroup_move_account().
- replaced get_page() with get_page_unless_zero().
(This is necessary for avoiding confliction with migration)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Documentation/controllers/memory.txt | 7 ++--
mm/memcontrol.c | 51 +++++++++++++++++++++--------------
2 files changed, 35 insertions(+), 23 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -29,6 +29,7 @@
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/spinlock.h>
+#include <linux/pagemap.h>
#include <linux/fs.h>
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
@@ -976,17 +977,14 @@ int mem_cgroup_resize_limit(struct mem_c
/*
- * This routine traverse page_cgroup in given list and drop them all.
- * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
+ * This routine traverse page_cgroup in given list and move them all.
*/
-#define FORCE_UNCHARGE_BATCH (128)
static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
struct mem_cgroup_per_zone *mz,
enum lru_list lru)
{
struct page_cgroup *pc;
struct page *page;
- int count = FORCE_UNCHARGE_BATCH;
unsigned long flags;
struct list_head *list;
@@ -996,23 +994,36 @@ static void mem_cgroup_force_empty_list(
while (!list_empty(list)) {
pc = list_entry(list->prev, struct page_cgroup, lru);
page = pc->page;
- get_page(page);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- /*
- * Check if this page is on LRU. !LRU page can be found
- * if it's under page migration.
- */
- if (PageLRU(page)) {
- __mem_cgroup_uncharge_common(page,
- MEM_CGROUP_CHARGE_TYPE_FORCE);
+ /* For avoiding race with speculative page cache handling. */
+ if (!PageLRU(page) || !get_page_unless_zero(page)) {
+ list_move(&pc->lru, list);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ yield();
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ continue;
+ }
+ if (!trylock_page(page)) {
+ list_move(&pc->lru, list);
put_page(page);
- if (--count <= 0) {
- count = FORCE_UNCHARGE_BATCH;
- cond_resched();
- }
- } else
- cond_resched();
- spin_lock_irqsave(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ yield();
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ continue;
+ }
+ if (mem_cgroup_move_account(page, pc, mem, &init_mem_cgroup)) {
+ /* some confliction */
+ list_move(&pc->lru, list);
+ unlock_page(page);
+ put_page(page);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ yield();
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ } else {
+ unlock_page(page);
+ put_page(page);
+ }
+ if (atomic_read(&mem->css.cgroup->count) > 0)
+ break;
}
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
Index: mmotm-2.6.27-rc6+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.27-rc6+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.27-rc6+/Documentation/controllers/memory.txt
@@ -207,7 +207,8 @@ The memory.force_empty gives an interfac
# echo 1 > memory.force_empty
-will drop all charges in cgroup. Currently, this is maintained for test.
+will move all charges to root cgroup.
+(This policy may be modified in future.)
4. Testing
@@ -238,8 +239,8 @@ reclaimed.
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it. Such charges are automatically moved to
+root cgroup at rmidr() if there are no tasks. (This policy may be changed.)
5. TODO
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 5/13] memcg: cleanup to make mapping null before unchage
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (3 preceding siblings ...)
2008-09-22 11:00 ` [PATCH 4/13] memcg: force_empty moving account KAMEZAWA Hiroyuki
@ 2008-09-22 11:02 ` KAMEZAWA Hiroyuki
2008-09-22 11:03 ` [PATCH 6/13] memcg: optimze per cpu accounting for memcg KAMEZAWA Hiroyuki
` (8 subsequent siblings)
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:02 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.
"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".
This patch also adds BUG_ON() to mem_cgroup_uncharge_cache_page();
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/filemap.c | 2 +-
mm/memcontrol.c | 1 +
mm/migrate.c | 12 +++++++++---
3 files changed, 11 insertions(+), 4 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/filemap.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/filemap.c
+++ mmotm-2.6.27-rc6+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
{
struct address_space *mapping = page->mapping;
- mem_cgroup_uncharge_cache_page(page);
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
BUG_ON(page_mapped(page));
+ mem_cgroup_uncharge_cache_page(page);
/*
* Some filesystems seem to re-dirty the page even after
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -859,6 +859,7 @@ void mem_cgroup_uncharge_page(struct pag
void mem_cgroup_uncharge_cache_page(struct page *page)
{
VM_BUG_ON(page_mapped(page));
+ VM_BUG_ON(page->mapping);
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}
Index: mmotm-2.6.27-rc6+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc6+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
__inc_zone_page_state(newpage, NR_FILE_PAGES);
spin_unlock_irq(&mapping->tree_lock);
- if (!PageSwapCache(newpage))
- mem_cgroup_uncharge_cache_page(page);
return 0;
}
@@ -378,7 +376,15 @@ static void migrate_page_copy(struct pag
#endif
ClearPagePrivate(page);
set_page_private(page, 0);
- page->mapping = NULL;
+ /* page->mapping contains a flag for PageAnon() */
+ if (PageAnon(page)) {
+ /* This page is uncharged at try_to_unmap(). */
+ page->mapping = NULL;
+ } else {
+ /* Obsolete file cache should be uncharged */
+ page->mapping = NULL;
+ mem_cgroup_uncharge_cache_page(page);
+ }
/*
* If any waiters have accumulated on the new page then
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 6/13] memcg: optimze per cpu accounting for memcg
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (4 preceding siblings ...)
2008-09-22 11:02 ` [PATCH 5/13] memcg: cleanup to make mapping null before unchage KAMEZAWA Hiroyuki
@ 2008-09-22 11:03 ` KAMEZAWA Hiroyuki
2008-09-22 11:05 ` [PATCH 3.5/13] memcg: make page_cgroup flags to be atomic KAMEZAWA Hiroyuki
` (7 subsequent siblings)
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:03 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Some obvious optimization to memcg.
I found mem_cgroup_charge_statistics() is a little big (in object) and
does unnecessary address calclation.
This patch is for optimization to reduce the size of this function.
And res_counter_charge() is 'likely' to success.
Changelog v3->v4:
- merged with an other leaf patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -67,11 +67,10 @@ struct mem_cgroup_stat {
/*
* For accounting under irq disable, no need for increment preempt count.
*/
-static void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat *stat,
+static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
enum mem_cgroup_stat_index idx, int val)
{
- int cpu = smp_processor_id();
- stat->cpustat[cpu].count[idx] += val;
+ stat->count[idx] += val;
}
static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
@@ -238,18 +237,21 @@ static void mem_cgroup_charge_statistics
{
int val = (charge)? 1 : -1;
struct mem_cgroup_stat *stat = &mem->stat;
+ struct mem_cgroup_stat_cpu *cpustat;
VM_BUG_ON(!irqs_disabled());
+
+ cpustat = &stat->cpustat[smp_processor_id()];
if (PageCgroupCache(pc))
- __mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
else
- __mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
if (charge)
- __mem_cgroup_stat_add_safe(stat,
+ __mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
else
- __mem_cgroup_stat_add_safe(stat,
+ __mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
}
@@ -681,7 +683,7 @@ static int mem_cgroup_charge_common(stru
css_get(&memcg->css);
}
- while (res_counter_charge(&mem->res, PAGE_SIZE)) {
+ while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
if (!(gfp_mask & __GFP_WAIT))
goto out;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 3.5/13] memcg: make page_cgroup flags to be atomic
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (5 preceding siblings ...)
2008-09-22 11:03 ` [PATCH 6/13] memcg: optimze per cpu accounting for memcg KAMEZAWA Hiroyuki
@ 2008-09-22 11:05 ` KAMEZAWA Hiroyuki
2008-09-22 11:09 ` [PATCH 3.6/13] memcg: add function to move account KAMEZAWA Hiroyuki
` (6 subsequent siblings)
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:05 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Sorry, this patch is comes after "3".
==
This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.
This patch itself makes memcg slow but this patch's final purpose is
to remove lock_page_cgroup() and allowing fast access to page_cgroup.
(And total performance will increase after all patches applied.)
Before trying to modify memory resource controller, this atomic operation
on flags is necessary. Most of flags in this patch is for LRU and modfied
under mz->lru_lock but we'll add another flags which is not for LRU soon.
So we use atomic version here.
Changelog: (v3) -> (v4)
- removed unsued operations.
- adjusted to new ctype MEM_CGROUP_CHARGE_TYPE_SHMEM
Changelog: (v2) -> (v3)
- renamed macros and flags to be longer name.
- added comments.
- added "default bit set" for File, Shmem, Anon.
Changelog: (preview) -> (v1):
- patch ordering is changed.
- Added macro for defining functions for Test/Set/Clear bit.
- made the names of flags shorter.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 122 +++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 82 insertions(+), 40 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -161,12 +161,46 @@ struct page_cgroup {
struct list_head lru; /* per cgroup LRU list */
struct page *page;
struct mem_cgroup *mem_cgroup;
- int flags;
+ unsigned long flags;
};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
+
+enum {
+ /* flags for mem_cgroup */
+ PCG_CACHE, /* charged as cache */
+ /* flags for LRU placement */
+ PCG_ACTIVE, /* page is active in this cgroup */
+ PCG_FILE, /* page is file system backed */
+ PCG_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname) \
+static inline int PageCgroup##uname(struct page_cgroup *pc) \
+ { return test_bit(PCG_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname) \
+static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
+ { set_bit(PCG_##lname, &pc->flags); }
+
+#define CLEARPCGFLAG(uname, lname) \
+static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
+ { clear_bit(PCG_##lname, &pc->flags); }
+
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
static int page_cgroup_nid(struct page_cgroup *pc)
{
@@ -181,21 +215,31 @@ static enum zone_type page_cgroup_zid(st
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
- MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
MEM_CGROUP_CHARGE_TYPE_SHMEM, /* used by page migration of shmem */
+ MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
+ NR_CHARGE_TYPE,
+};
+
+static const unsigned long
+pcg_default_flags[NR_CHARGE_TYPE] = {
+ ((1 << PCG_CACHE) | (1 << PCG_FILE)),
+ ((1 << PCG_ACTIVE)),
+ ((1 << PCG_ACTIVE) | (1 << PCG_CACHE)),
+ 0,
};
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
-static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
- bool charge)
+static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
+ struct page_cgroup *pc,
+ bool charge)
{
int val = (charge)? 1 : -1;
struct mem_cgroup_stat *stat = &mem->stat;
VM_BUG_ON(!irqs_disabled());
- if (flags & PAGE_CGROUP_FLAG_CACHE)
+ if (PageCgroupCache(pc))
__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
else
__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
@@ -296,18 +340,18 @@ static void __mem_cgroup_remove_list(str
{
int lru = LRU_BASE;
- if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+ if (PageCgroupUnevictable(pc))
lru = LRU_UNEVICTABLE;
else {
- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ if (PageCgroupActive(pc))
lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ if (PageCgroupFile(pc))
lru += LRU_FILE;
}
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
- mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
+ mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
list_del(&pc->lru);
}
@@ -316,27 +360,27 @@ static void __mem_cgroup_add_list(struct
{
int lru = LRU_BASE;
- if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+ if (PageCgroupUnevictable(pc))
lru = LRU_UNEVICTABLE;
else {
- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ if (PageCgroupActive(pc))
lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ if (PageCgroupFile(pc))
lru += LRU_FILE;
}
MEM_CGROUP_ZSTAT(mz, lru) += 1;
list_add(&pc->lru, &mz->lists[lru]);
- mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
+ mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
}
static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
- int active = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
- int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
- int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
+ int active = PageCgroupActive(pc);
+ int file = PageCgroupFile(pc);
+ int unevictable = PageCgroupUnevictable(pc);
enum lru_list from = unevictable ? LRU_UNEVICTABLE :
(LRU_FILE * !!file + !!active);
@@ -344,16 +388,20 @@ static void __mem_cgroup_move_lists(stru
return;
MEM_CGROUP_ZSTAT(mz, from) -= 1;
-
+ /*
+ * However this is done under mz->lru_lock, another flags, which
+ * are not related to LRU, will be modified from out-of-lock.
+ * We have to use atomic set/clear flags.
+ */
if (is_unevictable_lru(lru)) {
- pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
- pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
+ ClearPageCgroupActive(pc);
+ SetPageCgroupUnevictable(pc);
} else {
if (is_active_lru(lru))
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+ SetPageCgroupActive(pc);
else
- pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
- pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
+ ClearPageCgroupActive(pc);
+ ClearPageCgroupUnevictable(pc);
}
MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -590,16 +638,7 @@ static int mem_cgroup_charge_common(stru
* If a page is accounted as a page cache, insert to inactive list.
* If anon, insert to active list.
*/
- if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
- pc->flags = PAGE_CGROUP_FLAG_CACHE;
- if (page_is_file_cache(page))
- pc->flags |= PAGE_CGROUP_FLAG_FILE;
- else
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
- } else if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
- pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
- else /* MEM_CGROUP_CHARGE_TYPE_SHMEM */
- pc->flags = PAGE_CGROUP_FLAG_CACHE | PAGE_CGROUP_FLAG_ACTIVE;
+ pc->flags = pcg_default_flags[ctype];
lock_page_cgroup(page);
if (unlikely(page_get_page_cgroup(page))) {
@@ -678,8 +717,12 @@ int mem_cgroup_cache_charge(struct page
if (unlikely(!mm))
mm = &init_mm;
- return mem_cgroup_charge_common(page, mm, gfp_mask,
+ if (page_is_file_cache(page))
+ return mem_cgroup_charge_common(page, mm, gfp_mask,
MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+ else
+ return mem_cgroup_charge_common(page, mm, gfp_mask,
+ MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}
/*
@@ -707,8 +750,7 @@ __mem_cgroup_uncharge_common(struct page
VM_BUG_ON(pc->page != page);
if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
- && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
- || page_mapped(page)))
+ && ((PageCgroupCache(pc) || page_mapped(page))))
goto unlock;
mz = page_cgroup_zoneinfo(pc);
@@ -758,7 +800,7 @@ int mem_cgroup_prepare_migration(struct
if (pc) {
mem = pc->mem_cgroup;
css_get(&mem->css);
- if (pc->flags & PAGE_CGROUP_FLAG_CACHE) {
+ if (PageCgroupCache(pc)) {
if (page_is_file_cache(page))
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
else
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 3.6/13] memcg: add function to move account.
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (6 preceding siblings ...)
2008-09-22 11:05 ` [PATCH 3.5/13] memcg: make page_cgroup flags to be atomic KAMEZAWA Hiroyuki
@ 2008-09-22 11:09 ` KAMEZAWA Hiroyuki
2008-09-24 6:50 ` Daisuke Nishimura
2008-09-22 11:12 ` [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page) KAMEZAWA Hiroyuki
` (5 subsequent siblings)
13 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:09 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Sorry, this patchs is after "3.5" , before "4"....
==
This patch provides a function to move account information of a page between
mem_cgroups.
This moving of page_cgroup is done under
- the page is locked.
- lru_lock of source/destination mem_cgroup is held.
Then, a routine which touches pc->mem_cgroup without page_lock() should
confirm pc->mem_cgroup is still valid or not. Typlical code can be following.
(while page is not under lock_page())
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc)
spin_lock_irqsave(&mz->lru_lock);
if (pc->mem_cgroup == mem)
...../* some list handling */
spin_unlock_irq(&mz->lru_lock);
If you find page_cgroup from mem_cgroup's LRU under mz->lru_lock, you don't
have to worry about anything.
Changelog: (v3) -> (v4)
- no changes.
Changelog: (v2) -> (v3)
- added lock_page_cgroup().
- splitted out from new-force-empty patch.
- added how-to-use text.
- fixed race in __mem_cgroup_uncharge_common().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 82 insertions(+), 3 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -424,6 +424,7 @@ int task_in_mem_cgroup(struct task_struc
void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
{
struct page_cgroup *pc;
+ struct mem_cgroup *mem;
struct mem_cgroup_per_zone *mz;
unsigned long flags;
@@ -442,9 +443,14 @@ void mem_cgroup_move_lists(struct page *
pc = page_get_page_cgroup(page);
if (pc) {
+ mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, lru);
+ /*
+ * check against the race with move_account.
+ */
+ if (likely(mem == pc->mem_cgroup))
+ __mem_cgroup_move_lists(pc, lru);
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
unlock_page_cgroup(page);
@@ -565,6 +571,71 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}
+/**
+ * mem_cgroup_move_account - move account of the page
+ * @page ... the target page of being moved.
+ * @pc ... page_cgroup of the page.
+ * @from ... mem_cgroup which the page is moved from.
+ * @to ... mem_cgroup which the page is moved to.
+ *
+ * The caller must confirm following.
+ * 1. lock the page by lock_page().
+ * 2. disable irq.
+ * 3. lru_lock of old mem_cgroup should be held.
+ * 4. pc is guaranteed to be valid and on mem_cgroup's LRU.
+ *
+ * Because we cannot call try_to_free_page() here, the caller must guarantee
+ * this moving of change never fails. Currently this is called only against
+ * root cgroup, which has no limitation of resource.
+ * Returns 0 at success, returns 1 at failure.
+ */
+int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
+ struct mem_cgroup *from, struct mem_cgroup *to)
+{
+ struct mem_cgroup_per_zone *from_mz, *to_mz;
+ int nid, zid;
+ int ret = 1;
+
+ VM_BUG_ON(!irqs_disabled());
+ VM_BUG_ON(!PageLocked(page));
+
+ nid = page_to_nid(page);
+ zid = page_zonenum(page);
+ from_mz = mem_cgroup_zoneinfo(from, nid, zid);
+ to_mz = mem_cgroup_zoneinfo(to, nid, zid);
+
+ if (res_counter_charge(&to->res, PAGE_SIZE)) {
+ /* Now, we assume no_limit...no failure here. */
+ return ret;
+ }
+ if (!try_lock_page_cgroup(page)) {
+ res_counter_uncharge(&to->res, PAGE_SIZE);
+ return ret;
+ }
+
+ if (page_get_page_cgroup(page) != pc) {
+ res_counter_uncharge(&to->res, PAGE_SIZE);
+ goto out;
+ }
+
+ if (spin_trylock(&to_mz->lru_lock)) {
+ __mem_cgroup_remove_list(from_mz, pc);
+ css_put(&from->css);
+ res_counter_uncharge(&from->res, PAGE_SIZE);
+ pc->mem_cgroup = to;
+ css_get(&to->css);
+ __mem_cgroup_add_list(to_mz, pc);
+ ret = 0;
+ spin_unlock(&to_mz->lru_lock);
+ } else {
+ res_counter_uncharge(&to->res, PAGE_SIZE);
+ }
+out:
+ unlock_page_cgroup(page);
+
+ return ret;
+}
+
/*
* Charge the memory controller for page usage.
* Return
@@ -752,16 +823,24 @@ __mem_cgroup_uncharge_common(struct page
if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
&& ((PageCgroupCache(pc) || page_mapped(page))))
goto unlock;
-
+retry:
+ mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
+ if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
+ unlikely(mem != pc->mem_cgroup)) {
+ /* MAPPED account can be done without lock_page().
+ Check race with mem_cgroup_move_account() */
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ goto retry;
+ }
__mem_cgroup_remove_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
page_assign_page_cgroup(page, NULL);
unlock_page_cgroup(page);
- mem = pc->mem_cgroup;
+
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (7 preceding siblings ...)
2008-09-22 11:09 ` [PATCH 3.6/13] memcg: add function to move account KAMEZAWA Hiroyuki
@ 2008-09-22 11:12 ` KAMEZAWA Hiroyuki
2008-09-22 14:52 ` Dave Hansen
` (4 more replies)
2008-09-22 11:13 ` [PATCH 10/13] memcg: page_cgroup look aside table KAMEZAWA Hiroyuki
` (4 subsequent siblings)
13 siblings, 5 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:12 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Remove page_cgroup pointer from struct page.
This patch removes page_cgroup pointer from struct page and make it be able
to get from pfn. Then, relationship of them is
Before this:
pfn <-> struct page <-> struct page_cgroup.
After this:
struct page <-> pfn -> struct page_cgroup -> struct page.
Benefit of this approach is we can remove 8 bytes from struct page.
Other changes are:
- lock/unlock_page_cgroup() uses its own bit on struct page_cgroup.
- all necessary page_cgroups are allocated at boot.
Characteristics:
- page cgroup is allocated as some amount of chunk.
This patch uses SECTION_SIZE as size of chunk if 64bit/SPARSEMEM is enabled.
If not, appropriate default number is selected.
- all page_cgroup struct is maintained by hash.
I think we have 2 ways to handle sparse index in general
...radix-tree and hash. This uses hash because radix-tree's layout is
affected by memory map's layout.
- page_cgroup.h/page_cgroup.c is added.
Changelog: v3 -> v4.
- changed arguments to lookup_page_cgroup() from "pfn" to "page",
Changelog: v2 -> v3
- changed arguments from pfn to struct page*.
- added memory hotplug callback (no undo...needs .more work.)
- adjusted to new mmotm.
Changelog: v1 -> v2
- Fixed memory allocation failure at boot to do panic with good message.
- rewrote charge/uncharge path (no changes in logic.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/mm_types.h | 4
include/linux/page_cgroup.h | 89 +++++++++++++++
mm/Makefile | 2
mm/memcontrol.c | 251 +++++++++++---------------------------------
mm/page_alloc.c | 9 -
mm/page_cgroup.c | 235 +++++++++++++++++++++++++++++++++++++++++
6 files changed, 394 insertions(+), 196 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc6+/mm/page_cgroup.c
@@ -0,0 +1,235 @@
+#include <linux/mm.h>
+#include <linux/rcupdate.h>
+#include <linux/rculist.h>
+#include <linux/bootmem.h>
+#include <linux/bit_spinlock.h>
+#include <linux/page_cgroup.h>
+#include <linux/hash.h>
+#include <linux/memory.h>
+
+
+
+struct pcg_hash_head {
+ spinlock_t lock;
+ struct hlist_head head;
+};
+
+static struct pcg_hash_head *pcg_hashtable __read_mostly;
+
+struct pcg_hash {
+ struct hlist_node node;
+ unsigned long index;
+ struct page_cgroup *map;
+};
+
+#if BITS_PER_LONG == 32 /* we use kmalloc() */
+#define ENTS_PER_CHUNK_SHIFT (7)
+const bool chunk_vmalloc = false;
+#else /* we'll use vmalloc */
+#ifdef SECTION_SIZE_BITS
+#define ENTS_PER_CHUNK_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)
+#else
+#define ENTS_PER_CHUNK_SHIFT (14) /* covers 128MB on x86-64 */
+#endif
+const bool chunk_vmalloc = true;
+#endif
+
+#define ENTS_PER_CHUNK (1 << (ENTS_PER_CHUNK_SHIFT))
+#define ENTS_PER_CHUNK_MASK (ENTS_PER_CHUNK - 1)
+
+static int pcg_hashshift __read_mostly;
+static int pcg_hashmask __read_mostly;
+
+#define PCG_HASHSHIFT (pcg_hashshift)
+#define PCG_HASHMASK (pcg_hashmask)
+#define PCG_HASHSIZE (1 << pcg_hashshift)
+
+static int pcg_hashfun(unsigned long index)
+{
+ return hash_long(index, pcg_hashshift);
+}
+
+struct page_cgroup *lookup_page_cgroup(unsigned long pfn)
+{
+ unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
+ struct pcg_hash *ent;
+ struct pcg_hash_head *head;
+ struct hlist_node *node;
+ struct page_cgroup *pc = NULL;
+ int hnum;
+
+ hnum = pcg_hashfun(index);
+ head = pcg_hashtable + hnum;
+ rcu_read_lock();
+ hlist_for_each_entry(ent, node, &head->head, node) {
+ if (ent->index == index) {
+ pc = ent->map + pfn;
+ break;
+ }
+ }
+ rcu_read_unlock();
+ return pc;
+}
+
+static int __meminit alloc_page_cgroup(int node, unsigned long index)
+{
+ struct pcg_hash *ent;
+ struct pcg_hash_head *head;
+ struct page_cgroup *pc;
+ unsigned long flags, base;
+ int hnum, i;
+ int mapsize = sizeof(struct page_cgroup) * ENTS_PER_CHUNK;
+
+ if (lookup_page_cgroup(index << ENTS_PER_CHUNK_SHIFT))
+ return 0;
+
+ if (!chunk_vmalloc) {
+ int ent_size = sizeof(*ent) + mapsize;
+ ent = kmalloc_node(ent_size, GFP_KERNEL, node);
+ if (!ent)
+ return 1;
+ pc = (void *)(ent + 1);
+ } else {
+ ent = kmalloc_node(sizeof(*ent), GFP_KERNEL, node);
+ if (!ent)
+ return 1;
+ pc = vmalloc_node(mapsize, node);
+ if (!pc) {
+ kfree(ent);
+ return 1;
+ }
+ }
+ ent->map = pc - (index << ENTS_PER_CHUNK_SHIFT);
+ ent->index = index;
+ INIT_HLIST_NODE(&ent->node);
+
+ for (base = index << ENTS_PER_CHUNK_SHIFT, i = 0;
+ i < ENTS_PER_CHUNK; i++) {
+ unsigned long pfn = base + i;
+ pc = ent->map + pfn;
+ pc->page = pfn_to_page(pfn);
+ pc->mem_cgroup = NULL;
+ pc->flags = 0;
+ }
+
+ hnum = pcg_hashfun(index);
+ head = &pcg_hashtable[hnum];
+ spin_lock_irqsave(&head->lock, flags);
+ hlist_add_head_rcu(&ent->node, &head->head);
+ spin_unlock_irqrestore(&head->lock, flags);
+ return 0;
+}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+int online_page_cgroup(unsigned long start_pfn,
+ unsigned long nr_pages,
+ int nid)
+{
+ unsigned long index, end_pfn, start, end;
+ int fail = 0;
+
+ end_pfn = start_pfn + nr_pages;
+ start = start_pfn >> ENTS_PER_CHUNK_SHIFT;
+ end = (end_pfn + ENTS_PER_CHUNK - 1) >> ENTS_PER_CHUNK_SHIFT;
+
+ for (index = start; (!fail) && (index < end); index++) {
+ unsigned long pfn = index << ENTS_PER_CHUNK_SHIFT;
+ if (lookup_page_cgroup(pfn))
+ continue;
+ fail = alloc_page_cgroup(nid, index);
+ }
+ return fail;
+}
+
+static int pcg_memory_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ struct memory_notify *mn = arg;
+ int ret = 0;
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = online_page_cgroup(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_GOING_OFFLINE:
+ break;
+ case MEM_CANCEL_ONLINE:
+ case MEM_OFFLINE:
+ case MEM_ONLINE:
+ case MEM_CANCEL_OFFLINE:
+ break;
+ }
+ ret = notifier_from_errno(ret);
+ return ret;
+}
+
+#endif
+
+/* Called From mem_cgroup's initilization */
+void __init page_cgroup_init(void)
+{
+ struct pcg_hash_head *head;
+ int node, i, fail;
+ unsigned long start, pfn, end, index, offset;
+ long default_pcg_hash_size;
+
+ /* we don't need too large hash */
+ default_pcg_hash_size = (max_pfn/ENTS_PER_CHUNK);
+ default_pcg_hash_size *= 2;
+ /* if too big, use automatic calclation */
+ if (default_pcg_hash_size > 1024 * 1024)
+ default_pcg_hash_size = 0;
+
+ pcg_hashtable = alloc_large_system_hash("PageCgroup Hash",
+ sizeof(struct pcg_hash_head),
+ default_pcg_hash_size,
+ 13,
+ 0,
+ &pcg_hashshift,
+ &pcg_hashmask,
+ 0);
+ if (!pcg_hashtable) {
+ fail = 1;
+ goto nomem;
+ }
+
+ for (i = 0; i < PCG_HASHSIZE; i++) {
+ head = &pcg_hashtable[i];
+ spin_lock_init(&head->lock);
+ INIT_HLIST_HEAD(&head->head);
+ }
+
+ fail = 0;
+ for_each_node(node) {
+ start = NODE_DATA(node)->node_start_pfn;
+ end = start + NODE_DATA(node)->node_spanned_pages;
+ start >>= ENTS_PER_CHUNK_SHIFT;
+ end = (end + ENTS_PER_CHUNK - 1) >> ENTS_PER_CHUNK_SHIFT;
+ for (index = start; (!fail) && (index < end); index++) {
+ pfn = index << ENTS_PER_CHUNK_SHIFT;
+ /*
+ * In usual, this loop breaks at offset=0.
+ * Handle a case a hole in MAX_ORDER (ia64 only...)
+ */
+ for (offset = 0; offset < ENTS_PER_CHUNK; offset++) {
+ if (pfn_valid(pfn + offset)) {
+ fail = alloc_page_cgroup(node, index);
+ break;
+ }
+ }
+ }
+ if (fail)
+ break;
+ }
+
+ hotplug_memory_notifier(pcg_memory_callback, 0);
+nomem:
+ if (fail) {
+ printk("Not enough memory for memory resource controller.\n");
+ panic("please try cgroup_disable=memory boot option.");
+ }
+ return;
+}
+
+
Index: mmotm-2.6.27-rc6+/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.27-rc6+.orig/include/linux/mm_types.h
+++ mmotm-2.6.27-rc6+/include/linux/mm_types.h
@@ -94,10 +94,6 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- unsigned long page_cgroup;
-#endif
-
#ifdef CONFIG_KMEMCHECK
void *shadow;
#endif
Index: mmotm-2.6.27-rc6+/mm/Makefile
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/Makefile
+++ mmotm-2.6.27-rc6+/mm/Makefile
@@ -34,6 +34,6 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
Index: mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
@@ -0,0 +1,89 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+/*
+ * Page Cgroup can be considered as an extended mem_map.
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ * All page cgroups are allocated at boot or memory hotplug event,
+ * then the page cgroup for pfn always exists.
+ */
+struct page_cgroup {
+ unsigned long flags;
+ struct mem_cgroup *mem_cgroup;
+ struct page *page;
+ struct list_head lru; /* per cgroup LRU list */
+};
+
+void __init page_cgroup_init(void);
+struct page_cgroup *lookup_page_cgroup(unsigned long pfn);
+
+enum {
+ /* flags for mem_cgroup */
+ PCG_LOCK, /* page cgroup is locked */
+ PCG_CACHE, /* charged as cache */
+ PCG_USED, /* this object is in use. */
+ /* flags for LRU placement */
+ PCG_ACTIVE, /* page is active in this cgroup */
+ PCG_FILE, /* page is file system backed */
+ PCG_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname) \
+static inline int PageCgroup##uname(struct page_cgroup *pc) \
+ { return test_bit(PCG_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname) \
+static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
+ { set_bit(PCG_##lname, &pc->flags); }
+
+#define CLEARPCGFLAG(uname, lname) \
+static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
+ { clear_bit(PCG_##lname, &pc->flags); }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+
+TESTPCGFLAG(Used, USED)
+CLEARPCGFLAG(Used, USED)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+static inline int page_cgroup_nid(struct page_cgroup *pc)
+{
+ return page_to_nid(pc->page);
+}
+
+static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+ return page_zonenum(pc->page);
+}
+
+static inline void lock_page_cgroup(struct page_cgroup *pc)
+{
+ bit_spin_lock(PCG_LOCK, &pc->flags);
+}
+
+static inline int trylock_page_cgroup(struct page_cgroup *pc)
+{
+ return bit_spin_trylock(PCG_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup(struct page_cgroup *pc)
+{
+ bit_spin_unlock(PCG_LOCK, &pc->flags);
+}
+
+
+#endif
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -34,11 +34,11 @@
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
+#include <linux/page_cgroup.h>
#include <asm/uaccess.h>
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
-static struct kmem_cache *page_cgroup_cache __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5
/*
@@ -138,80 +138,6 @@ static struct mem_cgroup init_mem_cgroup
#define is_root_cgroup(cgrp) ((cgrp) == &init_mem_cgroup)
-
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin). But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK 0x0
-#endif
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
- struct list_head lru; /* per cgroup LRU list */
- struct page *page;
- struct mem_cgroup *mem_cgroup;
- unsigned long flags;
-};
-
-enum {
- /* flags for mem_cgroup */
- PCG_CACHE, /* charged as cache */
- /* flags for LRU placement */
- PCG_ACTIVE, /* page is active in this cgroup */
- PCG_FILE, /* page is file system backed */
- PCG_UNEVICTABLE, /* page is unevictableable */
-};
-
-#define TESTPCGFLAG(uname, lname) \
-static inline int PageCgroup##uname(struct page_cgroup *pc) \
- { return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname) \
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
- { set_bit(PCG_##lname, &pc->flags); }
-
-#define CLEARPCGFLAG(uname, lname) \
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
- { clear_bit(PCG_##lname, &pc->flags); }
-
-
-/* Cache flag is set only once (at allocation) */
-TESTPCGFLAG(Cache, CACHE)
-
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
-static int page_cgroup_nid(struct page_cgroup *pc)
-{
- return page_to_nid(pc->page);
-}
-
-static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
- return page_zonenum(pc->page);
-}
-
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -222,9 +148,9 @@ enum charge_type {
static const unsigned long
pcg_default_flags[NR_CHARGE_TYPE] = {
- ((1 << PCG_CACHE) | (1 << PCG_FILE)),
- ((1 << PCG_ACTIVE)),
- ((1 << PCG_ACTIVE) | (1 << PCG_CACHE)),
+ (1 << PCG_CACHE) | (1 << PCG_FILE) | (1 << PCG_USED) | (1 << PCG_LOCK),
+ (1 << PCG_ACTIVE) | (1 << PCG_LOCK) | (1 << PCG_USED),
+ (1 << PCG_ACTIVE) | (1 << PCG_CACHE) | (1 << PCG_USED)| (1 << PCG_LOCK),
0,
};
@@ -307,37 +233,6 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}
-static inline int page_cgroup_locked(struct page *page)
-{
- return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
- VM_BUG_ON(!page_cgroup_locked(page));
- page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
- return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
- bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
- return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
- bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
{
@@ -441,22 +336,19 @@ void mem_cgroup_move_lists(struct page *
* safely get to page_cgroup without it, so just try_lock it:
* mem_cgroup_isolate_pages allows for page left on wrong list.
*/
- if (!try_lock_page_cgroup(page))
+ pc = lookup_page_cgroup(page_to_pfn(page));
+
+ if (!trylock_page_cgroup(pc))
return;
- pc = page_get_page_cgroup(page);
- if (pc) {
+ if (PageCgroupUsed(pc)) {
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
- /*
- * check against the race with move_account.
- */
- if (likely(mem == pc->mem_cgroup))
- __mem_cgroup_move_lists(pc, lru);
+ __mem_cgroup_move_lists(pc, lru);
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
- unlock_page_cgroup(page);
+ unlock_page_cgroup(pc);
}
/*
@@ -543,6 +435,8 @@ unsigned long mem_cgroup_isolate_pages(u
list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
if (scan >= nr_to_scan)
break;
+ if (unlikely(!PageCgroupUsed(pc)))
+ continue;
page = pc->page;
if (unlikely(!PageLRU(page)))
@@ -611,12 +505,12 @@ int mem_cgroup_move_account(struct page
/* Now, we assume no_limit...no failure here. */
return ret;
}
- if (!try_lock_page_cgroup(page)) {
+ if (!trylock_page_cgroup(pc)) {
res_counter_uncharge(&to->res, PAGE_SIZE);
return ret;
}
- if (page_get_page_cgroup(page) != pc) {
+ if (!PageCgroupUsed(pc)) {
res_counter_uncharge(&to->res, PAGE_SIZE);
goto out;
}
@@ -634,7 +528,7 @@ int mem_cgroup_move_account(struct page
res_counter_uncharge(&to->res, PAGE_SIZE);
}
out:
- unlock_page_cgroup(page);
+ unlock_page_cgroup(pc);
return ret;
}
@@ -651,26 +545,27 @@ static int mem_cgroup_charge_common(stru
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
- unsigned long flags;
unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
- pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
- if (unlikely(pc == NULL))
- goto err;
-
+ pc = lookup_page_cgroup(page_to_pfn(page));
+ /* can happen at boot */
+ if (unlikely(!pc))
+ return 0;
+ prefetchw(pc);
/*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
* thread group leader migrates. It's possible that mm is not
* set, if so charge the init_mm (happens for pagecache usage).
*/
+
if (likely(!memcg)) {
rcu_read_lock();
mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
if (unlikely(!mem)) {
rcu_read_unlock();
- kmem_cache_free(page_cgroup_cache, pc);
return 0;
}
/*
@@ -706,36 +601,34 @@ static int mem_cgroup_charge_common(stru
}
}
+ preempt_disable();
+ lock_page_cgroup(pc);
+ if (unlikely(PageCgroupUsed(pc))) {
+ unlock_page_cgroup(pc);
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
+ preempt_enable();
+ goto done;
+ }
pc->mem_cgroup = mem;
- pc->page = page;
/*
* If a page is accounted as a page cache, insert to inactive list.
* If anon, insert to active list.
*/
pc->flags = pcg_default_flags[ctype];
- lock_page_cgroup(page);
- if (unlikely(page_get_page_cgroup(page))) {
- unlock_page_cgroup(page);
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
- kmem_cache_free(page_cgroup_cache, pc);
- goto done;
- }
- page_assign_page_cgroup(page, pc);
-
mz = page_cgroup_zoneinfo(pc);
+
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
+ unlock_page_cgroup(pc);
+ preempt_enable();
- unlock_page_cgroup(page);
done:
return 0;
out:
css_put(&mem->css);
- kmem_cache_free(page_cgroup_cache, pc);
-err:
return -ENOMEM;
}
@@ -743,7 +636,8 @@ int mem_cgroup_charge(struct page *page,
{
if (mem_cgroup_subsys.disabled)
return 0;
-
+ if (PageCompound(page))
+ return 0;
/*
* If already mapped, we don't have to account.
* If page cache, page->mapping has address_space.
@@ -764,7 +658,8 @@ int mem_cgroup_cache_charge(struct page
{
if (mem_cgroup_subsys.disabled)
return 0;
-
+ if (PageCompound(page))
+ return 0;
/*
* Corner case handling. This is called from add_to_page_cache()
* in usual. But some FS (shmem) precharges this page before calling it
@@ -777,15 +672,16 @@ int mem_cgroup_cache_charge(struct page
if (!(gfp_mask & __GFP_WAIT)) {
struct page_cgroup *pc;
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (pc) {
- VM_BUG_ON(pc->page != page);
- VM_BUG_ON(!pc->mem_cgroup);
- unlock_page_cgroup(page);
+
+ pc = lookup_page_cgroup(page_to_pfn(page));
+ if (!pc)
+ return 0;
+ lock_page_cgroup(pc);
+ if (PageCgroupUsed(pc)) {
+ unlock_page_cgroup(pc);
return 0;
}
- unlock_page_cgroup(page);
+ unlock_page_cgroup(pc);
}
if (unlikely(!mm))
@@ -808,53 +704,46 @@ __mem_cgroup_uncharge_common(struct page
struct page_cgroup *pc;
struct mem_cgroup *mem;
struct mem_cgroup_per_zone *mz;
+ unsigned long pfn = page_to_pfn(page);
unsigned long flags;
if (mem_cgroup_subsys.disabled)
return;
+ /* check the condition we can know from page */
- /*
- * Check if our page_cgroup is valid
- */
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (unlikely(!pc))
- goto unlock;
-
- VM_BUG_ON(pc->page != page);
+ pc = lookup_page_cgroup(pfn);
+ if (unlikely(!pc || !PageCgroupUsed(pc)))
+ return;
+ preempt_disable();
+ lock_page_cgroup(pc);
+ if (unlikely(page_mapped(page))) {
+ unlock_page_cgroup(pc);
+ preempt_enable();
+ return;
+ }
+ ClearPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
- if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
- && ((PageCgroupCache(pc) || page_mapped(page))))
- goto unlock;
-retry:
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
+
spin_lock_irqsave(&mz->lru_lock, flags);
- if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
- unlikely(mem != pc->mem_cgroup)) {
- /* MAPPED account can be done without lock_page().
- Check race with mem_cgroup_move_account() */
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- goto retry;
- }
__mem_cgroup_remove_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
-
- page_assign_page_cgroup(page, NULL);
- unlock_page_cgroup(page);
-
-
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ pc->mem_cgroup = NULL;
css_put(&mem->css);
+ preempt_enable();
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
- kmem_cache_free(page_cgroup_cache, pc);
return;
-unlock:
- unlock_page_cgroup(page);
}
void mem_cgroup_uncharge_page(struct page *page)
{
+ if (page_mapped(page))
+ return;
+ if (page->mapping && !PageAnon(page))
+ return;
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
}
@@ -878,9 +767,9 @@ int mem_cgroup_prepare_migration(struct
if (mem_cgroup_subsys.disabled)
return 0;
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (pc) {
+ pc = lookup_page_cgroup(page_to_pfn(page));
+ lock_page_cgroup(pc);
+ if (PageCgroupUsed(pc)) {
mem = pc->mem_cgroup;
css_get(&mem->css);
if (PageCgroupCache(pc)) {
@@ -890,7 +779,7 @@ int mem_cgroup_prepare_migration(struct
ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
}
}
- unlock_page_cgroup(page);
+ unlock_page_cgroup(pc);
if (mem) {
ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
ctype, mem);
@@ -1271,8 +1160,8 @@ mem_cgroup_create(struct cgroup_subsys *
int node;
if (unlikely((cont->parent) == NULL)) {
+ page_cgroup_init();
mem = &init_mem_cgroup;
- page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
} else {
mem = mem_cgroup_alloc();
if (!mem)
Index: mmotm-2.6.27-rc6+/mm/page_alloc.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/page_alloc.c
+++ mmotm-2.6.27-rc6+/mm/page_alloc.c
@@ -223,17 +223,12 @@ static inline int bad_range(struct zone
static void bad_page(struct page *page)
{
- void *pc = page_get_page_cgroup(page);
-
printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
current->comm, page, (int)(2*sizeof(unsigned long)),
(unsigned long)page->flags, page->mapping,
page_mapcount(page), page_count(page));
- if (pc) {
- printk(KERN_EMERG "cgroup:%p\n", pc);
- page_reset_bad_cgroup(page);
- }
+
printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
KERN_EMERG "Backtrace:\n");
dump_stack();
@@ -472,7 +467,6 @@ static inline void free_pages_check(stru
free_page_mlock(page);
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_get_page_cgroup(page) != NULL) |
(page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
bad_page(page);
@@ -609,7 +603,6 @@ static void prep_new_page(struct page *p
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_get_page_cgroup(page) != NULL) |
(page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
bad_page(page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 10/13] memcg: page_cgroup look aside table
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (8 preceding siblings ...)
2008-09-22 11:12 ` [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page) KAMEZAWA Hiroyuki
@ 2008-09-22 11:13 ` KAMEZAWA Hiroyuki
2008-09-22 11:17 ` [PATCH 11/13] memcg: lazy LRU free (NEW) KAMEZAWA Hiroyuki
` (3 subsequent siblings)
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Use per-cpu cache for fast access to page_cgroup.
This patch is for making fastpath faster.
Because page_cgroup is accessed when the page is allocated/freed,
we can assume several of continuous page_cgroup will be accessed soon.
(If not interleaved on NUMA...but in such case, alloc/free itself is slow.)
We cache some set of page_cgroup's base pointer on per-cpu area and
use it when we hit.
Changelong: v3 -> v4
- rewrite noinline -> noinline_for_stack.
- added cpu hotplug support.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/page_cgroup.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 70 insertions(+), 3 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/page_cgroup.c
+++ mmotm-2.6.27-rc6+/mm/page_cgroup.c
@@ -6,7 +6,7 @@
#include <linux/page_cgroup.h>
#include <linux/hash.h>
#include <linux/memory.h>
-
+#include <linux/cpu.h>
struct pcg_hash_head {
@@ -44,15 +44,26 @@ static int pcg_hashmask __read_mostly;
#define PCG_HASHMASK (pcg_hashmask)
#define PCG_HASHSIZE (1 << pcg_hashshift)
+#define PCG_CACHE_MAX_SLOT (32)
+#define PCG_CACHE_MASK (PCG_CACHE_MAX_SLOT - 1)
+struct percpu_page_cgroup_cache {
+ struct {
+ unsigned long index;
+ struct page_cgroup *base;
+ } slots[PCG_CACHE_MAX_SLOT];
+};
+DEFINE_PER_CPU(struct percpu_page_cgroup_cache, pcg_cache);
+
static int pcg_hashfun(unsigned long index)
{
return hash_long(index, pcg_hashshift);
}
-struct page_cgroup *lookup_page_cgroup(struct page *page)
+noinline_for_stack static struct page_cgroup *
+__lookup_page_cgroup(struct percpu_page_cgroup_cache *pcc,unsigned long pfn)
{
- unsigned long pfn = page_to_pfn(page);
unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
+ int s = index & PCG_CACHE_MASK;
struct pcg_hash *ent;
struct pcg_hash_head *head;
struct hlist_node *node;
@@ -65,6 +76,8 @@ struct page_cgroup *lookup_page_cgroup(s
hlist_for_each_entry(ent, node, &head->head, node) {
if (ent->index == index) {
pc = ent->map + pfn;
+ pcc->slots[s].index = ent->index;
+ pcc->slots[s].base = ent->map;
break;
}
}
@@ -123,6 +136,56 @@ static int __meminit alloc_page_cgroup(i
return 0;
}
+struct page_cgroup *lookup_page_cgroup(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct percpu_page_cgroup_cache *pcc;
+ struct page_cgroup *ret;
+ unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
+ int hnum = index & PCG_CACHE_MASK;
+
+ pcc = &get_cpu_var(pcg_cache);
+ if (likely(pcc->slots[hnum].index == index))
+ ret = pcc->slots[hnum].base + pfn;
+ else
+ ret = __lookup_page_cgroup(pcc, pfn);
+ put_cpu_var(pcg_cache);
+ return ret;
+}
+
+
+
+void __cpuinit clear_page_cgroup_cache_pcg(int cpu)
+{
+ struct percpu_page_cgroup_cache *pcc;
+ int i;
+
+ pcc = &per_cpu(pcg_cache, cpu);
+ for (i = 0; i < PCG_CACHE_MAX_SLOT; i++)
+ pcc->slots[i].index = -1;
+}
+
+static int __cpuinit cpu_page_cgroup_callback(struct notifier_block *nb,
+ unsigned long action, void *hcpu)
+{
+ int cpu = (long)hcpu;
+
+ switch(action){
+ case CPU_UP_PREPARE:
+ clear_page_cgroup_cache_pcg(cpu);
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+struct notifier_block cpu_page_cgroup_nb = {
+ .notifier_call = cpu_page_cgroup_callback,
+};
+
+
+
#ifdef CONFIG_MEMORY_HOTPLUG
int online_page_cgroup(unsigned long start_pfn,
@@ -227,6 +290,10 @@ void __init page_cgroup_init(void)
if (fail)
break;
}
+ cpu_page_cgroup_callback(&cpu_page_cgroup_nb,
+ (unsigned long)CPU_UP_PREPARE,
+ (void *)(long)smp_processor_id());
+ register_hotcpu_notifier(&cpu_page_cgroup_nb);
hotplug_memory_notifier(pcg_memory_callback, 0);
nomem:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 11/13] memcg: lazy LRU free (NEW)
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (9 preceding siblings ...)
2008-09-22 11:13 ` [PATCH 10/13] memcg: page_cgroup look aside table KAMEZAWA Hiroyuki
@ 2008-09-22 11:17 ` KAMEZAWA Hiroyuki
2008-09-22 11:22 ` [PATCH 12/13] memcg: lazy LRU add KAMEZAWA Hiroyuki
` (2 subsequent siblings)
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:17 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Free page_cgroup from its LRU in batched manner.
When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 153 insertions(+), 10 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -35,6 +35,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include <linux/cpu.h>
#include <asm/uaccess.h>
@@ -533,6 +534,116 @@ out:
return ret;
}
+
+#define MEMCG_PCPVEC_SIZE (14) /* size of pagevec */
+struct memcg_percpu_vec {
+ int nr;
+ int limit;
+ struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+ unsigned long flags;
+ struct mem_cgroup_per_zone *mz, *prev_mz;
+ struct page_cgroup *pc;
+ int i, nr;
+
+ local_irq_save(flags);
+ nr = mpv->nr;
+ mpv->nr = 0;
+ prev_mz = NULL;
+ for (i = nr - 1; i >= 0; i--) {
+ pc = mpv->vec[i];
+ VM_BUG_ON(PageCgroupUsed(pc));
+ mz = page_cgroup_zoneinfo(pc);
+ if (prev_mz != mz) {
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ prev_mz = mz;
+ spin_lock(&mz->lru_lock);
+ }
+ __mem_cgroup_remove_list(mz, pc);
+ css_put(&pc->mem_cgroup->css);
+ pc->mem_cgroup = NULL;
+ }
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+ struct memcg_percpu_vec *mpv;
+
+ mpv = &get_cpu_var(memcg_free_vec);
+ mpv->vec[mpv->nr++] = pc;
+ if (mpv->nr >= mpv->limit)
+ __release_page_cgroup(mpv);
+ put_cpu_var(memcg_free_vec);
+}
+
+static void page_cgroup_start_cache_cpu(int cpu)
+{
+ struct memcg_percpu_vec *mpv;
+ mpv = &per_cpu(memcg_free_vec, cpu);
+ mpv->limit = MEMCG_PCPVEC_SIZE;
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void page_cgroup_stop_cache_cpu(int cpu)
+{
+ struct memcg_percpu_vec *mpv;
+ mpv = &per_cpu(memcg_free_vec, cpu);
+ mpv->limit = 0;
+}
+#endif
+
+
+/*
+ * Used when freeing memory resource controller to remove all
+ * page_cgroup (in obsolete list).
+ */
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+
+static void drain_page_cgroup_local(struct work_struct *work)
+{
+ struct memcg_percpu_vec *mpv;
+ mpv = &get_cpu_var(memcg_free_vec);
+ __release_page_cgroup(mpv);
+ put_cpu_var(mpv);
+}
+
+static void drain_page_cgroup_cpu(int cpu)
+{
+ int local_cpu;
+ struct work_struct work;
+
+ local_cpu = get_cpu();
+ if (local_cpu == cpu) {
+ drain_page_cgroup_local(NULL);
+ put_cpu();
+ return;
+ }
+ put_cpu();
+
+ INIT_WORK(&work, drain_page_cgroup_local);
+ schedule_work_on(cpu, &work);
+ flush_work(&work);
+}
+
+static void drain_page_cgroup_all(void)
+{
+ mutex_lock(&memcg_force_drain_mutex);
+ schedule_on_each_cpu(drain_page_cgroup_local);
+ mutex_unlock(&memcg_force_drain_mutex);
+}
+
+
/*
* Charge the memory controller for page usage.
* Return
@@ -703,8 +814,6 @@ __mem_cgroup_uncharge_common(struct page
{
struct page_cgroup *pc;
struct mem_cgroup *mem;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
if (mem_cgroup_subsys.disabled)
return;
@@ -722,16 +831,10 @@ __mem_cgroup_uncharge_common(struct page
}
ClearPageCgroupUsed(pc);
unlock_page_cgroup(pc);
+ preempt_enable();
mem = pc->mem_cgroup;
- mz = page_cgroup_zoneinfo(pc);
-
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- pc->mem_cgroup = NULL;
- css_put(&mem->css);
- preempt_enable();
+ release_page_cgroup(pc);
res_counter_uncharge(&mem->res, PAGE_SIZE);
return;
@@ -889,6 +992,7 @@ static void mem_cgroup_force_empty_list(
if (!PageLRU(page) || !get_page_unless_zero(page)) {
list_move(&pc->lru, list);
spin_unlock_irqrestore(&mz->lru_lock, flags);
+ drain_page_cgroup_all();
yield();
spin_lock_irqsave(&mz->lru_lock, flags);
continue;
@@ -907,6 +1011,7 @@ static void mem_cgroup_force_empty_list(
unlock_page(page);
put_page(page);
spin_unlock_irqrestore(&mz->lru_lock, flags);
+ drain_page_cgroup_all();
yield();
spin_lock_irqsave(&mz->lru_lock, flags);
} else {
@@ -935,6 +1040,7 @@ static int mem_cgroup_force_empty(struct
* So, we have to do loop here until all lists are empty.
*/
while (mem->res.usage > 0) {
+ drain_page_cgroup_all();
if (atomic_read(&mem->css.cgroup->count) > 0)
goto out;
for_each_node_state(node, N_POSSIBLE)
@@ -947,6 +1053,7 @@ static int mem_cgroup_force_empty(struct
}
}
ret = 0;
+ drain_page_cgroup_all();
out:
css_put(&mem->css);
return ret;
@@ -1151,6 +1258,38 @@ static void mem_cgroup_free(struct mem_c
vfree(mem);
}
+static void mem_cgroup_init_pcp(int cpu)
+{
+ page_cgroup_start_cache_cpu(cpu);
+}
+
+static int cpu_memcgroup_callback(struct notifier_block *nb,
+ unsigned long action, void *hcpu)
+{
+ int cpu = (long)hcpu;
+
+ switch(action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ mem_cgroup_init_pcp(cpu);
+ break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ page_cgroup_stop_cache_cpu(cpu);
+ drain_page_cgroup_cpu(cpu);
+ break;
+#endif
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __refdata memcgroup_nb =
+{
+ .notifier_call = cpu_memcgroup_callback,
+};
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1161,6 +1300,10 @@ mem_cgroup_create(struct cgroup_subsys *
if (unlikely((cont->parent) == NULL)) {
page_cgroup_init();
mem = &init_mem_cgroup;
+ cpu_memcgroup_callback(&memcgroup_nb,
+ (unsigned long)CPU_UP_PREPARE,
+ (void *)(long)smp_processor_id());
+ register_hotcpu_notifier(&memcgroup_nb);
} else {
mem = mem_cgroup_alloc();
if (!mem)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 12/13] memcg: lazy LRU add
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (10 preceding siblings ...)
2008-09-22 11:17 ` [PATCH 11/13] memcg: lazy LRU free (NEW) KAMEZAWA Hiroyuki
@ 2008-09-22 11:22 ` KAMEZAWA Hiroyuki
2008-09-22 11:24 ` [PATCH 13/13] memcg: swap accounting fix KAMEZAWA Hiroyuki
2008-09-22 11:28 ` [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
Delaying add_to_lru() and do it in batched manner like page_vec.
For doint that 2 flags PCG_USED and PCG_LRU.
If PCG_LRU is set, page is on LRU. It safe to access LRU via page_cgroup.
(under some lock.)
For avoiding race, this patch uses TestSetPageCgroupUsed().
and checking PCG_USED bit and PCG_LRU bit in add/free vector.
By this, lock_page_cgroup() in mem_cgroup_charge() is removed.
(I don't want to call lock_page_cgroup() under mz->lru_lock when
add/free vector core logic. So, TestSetPageCgroupUsed() logic is added.
TestSet is an easy way to avoid unneccesary nest of locks.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/page_cgroup.h | 10 +++
mm/memcontrol.c | 125 ++++++++++++++++++++++++++++++--------------
2 files changed, 98 insertions(+), 37 deletions(-)
Index: mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc6+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
@@ -23,6 +23,7 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
+ PCG_LRU, /* this is on LRU */
/* flags for LRU placement */
PCG_ACTIVE, /* page is active in this cgroup */
PCG_FILE, /* page is file system backed */
@@ -41,11 +42,20 @@ static inline void SetPageCgroup##uname(
static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
{ clear_bit(PCG_##lname, &pc->flags); }
+#define TESTSETPCGFLAG(uname, lname)\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+ { return test_and_set_bit(PCG_##lname, &pc->flags); }
+
/* Cache flag is set only once (at allocation) */
TESTPCGFLAG(Cache, CACHE)
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)
+TESTSETPCGFLAG(Used, USED)
+
+TESTPCGFLAG(LRU, LRU)
+SETPCGFLAG(LRU, LRU)
+CLEARPCGFLAG(LRU, LRU)
/* LRU management flags (from global-lru definition) */
TESTPCGFLAG(File, FILE)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -149,9 +149,9 @@ enum charge_type {
static const unsigned long
pcg_default_flags[NR_CHARGE_TYPE] = {
- (1 << PCG_CACHE) | (1 << PCG_FILE) | (1 << PCG_USED) | (1 << PCG_LOCK),
- (1 << PCG_ACTIVE) | (1 << PCG_LOCK) | (1 << PCG_USED),
- (1 << PCG_ACTIVE) | (1 << PCG_CACHE) | (1 << PCG_USED)| (1 << PCG_LOCK),
+ (1 << PCG_CACHE) | (1 << PCG_FILE) | (1 << PCG_USED),
+ (1 << PCG_ACTIVE) | (1 << PCG_USED),
+ (1 << PCG_ACTIVE) | (1 << PCG_CACHE) | (1 << PCG_USED),
0,
};
@@ -194,7 +194,6 @@ page_cgroup_zoneinfo(struct page_cgroup
struct mem_cgroup *mem = pc->mem_cgroup;
int nid = page_cgroup_nid(pc);
int zid = page_cgroup_zid(pc);
-
return mem_cgroup_zoneinfo(mem, nid, zid);
}
@@ -342,7 +341,7 @@ void mem_cgroup_move_lists(struct page *
if (!trylock_page_cgroup(pc))
return;
- if (PageCgroupUsed(pc)) {
+ if (PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
@@ -502,6 +501,9 @@ int mem_cgroup_move_account(struct page
from_mz = mem_cgroup_zoneinfo(from, nid, zid);
to_mz = mem_cgroup_zoneinfo(to, nid, zid);
+ if (!PageCgroupLRU(pc))
+ return ret;
+
if (res_counter_charge(&to->res, PAGE_SIZE)) {
/* Now, we assume no_limit...no failure here. */
return ret;
@@ -518,10 +520,8 @@ int mem_cgroup_move_account(struct page
if (spin_trylock(&to_mz->lru_lock)) {
__mem_cgroup_remove_list(from_mz, pc);
- css_put(&from->css);
res_counter_uncharge(&from->res, PAGE_SIZE);
pc->mem_cgroup = to;
- css_get(&to->css);
__mem_cgroup_add_list(to_mz, pc);
ret = 0;
spin_unlock(&to_mz->lru_lock);
@@ -542,6 +542,7 @@ struct memcg_percpu_vec {
struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
};
static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_add_vec);
static void
__release_page_cgroup(struct memcg_percpu_vec *mpv)
@@ -557,7 +558,6 @@ __release_page_cgroup(struct memcg_percp
prev_mz = NULL;
for (i = nr - 1; i >= 0; i--) {
pc = mpv->vec[i];
- VM_BUG_ON(PageCgroupUsed(pc));
mz = page_cgroup_zoneinfo(pc);
if (prev_mz != mz) {
if (prev_mz)
@@ -565,9 +565,10 @@ __release_page_cgroup(struct memcg_percp
prev_mz = mz;
spin_lock(&mz->lru_lock);
}
- __mem_cgroup_remove_list(mz, pc);
- css_put(&pc->mem_cgroup->css);
- pc->mem_cgroup = NULL;
+ if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
+ __mem_cgroup_remove_list(mz, pc);
+ ClearPageCgroupLRU(pc);
+ }
}
if (prev_mz)
spin_unlock(&prev_mz->lru_lock);
@@ -576,10 +577,43 @@ __release_page_cgroup(struct memcg_percp
}
static void
+__set_page_cgroup_lru(struct memcg_percpu_vec *mpv)
+{
+ unsigned long flags;
+ struct mem_cgroup_per_zone *mz, *prev_mz;
+ struct page_cgroup *pc;
+ int i, nr;
+
+ local_irq_save(flags);
+ nr = mpv->nr;
+ mpv->nr = 0;
+ prev_mz = NULL;
+
+ for (i = nr - 1; i >= 0; i--) {
+ pc = mpv->vec[i];
+ mz = page_cgroup_zoneinfo(pc);
+ if (prev_mz != mz) {
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ prev_mz = mz;
+ spin_lock(&mz->lru_lock);
+ }
+ if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
+ SetPageCgroupLRU(pc);
+ __mem_cgroup_add_list(mz, pc);
+ }
+ }
+
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ local_irq_restore(flags);
+
+}
+
+static void
release_page_cgroup(struct page_cgroup *pc)
{
struct memcg_percpu_vec *mpv;
-
mpv = &get_cpu_var(memcg_free_vec);
mpv->vec[mpv->nr++] = pc;
if (mpv->nr >= mpv->limit)
@@ -587,11 +621,25 @@ release_page_cgroup(struct page_cgroup *
put_cpu_var(memcg_free_vec);
}
+static void
+set_page_cgroup_lru(struct page_cgroup *pc)
+{
+ struct memcg_percpu_vec *mpv;
+
+ mpv = &get_cpu_var(memcg_add_vec);
+ mpv->vec[mpv->nr++] = pc;
+ if (mpv->nr >= mpv->limit)
+ __set_page_cgroup_lru(mpv);
+ put_cpu_var(memcg_add_vec);
+}
+
static void page_cgroup_start_cache_cpu(int cpu)
{
struct memcg_percpu_vec *mpv;
mpv = &per_cpu(memcg_free_vec, cpu);
mpv->limit = MEMCG_PCPVEC_SIZE;
+ mpv = &per_cpu(memcg_add_vec, cpu);
+ mpv->limit = MEMCG_PCPVEC_SIZE;
}
#ifdef CONFIG_HOTPLUG_CPU
@@ -600,6 +648,8 @@ static void page_cgroup_stop_cache_cpu(i
struct memcg_percpu_vec *mpv;
mpv = &per_cpu(memcg_free_vec, cpu);
mpv->limit = 0;
+ mpv = &per_cpu(memcg_add_vec, cpu);
+ mpv->limit = 0;
}
#endif
@@ -613,6 +663,9 @@ static DEFINE_MUTEX(memcg_force_drain_mu
static void drain_page_cgroup_local(struct work_struct *work)
{
struct memcg_percpu_vec *mpv;
+ mpv = &get_cpu_var(memcg_add_vec);
+ __set_page_cgroup_lru(mpv);
+ put_cpu_var(mpv);
mpv = &get_cpu_var(memcg_free_vec);
__release_page_cgroup(mpv);
put_cpu_var(mpv);
@@ -679,14 +732,9 @@ static int mem_cgroup_charge_common(stru
rcu_read_unlock();
return 0;
}
- /*
- * For every charge from the cgroup, increment reference count
- */
- css_get(&mem->css);
rcu_read_unlock();
} else {
mem = memcg;
- css_get(&memcg->css);
}
while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
@@ -713,33 +761,36 @@ static int mem_cgroup_charge_common(stru
}
preempt_disable();
- lock_page_cgroup(pc);
- if (unlikely(PageCgroupUsed(pc))) {
- unlock_page_cgroup(pc);
+ if (TestSetPageCgroupUsed(pc)) {
res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
preempt_enable();
goto done;
}
- pc->mem_cgroup = mem;
/*
- * If a page is accounted as a page cache, insert to inactive list.
- * If anon, insert to active list.
- */
- pc->flags = pcg_default_flags[ctype];
-
- mz = page_cgroup_zoneinfo(pc);
+ * page cgroup is *unused* now....but....
+ * We can assume old mem_cgroup's metadata is still available
+ * because pc is not on stale LRU after force_empty() is called.
+ */
+ if (likely(!PageCgroupLRU(pc)))
+ pc->flags = pcg_default_flags[ctype];
+ else {
+ mz = page_cgroup_zoneinfo(pc);
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ if (PageCgroupLRU(pc)) {
+ __mem_cgroup_remove_list(mz, pc);
+ ClearPageCgroupLRU(pc);
+ }
+ pc->flags = pcg_default_flags[ctype];
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ }
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- unlock_page_cgroup(pc);
+ pc->mem_cgroup = mem;
+ set_page_cgroup_lru(pc);
preempt_enable();
done:
return 0;
out:
- css_put(&mem->css);
return -ENOMEM;
}
@@ -830,12 +881,12 @@ __mem_cgroup_uncharge_common(struct page
return;
}
ClearPageCgroupUsed(pc);
+ mem = pc->mem_cgroup;
unlock_page_cgroup(pc);
preempt_enable();
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
- mem = pc->mem_cgroup;
release_page_cgroup(pc);
- res_counter_uncharge(&mem->res, PAGE_SIZE);
return;
}
@@ -1054,6 +1105,7 @@ static int mem_cgroup_force_empty(struct
}
ret = 0;
drain_page_cgroup_all();
+ synchronize_sched();
out:
css_put(&mem->css);
return ret;
@@ -1340,8 +1392,7 @@ static void mem_cgroup_destroy(struct cg
for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);
-
- mem_cgroup_free(mem_cgroup_from_cont(cont));
+ mem_cgroup_free(mem);
}
static int mem_cgroup_populate(struct cgroup_subsys *ss,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 13/13] memcg: swap accounting fix
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (11 preceding siblings ...)
2008-09-22 11:22 ` [PATCH 12/13] memcg: lazy LRU add KAMEZAWA Hiroyuki
@ 2008-09-22 11:24 ` KAMEZAWA Hiroyuki
2008-09-22 11:28 ` [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:24 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
There is a small race in do_swap_page(). When the page swapped-in is charged,
the mapcount can be greater than 0. But, at the same time some process (shares
it ) call unmap and make mapcount 1->0 and the page is uncharged.
For fixing this, I added a new interface.
- precharge
account to res_counter by PAGE_SIZE and try to free pages if necessary.
- commit
register page_cgroup and add to LRU if necessary.
- cancel
uncharge PAGE_SIZE because of do_swap_page failure.
This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does precharge -> commit at a time.
These precharge/commit/cancel can be used for other places, shmem,
migration, etc..we'll revisit later.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/memcontrol.h | 21 +++++++
mm/memcontrol.c | 135 +++++++++++++++++++++++++++++++--------------
mm/memory.c | 6 +-
3 files changed, 120 insertions(+), 42 deletions(-)
Index: mmotm-2.6.27-rc6+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc6+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc6+/include/linux/memcontrol.h
@@ -32,6 +32,13 @@ struct mm_struct;
extern struct page_cgroup *page_get_page_cgroup(struct page *page);
extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+/* for swap handling */
+extern int mem_cgroup_precharge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern void mem_cgroup_commit_charge_swap(struct page *page,
+ struct mem_cgroup *ptr);
+extern void mem_cgroup_cancel_charge_swap(struct mem_cgroup *ptr);
+
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
@@ -94,6 +101,20 @@ static inline int mem_cgroup_cache_charg
return 0;
}
+static int mem_cgroup_precharge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+ return 0;
+}
+
+static void mem_cgroup_commit_charge_swap(struct page *page,
+ struct mem_cgroup *ptr)
+{
+}
+static void mem_cgroup_cancel_charge_swap(struct mem_cgroup *ptr)
+{
+}
+
static inline void mem_cgroup_uncharge_page(struct page *page)
{
}
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -698,52 +698,44 @@ static void drain_page_cgroup_all(void)
/*
- * Charge the memory controller for page usage.
- * Return
- * 0 if the charge was successful
- * < 0 if the cgroup is over its limit
+ * charge against mem_cgroup linked to this mm. (or *ptr)
+ *
+ * This just charge PAGE_SIZE and reduce memory usage if necessary.
+ *
+ * Pages on radix-tree is charged at radix-tree add/remove under lock.
+ * new pages are charged at allocation and both are guaranteed to be that
+ * there are no racy users. We does precharge->commit at once.
+ *
+ * About swapcache, we can't trust page->mapcount until it's mapped.
+ * Then we do precharge before map and commit/cancel after the mapping is
+ * established. (see below, we have commit_swap and cancel_swap)
*/
-static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype,
- struct mem_cgroup *memcg)
+
+int mem_cgroup_precharge(struct mm_struct *mm,
+ gfp_t mask, struct mem_cgroup **ptr)
{
struct mem_cgroup *mem;
- struct page_cgroup *pc;
- unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
- pc = lookup_page_cgroup(page);
- /* can happen at boot */
- if (unlikely(!pc))
- return 0;
- prefetchw(pc);
- /*
- * We always charge the cgroup the mm_struct belongs to.
- * The mm_struct's mem_cgroup changes on task migration if the
- * thread group leader migrates. It's possible that mm is not
- * set, if so charge the init_mm (happens for pagecache usage).
- */
+ int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- if (likely(!memcg)) {
- rcu_read_lock();
+ rcu_read_lock();
+ if (!*ptr) {
mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
if (unlikely(!mem)) {
rcu_read_unlock();
- return 0;
+ return -ESRCH;
}
- rcu_read_unlock();
+ *ptr = mem;
} else {
- mem = memcg;
+ mem = *ptr;
}
+ rcu_read_unlock();
+ css_get(&mem->css);
while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
- if (!(gfp_mask & __GFP_WAIT))
- goto out;
-
- if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+ if (!(mask & __GFP_WAIT))
+ goto nomem;
+ if (try_to_free_mem_cgroup_pages(mem, mask))
continue;
-
/*
* try_to_free_mem_cgroup_pages() might not give us a full
* picture of reclaim. Some pages are reclaimed and might be
@@ -755,16 +747,31 @@ static int mem_cgroup_charge_common(stru
continue;
if (!nr_retries--) {
- mem_cgroup_out_of_memory(mem, gfp_mask);
- goto out;
+ mem_cgroup_out_of_memory(mem, mask);
+ goto nomem;
}
}
+ return 0;
+nomem:
+ css_put(&mem->css);
+ return -ENOMEM;
+}
+void mem_cgroup_commit_charge(struct page_cgroup *pc,
+ struct mem_cgroup *mem,
+ enum charge_type ctype)
+{
+ struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
+
+ if (!mem)
+ return;
preempt_disable();
if (TestSetPageCgroupUsed(pc)) {
res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
preempt_enable();
- goto done;
+ return;
}
/*
* page cgroup is *unused* now....but....
@@ -786,14 +793,43 @@ static int mem_cgroup_charge_common(stru
pc->mem_cgroup = mem;
set_page_cgroup_lru(pc);
+ css_put(&mem->css);
preempt_enable();
+}
-done:
+/*
+ * Charge the memory controller for page usage.
+ * Return
+ * 0 if the charge was successful
+ * < 0 if the cgroup is over its limit
+ */
+static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, enum charge_type ctype,
+ struct mem_cgroup *memcg)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *ptr = memcg;
+ int ret;
+
+ pc = lookup_page_cgroup(page);
+ /* can happen at boot */
+ if (unlikely(!pc))
+ return 0;
+ prefetchw(pc);
+
+ ret = mem_cgroup_precharge(mm, gfp_mask, &ptr);
+ if (likely(!ret)) {
+ mem_cgroup_commit_charge(pc, ptr, ctype);
+ return 0;
+ }
+ if (unlikely((ret == -ENOMEM)))
+ return ret;
+ /* ESRCH case */
return 0;
-out:
- return -ENOMEM;
}
+
+
int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
if (mem_cgroup_subsys.disabled)
@@ -806,7 +842,7 @@ int mem_cgroup_charge(struct page *page,
* But page->mapping may have out-of-use anon_vma pointer,
* detecit it by PageAnon() check. newly-mapped-anon's page->mapping
* is NULL.
- */
+ */
if (page_mapped(page) || (page->mapping && !PageAnon(page)))
return 0;
if (unlikely(!mm))
@@ -857,6 +893,25 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}
+
+void mem_cgroup_commit_charge_swap(struct page *page, struct mem_cgroup *ptr)
+{
+ struct page_cgroup *pc;
+ if (!ptr)
+ return;
+ pc = lookup_page_cgroup(page);
+ mem_cgroup_commit_charge(pc, ptr, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+}
+
+void mem_cgroup_cancel_charge_swap(struct mem_cgroup *mem)
+{
+ if (!mem)
+ return;
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
+}
+
+
/*
* uncharge if !page_mapped(page)
*/
Index: mmotm-2.6.27-rc6+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memory.c
+++ mmotm-2.6.27-rc6+/mm/memory.c
@@ -2287,6 +2287,7 @@ static int do_swap_page(struct mm_struct
struct page *page;
swp_entry_t entry;
pte_t pte;
+ struct mem_cgroup *ptr = NULL;
int ret = 0;
if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
@@ -2323,7 +2324,7 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (mem_cgroup_precharge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
ret = VM_FAULT_OOM;
goto out;
}
@@ -2353,6 +2354,7 @@ static int do_swap_page(struct mm_struct
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ mem_cgroup_commit_charge_swap(page, ptr);
swap_free(entry);
if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2373,7 +2375,7 @@ unlock:
out:
return ret;
out_nomap:
- mem_cgroup_uncharge_page(page);
+ mem_cgroup_cancel_charge_swap(ptr);
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
page_cache_release(page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/13] memory cgroup updates v4
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
` (12 preceding siblings ...)
2008-09-22 11:24 ` [PATCH 13/13] memcg: swap accounting fix KAMEZAWA Hiroyuki
@ 2008-09-22 11:28 ` KAMEZAWA Hiroyuki
13 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-22 11:28 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
On Mon, 22 Sep 2008 19:51:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Brief description.
>
> 1/13 .... special mapping fix. (NEW)
> => avoid accounting pages not on LRU...which we cannot reclaim.
>
> 2/13 .... account swap-cache under lock.
> => move accounting of swap-cache under lock for avoiding unnecessary race.
>
> 3/13 .... make root cgroup to be unlimited.
> => fix root cgroup's memory limit to unlimited.
>
> 4/13 .... atomic-flags for page_cgroup
> => make page_cgroup->flags to be atomic.
>
> 5/13 .... implement move_account function.
> => add a function for moving page_cgroup's account to other cgroup.
>
> 6/13 ... force_empty to migrate account
> => move all account to root cgroup rather than forget all.
>
> 7/13 ...make mapping NULL (clean up)
> => ensure page->mapping to be NULL before calling mem_cgroup_uncharge_cache().
>
> 8/13 ...optimize cpustat
> => optimize access to per-cpu statistics for memcg.
>
> 9/13 ...lookup page_cgroup (CHANGED)
> => preallocate all page_cgroup at boot and remove page->page_cgroup pointer.
>
> 10/13...page_cgroup lookaside buffer
> => helps looking up page_cgroup from page.
>
> 11/13...lazy lru freeing page_cgroup (NEW)
> => do removal from LRU in bached manner like pagevec.
>
> 12/13...lazy lru add page_cgroup (NEW)
> => do addition to LRU in bached manner like pagevec.
>
> 13/13...swap accountig fix. (NEW)
> => fix race in swap accounting (can be happen)
> and this intrduce new protocal as precharge/commit/cancel.
>
> Some patches are big but not complicated I think.
>
Sorry for crazy patch numbering...
1 -> 1
2 -> 2
3 -> 3
3.5 -> 4
3.6 -> 5
4 -> 6
5 -> 7
6 -> 8
9 -> 9
10 -> 10
11 -> 11
12 -> 12
I may not able to do quick responce, sorry.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 4/13] memcg: force_empty moving account
2008-09-22 11:00 ` [PATCH 4/13] memcg: force_empty moving account KAMEZAWA Hiroyuki
@ 2008-09-22 14:23 ` Peter Zijlstra
2008-09-22 14:50 ` kamezawa.hiroyu
1 sibling, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2008-09-22 14:23 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML, Ingo Molnar
On Mon, 2008-09-22 at 20:00 +0900, KAMEZAWA Hiroyuki wrote:
> + /* For avoiding race with speculative page cache handling. */
> + if (!PageLRU(page) || !get_page_unless_zero(page)) {
> + list_move(&pc->lru, list);
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> + yield();
Gah, no way!
> + spin_lock_irqsave(&mz->lru_lock, flags);
> + continue;
> + }
> + if (!trylock_page(page)) {
> + list_move(&pc->lru, list);
> put_page(page);
> - if (--count <= 0) {
> - count = FORCE_UNCHARGE_BATCH;
> - cond_resched();
> - }
> - } else
> - cond_resched();
> - spin_lock_irqsave(&mz->lru_lock, flags);
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> + yield();
Seriously?!
> + spin_lock_irqsave(&mz->lru_lock, flags);
> + continue;
> + }
> + if (mem_cgroup_move_account(page, pc, mem, &init_mem_cgroup)) {
> + /* some confliction */
> + list_move(&pc->lru, list);
> + unlock_page(page);
> + put_page(page);
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> + yield();
Inflicting pain..
> + spin_lock_irqsave(&mz->lru_lock, flags);
> + } else {
> + unlock_page(page);
> + put_page(page);
> + }
> + if (atomic_read(&mem->css.cgroup->count) > 0)
> + break;
> }
> spin_unlock_irqrestore(&mz->lru_lock, flags);
do _NOT_ use yield() ever! unless you know what you're doing, and
probably not even then.
NAK!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: [PATCH 4/13] memcg: force_empty moving account
2008-09-22 11:00 ` [PATCH 4/13] memcg: force_empty moving account KAMEZAWA Hiroyuki
2008-09-22 14:23 ` Peter Zijlstra
@ 2008-09-22 14:50 ` kamezawa.hiroyu
2008-09-22 14:56 ` Peter Zijlstra
2008-09-22 15:06 ` kamezawa.hiroyu
1 sibling, 2 replies; 37+ messages in thread
From: kamezawa.hiroyu @ 2008-09-22 14:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura, xemul, LKML, Ingo Molnar
----- Original Message -----
>> + spin_lock_irqsave(&mz->lru_lock, flags);
>> + } else {
>> + unlock_page(page);
>> + put_page(page);
>> + }
>> + if (atomic_read(&mem->css.cgroup->count) > 0)
>> + break;
>> }
>> spin_unlock_irqrestore(&mz->lru_lock, flags);
>
>do _NOT_ use yield() ever! unless you know what you're doing, and
>probably not even then.
>
>NAK!
Hmm, sorry. cond_resched() is ok ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 11:12 ` [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page) KAMEZAWA Hiroyuki
@ 2008-09-22 14:52 ` Dave Hansen
2008-09-22 15:14 ` kamezawa.hiroyu
` (3 subsequent siblings)
4 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2008-09-22 14:52 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
On Mon, 2008-09-22 at 20:12 +0900, KAMEZAWA Hiroyuki wrote:
> + /* we don't need too large hash */
> + default_pcg_hash_size = (max_pfn/ENTS_PER_CHUNK);
> + default_pcg_hash_size *= 2;
> + /* if too big, use automatic calclation */
> + if (default_pcg_hash_size > 1024 * 1024)
> + default_pcg_hash_size = 0;
> +
> + pcg_hashtable = alloc_large_system_hash("PageCgroup Hash",
> + sizeof(struct pcg_hash_head),
> + default_pcg_hash_size,
> + 13,
> + 0,
> + &pcg_hashshift,
> + &pcg_hashmask,
> + 0);
The one thing I don't see here is much explanation about how large this
structure will get.
Basing it on max_pfn makes me nervous because of what it will do on
machines with very sparse memory. Is this like sparsemem where the
structure can be small enough to actually span all of physical memory,
or will it be a large memory user?
Can you lay out how much memory this will use on a machine like Dave
Miller's which has 1GB of memory at 0x0 and 1GB of memory at 1TB up in
the address space?
Also, how large do the hash buckets get in the average case?
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: [PATCH 4/13] memcg: force_empty moving account
2008-09-22 14:50 ` kamezawa.hiroyu
@ 2008-09-22 14:56 ` Peter Zijlstra
2008-09-22 15:06 ` kamezawa.hiroyu
1 sibling, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2008-09-22 14:56 UTC (permalink / raw)
To: kamezawa.hiroyu; +Cc: linux-mm, balbir, nishimura, xemul, LKML, Ingo Molnar
On Mon, 2008-09-22 at 23:50 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
> ----- Original Message -----
> >> + spin_lock_irqsave(&mz->lru_lock, flags);
> >> + } else {
> >> + unlock_page(page);
> >> + put_page(page);
> >> + }
> >> + if (atomic_read(&mem->css.cgroup->count) > 0)
> >> + break;
> >> }
> >> spin_unlock_irqrestore(&mz->lru_lock, flags);
> >
> >do _NOT_ use yield() ever! unless you know what you're doing, and
> >probably not even then.
> >
> >NAK!
> Hmm, sorry. cond_resched() is ok ?
depends on what you want to do, please explain what you're trying to do.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: Re: [PATCH 4/13] memcg: force_empty moving account
2008-09-22 14:50 ` kamezawa.hiroyu
2008-09-22 14:56 ` Peter Zijlstra
@ 2008-09-22 15:06 ` kamezawa.hiroyu
2008-09-22 15:32 ` Peter Zijlstra
2008-09-22 15:43 ` kamezawa.hiroyu
1 sibling, 2 replies; 37+ messages in thread
From: kamezawa.hiroyu @ 2008-09-22 15:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: kamezawa.hiroyu, linux-mm, balbir, nishimura, xemul, LKML, Ingo Molnar
----- Original Message -----
>On Mon, 2008-09-22 at 23:50 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
>> ----- Original Message -----
>> >> + spin_lock_irqsave(&mz->lru_lock, flags);
>> >> + } else {
>> >> + unlock_page(page);
>> >> + put_page(page);
>> >> + }
>> >> + if (atomic_read(&mem->css.cgroup->count) > 0)
>> >> + break;
>> >> }
>> >> spin_unlock_irqrestore(&mz->lru_lock, flags);
>> >
>> >do _NOT_ use yield() ever! unless you know what you're doing, and
>> >probably not even then.
>> >
>> >NAK!
>> Hmm, sorry. cond_resched() is ok ?
>
>depends on what you want to do, please explain what you're trying to do.
>
Sorry again.
This force_empty is called only in following situation
- there is no user threas in this cgroup.
- a user tries to rmdir() this cgroup or explicitly type
echo 1 > ../memory.force_empty.
force_empty() scans lru list of this cgroup and check page_cgroup on the
list one by one. Because there are no tasks in this group, force_empty can
see following racy condtions while scanning.
- global lru tries to remove the page which pointed by page_cgroup
and it is not-on-LRU.
- the page is locked by someone.
....find some lock contetion with invalidation/truncate.
- in later patch, page_cgroup can be on pagevec(i added) and we have to drain
it to remove from LRU.
In above situation, force_empty() have to wait for some event proceeds.
Hmm...detecting busy situation in loop and sleep in out-side-of-loop
is better ? Anyway, ok, I'll rewrite this.
BTW, sched.c::yield() is for what purpose now ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 11:12 ` [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page) KAMEZAWA Hiroyuki
2008-09-22 14:52 ` Dave Hansen
@ 2008-09-22 15:14 ` kamezawa.hiroyu
2008-09-22 15:47 ` Dave Hansen
2008-09-22 15:57 ` kamezawa.hiroyu
2008-09-22 15:47 ` Peter Zijlstra
` (2 subsequent siblings)
4 siblings, 2 replies; 37+ messages in thread
From: kamezawa.hiroyu @ 2008-09-22 15:14 UTC (permalink / raw)
To: Dave Hansen; +Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura, xemul, LKML
----- Original Message -----
>On Mon, 2008-09-22 at 20:12 +0900, KAMEZAWA Hiroyuki wrote:
>> + /* we don't need too large hash */
>> + default_pcg_hash_size = (max_pfn/ENTS_PER_CHUNK);
>> + default_pcg_hash_size *= 2;
>> + /* if too big, use automatic calclation */
>> + if (default_pcg_hash_size > 1024 * 1024)
>> + default_pcg_hash_size = 0;
>> +
>> + pcg_hashtable = alloc_large_system_hash("PageCgroup Hash",
>> + sizeof(struct pcg_hash_head),
>> + default_pcg_hash_size,
>> + 13,
>> + 0,
>> + &pcg_hashshift,
>> + &pcg_hashmask,
>> + 0);
>
>The one thing I don't see here is much explanation about how large this
>structure will get.
>
max 8MB. (1024 *1024 * 8)...I'll reduce this.
>Basing it on max_pfn makes me nervous because of what it will do on
>machines with very sparse memory. Is this like sparsemem where the
>structure can be small enough to actually span all of physical memory,
>or will it be a large memory user?
>
I admit this calcuration is too easy. Hmm, based on totalram_pages is
better. ok.
>Can you lay out how much memory this will use on a machine like Dave
>Miller's which has 1GB of memory at 0x0 and 1GB of memory at 1TB up in
>the address space?
>
>Also, how large do the hash buckets get in the average case?
>
on my 48GB box, hashtable was 16384bytes. (in dmesg log.)
(section size was 128MB.)
I'll rewrite this based on totalram_pages.
BTW, do you know difference between num_physpages and totalram_pages ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: Re: [PATCH 4/13] memcg: force_empty moving account
2008-09-22 15:06 ` kamezawa.hiroyu
@ 2008-09-22 15:32 ` Peter Zijlstra
2008-09-22 15:43 ` kamezawa.hiroyu
1 sibling, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2008-09-22 15:32 UTC (permalink / raw)
To: kamezawa.hiroyu; +Cc: linux-mm, balbir, nishimura, xemul, LKML, Ingo Molnar
On Tue, 2008-09-23 at 00:06 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
> ----- Original Message -----
> >On Mon, 2008-09-22 at 23:50 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
> >> ----- Original Message -----
> >> >> + spin_lock_irqsave(&mz->lru_lock, flags);
> >> >> + } else {
> >> >> + unlock_page(page);
> >> >> + put_page(page);
> >> >> + }
> >> >> + if (atomic_read(&mem->css.cgroup->count) > 0)
> >> >> + break;
> >> >> }
> >> >> spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> >
> >> >do _NOT_ use yield() ever! unless you know what you're doing, and
> >> >probably not even then.
> >> >
> >> >NAK!
> >> Hmm, sorry. cond_resched() is ok ?
> >
> >depends on what you want to do, please explain what you're trying to do.
> >
> Sorry again.
>
> This force_empty is called only in following situation
> - there is no user threas in this cgroup.
> - a user tries to rmdir() this cgroup or explicitly type
> echo 1 > ../memory.force_empty.
>
> force_empty() scans lru list of this cgroup and check page_cgroup on the
> list one by one. Because there are no tasks in this group, force_empty can
> see following racy condtions while scanning.
>
> - global lru tries to remove the page which pointed by page_cgroup
> and it is not-on-LRU.
So you either skip the page because it already got un-accounted, or you
retry because its state is already updated to some new state.
> - the page is locked by someone.
> ....find some lock contetion with invalidation/truncate.
Then you just contend the lock and get woken when you obtain?
> - in later patch, page_cgroup can be on pagevec(i added) and we have to drain
> it to remove from LRU.
Then unlock, drain, lock, no need to sleep some arbitrary amount of time
[0-inf).
> In above situation, force_empty() have to wait for some event proceeds.
>
> Hmm...detecting busy situation in loop and sleep in out-side-of-loop
> is better ? Anyway, ok, I'll rewrite this.
The better solution is to wait for events in a non-polling fashion, for
example by using wait_event().
yield() might not actually wait at all, suppose you're the highest
priority FIFO task on the system - if you used yield and rely on someone
else to run you'll deadlock.
Also, depending on sysctl_sched_compat_yield, SCHED_OTHER tasks using
yield() can behave radically different.
> BTW, sched.c::yield() is for what purpose now ?
There are some (lagacy) users of yield, sadly they are all incorrect,
but removing them is non-trivial for various reasons.
The -rt kernel has 2 sites where yield() is the correct thing to do. In
both cases its where 2 SCHED_FIFO-99 tasks (migration and stop_machine)
depend on each-other.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: Re: Re: [PATCH 4/13] memcg: force_empty moving account
2008-09-22 15:06 ` kamezawa.hiroyu
2008-09-22 15:32 ` Peter Zijlstra
@ 2008-09-22 15:43 ` kamezawa.hiroyu
1 sibling, 0 replies; 37+ messages in thread
From: kamezawa.hiroyu @ 2008-09-22 15:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: kamezawa.hiroyu, linux-mm, balbir, nishimura, xemul, LKML, Ingo Molnar
----- Original Message -----
>> This force_empty is called only in following situation
>> - there is no user threas in this cgroup.
>> - a user tries to rmdir() this cgroup or explicitly type
>> echo 1 > ../memory.force_empty.
>>
>> force_empty() scans lru list of this cgroup and check page_cgroup on the
>> list one by one. Because there are no tasks in this group, force_empty can
>> see following racy condtions while scanning.
>>
>> - global lru tries to remove the page which pointed by page_cgroup
>> and it is not-on-LRU.
>
>So you either skip the page because it already got un-accounted, or you
>retry because its state is already updated to some new state.
>
>> - the page is locked by someone.
>> ....find some lock contetion with invalidation/truncate.
>
>Then you just contend the lock and get woken when you obtain?
>
>> - in later patch, page_cgroup can be on pagevec(i added) and we have to dr
ain
>> it to remove from LRU.
>
>Then unlock, drain, lock, no need to sleep some arbitrary amount of time
>[0-inf).
>
>> In above situation, force_empty() have to wait for some event proceeds.
>>
>> Hmm...detecting busy situation in loop and sleep in out-side-of-loop
>> is better ? Anyway, ok, I'll rewrite this.
>
>The better solution is to wait for events in a non-polling fashion, for
>example by using wait_event().
>
Hmm,
spin_unlock -> wait_on_page_locked() -> break loop or spin_lock and retry
will be a candidates. I'll see how it looks.
>yield() might not actually wait at all, suppose you're the highest
>priority FIFO task on the system - if you used yield and rely on someone
>else to run you'll deadlock.
>
Oh, I missed that. ok. yield() here is bad.
>Also, depending on sysctl_sched_compat_yield, SCHED_OTHER tasks using
>yield() can behave radically different.
>
>> BTW, sched.c::yield() is for what purpose now ?
>
>There are some (lagacy) users of yield, sadly they are all incorrect,
>but removing them is non-trivial for various reasons.
>
>The -rt kernel has 2 sites where yield() is the correct thing to do. In
>both cases its where 2 SCHED_FIFO-99 tasks (migration and stop_machine)
>depend on each-other.
>
Thank you for kindly advices. I'll rewrite.
Regards,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 15:14 ` kamezawa.hiroyu
@ 2008-09-22 15:47 ` Dave Hansen
2008-09-22 15:57 ` kamezawa.hiroyu
1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2008-09-22 15:47 UTC (permalink / raw)
To: kamezawa.hiroyu; +Cc: linux-mm, balbir, nishimura, xemul, LKML
On Tue, 2008-09-23 at 00:14 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
> >Basing it on max_pfn makes me nervous because of what it will do on
> >machines with very sparse memory. Is this like sparsemem where the
> >structure can be small enough to actually span all of physical memory,
> >or will it be a large memory user?
> >
> I admit this calcuration is too easy. Hmm, based on totalram_pages is
> better. ok.
No, I was setting a trap. ;)
If you use totalram_pages, I'll just complain that it doesn't work if a
memory hotplug machine drastically changes its size. You'll end up with
pretty darn big hash buckets.
You basically can't get away with the fact that you (potentially) have
really sparse addresses to play with here. Using a hash table is
exactly the same as using an array such as sparsemem except you randomly
index into it instead of using straight arithmetic.
My gut says that you'll need to do exactly the same things sparsemem did
here, which is at *least* have a two-level lookup before you get to the
linear search. The two-level lookup also makes the hotplug problem
easier.
As I look at this, I always have to bounce between these tradeoffs:
1. deal with sparse address spaces (keeps you from using max_pfn)
2. scale as that sparse address space has memory hotplugged into it
(keeps you from using boot-time present_pages)
3. deal with performance impacts from new data structures created to
deal with the other two :)
> >Can you lay out how much memory this will use on a machine like Dave
> >Miller's which has 1GB of memory at 0x0 and 1GB of memory at 1TB up in
> >the address space?
>
> >Also, how large do the hash buckets get in the average case?
> >
> on my 48GB box, hashtable was 16384bytes. (in dmesg log.)
> (section size was 128MB.)
I'm wondering how long the linear searches of those hlists get.
> I'll rewrite this based on totalram_pages.
>
> BTW, do you know difference between num_physpages and totalram_pages ?
num_physpages appears to be linked to the size of the address space and
totalram_pages looks like the amount of ram present. Kinda
spanned_pages and present_pages. But, who knows how consistent they are
these days. :)
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 11:12 ` [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page) KAMEZAWA Hiroyuki
2008-09-22 14:52 ` Dave Hansen
2008-09-22 15:14 ` kamezawa.hiroyu
@ 2008-09-22 15:47 ` Peter Zijlstra
2008-09-22 16:04 ` kamezawa.hiroyu
2008-09-23 23:48 ` KAMEZAWA Hiroyuki
4 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2008-09-22 15:47 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
On Mon, 2008-09-22 at 20:12 +0900, KAMEZAWA Hiroyuki wrote:
> - all page_cgroup struct is maintained by hash.
> I think we have 2 ways to handle sparse index in general
> ...radix-tree and hash. This uses hash because radix-tree's layout is
> affected by memory map's layout.
Could you provide further detail? That is, is this solely because our
radix tree implementation is sucky for large indexes?
If so, I did most of the work of fixing that, just need to spend a
little more time to stabalize the code.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 15:14 ` kamezawa.hiroyu
2008-09-22 15:47 ` Dave Hansen
@ 2008-09-22 15:57 ` kamezawa.hiroyu
2008-09-22 16:10 ` Dave Hansen
2008-09-22 17:34 ` kamezawa.hiroyu
1 sibling, 2 replies; 37+ messages in thread
From: kamezawa.hiroyu @ 2008-09-22 15:57 UTC (permalink / raw)
To: Dave Hansen; +Cc: kamezawa.hiroyu, linux-mm, balbir, nishimura, xemul, LKML
----- Original Message -----
>> >
>> I admit this calcuration is too easy. Hmm, based on totalram_pages is
>> better. ok.
>
>No, I was setting a trap. ;)
>
Bomb!
>If you use totalram_pages, I'll just complain that it doesn't work if a
>memory hotplug machine drastically changes its size. You'll end up with
>pretty darn big hash buckets.
>
As I wrote, this is just _generic_ one.
I'll add FLATMEM and SPARSEMEM support later.
I never want to write SPARSEMEM_EXTREME by myself and want to depend
on SPARSEMEM's internal implementation, which I know well.
>You basically can't get away with the fact that you (potentially) have
>really sparse addresses to play with here. Using a hash table is
>exactly the same as using an array such as sparsemem except you randomly
>index into it instead of using straight arithmetic.
>
see the next patch. per-cpu look-aside cache works well.
>My gut says that you'll need to do exactly the same things sparsemem did
>here, which is at *least* have a two-level lookup before you get to the
>linear search. The two-level lookup also makes the hotplug problem
>easier.
>
>As I look at this, I always have to bounce between these tradeoffs:
>
>1. deal with sparse address spaces (keeps you from using max_pfn)
>2. scale as that sparse address space has memory hotplugged into it
> (keeps you from using boot-time present_pages)
>3. deal with performance impacts from new data structures created to
> deal with the other two :)
>
>> >Can you lay out how much memory this will use on a machine like Dave
>> >Miller's which has 1GB of memory at 0x0 and 1GB of memory at 1TB up in
>> >the address space?
>>
>> >Also, how large do the hash buckets get in the average case?
>> >
>> on my 48GB box, hashtable was 16384bytes. (in dmesg log.)
>> (section size was 128MB.)
>
>I'm wondering how long the linear searches of those hlists get.
>
In above case, just one step. 16384/8 * 128MB.
In ppc, it has 16MB sections, hash table will be bigger. But "walk" is
not very long.
Anyway, How "walk" is long is not very big problem because look-aside
buffer helps.
I'll add FLATMEM/SPARSEMEM support later. Could you wait for a while ?
Because we have lookup_page_cgroup() after this, we can do anything.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 11:12 ` [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page) KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2008-09-22 15:47 ` Peter Zijlstra
@ 2008-09-22 16:04 ` kamezawa.hiroyu
2008-09-22 16:06 ` Peter Zijlstra
2008-09-23 23:48 ` KAMEZAWA Hiroyuki
4 siblings, 1 reply; 37+ messages in thread
From: kamezawa.hiroyu @ 2008-09-22 16:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura, xemul, LKML
----- Original Message -----
>On Mon, 2008-09-22 at 20:12 +0900, KAMEZAWA Hiroyuki wrote:
>
>> - all page_cgroup struct is maintained by hash.
>> I think we have 2 ways to handle sparse index in general
>> ...radix-tree and hash. This uses hash because radix-tree's layout is
>> affected by memory map's layout.
>
>Could you provide further detail? That is, is this solely because our
>radix tree implementation is sucky for large indexes?
>
no, sparse-large index.
>If so, I did most of the work of fixing that, just need to spend a
>little more time to stabalize the code.
>
IIUC, radix tree's height is determined by how sparse the space is.
In big servers, each node's memory is tend to be aligned to some aligned
address. like (following is an extreme example)
256M.....node 0 equips 4GB mem =32section
<very big hole>
256T .... node 1 equips 4GB mem =32section
<very big hole>
512T .... node 2 equips 4GB mem =32section
<very big hole>
.....
Then, steps to reach entries is tend to be larger than hash.
I'm sorry if I misunderstood.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 16:04 ` kamezawa.hiroyu
@ 2008-09-22 16:06 ` Peter Zijlstra
0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2008-09-22 16:06 UTC (permalink / raw)
To: kamezawa.hiroyu; +Cc: linux-mm, balbir, nishimura, xemul, LKML
On Tue, 2008-09-23 at 01:04 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
> ----- Original Message -----
> >On Mon, 2008-09-22 at 20:12 +0900, KAMEZAWA Hiroyuki wrote:
> >
> >> - all page_cgroup struct is maintained by hash.
> >> I think we have 2 ways to handle sparse index in general
> >> ...radix-tree and hash. This uses hash because radix-tree's layout is
> >> affected by memory map's layout.
> >
> >Could you provide further detail? That is, is this solely because our
> >radix tree implementation is sucky for large indexes?
> >
> no, sparse-large index.
>
> >If so, I did most of the work of fixing that, just need to spend a
> >little more time to stabalize the code.
> >
>
> IIUC, radix tree's height is determined by how sparse the space is.
Right, so Yes. Its that which I fixed.
> Then, steps to reach entries is tend to be larger than hash.
> I'm sorry if I misunderstood.
No problems,. I'll try and brush up that radix tree code and post
sometime soon.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 15:57 ` kamezawa.hiroyu
@ 2008-09-22 16:10 ` Dave Hansen
2008-09-22 17:34 ` kamezawa.hiroyu
1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2008-09-22 16:10 UTC (permalink / raw)
To: kamezawa.hiroyu; +Cc: linux-mm, balbir, nishimura, xemul, LKML
On Tue, 2008-09-23 at 00:57 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
> I'll add FLATMEM/SPARSEMEM support later. Could you wait for a while ?
> Because we have lookup_page_cgroup() after this, we can do anything.
OK, I'll stop harassing for the moment, and take a look at the cache. :)
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Re: Re: Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 15:57 ` kamezawa.hiroyu
2008-09-22 16:10 ` Dave Hansen
@ 2008-09-22 17:34 ` kamezawa.hiroyu
1 sibling, 0 replies; 37+ messages in thread
From: kamezawa.hiroyu @ 2008-09-22 17:34 UTC (permalink / raw)
To: Dave Hansen; +Cc: kamezawa.hiroyu, linux-mm, balbir, nishimura, xemul, LKML
>On Tue, 2008-09-23 at 00:57 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
>> I'll add FLATMEM/SPARSEMEM support later. Could you wait for a while ?
>> Because we have lookup_page_cgroup() after this, we can do anything.
>
>OK, I'll stop harassing for the moment, and take a look at the cache. :)
>
Why I don't say "optimize this! now! more!" is where this is called is
limited now. only at charge/uncharge. This is not memmap.
charge ...the first page fault to the page
add to radix-tree
uncharge ...the last unmap aginst the page
remove from radix-tree.
I can make this faster by using charactoristics of FLATMEM and others.
(with more #ifdefs and codes.)
But would like to start from generic one because adding interface is
the first thing I have to do here.
BTW, to be honest, I don't like 2-level-table-lookup like
SPARSEMEM_EXTREME, here. A style like SPARSEMEM_VMEMMAP...using
linear virtual address map will be goal of mine.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-22 11:12 ` [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page) KAMEZAWA Hiroyuki
` (3 preceding siblings ...)
2008-09-22 16:04 ` kamezawa.hiroyu
@ 2008-09-23 23:48 ` KAMEZAWA Hiroyuki
2008-09-24 2:09 ` Balbir Singh
4 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-23 23:48 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, LKML
After sleeping all day, I changed my mind and decided to drop this.
It seems no one like this.
I'll add FLATMEM/DISCONTIGMEM/SPARSEMEM support directly.
I already have wasted a month on this not-interesting work and want to fix
this soon.
I'm glad if people help me to test FLATMEM/DISCONTIGMEM/SPARSEMEM because
there are various kinds of memory map. I have only x86-64 box.
Thanks,
-Kame
On Mon, 22 Sep 2008 20:12:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Remove page_cgroup pointer from struct page.
>
> This patch removes page_cgroup pointer from struct page and make it be able
> to get from pfn. Then, relationship of them is
>
> Before this:
> pfn <-> struct page <-> struct page_cgroup.
> After this:
> struct page <-> pfn -> struct page_cgroup -> struct page.
>
> Benefit of this approach is we can remove 8 bytes from struct page.
>
> Other changes are:
> - lock/unlock_page_cgroup() uses its own bit on struct page_cgroup.
> - all necessary page_cgroups are allocated at boot.
>
> Characteristics:
> - page cgroup is allocated as some amount of chunk.
> This patch uses SECTION_SIZE as size of chunk if 64bit/SPARSEMEM is enabled.
> If not, appropriate default number is selected.
> - all page_cgroup struct is maintained by hash.
> I think we have 2 ways to handle sparse index in general
> ...radix-tree and hash. This uses hash because radix-tree's layout is
> affected by memory map's layout.
> - page_cgroup.h/page_cgroup.c is added.
>
> Changelog: v3 -> v4.
> - changed arguments to lookup_page_cgroup() from "pfn" to "page",
>
> Changelog: v2 -> v3
> - changed arguments from pfn to struct page*.
> - added memory hotplug callback (no undo...needs .more work.)
> - adjusted to new mmotm.
>
> Changelog: v1 -> v2
> - Fixed memory allocation failure at boot to do panic with good message.
> - rewrote charge/uncharge path (no changes in logic.)
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> include/linux/mm_types.h | 4
> include/linux/page_cgroup.h | 89 +++++++++++++++
> mm/Makefile | 2
> mm/memcontrol.c | 251 +++++++++++---------------------------------
> mm/page_alloc.c | 9 -
> mm/page_cgroup.c | 235 +++++++++++++++++++++++++++++++++++++++++
> 6 files changed, 394 insertions(+), 196 deletions(-)
>
> Index: mmotm-2.6.27-rc6+/mm/page_cgroup.c
> ===================================================================
> --- /dev/null
> +++ mmotm-2.6.27-rc6+/mm/page_cgroup.c
> @@ -0,0 +1,235 @@
> +#include <linux/mm.h>
> +#include <linux/rcupdate.h>
> +#include <linux/rculist.h>
> +#include <linux/bootmem.h>
> +#include <linux/bit_spinlock.h>
> +#include <linux/page_cgroup.h>
> +#include <linux/hash.h>
> +#include <linux/memory.h>
> +
> +
> +
> +struct pcg_hash_head {
> + spinlock_t lock;
> + struct hlist_head head;
> +};
> +
> +static struct pcg_hash_head *pcg_hashtable __read_mostly;
> +
> +struct pcg_hash {
> + struct hlist_node node;
> + unsigned long index;
> + struct page_cgroup *map;
> +};
> +
> +#if BITS_PER_LONG == 32 /* we use kmalloc() */
> +#define ENTS_PER_CHUNK_SHIFT (7)
> +const bool chunk_vmalloc = false;
> +#else /* we'll use vmalloc */
> +#ifdef SECTION_SIZE_BITS
> +#define ENTS_PER_CHUNK_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)
> +#else
> +#define ENTS_PER_CHUNK_SHIFT (14) /* covers 128MB on x86-64 */
> +#endif
> +const bool chunk_vmalloc = true;
> +#endif
> +
> +#define ENTS_PER_CHUNK (1 << (ENTS_PER_CHUNK_SHIFT))
> +#define ENTS_PER_CHUNK_MASK (ENTS_PER_CHUNK - 1)
> +
> +static int pcg_hashshift __read_mostly;
> +static int pcg_hashmask __read_mostly;
> +
> +#define PCG_HASHSHIFT (pcg_hashshift)
> +#define PCG_HASHMASK (pcg_hashmask)
> +#define PCG_HASHSIZE (1 << pcg_hashshift)
> +
> +static int pcg_hashfun(unsigned long index)
> +{
> + return hash_long(index, pcg_hashshift);
> +}
> +
> +struct page_cgroup *lookup_page_cgroup(unsigned long pfn)
> +{
> + unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
> + struct pcg_hash *ent;
> + struct pcg_hash_head *head;
> + struct hlist_node *node;
> + struct page_cgroup *pc = NULL;
> + int hnum;
> +
> + hnum = pcg_hashfun(index);
> + head = pcg_hashtable + hnum;
> + rcu_read_lock();
> + hlist_for_each_entry(ent, node, &head->head, node) {
> + if (ent->index == index) {
> + pc = ent->map + pfn;
> + break;
> + }
> + }
> + rcu_read_unlock();
> + return pc;
> +}
> +
> +static int __meminit alloc_page_cgroup(int node, unsigned long index)
> +{
> + struct pcg_hash *ent;
> + struct pcg_hash_head *head;
> + struct page_cgroup *pc;
> + unsigned long flags, base;
> + int hnum, i;
> + int mapsize = sizeof(struct page_cgroup) * ENTS_PER_CHUNK;
> +
> + if (lookup_page_cgroup(index << ENTS_PER_CHUNK_SHIFT))
> + return 0;
> +
> + if (!chunk_vmalloc) {
> + int ent_size = sizeof(*ent) + mapsize;
> + ent = kmalloc_node(ent_size, GFP_KERNEL, node);
> + if (!ent)
> + return 1;
> + pc = (void *)(ent + 1);
> + } else {
> + ent = kmalloc_node(sizeof(*ent), GFP_KERNEL, node);
> + if (!ent)
> + return 1;
> + pc = vmalloc_node(mapsize, node);
> + if (!pc) {
> + kfree(ent);
> + return 1;
> + }
> + }
> + ent->map = pc - (index << ENTS_PER_CHUNK_SHIFT);
> + ent->index = index;
> + INIT_HLIST_NODE(&ent->node);
> +
> + for (base = index << ENTS_PER_CHUNK_SHIFT, i = 0;
> + i < ENTS_PER_CHUNK; i++) {
> + unsigned long pfn = base + i;
> + pc = ent->map + pfn;
> + pc->page = pfn_to_page(pfn);
> + pc->mem_cgroup = NULL;
> + pc->flags = 0;
> + }
> +
> + hnum = pcg_hashfun(index);
> + head = &pcg_hashtable[hnum];
> + spin_lock_irqsave(&head->lock, flags);
> + hlist_add_head_rcu(&ent->node, &head->head);
> + spin_unlock_irqrestore(&head->lock, flags);
> + return 0;
> +}
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +int online_page_cgroup(unsigned long start_pfn,
> + unsigned long nr_pages,
> + int nid)
> +{
> + unsigned long index, end_pfn, start, end;
> + int fail = 0;
> +
> + end_pfn = start_pfn + nr_pages;
> + start = start_pfn >> ENTS_PER_CHUNK_SHIFT;
> + end = (end_pfn + ENTS_PER_CHUNK - 1) >> ENTS_PER_CHUNK_SHIFT;
> +
> + for (index = start; (!fail) && (index < end); index++) {
> + unsigned long pfn = index << ENTS_PER_CHUNK_SHIFT;
> + if (lookup_page_cgroup(pfn))
> + continue;
> + fail = alloc_page_cgroup(nid, index);
> + }
> + return fail;
> +}
> +
> +static int pcg_memory_callback(struct notifier_block *self,
> + unsigned long action, void *arg)
> +{
> + struct memory_notify *mn = arg;
> + int ret = 0;
> + switch (action) {
> + case MEM_GOING_ONLINE:
> + ret = online_page_cgroup(mn->start_pfn,
> + mn->nr_pages, mn->status_change_nid);
> + break;
> + case MEM_GOING_OFFLINE:
> + break;
> + case MEM_CANCEL_ONLINE:
> + case MEM_OFFLINE:
> + case MEM_ONLINE:
> + case MEM_CANCEL_OFFLINE:
> + break;
> + }
> + ret = notifier_from_errno(ret);
> + return ret;
> +}
> +
> +#endif
> +
> +/* Called From mem_cgroup's initilization */
> +void __init page_cgroup_init(void)
> +{
> + struct pcg_hash_head *head;
> + int node, i, fail;
> + unsigned long start, pfn, end, index, offset;
> + long default_pcg_hash_size;
> +
> + /* we don't need too large hash */
> + default_pcg_hash_size = (max_pfn/ENTS_PER_CHUNK);
> + default_pcg_hash_size *= 2;
> + /* if too big, use automatic calclation */
> + if (default_pcg_hash_size > 1024 * 1024)
> + default_pcg_hash_size = 0;
> +
> + pcg_hashtable = alloc_large_system_hash("PageCgroup Hash",
> + sizeof(struct pcg_hash_head),
> + default_pcg_hash_size,
> + 13,
> + 0,
> + &pcg_hashshift,
> + &pcg_hashmask,
> + 0);
> + if (!pcg_hashtable) {
> + fail = 1;
> + goto nomem;
> + }
> +
> + for (i = 0; i < PCG_HASHSIZE; i++) {
> + head = &pcg_hashtable[i];
> + spin_lock_init(&head->lock);
> + INIT_HLIST_HEAD(&head->head);
> + }
> +
> + fail = 0;
> + for_each_node(node) {
> + start = NODE_DATA(node)->node_start_pfn;
> + end = start + NODE_DATA(node)->node_spanned_pages;
> + start >>= ENTS_PER_CHUNK_SHIFT;
> + end = (end + ENTS_PER_CHUNK - 1) >> ENTS_PER_CHUNK_SHIFT;
> + for (index = start; (!fail) && (index < end); index++) {
> + pfn = index << ENTS_PER_CHUNK_SHIFT;
> + /*
> + * In usual, this loop breaks at offset=0.
> + * Handle a case a hole in MAX_ORDER (ia64 only...)
> + */
> + for (offset = 0; offset < ENTS_PER_CHUNK; offset++) {
> + if (pfn_valid(pfn + offset)) {
> + fail = alloc_page_cgroup(node, index);
> + break;
> + }
> + }
> + }
> + if (fail)
> + break;
> + }
> +
> + hotplug_memory_notifier(pcg_memory_callback, 0);
> +nomem:
> + if (fail) {
> + printk("Not enough memory for memory resource controller.\n");
> + panic("please try cgroup_disable=memory boot option.");
> + }
> + return;
> +}
> +
> +
> Index: mmotm-2.6.27-rc6+/include/linux/mm_types.h
> ===================================================================
> --- mmotm-2.6.27-rc6+.orig/include/linux/mm_types.h
> +++ mmotm-2.6.27-rc6+/include/linux/mm_types.h
> @@ -94,10 +94,6 @@ struct page {
> void *virtual; /* Kernel virtual address (NULL if
> not kmapped, ie. highmem) */
> #endif /* WANT_PAGE_VIRTUAL */
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> - unsigned long page_cgroup;
> -#endif
> -
> #ifdef CONFIG_KMEMCHECK
> void *shadow;
> #endif
> Index: mmotm-2.6.27-rc6+/mm/Makefile
> ===================================================================
> --- mmotm-2.6.27-rc6+.orig/mm/Makefile
> +++ mmotm-2.6.27-rc6+/mm/Makefile
> @@ -34,6 +34,6 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_SMP) += allocpercpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
> obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
> Index: mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
> ===================================================================
> --- /dev/null
> +++ mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
> @@ -0,0 +1,89 @@
> +#ifndef __LINUX_PAGE_CGROUP_H
> +#define __LINUX_PAGE_CGROUP_H
> +
> +/*
> + * Page Cgroup can be considered as an extended mem_map.
> + * A page_cgroup page is associated with every page descriptor. The
> + * page_cgroup helps us identify information about the cgroup
> + * All page cgroups are allocated at boot or memory hotplug event,
> + * then the page cgroup for pfn always exists.
> + */
> +struct page_cgroup {
> + unsigned long flags;
> + struct mem_cgroup *mem_cgroup;
> + struct page *page;
> + struct list_head lru; /* per cgroup LRU list */
> +};
> +
> +void __init page_cgroup_init(void);
> +struct page_cgroup *lookup_page_cgroup(unsigned long pfn);
> +
> +enum {
> + /* flags for mem_cgroup */
> + PCG_LOCK, /* page cgroup is locked */
> + PCG_CACHE, /* charged as cache */
> + PCG_USED, /* this object is in use. */
> + /* flags for LRU placement */
> + PCG_ACTIVE, /* page is active in this cgroup */
> + PCG_FILE, /* page is file system backed */
> + PCG_UNEVICTABLE, /* page is unevictableable */
> +};
> +
> +#define TESTPCGFLAG(uname, lname) \
> +static inline int PageCgroup##uname(struct page_cgroup *pc) \
> + { return test_bit(PCG_##lname, &pc->flags); }
> +
> +#define SETPCGFLAG(uname, lname) \
> +static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
> + { set_bit(PCG_##lname, &pc->flags); }
> +
> +#define CLEARPCGFLAG(uname, lname) \
> +static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
> + { clear_bit(PCG_##lname, &pc->flags); }
> +
> +/* Cache flag is set only once (at allocation) */
> +TESTPCGFLAG(Cache, CACHE)
> +
> +TESTPCGFLAG(Used, USED)
> +CLEARPCGFLAG(Used, USED)
> +
> +/* LRU management flags (from global-lru definition) */
> +TESTPCGFLAG(File, FILE)
> +SETPCGFLAG(File, FILE)
> +CLEARPCGFLAG(File, FILE)
> +
> +TESTPCGFLAG(Active, ACTIVE)
> +SETPCGFLAG(Active, ACTIVE)
> +CLEARPCGFLAG(Active, ACTIVE)
> +
> +TESTPCGFLAG(Unevictable, UNEVICTABLE)
> +SETPCGFLAG(Unevictable, UNEVICTABLE)
> +CLEARPCGFLAG(Unevictable, UNEVICTABLE)
> +
> +static inline int page_cgroup_nid(struct page_cgroup *pc)
> +{
> + return page_to_nid(pc->page);
> +}
> +
> +static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
> +{
> + return page_zonenum(pc->page);
> +}
> +
> +static inline void lock_page_cgroup(struct page_cgroup *pc)
> +{
> + bit_spin_lock(PCG_LOCK, &pc->flags);
> +}
> +
> +static inline int trylock_page_cgroup(struct page_cgroup *pc)
> +{
> + return bit_spin_trylock(PCG_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup(struct page_cgroup *pc)
> +{
> + bit_spin_unlock(PCG_LOCK, &pc->flags);
> +}
> +
> +
> +#endif
> Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
> +++ mmotm-2.6.27-rc6+/mm/memcontrol.c
> @@ -34,11 +34,11 @@
> #include <linux/seq_file.h>
> #include <linux/vmalloc.h>
> #include <linux/mm_inline.h>
> +#include <linux/page_cgroup.h>
>
> #include <asm/uaccess.h>
>
> struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> -static struct kmem_cache *page_cgroup_cache __read_mostly;
> #define MEM_CGROUP_RECLAIM_RETRIES 5
>
> /*
> @@ -138,80 +138,6 @@ static struct mem_cgroup init_mem_cgroup
>
> #define is_root_cgroup(cgrp) ((cgrp) == &init_mem_cgroup)
>
> -
> -/*
> - * We use the lower bit of the page->page_cgroup pointer as a bit spin
> - * lock. We need to ensure that page->page_cgroup is at least two
> - * byte aligned (based on comments from Nick Piggin). But since
> - * bit_spin_lock doesn't actually set that lock bit in a non-debug
> - * uniprocessor kernel, we should avoid setting it here too.
> - */
> -#define PAGE_CGROUP_LOCK_BIT 0x0
> -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> -#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
> -#else
> -#define PAGE_CGROUP_LOCK 0x0
> -#endif
> -
> -/*
> - * A page_cgroup page is associated with every page descriptor. The
> - * page_cgroup helps us identify information about the cgroup
> - */
> -struct page_cgroup {
> - struct list_head lru; /* per cgroup LRU list */
> - struct page *page;
> - struct mem_cgroup *mem_cgroup;
> - unsigned long flags;
> -};
> -
> -enum {
> - /* flags for mem_cgroup */
> - PCG_CACHE, /* charged as cache */
> - /* flags for LRU placement */
> - PCG_ACTIVE, /* page is active in this cgroup */
> - PCG_FILE, /* page is file system backed */
> - PCG_UNEVICTABLE, /* page is unevictableable */
> -};
> -
> -#define TESTPCGFLAG(uname, lname) \
> -static inline int PageCgroup##uname(struct page_cgroup *pc) \
> - { return test_bit(PCG_##lname, &pc->flags); }
> -
> -#define SETPCGFLAG(uname, lname) \
> -static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
> - { set_bit(PCG_##lname, &pc->flags); }
> -
> -#define CLEARPCGFLAG(uname, lname) \
> -static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
> - { clear_bit(PCG_##lname, &pc->flags); }
> -
> -
> -/* Cache flag is set only once (at allocation) */
> -TESTPCGFLAG(Cache, CACHE)
> -
> -/* LRU management flags (from global-lru definition) */
> -TESTPCGFLAG(File, FILE)
> -SETPCGFLAG(File, FILE)
> -CLEARPCGFLAG(File, FILE)
> -
> -TESTPCGFLAG(Active, ACTIVE)
> -SETPCGFLAG(Active, ACTIVE)
> -CLEARPCGFLAG(Active, ACTIVE)
> -
> -TESTPCGFLAG(Unevictable, UNEVICTABLE)
> -SETPCGFLAG(Unevictable, UNEVICTABLE)
> -CLEARPCGFLAG(Unevictable, UNEVICTABLE)
> -
> -static int page_cgroup_nid(struct page_cgroup *pc)
> -{
> - return page_to_nid(pc->page);
> -}
> -
> -static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
> -{
> - return page_zonenum(pc->page);
> -}
> -
> enum charge_type {
> MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -222,9 +148,9 @@ enum charge_type {
>
> static const unsigned long
> pcg_default_flags[NR_CHARGE_TYPE] = {
> - ((1 << PCG_CACHE) | (1 << PCG_FILE)),
> - ((1 << PCG_ACTIVE)),
> - ((1 << PCG_ACTIVE) | (1 << PCG_CACHE)),
> + (1 << PCG_CACHE) | (1 << PCG_FILE) | (1 << PCG_USED) | (1 << PCG_LOCK),
> + (1 << PCG_ACTIVE) | (1 << PCG_LOCK) | (1 << PCG_USED),
> + (1 << PCG_ACTIVE) | (1 << PCG_CACHE) | (1 << PCG_USED)| (1 << PCG_LOCK),
> 0,
> };
>
> @@ -307,37 +233,6 @@ struct mem_cgroup *mem_cgroup_from_task(
> struct mem_cgroup, css);
> }
>
> -static inline int page_cgroup_locked(struct page *page)
> -{
> - return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> -{
> - VM_BUG_ON(!page_cgroup_locked(page));
> - page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> -}
> -
> -struct page_cgroup *page_get_page_cgroup(struct page *page)
> -{
> - return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
> -}
> -
> -static void lock_page_cgroup(struct page *page)
> -{
> - bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static int try_lock_page_cgroup(struct page *page)
> -{
> - return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void unlock_page_cgroup(struct page *page)
> -{
> - bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
> struct page_cgroup *pc)
> {
> @@ -441,22 +336,19 @@ void mem_cgroup_move_lists(struct page *
> * safely get to page_cgroup without it, so just try_lock it:
> * mem_cgroup_isolate_pages allows for page left on wrong list.
> */
> - if (!try_lock_page_cgroup(page))
> + pc = lookup_page_cgroup(page_to_pfn(page));
> +
> + if (!trylock_page_cgroup(pc))
> return;
>
> - pc = page_get_page_cgroup(page);
> - if (pc) {
> + if (PageCgroupUsed(pc)) {
> mem = pc->mem_cgroup;
> mz = page_cgroup_zoneinfo(pc);
> spin_lock_irqsave(&mz->lru_lock, flags);
> - /*
> - * check against the race with move_account.
> - */
> - if (likely(mem == pc->mem_cgroup))
> - __mem_cgroup_move_lists(pc, lru);
> + __mem_cgroup_move_lists(pc, lru);
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> }
> - unlock_page_cgroup(page);
> + unlock_page_cgroup(pc);
> }
>
> /*
> @@ -543,6 +435,8 @@ unsigned long mem_cgroup_isolate_pages(u
> list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
> if (scan >= nr_to_scan)
> break;
> + if (unlikely(!PageCgroupUsed(pc)))
> + continue;
> page = pc->page;
>
> if (unlikely(!PageLRU(page)))
> @@ -611,12 +505,12 @@ int mem_cgroup_move_account(struct page
> /* Now, we assume no_limit...no failure here. */
> return ret;
> }
> - if (!try_lock_page_cgroup(page)) {
> + if (!trylock_page_cgroup(pc)) {
> res_counter_uncharge(&to->res, PAGE_SIZE);
> return ret;
> }
>
> - if (page_get_page_cgroup(page) != pc) {
> + if (!PageCgroupUsed(pc)) {
> res_counter_uncharge(&to->res, PAGE_SIZE);
> goto out;
> }
> @@ -634,7 +528,7 @@ int mem_cgroup_move_account(struct page
> res_counter_uncharge(&to->res, PAGE_SIZE);
> }
> out:
> - unlock_page_cgroup(page);
> + unlock_page_cgroup(pc);
>
> return ret;
> }
> @@ -651,26 +545,27 @@ static int mem_cgroup_charge_common(stru
> {
> struct mem_cgroup *mem;
> struct page_cgroup *pc;
> - unsigned long flags;
> unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> struct mem_cgroup_per_zone *mz;
> + unsigned long flags;
>
> - pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> - if (unlikely(pc == NULL))
> - goto err;
> -
> + pc = lookup_page_cgroup(page_to_pfn(page));
> + /* can happen at boot */
> + if (unlikely(!pc))
> + return 0;
> + prefetchw(pc);
> /*
> * We always charge the cgroup the mm_struct belongs to.
> * The mm_struct's mem_cgroup changes on task migration if the
> * thread group leader migrates. It's possible that mm is not
> * set, if so charge the init_mm (happens for pagecache usage).
> */
> +
> if (likely(!memcg)) {
> rcu_read_lock();
> mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> if (unlikely(!mem)) {
> rcu_read_unlock();
> - kmem_cache_free(page_cgroup_cache, pc);
> return 0;
> }
> /*
> @@ -706,36 +601,34 @@ static int mem_cgroup_charge_common(stru
> }
> }
>
> + preempt_disable();
> + lock_page_cgroup(pc);
> + if (unlikely(PageCgroupUsed(pc))) {
> + unlock_page_cgroup(pc);
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + css_put(&mem->css);
> + preempt_enable();
> + goto done;
> + }
> pc->mem_cgroup = mem;
> - pc->page = page;
> /*
> * If a page is accounted as a page cache, insert to inactive list.
> * If anon, insert to active list.
> */
> pc->flags = pcg_default_flags[ctype];
>
> - lock_page_cgroup(page);
> - if (unlikely(page_get_page_cgroup(page))) {
> - unlock_page_cgroup(page);
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> - css_put(&mem->css);
> - kmem_cache_free(page_cgroup_cache, pc);
> - goto done;
> - }
> - page_assign_page_cgroup(page, pc);
> -
> mz = page_cgroup_zoneinfo(pc);
> +
> spin_lock_irqsave(&mz->lru_lock, flags);
> __mem_cgroup_add_list(mz, pc);
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> + unlock_page_cgroup(pc);
> + preempt_enable();
>
> - unlock_page_cgroup(page);
> done:
> return 0;
> out:
> css_put(&mem->css);
> - kmem_cache_free(page_cgroup_cache, pc);
> -err:
> return -ENOMEM;
> }
>
> @@ -743,7 +636,8 @@ int mem_cgroup_charge(struct page *page,
> {
> if (mem_cgroup_subsys.disabled)
> return 0;
> -
> + if (PageCompound(page))
> + return 0;
> /*
> * If already mapped, we don't have to account.
> * If page cache, page->mapping has address_space.
> @@ -764,7 +658,8 @@ int mem_cgroup_cache_charge(struct page
> {
> if (mem_cgroup_subsys.disabled)
> return 0;
> -
> + if (PageCompound(page))
> + return 0;
> /*
> * Corner case handling. This is called from add_to_page_cache()
> * in usual. But some FS (shmem) precharges this page before calling it
> @@ -777,15 +672,16 @@ int mem_cgroup_cache_charge(struct page
> if (!(gfp_mask & __GFP_WAIT)) {
> struct page_cgroup *pc;
>
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (pc) {
> - VM_BUG_ON(pc->page != page);
> - VM_BUG_ON(!pc->mem_cgroup);
> - unlock_page_cgroup(page);
> +
> + pc = lookup_page_cgroup(page_to_pfn(page));
> + if (!pc)
> + return 0;
> + lock_page_cgroup(pc);
> + if (PageCgroupUsed(pc)) {
> + unlock_page_cgroup(pc);
> return 0;
> }
> - unlock_page_cgroup(page);
> + unlock_page_cgroup(pc);
> }
>
> if (unlikely(!mm))
> @@ -808,53 +704,46 @@ __mem_cgroup_uncharge_common(struct page
> struct page_cgroup *pc;
> struct mem_cgroup *mem;
> struct mem_cgroup_per_zone *mz;
> + unsigned long pfn = page_to_pfn(page);
> unsigned long flags;
>
> if (mem_cgroup_subsys.disabled)
> return;
> + /* check the condition we can know from page */
>
> - /*
> - * Check if our page_cgroup is valid
> - */
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (unlikely(!pc))
> - goto unlock;
> -
> - VM_BUG_ON(pc->page != page);
> + pc = lookup_page_cgroup(pfn);
> + if (unlikely(!pc || !PageCgroupUsed(pc)))
> + return;
> + preempt_disable();
> + lock_page_cgroup(pc);
> + if (unlikely(page_mapped(page))) {
> + unlock_page_cgroup(pc);
> + preempt_enable();
> + return;
> + }
> + ClearPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
>
> - if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> - && ((PageCgroupCache(pc) || page_mapped(page))))
> - goto unlock;
> -retry:
> mem = pc->mem_cgroup;
> mz = page_cgroup_zoneinfo(pc);
> +
> spin_lock_irqsave(&mz->lru_lock, flags);
> - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
> - unlikely(mem != pc->mem_cgroup)) {
> - /* MAPPED account can be done without lock_page().
> - Check race with mem_cgroup_move_account() */
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> - goto retry;
> - }
> __mem_cgroup_remove_list(mz, pc);
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> - page_assign_page_cgroup(page, NULL);
> - unlock_page_cgroup(page);
> -
> -
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> + pc->mem_cgroup = NULL;
> css_put(&mem->css);
> + preempt_enable();
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
>
> - kmem_cache_free(page_cgroup_cache, pc);
> return;
> -unlock:
> - unlock_page_cgroup(page);
> }
>
> void mem_cgroup_uncharge_page(struct page *page)
> {
> + if (page_mapped(page))
> + return;
> + if (page->mapping && !PageAnon(page))
> + return;
> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> }
>
> @@ -878,9 +767,9 @@ int mem_cgroup_prepare_migration(struct
> if (mem_cgroup_subsys.disabled)
> return 0;
>
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (pc) {
> + pc = lookup_page_cgroup(page_to_pfn(page));
> + lock_page_cgroup(pc);
> + if (PageCgroupUsed(pc)) {
> mem = pc->mem_cgroup;
> css_get(&mem->css);
> if (PageCgroupCache(pc)) {
> @@ -890,7 +779,7 @@ int mem_cgroup_prepare_migration(struct
> ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> }
> }
> - unlock_page_cgroup(page);
> + unlock_page_cgroup(pc);
> if (mem) {
> ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> ctype, mem);
> @@ -1271,8 +1160,8 @@ mem_cgroup_create(struct cgroup_subsys *
> int node;
>
> if (unlikely((cont->parent) == NULL)) {
> + page_cgroup_init();
> mem = &init_mem_cgroup;
> - page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
> } else {
> mem = mem_cgroup_alloc();
> if (!mem)
> Index: mmotm-2.6.27-rc6+/mm/page_alloc.c
> ===================================================================
> --- mmotm-2.6.27-rc6+.orig/mm/page_alloc.c
> +++ mmotm-2.6.27-rc6+/mm/page_alloc.c
> @@ -223,17 +223,12 @@ static inline int bad_range(struct zone
>
> static void bad_page(struct page *page)
> {
> - void *pc = page_get_page_cgroup(page);
> -
> printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
> "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
> current->comm, page, (int)(2*sizeof(unsigned long)),
> (unsigned long)page->flags, page->mapping,
> page_mapcount(page), page_count(page));
> - if (pc) {
> - printk(KERN_EMERG "cgroup:%p\n", pc);
> - page_reset_bad_cgroup(page);
> - }
> +
> printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
> KERN_EMERG "Backtrace:\n");
> dump_stack();
> @@ -472,7 +467,6 @@ static inline void free_pages_check(stru
> free_page_mlock(page);
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> - (page_get_page_cgroup(page) != NULL) |
> (page_count(page) != 0) |
> (page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
> bad_page(page);
> @@ -609,7 +603,6 @@ static void prep_new_page(struct page *p
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> - (page_get_page_cgroup(page) != NULL) |
> (page_count(page) != 0) |
> (page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
> bad_page(page);
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-23 23:48 ` KAMEZAWA Hiroyuki
@ 2008-09-24 2:09 ` Balbir Singh
2008-09-24 3:09 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2008-09-24 2:09 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, xemul, LKML
On Wed, Sep 24, 2008 at 5:18 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> After sleeping all day, I changed my mind and decided to drop this.
> It seems no one like this.
>
I've not yet looked at the patch in detail, I just got back from a long travel.
> I'll add FLATMEM/DISCONTIGMEM/SPARSEMEM support directly.
> I already have wasted a month on this not-interesting work and want to fix
> this soon.
>
Let's look at the basic requirement, make memory resource controller
not suck with 32 bit systems. I have been thinking of about removing
page_cgroup from struct page only for 32 bit systems (use radix tree),
32 bit systems can have a maximum of 64GB if PAE is enabled, I suspect
radix tree should work there and let the 64 bit systems work as is. If
performance is an issue, I would recommend the 32 bit folks upgrade to
64 bit :) Can we build consensus around this approach?
> I'm glad if people help me to test FLATMEM/DISCONTIGMEM/SPARSEMEM because
> there are various kinds of memory map. I have only x86-64 box.
I can help test your patches on powerpc 64 bit and find a 32 bit
system to test it as well. What do you think about the points above?
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-24 2:09 ` Balbir Singh
@ 2008-09-24 3:09 ` KAMEZAWA Hiroyuki
2008-09-24 8:31 ` Balbir Singh
0 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-24 3:09 UTC (permalink / raw)
To: Balbir Singh; +Cc: linux-mm, nishimura, xemul, LKML
On Wed, 24 Sep 2008 07:39:58 +0530
"Balbir Singh" <balbir@linux.vnet.ibm.com> wrote:
> > I'll add FLATMEM/DISCONTIGMEM/SPARSEMEM support directly.
> > I already have wasted a month on this not-interesting work and want to fix
> > this soon.
> >
>
> Let's look at the basic requirement, make memory resource controller
> not suck with 32 bit systems. I have been thinking of about removing
> page_cgroup from struct page only for 32 bit systems (use radix tree),
> 32 bit systems can have a maximum of 64GB if PAE is enabled, I suspect
> radix tree should work there and let the 64 bit systems work as is. If
> performance is an issue, I would recommend the 32 bit folks upgrade to
> 64 bit :) Can we build consensus around this approach?
>
My thinking is below. (assume 64bit)
- remove page_cgroup pointer from struct page allows us to reduce
static memory usage at boot by 8bytes/4096bytes if memory cgroup is disabled.
This reaches 96MB on my 48 GB box. I think this is big.
- pre-allocation of page_cgroup gives us following.
Pros.
- We are not necessary to be afraid of "failure of kmalloc" and
"goes down to memory reclaim at kmalloc"
This makes memory resource controller much simpler and robust.
- We can know what amount of kernel memory will be used for
LRU pages management.
Cons.
- All page_cgroups are allocated at boot.
This reaches 480MB on my 48GB box.
But I think we can ignore "Cons.". If we use up memory, we'll use tons of
page_cgroup. Considering memory fragmentation caused by allocating a lots of
very small object, pre-allocation makes memcg better.
> > I'm glad if people help me to test FLATMEM/DISCONTIGMEM/SPARSEMEM because
> > there are various kinds of memory map. I have only x86-64 box.
>
> I can help test your patches on powerpc 64 bit and find a 32 bit
> system to test it as well. What do you think about the points above?
>
In Kconfig, x86-64 just uses SPARSEMEM and FLATMEM/DISCONTIGMEM cannot be selected.
I can compile by hand but cannot do real test.
I already wrote a replacement, quite easy to read.
(now under test.)
==
pre-allocate all page_cgroup at boot and remove page_cgroup poitner
from struct page. This patch adds an interface as
struct page_cgroup *lookup_page_cgroup(struct page*)
All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/memcontrol.h | 11 -
include/linux/mm_types.h | 4
include/linux/mmzone.h | 8 +
include/linux/page_cgroup.h | 90 +++++++++++++++
mm/Makefile | 2
mm/memcontrol.c | 256 ++++++++++++--------------------------------
mm/page_alloc.c | 10 -
mm/page_cgroup.c | 200 ++++++++++++++++++++++++++++++++++
8 files changed, 374 insertions(+), 207 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc6+/mm/page_cgroup.c
@@ -0,0 +1,243 @@
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/bootmem.h>
+#include <linux/bit_spinlock.h>
+#include <linux/page_cgroup.h>
+#include <linux/hash.h>
+#include <linux/memory.h>
+
+static void __meminit
+__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
+{
+ pc->flags = 0;
+ pc->mem_cgroup = NULL;
+ pc->page = pfn_to_page(pfn);
+}
+
+#ifdef CONFIG_FLAT_NODE_MEM_MAP
+
+static unsigned long total_usage = 0;
+
+void __init pgdat_page_cgroup_init(struct pglist_data *pgdat)
+{
+ pgdat->node_page_cgroup = NULL;
+}
+
+struct page_cgroup *lookup_page_cgroup(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long offset;
+ struct page_cgroup *base;
+
+ base = NODE_DATA(page_to_nid(nid))->node_page_cgroup;
+ if (unlikely(!base))
+ return NULL;
+
+ offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+ return base + offset;
+}
+
+static int __init alloc_node_page_cgroup(int nid)
+{
+ struct page_cgroup *base, *pc;
+ unsigned long table_size;
+ unsigned long start_pfn, nr_pages, index;
+
+ start_pfn = NODE_DATA(nid)->node_start_pfn;
+ nr_pages = NODE_DATA(nid)->node_spanned_pages;
+
+ table_size = sizeof(struct page_cgroup) * nr_pages;
+
+ base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+ table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ if (!base)
+ return -ENOMEM;
+ for (index = 0; index < nr_pages; index++) {
+ pc = base + index;
+ __init_page_cgroup(pc, start_pfn + index);
+ }
+ NODE_DATA(nid)->node_page_cgroup = base;
+ total_usage += table_size;
+ return 0;
+}
+
+void __init free_node_page_cgroup(int nid)
+{
+ unsigned long table_size;
+ unsigned long nr_pages;
+ struct page_cgroup *base;
+
+ base = NODE_DATA(nid)->node_page_cgroup;
+ if (!base)
+ return;
+ nr_pages = NODE_DATA(nid)->node_spanned_pages;
+
+ table_size = sizeof(struct page_cgroup) * nr_pages;
+
+ free_bootmem_node(NODE_DATA(nid),
+ (unsigned long)base, table_size);
+ NODE_DATA(nid)->node_page_cgroup = NULL;
+}
+
+void __init page_cgroup_init(void)
+{
+
+ int nid, fail;
+
+ for_each_online_node(nid) {
+ fail = alloc_node_page_cgroup(nid);
+ if (fail)
+ goto fail;
+ }
+ printk("allocate page_cgroup at boot for %ld bytes", total_usage);
+ printk("please try cgroup_disable=memory option if you don't want\n");
+ return;
+fail:
+ printk("allocation of page_cgroup was failed.\n");
+ printk("please try cgroup_disable=memory boot option\n");
+ panic("Out of memory");
+}
+
+#else /* CONFIG_FLAT_NODE_MEM_MAP */
+
+struct page_cgroup *lookup_page_cgroup(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct mem_section *section = __pfn_to_section(pfn);
+
+ return section->page_cgroup + pfn;
+}
+
+int __meminit init_section_page_cgroup(unsigned long pfn)
+{
+ struct mem_section *section;
+ struct page_cgroup *base, *pc;
+ unsigned long total_size;
+ int nid, index;
+
+ section = __pfn_to_section(pfn);
+
+ if (section->page_cgroup)
+ return 0;
+
+ nid = page_to_nid(pfn_to_page(pfn));
+
+ total_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+ base = vmalloc_node(total_size, nid);
+ if (!base) {
+ printk("page cgroup allocation failure\n");
+ return -ENOMEM;
+ }
+ for (index = 0; index < PAGES_PER_SECTION; index++) {
+ pc = base + index;
+ __init_page_cgroup(pc, pfn + index);
+ }
+
+ section = __pfn_to_section(pfn);
+ section->page_cgroup = base - pfn;
+ return 0;
+}
+#ifdef CONFIG_MEMORY_HOTPLUG
+void __free_page_cgroup(unsigned long pfn)
+{
+ struct mem_section *ms;
+ struct page_cgroup *base;
+
+ ms = __pfn_to_section(pfn);
+ if (!ms || !ms->page_cgroup)
+ return;
+ base = ms->page_cgroup + pfn;
+ ms->page_cgroup = NULL;
+ vfree(base);
+}
+
+int online_page_cgroup(unsigned long start_pfn,
+ unsigned long nr_pages,
+ int nid)
+{
+ unsigned long start, end, pfn;
+ int fail = 0;
+
+ start = start_pfn & (PAGES_PER_SECTION - 1);
+ end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
+
+ for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
+ if (!pfn_present(pfn))
+ continue;
+ fail = init_section_page_cgroup(pfn);
+ }
+ if (!fail)
+ return 0;
+
+ /* rollback */
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+ __free_page_cgroup(pfn);
+
+ return -ENOMEM;
+}
+
+int offline_page_cgroup(unsigned long start_pfn,
+ unsigned long nr_pages, int nid)
+{
+ unsigned long start, end, pfn;
+
+ start = start_pfn & (PAGES_PER_SECTION - 1);
+ end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
+
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+ __free_page_cgroup(pfn);
+ return 0;
+
+}
+
+static int page_cgroup_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ struct memory_notify *mn = arg;
+ int ret = 0;
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = online_page_cgroup(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_CANCEL_ONLINE:
+ case MEM_OFFLINE:
+ offline_page_cgroup(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_GOING_OFFLINE:
+ break;
+ case MEM_ONLINE:
+ case MEM_CANCEL_OFFLINE:
+ break;
+ }
+ ret = notifier_from_errno(ret);
+ return ret;
+}
+
+#endif
+
+void __init page_cgroup_init(void)
+{
+ unsigned long pfn;
+ int fail = 0;
+
+ for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
+ if (!pfn_present(pfn))
+ continue;
+ fail = init_section_page_cgroup(pfn);
+ }
+ if (fail) {
+ printk("please try cgroup_disable=memory boot option\n");
+ panic("Out of memory");
+ } else {
+ hotplug_memory_notifier(page_cgroup_callback, 0);
+ }
+}
+
+void __init pgdat_page_cgroup_init(struct pgdata_list *pgdat)
+{
+ return;
+}
+
+#endif
Index: mmotm-2.6.27-rc6+/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.27-rc6+.orig/include/linux/mm_types.h
+++ mmotm-2.6.27-rc6+/include/linux/mm_types.h
@@ -94,10 +94,6 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- unsigned long page_cgroup;
-#endif
-
#ifdef CONFIG_KMEMCHECK
void *shadow;
#endif
Index: mmotm-2.6.27-rc6+/mm/Makefile
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/Makefile
+++ mmotm-2.6.27-rc6+/mm/Makefile
@@ -34,6 +34,6 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
Index: mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
@@ -0,0 +1,90 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+/*
+ * Page Cgroup can be considered as an extended mem_map.
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ * All page cgroups are allocated at boot or memory hotplug event,
+ * then the page cgroup for pfn always exists.
+ */
+struct page_cgroup {
+ unsigned long flags;
+ struct mem_cgroup *mem_cgroup;
+ struct page *page;
+ struct list_head lru; /* per cgroup LRU list */
+};
+
+void __init pgdat_page_cgroup_init(struct pglist_data *pgdat);
+void __init page_cgroup_init(void);
+struct page_cgroup *lookup_page_cgroup(struct page *page);
+
+enum {
+ /* flags for mem_cgroup */
+ PCG_LOCK, /* page cgroup is locked */
+ PCG_CACHE, /* charged as cache */
+ PCG_USED, /* this object is in use. */
+ /* flags for LRU placement */
+ PCG_ACTIVE, /* page is active in this cgroup */
+ PCG_FILE, /* page is file system backed */
+ PCG_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname) \
+static inline int PageCgroup##uname(struct page_cgroup *pc) \
+ { return test_bit(PCG_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname) \
+static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
+ { set_bit(PCG_##lname, &pc->flags); }
+
+#define CLEARPCGFLAG(uname, lname) \
+static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
+ { clear_bit(PCG_##lname, &pc->flags); }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+
+TESTPCGFLAG(Used, USED)
+CLEARPCGFLAG(Used, USED)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+static inline int page_cgroup_nid(struct page_cgroup *pc)
+{
+ return page_to_nid(pc->page);
+}
+
+static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+ return page_zonenum(pc->page);
+}
+
+static inline void lock_page_cgroup(struct page_cgroup *pc)
+{
+ bit_spin_lock(PCG_LOCK, &pc->flags);
+}
+
+static inline int trylock_page_cgroup(struct page_cgroup *pc)
+{
+ return bit_spin_trylock(PCG_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup(struct page_cgroup *pc)
+{
+ bit_spin_unlock(PCG_LOCK, &pc->flags);
+}
+
+
+#endif
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -34,11 +34,11 @@
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
+#include <linux/page_cgroup.h>
#include <asm/uaccess.h>
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
-static struct kmem_cache *page_cgroup_cache __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5
/*
@@ -138,80 +138,6 @@ static struct mem_cgroup init_mem_cgroup
#define is_root_cgroup(cgrp) ((cgrp) == &init_mem_cgroup)
-
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin). But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK 0x0
-#endif
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
- struct list_head lru; /* per cgroup LRU list */
- struct page *page;
- struct mem_cgroup *mem_cgroup;
- unsigned long flags;
-};
-
-enum {
- /* flags for mem_cgroup */
- PCG_CACHE, /* charged as cache */
- /* flags for LRU placement */
- PCG_ACTIVE, /* page is active in this cgroup */
- PCG_FILE, /* page is file system backed */
- PCG_UNEVICTABLE, /* page is unevictableable */
-};
-
-#define TESTPCGFLAG(uname, lname) \
-static inline int PageCgroup##uname(struct page_cgroup *pc) \
- { return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname) \
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
- { set_bit(PCG_##lname, &pc->flags); }
-
-#define CLEARPCGFLAG(uname, lname) \
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
- { clear_bit(PCG_##lname, &pc->flags); }
-
-
-/* Cache flag is set only once (at allocation) */
-TESTPCGFLAG(Cache, CACHE)
-
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
-static int page_cgroup_nid(struct page_cgroup *pc)
-{
- return page_to_nid(pc->page);
-}
-
-static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
- return page_zonenum(pc->page);
-}
-
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -222,9 +148,9 @@ enum charge_type {
static const unsigned long
pcg_default_flags[NR_CHARGE_TYPE] = {
- ((1 << PCG_CACHE) | (1 << PCG_FILE)),
- ((1 << PCG_ACTIVE)),
- ((1 << PCG_ACTIVE) | (1 << PCG_CACHE)),
+ (1 << PCG_CACHE) | (1 << PCG_FILE) | (1 << PCG_USED) | (1 << PCG_LOCK),
+ (1 << PCG_ACTIVE) | (1 << PCG_LOCK) | (1 << PCG_USED),
+ (1 << PCG_ACTIVE) | (1 << PCG_CACHE) | (1 << PCG_USED)| (1 << PCG_LOCK),
0,
};
@@ -307,37 +233,6 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}
-static inline int page_cgroup_locked(struct page *page)
-{
- return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
- VM_BUG_ON(!page_cgroup_locked(page));
- page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
- return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
- bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
- return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
- bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
{
@@ -441,22 +336,19 @@ void mem_cgroup_move_lists(struct page *
* safely get to page_cgroup without it, so just try_lock it:
* mem_cgroup_isolate_pages allows for page left on wrong list.
*/
- if (!try_lock_page_cgroup(page))
+ pc = lookup_page_cgroup(page);
+
+ if (!trylock_page_cgroup(pc))
return;
- pc = page_get_page_cgroup(page);
- if (pc) {
+ if (PageCgroupUsed(pc)) {
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
- /*
- * check against the race with move_account.
- */
- if (likely(mem == pc->mem_cgroup))
- __mem_cgroup_move_lists(pc, lru);
+ __mem_cgroup_move_lists(pc, lru);
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
- unlock_page_cgroup(page);
+ unlock_page_cgroup(pc);
}
/*
@@ -543,6 +435,8 @@ unsigned long mem_cgroup_isolate_pages(u
list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
if (scan >= nr_to_scan)
break;
+ if (unlikely(!PageCgroupUsed(pc)))
+ continue;
page = pc->page;
if (unlikely(!PageLRU(page)))
@@ -611,12 +505,12 @@ int mem_cgroup_move_account(struct page
/* Now, we assume no_limit...no failure here. */
return ret;
}
- if (!try_lock_page_cgroup(page)) {
+ if (!trylock_page_cgroup(pc)) {
res_counter_uncharge(&to->res, PAGE_SIZE);
return ret;
}
- if (page_get_page_cgroup(page) != pc) {
+ if (!PageCgroupUsed(pc)) {
res_counter_uncharge(&to->res, PAGE_SIZE);
goto out;
}
@@ -634,7 +528,7 @@ int mem_cgroup_move_account(struct page
res_counter_uncharge(&to->res, PAGE_SIZE);
}
out:
- unlock_page_cgroup(page);
+ unlock_page_cgroup(pc);
return ret;
}
@@ -651,26 +545,27 @@ static int mem_cgroup_charge_common(stru
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
- unsigned long flags;
unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
- pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
- if (unlikely(pc == NULL))
- goto err;
-
+ pc = lookup_page_cgroup(page);
+ /* can happen at boot */
+ if (unlikely(!pc))
+ return 0;
+ prefetchw(pc);
/*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
* thread group leader migrates. It's possible that mm is not
* set, if so charge the init_mm (happens for pagecache usage).
*/
+
if (likely(!memcg)) {
rcu_read_lock();
mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
if (unlikely(!mem)) {
rcu_read_unlock();
- kmem_cache_free(page_cgroup_cache, pc);
return 0;
}
/*
@@ -706,36 +601,34 @@ static int mem_cgroup_charge_common(stru
}
}
+ preempt_disable();
+ lock_page_cgroup(pc);
+ if (unlikely(PageCgroupUsed(pc))) {
+ unlock_page_cgroup(pc);
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
+ preempt_enable();
+ goto done;
+ }
pc->mem_cgroup = mem;
- pc->page = page;
/*
* If a page is accounted as a page cache, insert to inactive list.
* If anon, insert to active list.
*/
pc->flags = pcg_default_flags[ctype];
- lock_page_cgroup(page);
- if (unlikely(page_get_page_cgroup(page))) {
- unlock_page_cgroup(page);
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
- kmem_cache_free(page_cgroup_cache, pc);
- goto done;
- }
- page_assign_page_cgroup(page, pc);
-
mz = page_cgroup_zoneinfo(pc);
+
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
+ unlock_page_cgroup(pc);
+ preempt_enable();
- unlock_page_cgroup(page);
done:
return 0;
out:
css_put(&mem->css);
- kmem_cache_free(page_cgroup_cache, pc);
-err:
return -ENOMEM;
}
@@ -743,7 +636,8 @@ int mem_cgroup_charge(struct page *page,
{
if (mem_cgroup_subsys.disabled)
return 0;
-
+ if (PageCompound(page))
+ return 0;
/*
* If already mapped, we don't have to account.
* If page cache, page->mapping has address_space.
@@ -764,7 +658,8 @@ int mem_cgroup_cache_charge(struct page
{
if (mem_cgroup_subsys.disabled)
return 0;
-
+ if (PageCompound(page))
+ return 0;
/*
* Corner case handling. This is called from add_to_page_cache()
* in usual. But some FS (shmem) precharges this page before calling it
@@ -777,15 +672,16 @@ int mem_cgroup_cache_charge(struct page
if (!(gfp_mask & __GFP_WAIT)) {
struct page_cgroup *pc;
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (pc) {
- VM_BUG_ON(pc->page != page);
- VM_BUG_ON(!pc->mem_cgroup);
- unlock_page_cgroup(page);
+
+ pc = lookup_page_cgroup(page);
+ if (!pc)
+ return 0;
+ lock_page_cgroup(pc);
+ if (PageCgroupUsed(pc)) {
+ unlock_page_cgroup(pc);
return 0;
}
- unlock_page_cgroup(page);
+ unlock_page_cgroup(pc);
}
if (unlikely(!mm))
@@ -812,49 +708,41 @@ __mem_cgroup_uncharge_common(struct page
if (mem_cgroup_subsys.disabled)
return;
+ /* check the condition we can know from page */
- /*
- * Check if our page_cgroup is valid
- */
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (unlikely(!pc))
- goto unlock;
-
- VM_BUG_ON(pc->page != page);
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc || !PageCgroupUsed(pc)))
+ return;
+ preempt_disable();
+ lock_page_cgroup(pc);
+ if (unlikely(page_mapped(page))) {
+ unlock_page_cgroup(pc);
+ preempt_enable();
+ return;
+ }
+ ClearPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
- if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
- && ((PageCgroupCache(pc) || page_mapped(page))))
- goto unlock;
-retry:
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
+
spin_lock_irqsave(&mz->lru_lock, flags);
- if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
- unlikely(mem != pc->mem_cgroup)) {
- /* MAPPED account can be done without lock_page().
- Check race with mem_cgroup_move_account() */
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- goto retry;
- }
__mem_cgroup_remove_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
-
- page_assign_page_cgroup(page, NULL);
- unlock_page_cgroup(page);
-
-
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ pc->mem_cgroup = NULL;
css_put(&mem->css);
+ preempt_enable();
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
- kmem_cache_free(page_cgroup_cache, pc);
return;
-unlock:
- unlock_page_cgroup(page);
}
void mem_cgroup_uncharge_page(struct page *page)
{
+ if (page_mapped(page))
+ return;
+ if (page->mapping && !PageAnon(page))
+ return;
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
}
@@ -878,9 +766,9 @@ int mem_cgroup_prepare_migration(struct
if (mem_cgroup_subsys.disabled)
return 0;
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (pc) {
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ if (PageCgroupUsed(pc)) {
mem = pc->mem_cgroup;
css_get(&mem->css);
if (PageCgroupCache(pc)) {
@@ -890,7 +778,7 @@ int mem_cgroup_prepare_migration(struct
ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
}
}
- unlock_page_cgroup(page);
+ unlock_page_cgroup(pc);
if (mem) {
ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
ctype, mem);
@@ -1271,8 +1159,8 @@ mem_cgroup_create(struct cgroup_subsys *
int node;
if (unlikely((cont->parent) == NULL)) {
+ page_cgroup_init();
mem = &init_mem_cgroup;
- page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
} else {
mem = mem_cgroup_alloc();
if (!mem)
Index: mmotm-2.6.27-rc6+/mm/page_alloc.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/page_alloc.c
+++ mmotm-2.6.27-rc6+/mm/page_alloc.c
@@ -223,17 +223,12 @@ static inline int bad_range(struct zone
static void bad_page(struct page *page)
{
- void *pc = page_get_page_cgroup(page);
-
printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
current->comm, page, (int)(2*sizeof(unsigned long)),
(unsigned long)page->flags, page->mapping,
page_mapcount(page), page_count(page));
- if (pc) {
- printk(KERN_EMERG "cgroup:%p\n", pc);
- page_reset_bad_cgroup(page);
- }
+
printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
KERN_EMERG "Backtrace:\n");
dump_stack();
@@ -472,7 +467,6 @@ static inline void free_pages_check(stru
free_page_mlock(page);
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_get_page_cgroup(page) != NULL) |
(page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
bad_page(page);
@@ -609,7 +603,6 @@ static void prep_new_page(struct page *p
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_get_page_cgroup(page) != NULL) |
(page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
bad_page(page);
@@ -3495,6 +3488,7 @@ static void __paginginit free_area_init_
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
pgdat->kswapd_max_order = 0;
+ pgdat_page_cgroup_init(pgdat);
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
Index: mmotm-2.6.27-rc6+/include/linux/mmzone.h
===================================================================
--- mmotm-2.6.27-rc6+.orig/include/linux/mmzone.h
+++ mmotm-2.6.27-rc6+/include/linux/mmzone.h
@@ -604,6 +604,9 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP
struct page *node_mem_map;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct page_cgroup *node_page_cgroup;
+#endif
#endif
struct bootmem_data *bdata;
#ifdef CONFIG_MEMORY_HOTPLUG
@@ -932,6 +935,7 @@ static inline unsigned long early_pfn_to
#endif
struct page;
+struct page_cgroup;
struct mem_section {
/*
* This is, logically, a pointer to an array of struct
@@ -949,6 +953,10 @@ struct mem_section {
/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ /* see page_cgroup.h */
+ struct page_cgroup *page_cgroup;
+#endif
};
#ifdef CONFIG_SPARSEMEM_EXTREME
Index: mmotm-2.6.27-rc6+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc6+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc6+/include/linux/memcontrol.h
@@ -29,7 +29,6 @@ struct mm_struct;
#define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0)
-extern struct page_cgroup *page_get_page_cgroup(struct page *page);
extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -72,16 +71,8 @@ extern void mem_cgroup_record_reclaim_pr
extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
int priority, enum lru_list lru);
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
-static inline void page_reset_bad_cgroup(struct page *page)
-{
-}
-
-static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
- return NULL;
-}
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
static inline int mem_cgroup_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3.6/13] memcg: add function to move account.
2008-09-22 11:09 ` [PATCH 3.6/13] memcg: add function to move account KAMEZAWA Hiroyuki
@ 2008-09-24 6:50 ` Daisuke Nishimura
2008-09-24 7:11 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Daisuke Nishimura @ 2008-09-24 6:50 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, xemul, LKML
On Mon, 22 Sep 2008 20:09:48 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Sorry, this patchs is after "3.5" , before "4"....
>
> ==
> This patch provides a function to move account information of a page between
> mem_cgroups.
>
> This moving of page_cgroup is done under
> - the page is locked.
> - lru_lock of source/destination mem_cgroup is held.
>
> Then, a routine which touches pc->mem_cgroup without page_lock() should
> confirm pc->mem_cgroup is still valid or not. Typlical code can be following.
>
> (while page is not under lock_page())
> mem = pc->mem_cgroup;
> mz = page_cgroup_zoneinfo(pc)
> spin_lock_irqsave(&mz->lru_lock);
> if (pc->mem_cgroup == mem)
> ...../* some list handling */
> spin_unlock_irq(&mz->lru_lock);
>
Is this check needed?
Both move_lists and move_account takes page_cgroup lock.
Thanks,
Daisuke Nishimura.
> If you find page_cgroup from mem_cgroup's LRU under mz->lru_lock, you don't
> have to worry about anything.
>
> Changelog: (v3) -> (v4)
> - no changes.
> Changelog: (v2) -> (v3)
> - added lock_page_cgroup().
> - splitted out from new-force-empty patch.
> - added how-to-use text.
> - fixed race in __mem_cgroup_uncharge_common().
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> mm/memcontrol.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 82 insertions(+), 3 deletions(-)
>
> Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
> +++ mmotm-2.6.27-rc6+/mm/memcontrol.c
> @@ -424,6 +424,7 @@ int task_in_mem_cgroup(struct task_struc
> void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
> {
> struct page_cgroup *pc;
> + struct mem_cgroup *mem;
> struct mem_cgroup_per_zone *mz;
> unsigned long flags;
>
> @@ -442,9 +443,14 @@ void mem_cgroup_move_lists(struct page *
>
> pc = page_get_page_cgroup(page);
> if (pc) {
> + mem = pc->mem_cgroup;
> mz = page_cgroup_zoneinfo(pc);
> spin_lock_irqsave(&mz->lru_lock, flags);
> - __mem_cgroup_move_lists(pc, lru);
> + /*
> + * check against the race with move_account.
> + */
> + if (likely(mem == pc->mem_cgroup))
> + __mem_cgroup_move_lists(pc, lru);
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> }
> unlock_page_cgroup(page);
> @@ -565,6 +571,71 @@ unsigned long mem_cgroup_isolate_pages(u
> return nr_taken;
> }
>
> +/**
> + * mem_cgroup_move_account - move account of the page
> + * @page ... the target page of being moved.
> + * @pc ... page_cgroup of the page.
> + * @from ... mem_cgroup which the page is moved from.
> + * @to ... mem_cgroup which the page is moved to.
> + *
> + * The caller must confirm following.
> + * 1. lock the page by lock_page().
> + * 2. disable irq.
> + * 3. lru_lock of old mem_cgroup should be held.
> + * 4. pc is guaranteed to be valid and on mem_cgroup's LRU.
> + *
> + * Because we cannot call try_to_free_page() here, the caller must guarantee
> + * this moving of change never fails. Currently this is called only against
> + * root cgroup, which has no limitation of resource.
> + * Returns 0 at success, returns 1 at failure.
> + */
> +int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
> + struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> + struct mem_cgroup_per_zone *from_mz, *to_mz;
> + int nid, zid;
> + int ret = 1;
> +
> + VM_BUG_ON(!irqs_disabled());
> + VM_BUG_ON(!PageLocked(page));
> +
> + nid = page_to_nid(page);
> + zid = page_zonenum(page);
> + from_mz = mem_cgroup_zoneinfo(from, nid, zid);
> + to_mz = mem_cgroup_zoneinfo(to, nid, zid);
> +
> + if (res_counter_charge(&to->res, PAGE_SIZE)) {
> + /* Now, we assume no_limit...no failure here. */
> + return ret;
> + }
> + if (!try_lock_page_cgroup(page)) {
> + res_counter_uncharge(&to->res, PAGE_SIZE);
> + return ret;
> + }
> +
> + if (page_get_page_cgroup(page) != pc) {
> + res_counter_uncharge(&to->res, PAGE_SIZE);
> + goto out;
> + }
> +
> + if (spin_trylock(&to_mz->lru_lock)) {
> + __mem_cgroup_remove_list(from_mz, pc);
> + css_put(&from->css);
> + res_counter_uncharge(&from->res, PAGE_SIZE);
> + pc->mem_cgroup = to;
> + css_get(&to->css);
> + __mem_cgroup_add_list(to_mz, pc);
> + ret = 0;
> + spin_unlock(&to_mz->lru_lock);
> + } else {
> + res_counter_uncharge(&to->res, PAGE_SIZE);
> + }
> +out:
> + unlock_page_cgroup(page);
> +
> + return ret;
> +}
> +
> /*
> * Charge the memory controller for page usage.
> * Return
> @@ -752,16 +823,24 @@ __mem_cgroup_uncharge_common(struct page
> if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> && ((PageCgroupCache(pc) || page_mapped(page))))
> goto unlock;
> -
> +retry:
> + mem = pc->mem_cgroup;
> mz = page_cgroup_zoneinfo(pc);
> spin_lock_irqsave(&mz->lru_lock, flags);
> + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
> + unlikely(mem != pc->mem_cgroup)) {
> + /* MAPPED account can be done without lock_page().
> + Check race with mem_cgroup_move_account() */
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> + goto retry;
> + }
> __mem_cgroup_remove_list(mz, pc);
> spin_unlock_irqrestore(&mz->lru_lock, flags);
>
> page_assign_page_cgroup(page, NULL);
> unlock_page_cgroup(page);
>
> - mem = pc->mem_cgroup;
> +
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> css_put(&mem->css);
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3.6/13] memcg: add function to move account.
2008-09-24 6:50 ` Daisuke Nishimura
@ 2008-09-24 7:11 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-24 7:11 UTC (permalink / raw)
To: Daisuke Nishimura; +Cc: linux-mm, balbir, xemul, LKML
On Wed, 24 Sep 2008 15:50:11 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Mon, 22 Sep 2008 20:09:48 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Sorry, this patchs is after "3.5" , before "4"....
> >
> > ==
> > This patch provides a function to move account information of a page between
> > mem_cgroups.
> >
> > This moving of page_cgroup is done under
> > - the page is locked.
> > - lru_lock of source/destination mem_cgroup is held.
> >
> > Then, a routine which touches pc->mem_cgroup without page_lock() should
> > confirm pc->mem_cgroup is still valid or not. Typlical code can be following.
> >
> > (while page is not under lock_page())
> > mem = pc->mem_cgroup;
> > mz = page_cgroup_zoneinfo(pc)
> > spin_lock_irqsave(&mz->lru_lock);
> > if (pc->mem_cgroup == mem)
> > ...../* some list handling */
> > spin_unlock_irq(&mz->lru_lock);
> >
>
> Is this check needed?
> Both move_lists and move_account takes page_cgroup lock.
>
__mem_cgroup_move_lists() doesn't take.
But yes, if you know what it does, you can reduce checks.
Above is example.
Thanks,
-Kame
>
> Thanks,
> Daisuke Nishimura.
>
> > If you find page_cgroup from mem_cgroup's LRU under mz->lru_lock, you don't
> > have to worry about anything.
> >
> > Changelog: (v3) -> (v4)
> > - no changes.
> > Changelog: (v2) -> (v3)
> > - added lock_page_cgroup().
> > - splitted out from new-force-empty patch.
> > - added how-to-use text.
> > - fixed race in __mem_cgroup_uncharge_common().
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > mm/memcontrol.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 82 insertions(+), 3 deletions(-)
> >
> > Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
> > +++ mmotm-2.6.27-rc6+/mm/memcontrol.c
> > @@ -424,6 +424,7 @@ int task_in_mem_cgroup(struct task_struc
> > void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
> > {
> > struct page_cgroup *pc;
> > + struct mem_cgroup *mem;
> > struct mem_cgroup_per_zone *mz;
> > unsigned long flags;
> >
> > @@ -442,9 +443,14 @@ void mem_cgroup_move_lists(struct page *
> >
> > pc = page_get_page_cgroup(page);
> > if (pc) {
> > + mem = pc->mem_cgroup;
> > mz = page_cgroup_zoneinfo(pc);
> > spin_lock_irqsave(&mz->lru_lock, flags);
> > - __mem_cgroup_move_lists(pc, lru);
> > + /*
> > + * check against the race with move_account.
> > + */
> > + if (likely(mem == pc->mem_cgroup))
> > + __mem_cgroup_move_lists(pc, lru);
> > spin_unlock_irqrestore(&mz->lru_lock, flags);
> > }
> > unlock_page_cgroup(page);
> > @@ -565,6 +571,71 @@ unsigned long mem_cgroup_isolate_pages(u
> > return nr_taken;
> > }
> >
> > +/**
> > + * mem_cgroup_move_account - move account of the page
> > + * @page ... the target page of being moved.
> > + * @pc ... page_cgroup of the page.
> > + * @from ... mem_cgroup which the page is moved from.
> > + * @to ... mem_cgroup which the page is moved to.
> > + *
> > + * The caller must confirm following.
> > + * 1. lock the page by lock_page().
> > + * 2. disable irq.
> > + * 3. lru_lock of old mem_cgroup should be held.
> > + * 4. pc is guaranteed to be valid and on mem_cgroup's LRU.
> > + *
> > + * Because we cannot call try_to_free_page() here, the caller must guarantee
> > + * this moving of change never fails. Currently this is called only against
> > + * root cgroup, which has no limitation of resource.
> > + * Returns 0 at success, returns 1 at failure.
> > + */
> > +int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
> > + struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > + struct mem_cgroup_per_zone *from_mz, *to_mz;
> > + int nid, zid;
> > + int ret = 1;
> > +
> > + VM_BUG_ON(!irqs_disabled());
> > + VM_BUG_ON(!PageLocked(page));
> > +
> > + nid = page_to_nid(page);
> > + zid = page_zonenum(page);
> > + from_mz = mem_cgroup_zoneinfo(from, nid, zid);
> > + to_mz = mem_cgroup_zoneinfo(to, nid, zid);
> > +
> > + if (res_counter_charge(&to->res, PAGE_SIZE)) {
> > + /* Now, we assume no_limit...no failure here. */
> > + return ret;
> > + }
> > + if (!try_lock_page_cgroup(page)) {
> > + res_counter_uncharge(&to->res, PAGE_SIZE);
> > + return ret;
> > + }
> > +
> > + if (page_get_page_cgroup(page) != pc) {
> > + res_counter_uncharge(&to->res, PAGE_SIZE);
> > + goto out;
> > + }
> > +
> > + if (spin_trylock(&to_mz->lru_lock)) {
> > + __mem_cgroup_remove_list(from_mz, pc);
> > + css_put(&from->css);
> > + res_counter_uncharge(&from->res, PAGE_SIZE);
> > + pc->mem_cgroup = to;
> > + css_get(&to->css);
> > + __mem_cgroup_add_list(to_mz, pc);
> > + ret = 0;
> > + spin_unlock(&to_mz->lru_lock);
> > + } else {
> > + res_counter_uncharge(&to->res, PAGE_SIZE);
> > + }
> > +out:
> > + unlock_page_cgroup(page);
> > +
> > + return ret;
> > +}
> > +
> > /*
> > * Charge the memory controller for page usage.
> > * Return
> > @@ -752,16 +823,24 @@ __mem_cgroup_uncharge_common(struct page
> > if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > && ((PageCgroupCache(pc) || page_mapped(page))))
> > goto unlock;
> > -
> > +retry:
> > + mem = pc->mem_cgroup;
> > mz = page_cgroup_zoneinfo(pc);
> > spin_lock_irqsave(&mz->lru_lock, flags);
> > + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
> > + unlikely(mem != pc->mem_cgroup)) {
> > + /* MAPPED account can be done without lock_page().
> > + Check race with mem_cgroup_move_account() */
> > + spin_unlock_irqrestore(&mz->lru_lock, flags);
> > + goto retry;
> > + }
> > __mem_cgroup_remove_list(mz, pc);
> > spin_unlock_irqrestore(&mz->lru_lock, flags);
> >
> > page_assign_page_cgroup(page, NULL);
> > unlock_page_cgroup(page);
> >
> > - mem = pc->mem_cgroup;
> > +
> > res_counter_uncharge(&mem->res, PAGE_SIZE);
> > css_put(&mem->css);
> >
> >
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-24 3:09 ` KAMEZAWA Hiroyuki
@ 2008-09-24 8:31 ` Balbir Singh
2008-09-24 8:46 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2008-09-24 8:31 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, xemul, LKML
KAMEZAWA Hiroyuki wrote:
> On Wed, 24 Sep 2008 07:39:58 +0530
> "Balbir Singh" <balbir@linux.vnet.ibm.com> wrote:
>>> I'll add FLATMEM/DISCONTIGMEM/SPARSEMEM support directly.
>>> I already have wasted a month on this not-interesting work and want to fix
>>> this soon.
>>>
>> Let's look at the basic requirement, make memory resource controller
>> not suck with 32 bit systems. I have been thinking of about removing
>> page_cgroup from struct page only for 32 bit systems (use radix tree),
>> 32 bit systems can have a maximum of 64GB if PAE is enabled, I suspect
>> radix tree should work there and let the 64 bit systems work as is. If
>> performance is an issue, I would recommend the 32 bit folks upgrade to
>> 64 bit :) Can we build consensus around this approach?
>>
> My thinking is below. (assume 64bit)
>
assume 64 bit for the calculations below?
> - remove page_cgroup pointer from struct page allows us to reduce
> static memory usage at boot by 8bytes/4096bytes if memory cgroup is disabled.
> This reaches 96MB on my 48 GB box. I think this is big.
> - pre-allocation of page_cgroup gives us following.
> Pros.
> - We are not necessary to be afraid of "failure of kmalloc" and
> "goes down to memory reclaim at kmalloc"
> This makes memory resource controller much simpler and robust.
> - We can know what amount of kernel memory will be used for
> LRU pages management.
> Cons.
> - All page_cgroups are allocated at boot.
> This reaches 480MB on my 48GB box.
>
> But I think we can ignore "Cons.". If we use up memory, we'll use tons of
> page_cgroup. Considering memory fragmentation caused by allocating a lots of
> very small object, pre-allocation makes memcg better.
This looks like a good patch. I'll review and test it.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page)
2008-09-24 8:31 ` Balbir Singh
@ 2008-09-24 8:46 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-24 8:46 UTC (permalink / raw)
To: balbir; +Cc: linux-mm, nishimura, xemul, LKML
On Wed, 24 Sep 2008 01:31:59 -0700
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> KAMEZAWA Hiroyuki wrote:
> > On Wed, 24 Sep 2008 07:39:58 +0530
> > "Balbir Singh" <balbir@linux.vnet.ibm.com> wrote:
> >>> I'll add FLATMEM/DISCONTIGMEM/SPARSEMEM support directly.
> >>> I already have wasted a month on this not-interesting work and want to fix
> >>> this soon.
> >>>
> >> Let's look at the basic requirement, make memory resource controller
> >> not suck with 32 bit systems. I have been thinking of about removing
> >> page_cgroup from struct page only for 32 bit systems (use radix tree),
> >> 32 bit systems can have a maximum of 64GB if PAE is enabled, I suspect
> >> radix tree should work there and let the 64 bit systems work as is. If
> >> performance is an issue, I would recommend the 32 bit folks upgrade to
> >> 64 bit :) Can we build consensus around this approach?
> >>
> > My thinking is below. (assume 64bit)
> >
>
> assume 64 bit for the calculations below?
>
yes.
> > - remove page_cgroup pointer from struct page allows us to reduce
> > static memory usage at boot by 8bytes/4096bytes if memory cgroup is disabled.
> > This reaches 96MB on my 48 GB box. I think this is big.
> > - pre-allocation of page_cgroup gives us following.
> > Pros.
> > - We are not necessary to be afraid of "failure of kmalloc" and
> > "goes down to memory reclaim at kmalloc"
> > This makes memory resource controller much simpler and robust.
> > - We can know what amount of kernel memory will be used for
> > LRU pages management.
> > Cons.
> > - All page_cgroups are allocated at boot.
> > This reaches 480MB on my 48GB box.
> >
> > But I think we can ignore "Cons.". If we use up memory, we'll use tons of
> > page_cgroup. Considering memory fragmentation caused by allocating a lots of
> > very small object, pre-allocation makes memcg better.
>
> This looks like a good patch. I'll review and test it.
>
At least, I should handle "use vmalloc if kmalloc fails" case.
But will not have no major updates. I'll update the whole to the newest
mmotm and post tomorrow if I can start test tonight.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2008-09-24 8:46 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-22 10:51 [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
2008-09-22 10:55 ` [PATCH 1/13] memcg: avoid accounting special mapping KAMEZAWA Hiroyuki
2008-09-22 10:57 ` [PATCH 2/13] memcg: account fault-in swap under lock KAMEZAWA Hiroyuki
2008-09-22 10:58 ` [PATCH 3/13] memcg: nolimit root cgroup KAMEZAWA Hiroyuki
2008-09-22 11:00 ` [PATCH 4/13] memcg: force_empty moving account KAMEZAWA Hiroyuki
2008-09-22 14:23 ` Peter Zijlstra
2008-09-22 14:50 ` kamezawa.hiroyu
2008-09-22 14:56 ` Peter Zijlstra
2008-09-22 15:06 ` kamezawa.hiroyu
2008-09-22 15:32 ` Peter Zijlstra
2008-09-22 15:43 ` kamezawa.hiroyu
2008-09-22 11:02 ` [PATCH 5/13] memcg: cleanup to make mapping null before unchage KAMEZAWA Hiroyuki
2008-09-22 11:03 ` [PATCH 6/13] memcg: optimze per cpu accounting for memcg KAMEZAWA Hiroyuki
2008-09-22 11:05 ` [PATCH 3.5/13] memcg: make page_cgroup flags to be atomic KAMEZAWA Hiroyuki
2008-09-22 11:09 ` [PATCH 3.6/13] memcg: add function to move account KAMEZAWA Hiroyuki
2008-09-24 6:50 ` Daisuke Nishimura
2008-09-24 7:11 ` KAMEZAWA Hiroyuki
2008-09-22 11:12 ` [PATCH 9/13] memcg: lookup page cgroup (and remove pointer from struct page) KAMEZAWA Hiroyuki
2008-09-22 14:52 ` Dave Hansen
2008-09-22 15:14 ` kamezawa.hiroyu
2008-09-22 15:47 ` Dave Hansen
2008-09-22 15:57 ` kamezawa.hiroyu
2008-09-22 16:10 ` Dave Hansen
2008-09-22 17:34 ` kamezawa.hiroyu
2008-09-22 15:47 ` Peter Zijlstra
2008-09-22 16:04 ` kamezawa.hiroyu
2008-09-22 16:06 ` Peter Zijlstra
2008-09-23 23:48 ` KAMEZAWA Hiroyuki
2008-09-24 2:09 ` Balbir Singh
2008-09-24 3:09 ` KAMEZAWA Hiroyuki
2008-09-24 8:31 ` Balbir Singh
2008-09-24 8:46 ` KAMEZAWA Hiroyuki
2008-09-22 11:13 ` [PATCH 10/13] memcg: page_cgroup look aside table KAMEZAWA Hiroyuki
2008-09-22 11:17 ` [PATCH 11/13] memcg: lazy LRU free (NEW) KAMEZAWA Hiroyuki
2008-09-22 11:22 ` [PATCH 12/13] memcg: lazy LRU add KAMEZAWA Hiroyuki
2008-09-22 11:24 ` [PATCH 13/13] memcg: swap accounting fix KAMEZAWA Hiroyuki
2008-09-22 11:28 ` [PATCH 0/13] memory cgroup updates v4 KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox