* [-mm][PATCH 0/6] memcg: radix tree page_cgroup v3.
@ 2008-04-01 8:28 KAMEZAWA Hiroyuki
2008-04-01 8:30 ` [-mm][PATCH 1/6] memcg: radix-tree lookup for page_cgroup KAMEZAWA Hiroyuki
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-01 8:28 UTC (permalink / raw)
To: linux-mm; +Cc: balbir, xemul, kamezawa.hiroyu, yamamoto, menage
Hi, this is v3. This includes some fixes and some experimentals.
Still on -rc5-mm1. I'm now wondering how to merge and test this set...
Major Changes from previous one:
- fixed typos.
- merged prefetch in page_cgroup.
- added some experimentals.
patch 1-3 are already posted patches. patch 4 is for removing redundant codes.
I think patch 1-4 are ready to be tested widely.
patch 5/6 are experimantal but shows good numbers.
This is performance result.
(CONFIG is changed to use sparsemem_vmemmap)
==
BYTE UNIX Benchmarks (Version 4.1.0) Run execl. x86_64/SMP/2CPUs.
Higher value is better. All are measured just after boot.
ENVIRON : TEST BASELINE RESULT INDEX
mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6
before this set: Execl Throughput 43.0 2932.6 682.0
after patch 1-4: Execl Throughput 43.0 2899.1 674.2
after patch 1-6: Execl Throughput 43.0 3044.2 708.0
==
Because patch 5/6 is very aggressive but attractive,
I'd like to hear review/test comments.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [-mm][PATCH 1/6] memcg: radix-tree lookup for page_cgroup.
2008-04-01 8:28 [-mm][PATCH 0/6] memcg: radix tree page_cgroup v3 KAMEZAWA Hiroyuki
@ 2008-04-01 8:30 ` KAMEZAWA Hiroyuki
2008-04-01 8:31 ` [-mm][PATCH 2/6] boost by per_cpu KAMEZAWA Hiroyuki
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-01 8:30 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, xemul, yamamoto, menage
This patch implements radixt-tree based page cgroup.
This patch does
* add radix-tree based page_cgroup look up subsystem.
* remove bit_spin_lock used by page_cgroup.
Logic changes:
Before patch
* struct page had pointer to page_cgroup. Then, relationship between objects
was pfn <-> struct page <-> struct page_cgroup
* (spin) lock for page_cgroup was in struct page.
* page_cgroup->refcnt is incremented before charge is done.
* page migration does complicated page_cgroup migration under locks.
After patch
* struct page has no pointer to page_cgroup. Relationship between objects
is struct page <-> pfn <-> struct page_cgroup -> struct page,
* page_cgroup has its own spin lock.
* page_cgroup->refcnt is incremented after charge is done.
* page migration accounts a new page before migration. By this, we can
avoid complicated locks.
tested on ia64/NUMA, x86_64/SMP.
Changelog v2 -> v3:
* changes get_alloc_page_cgroup() to return -EBUSY while boot.
* fixed typos
* add PageLRU check into force_empty(). (this can be good guard against
migration.)
Changelog v1 -> v2:
* create a folded patch. maybe good for bysect.
* removed special handling codes for new pages under migration
Added PG_LRU check to force_empty.
* reflected comments.
* Added comments in the head of page_cgroup.c
* order of page_cgroup is automatically calculated.
* fixed handling of root_node[] entries in page_cgroup_init().
* rewrite init_page_cgroup_head() to do minimum work.
* fixed N_NORMAL_MEMORY handling.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/memcontrol.h | 20 --
include/linux/mm_types.h | 3
include/linux/page_cgroup.h | 55 +++++++
mm/Makefile | 2
mm/memcontrol.c | 331 ++++++++++++++++----------------------------
mm/migrate.c | 22 +-
mm/page_alloc.c | 8 -
mm/page_cgroup.c | 259 ++++++++++++++++++++++++++++++++++
8 files changed, 462 insertions(+), 238 deletions(-)
Index: mm-2.6.25-rc5-mm1-k/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mm-2.6.25-rc5-mm1-k/include/linux/page_cgroup.h
@@ -0,0 +1,55 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * page_cgroup is yet another mem_map structure for accounting usage.
+ * but, unlike mem_map, allocated on demand for accounted pages.
+ * see also memcontrol.h
+ * In nature, this consumes much amount of memory.
+ */
+
+struct mem_cgroup;
+
+struct page_cgroup {
+ spinlock_t lock; /* lock for all members */
+ int refcnt; /* reference count */
+ struct mem_cgroup *mem_cgroup; /* current cgroup subsys */
+ struct list_head lru; /* for per cgroup LRU */
+ int flags; /* See below */
+ struct page *page; /* the page this accounts for*/
+};
+
+/* flags */
+#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache. */
+#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* is on active list */
+
+/*
+ * look up page_cgroup. returns NULL if not exists.
+ */
+extern struct page_cgroup *get_page_cgroup(struct page *page);
+
+
+/*
+ * look up page_cgroup, allocate new one if it doesn't exist.
+ * Return value is
+ * 1. page_cgroup, at success.
+ * 2. -EXXXXX, at failure. (-EBUSY at boot)
+ */
+extern struct page_cgroup *
+get_alloc_page_cgroup(struct page *page, gfp_t gfpmask);
+
+#else
+
+static inline struct page_cgroup *get_page_cgroup(struct page *page)
+{
+ return NULL;
+}
+
+static inline struct page_cgroup *
+get_alloc_page_cgroup(struct page *page, gfp_t gfpmask)
+{
+ return NULL;
+}
+#endif
+#endif
Index: mm-2.6.25-rc5-mm1-k/mm/memcontrol.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/memcontrol.c
+++ mm-2.6.25-rc5-mm1-k/mm/memcontrol.c
@@ -30,6 +30,7 @@
#include <linux/spinlock.h>
#include <linux/fs.h>
#include <linux/seq_file.h>
+#include <linux/page_cgroup.h>
#include <asm/uaccess.h>
@@ -92,7 +93,7 @@ struct mem_cgroup_per_zone {
/*
* spin_lock to protect the per cgroup LRU
*/
- spinlock_t lru_lock;
+ spinlock_t lru_lock; /* irq should be off. */
struct list_head active_list;
struct list_head inactive_list;
unsigned long count[NR_MEM_CGROUP_ZSTAT];
@@ -139,33 +140,6 @@ struct mem_cgroup {
};
static struct mem_cgroup init_mem_cgroup;
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin). But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK 0x0
-#endif
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
- struct list_head lru; /* per cgroup LRU list */
- struct page *page;
- struct mem_cgroup *mem_cgroup;
- int ref_cnt; /* cached, mapped, migrating */
- int flags;
-};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
static int page_cgroup_nid(struct page_cgroup *pc)
{
@@ -256,37 +230,6 @@ void mm_free_cgroup(struct mm_struct *mm
css_put(&mm->mem_cgroup->css);
}
-static inline int page_cgroup_locked(struct page *page)
-{
- return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
- VM_BUG_ON(!page_cgroup_locked(page));
- page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
- return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
- bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
- return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
- bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
static void __mem_cgroup_remove_list(struct page_cgroup *pc)
{
int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
@@ -356,6 +299,10 @@ void mem_cgroup_move_lists(struct page *
struct mem_cgroup_per_zone *mz;
unsigned long flags;
+ /* This GFP will be ignored..*/
+ pc = get_page_cgroup(page);
+ if (!pc)
+ return;
/*
* We cannot lock_page_cgroup while holding zone's lru_lock,
* because other holders of lock_page_cgroup can be interrupted
@@ -363,17 +310,15 @@ void mem_cgroup_move_lists(struct page *
* safely get to page_cgroup without it, so just try_lock it:
* mem_cgroup_isolate_pages allows for page left on wrong list.
*/
- if (!try_lock_page_cgroup(page))
+ if (!spin_trylock_irqsave(&pc->lock, flags))
return;
-
- pc = page_get_page_cgroup(page);
- if (pc) {
+ if (pc->refcnt) {
mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
+ spin_lock(&mz->lru_lock);
__mem_cgroup_move_lists(pc, active);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock(&mz->lru_lock);
}
- unlock_page_cgroup(page);
+ spin_unlock_irqrestore(&pc->lock, flags);
}
/*
@@ -525,7 +470,8 @@ unsigned long mem_cgroup_isolate_pages(u
* < 0 if the cgroup is over its limit
*/
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype)
+ gfp_t gfp_mask, enum charge_type ctype,
+ struct mem_cgroup *memcg)
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
@@ -536,33 +482,26 @@ static int mem_cgroup_charge_common(stru
if (mem_cgroup_subsys.disabled)
return 0;
+ pc = get_alloc_page_cgroup(page, gfp_mask);
+ /* Before kmalloc initialization, get_page_cgroup can return EBUSY */
+ if (unlikely(IS_ERR(pc))) {
+ if (PTR_ERR(pc) == -EBUSY)
+ return NULL;
+ return PTR_ERR(pc);
+ }
+
+ spin_lock_irqsave(&pc->lock, flags);
/*
- * Should page_cgroup's go to their own slab?
- * One could optimize the performance of the charging routine
- * by saving a bit in the page_flags and using it as a lock
- * to see if the cgroup page already has a page_cgroup associated
- * with it
- */
-retry:
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- /*
- * The page_cgroup exists and
- * the page has already been accounted.
+ * Has the page already been accounted ?
*/
- if (pc) {
- VM_BUG_ON(pc->page != page);
- VM_BUG_ON(pc->ref_cnt <= 0);
-
- pc->ref_cnt++;
- unlock_page_cgroup(page);
- goto done;
+ if (pc->refcnt > 0) {
+ pc->refcnt++;
+ spin_unlock_irqrestore(&pc->lock, flags);
+ goto success;
}
- unlock_page_cgroup(page);
+ spin_unlock_irqrestore(&pc->lock, flags);
- pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
- if (pc == NULL)
- goto err;
+ /* Note: *new* pc's refcnt is still 0 here. */
/*
* We always charge the cgroup the mm_struct belongs to.
@@ -570,20 +509,24 @@ retry:
* thread group leader migrates. It's possible that mm is not
* set, if so charge the init_mm (happens for pagecache usage).
*/
- if (!mm)
- mm = &init_mm;
-
- rcu_read_lock();
- mem = rcu_dereference(mm->mem_cgroup);
- /*
- * For every charge from the cgroup, increment reference count
- */
- css_get(&mem->css);
- rcu_read_unlock();
+ if (memcg) {
+ mem = memcg;
+ css_get(&mem->css);
+ } else {
+ if (!mm)
+ mm = &init_mm;
+ rcu_read_lock();
+ mem = rcu_dereference(mm->mem_cgroup);
+ /*
+ * For every charge from the cgroup, increment reference count
+ */
+ css_get(&mem->css);
+ rcu_read_unlock();
+ }
while (res_counter_charge(&mem->res, PAGE_SIZE)) {
if (!(gfp_mask & __GFP_WAIT))
- goto out;
+ goto nomem;
if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
continue;
@@ -600,52 +543,51 @@ retry:
if (!nr_retries--) {
mem_cgroup_out_of_memory(mem, gfp_mask);
- goto out;
+ goto nomem;
}
congestion_wait(WRITE, HZ/10);
}
-
- pc->ref_cnt = 1;
- pc->mem_cgroup = mem;
- pc->page = page;
- pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
- if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
- pc->flags |= PAGE_CGROUP_FLAG_CACHE;
-
- lock_page_cgroup(page);
- if (page_get_page_cgroup(page)) {
- unlock_page_cgroup(page);
- /*
- * Another charge has been added to this page already.
- * We take lock_page_cgroup(page) again and read
- * page->cgroup, increment refcnt.... just retry is OK.
- */
+ /*
+ * We have to acquire 2 spinlocks.
+ */
+ spin_lock_irqsave(&pc->lock, flags);
+ /* Is anyone charged ? */
+ if (unlikely(pc->refcnt)) {
+ /* Someone charged this page while we released the lock */
+ pc->refcnt++;
+ spin_unlock_irqrestore(&pc->lock, flags);
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
- kfree(pc);
- goto retry;
+ goto success;
}
- page_assign_page_cgroup(page, pc);
+ /* Anyone doesn't touch this. */
+ VM_BUG_ON(pc->mem_cgroup);
+
+ if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
+ pc->flags = PAGE_CGROUP_FLAG_ACTIVE | PAGE_CGROUP_FLAG_CACHE;
+ else
+ pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+ pc->refcnt = 1;
+ pc->mem_cgroup = mem;
mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
+
+ spin_lock(&mz->lru_lock);
__mem_cgroup_add_list(pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock(&mz->lru_lock);
+ spin_unlock_irqrestore(&pc->lock, flags);
- unlock_page_cgroup(page);
-done:
+success:
return 0;
-out:
+nomem:
css_put(&mem->css);
- kfree(pc);
-err:
return -ENOMEM;
}
int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_MAPPED);
+ MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
}
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -654,7 +596,7 @@ int mem_cgroup_cache_charge(struct page
if (!mm)
mm = &init_mm;
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_CACHE);
+ MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
}
/*
@@ -664,105 +606,83 @@ int mem_cgroup_cache_charge(struct page
void mem_cgroup_uncharge_page(struct page *page)
{
struct page_cgroup *pc;
- struct mem_cgroup *mem;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
if (mem_cgroup_subsys.disabled)
return;
-
/*
* Check if our page_cgroup is valid
*/
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (!pc)
- goto unlock;
-
- VM_BUG_ON(pc->page != page);
- VM_BUG_ON(pc->ref_cnt <= 0);
+ pc = get_page_cgroup(page);
+ if (likely(pc)) {
+ unsigned long flags;
+ struct mem_cgroup *mem;
+ struct mem_cgroup_per_zone *mz;
- if (--(pc->ref_cnt) == 0) {
+ spin_lock_irqsave(&pc->lock, flags);
+ if (!pc->refcnt || --pc->refcnt > 0) {
+ spin_unlock_irqrestore(&pc->lock, flags);
+ return;
+ }
+ mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
+ spin_lock(&mz->lru_lock);
__mem_cgroup_remove_list(pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-
- page_assign_page_cgroup(page, NULL);
- unlock_page_cgroup(page);
+ spin_unlock(&mz->lru_lock);
+ pc->flags = 0;
+ pc->mem_cgroup = 0;
+ spin_unlock_irqrestore(&pc->lock, flags);
- mem = pc->mem_cgroup;
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
-
- kfree(pc);
- return;
}
-
-unlock:
- unlock_page_cgroup(page);
}
/*
- * Returns non-zero if a page (under migration) has valid page_cgroup member.
- * Refcnt of page_cgroup is incremented.
+ * Pre-charge against newpage while moving a page.
+ * This function is called before taking page locks.
*/
-int mem_cgroup_prepare_migration(struct page *page)
+int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
{
struct page_cgroup *pc;
+ struct mem_cgroup *mem = NULL;
+ int ret = 0;
+ enum charge_type type = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+ unsigned long flags;
if (mem_cgroup_subsys.disabled)
- return 0;
+ return ret;
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (pc)
- pc->ref_cnt++;
- unlock_page_cgroup(page);
- return pc != NULL;
-}
+ pc = get_page_cgroup(page);
-void mem_cgroup_end_migration(struct page *page)
-{
- mem_cgroup_uncharge_page(page);
+ if (pc) {
+ spin_lock_irqsave(&pc->lock, flags);
+ if (pc->refcnt) {
+ mem = pc->mem_cgroup;
+ css_get(&mem->css);
+ if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+ type = MEM_CGROUP_CHARGE_TYPE_CACHE;
+ else
+ type = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+ }
+ spin_unlock_irqrestore(&pc->lock, flags);
+ if (mem) {
+ ret = mem_cgroup_charge_common(newpage, NULL,
+ GFP_KERNEL, type, mem);
+ css_put(&mem->css);
+ }
+ }
+ return ret;
}
-
/*
- * We know both *page* and *newpage* are now not-on-LRU and PG_locked.
- * And no race with uncharge() routines because page_cgroup for *page*
- * has extra one reference by mem_cgroup_prepare_migration.
+ * At the end of migration, we'll push newpage to LRU and
+ * drop one refcnt which added at prepare_migration.
*/
-void mem_cgroup_page_migration(struct page *page, struct page *newpage)
+void mem_cgroup_end_migration(struct page *newpage)
{
- struct page_cgroup *pc;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (!pc) {
- unlock_page_cgroup(page);
+ if (mem_cgroup_subsys.disabled)
return;
- }
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-
- page_assign_page_cgroup(page, NULL);
- unlock_page_cgroup(page);
-
- pc->page = newpage;
- lock_page_cgroup(newpage);
- page_assign_page_cgroup(newpage, pc);
-
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-
- unlock_page_cgroup(newpage);
+ mem_cgroup_uncharge_page(newpage);
}
/*
@@ -790,10 +710,13 @@ static void mem_cgroup_force_empty_list(
while (!list_empty(list)) {
pc = list_entry(list->prev, struct page_cgroup, lru);
page = pc->page;
- get_page(page);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- mem_cgroup_uncharge_page(page);
- put_page(page);
+ if (PageLRU(page)) {
+ get_page(page);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ mem_cgroup_uncharge_page(page);
+ put_page(page);
+ } else
+ count = 0;
if (--count <= 0) {
count = FORCE_UNCHARGE_BATCH;
cond_resched();
Index: mm-2.6.25-rc5-mm1-k/include/linux/memcontrol.h
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/include/linux/memcontrol.h
+++ mm-2.6.25-rc5-mm1-k/include/linux/memcontrol.h
@@ -19,6 +19,7 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
+#include <linux/page_cgroup.h>
struct mem_cgroup;
struct page_cgroup;
@@ -30,9 +31,6 @@ struct mm_struct;
extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
extern void mm_free_cgroup(struct mm_struct *mm);
-#define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0)
-
-extern struct page_cgroup *page_get_page_cgroup(struct page *page);
extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -51,9 +49,8 @@ int task_in_mem_cgroup(struct task_struc
#define mm_match_cgroup(mm, cgroup) \
((cgroup) == rcu_dereference((mm)->mem_cgroup))
-extern int mem_cgroup_prepare_migration(struct page *page);
-extern void mem_cgroup_end_migration(struct page *page);
-extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
+extern int mem_cgroup_prepare_migration(struct page *, struct page *);
+extern void mem_cgroup_end_migration(struct page *);
/*
* For memory reclaim.
@@ -82,14 +79,6 @@ static inline void mm_free_cgroup(struct
{
}
-static inline void page_reset_bad_cgroup(struct page *page)
-{
-}
-
-static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
- return NULL;
-}
static inline int mem_cgroup_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
@@ -122,7 +111,8 @@ static inline int task_in_mem_cgroup(str
return 1;
}
-static inline int mem_cgroup_prepare_migration(struct page *page)
+static inline int
+mem_cgroup_prepare_migration(struct page *page , struct page *newpage)
{
return 0;
}
Index: mm-2.6.25-rc5-mm1-k/mm/page_alloc.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/page_alloc.c
+++ mm-2.6.25-rc5-mm1-k/mm/page_alloc.c
@@ -222,17 +222,11 @@ static inline int bad_range(struct zone
static void bad_page(struct page *page)
{
- void *pc = page_get_page_cgroup(page);
-
printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
current->comm, page, (int)(2*sizeof(unsigned long)),
(unsigned long)page->flags, page->mapping,
page_mapcount(page), page_count(page));
- if (pc) {
- printk(KERN_EMERG "cgroup:%p\n", pc);
- page_reset_bad_cgroup(page);
- }
printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
KERN_EMERG "Backtrace:\n");
dump_stack();
@@ -478,7 +472,6 @@ static inline int free_pages_check(struc
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_get_page_cgroup(page) != NULL) |
(page_count(page) != 0) |
(page->flags & (
1 << PG_lru |
@@ -628,7 +621,6 @@ static int prep_new_page(struct page *pa
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_get_page_cgroup(page) != NULL) |
(page_count(page) != 0) |
(page->flags & (
1 << PG_lru |
Index: mm-2.6.25-rc5-mm1-k/include/linux/mm_types.h
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/include/linux/mm_types.h
+++ mm-2.6.25-rc5-mm1-k/include/linux/mm_types.h
@@ -88,9 +88,6 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- unsigned long page_cgroup;
-#endif
#ifdef CONFIG_PAGE_OWNER
int order;
unsigned int gfp_mask;
Index: mm-2.6.25-rc5-mm1-k/mm/migrate.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/migrate.c
+++ mm-2.6.25-rc5-mm1-k/mm/migrate.c
@@ -358,6 +358,12 @@ static int migrate_page_move_mapping(str
write_unlock_irq(&mapping->tree_lock);
+ /* by mem_cgroup_prepare_migration, newpage is already
+ assigned to valid cgroup. and current->mm and GFP_ATOMIC
+ will not be used...*/
+ mem_cgroup_uncharge_page(page);
+ mem_cgroup_cache_charge(newpage, current->mm, GFP_ATOMIC);
+
return 0;
}
@@ -603,7 +609,6 @@ static int move_to_new_page(struct page
rc = fallback_migrate_page(mapping, newpage, page);
if (!rc) {
- mem_cgroup_page_migration(page, newpage);
remove_migration_ptes(page, newpage);
} else
newpage->mapping = NULL;
@@ -633,6 +638,12 @@ static int unmap_and_move(new_page_t get
/* page was freed from under us. So we are done. */
goto move_newpage;
+ charge = mem_cgroup_prepare_migration(page, newpage);
+ if (charge == -ENOMEM) {
+ rc = -ENOMEM;
+ goto move_newpage;
+ }
+
rc = -EAGAIN;
if (TestSetPageLocked(page)) {
if (!force)
@@ -684,19 +695,14 @@ static int unmap_and_move(new_page_t get
goto rcu_unlock;
}
- charge = mem_cgroup_prepare_migration(page);
/* Establish migration ptes or remove ptes */
try_to_unmap(page, 1);
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
- if (rc) {
+ if (rc)
remove_migration_ptes(page, page);
- if (charge)
- mem_cgroup_end_migration(page);
- } else if (charge)
- mem_cgroup_end_migration(newpage);
rcu_unlock:
if (rcu_locked)
rcu_read_unlock();
@@ -717,6 +723,8 @@ unlock:
}
move_newpage:
+ if (!charge)
+ mem_cgroup_end_migration(newpage);
/*
* Move the new page to the LRU. If migration was not successful
* then this will free the page.
Index: mm-2.6.25-rc5-mm1-k/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mm-2.6.25-rc5-mm1-k/mm/page_cgroup.c
@@ -0,0 +1,259 @@
+/*
+ * per-page accounting subsystem infrastructure. - linux/mm/page_cgroup.c
+ *
+ * (C) 2008 FUJITSU, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * page_cgroup is yet another mem_map under memory resoruce controller.
+ * It containes information which cannot be stored in usual mem_map.
+ * This allows us to keep 'struct page' small when a user doesn't activate
+ * memory resource controller.
+ *
+ * We can translate : struct page <-> pfn -> page_cgroup -> struct page.
+ *
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+#include <linux/memcontrol.h>
+#include <linux/err.h>
+
+static int page_cgroup_order __read_mostly;
+static int page_cgroup_head_size __read_mostly;
+
+#define PCGRP_SHIFT (page_cgroup_order)
+#define PCGRP_SIZE (1 << PCGRP_SHIFT)
+#define PCGRP_MASK (PCGRP_SIZE - 1)
+
+struct page_cgroup_head {
+ struct page_cgroup pc[0];
+};
+
+struct page_cgroup_root {
+ spinlock_t tree_lock;
+ struct radix_tree_root root_node;
+};
+
+/*
+ * Calculate page_cgroup order to be not larger than order-2 page allocation.
+ */
+static void calc_page_cgroup_order(void)
+{
+ int order = pageblock_order;
+ unsigned long size = sizeof(struct page_cgroup) << order;
+
+ while (size > PAGE_SIZE * 2) {
+ order -= 1;
+ size = sizeof(struct page_cgroup) << order;
+ }
+
+ page_cgroup_order = order;
+ page_cgroup_head_size = sizeof(struct page_cgroup_head) +
+ (sizeof(struct page_cgroup) << order);
+}
+
+static struct page_cgroup_root __initdata *tmp_root_dir[MAX_NUMNODES];
+static struct page_cgroup_root *root_node[MAX_NUMNODES] __read_mostly;
+
+static void
+init_page_cgroup_head(struct page_cgroup_head *head, unsigned long pfn)
+{
+ struct page *page;
+ struct page_cgroup *pc;
+ int i;
+
+ for (i = 0, page = pfn_to_page(pfn), pc = &head->pc[0];
+ i < PCGRP_SIZE; i++, page++, pc++) {
+ pc->refcnt = 0;
+ pc->page = page;
+ spin_lock_init(&pc->lock);
+ }
+}
+
+
+struct kmem_cache *page_cgroup_cachep;
+
+static struct page_cgroup_head *
+alloc_page_cgroup_head(unsigned long pfn, int nid, gfp_t mask)
+{
+ struct page_cgroup_head *head;
+
+ if (!node_state(nid, N_NORMAL_MEMORY))
+ nid = -1;
+ head = kmem_cache_alloc_node(page_cgroup_cachep, mask, nid);
+ if (head)
+ init_page_cgroup_head(head, pfn);
+
+ return head;
+}
+
+void free_page_cgroup(struct page_cgroup_head *head)
+{
+ kmem_cache_free(page_cgroup_cachep, head);
+}
+
+static struct page_cgroup_root *pcgroup_get_root(struct page *page)
+{
+ int nid;
+
+ VM_BUG_ON(!page);
+
+ nid = page_to_nid(page);
+
+ return root_node[nid];
+}
+
+/**
+ * get_page_cgroup - look up a page_cgroup for a page
+ * @page: the page whose page_cgroup is looked up.
+ *
+ * This just does lookup.
+ */
+struct page_cgroup *get_page_cgroup(struct page *page)
+{
+ struct page_cgroup_head *head;
+ struct page_cgroup_root *root;
+ struct page_cgroup *ret = NULL;
+ unsigned long pfn, idx;
+
+ /*
+ * NULL can be returned before initialization
+ */
+ root = pcgroup_get_root(page);
+ if (unlikely(!root))
+ return ret;
+
+ pfn = page_to_pfn(page);
+ idx = pfn >> PCGRP_SHIFT;
+ /*
+ * We don't need lock here because no one deletes this head.
+ * (Freeing routtine will be added later.)
+ */
+ rcu_read_lock();
+ head = radix_tree_lookup(&root->root_node, idx);
+ rcu_read_unlock();
+
+ if (likely(head))
+ ret = &head->pc[pfn & PCGRP_MASK];
+
+ return ret;
+}
+
+/**
+ * get_alloc_page_cgroup - look up or allocate a page_cgroup for a page
+ * @page: the page whose page_cgroup is looked up.
+ * @gfpmask: the gfpmask which will be used for page allocatiopn.
+ *
+ * look up and allocate if not found.
+ */
+
+struct page_cgroup *
+get_alloc_page_cgroup(struct page *page, gfp_t gfpmask)
+{
+ struct page_cgroup_root *root;
+ struct page_cgroup_head *head;
+ struct page_cgroup *pc;
+ unsigned long pfn, idx;
+ int nid;
+ unsigned long base_pfn, flags;
+ int error = 0;
+
+ might_sleep_if(gfpmask & __GFP_WAIT);
+
+retry:
+ pc = get_page_cgroup(page);
+ if (pc)
+ return pc;
+ /*
+ * NULL can be returned before initialization.
+ */
+ root = pcgroup_get_root(page);
+ if (unlikely(!root))
+ return ERR_PTR(-EBUSY);
+
+ pfn = page_to_pfn(page);
+ idx = pfn >> PCGRP_SHIFT;
+ nid = page_to_nid(page);
+ base_pfn = idx << PCGRP_SHIFT;
+
+ gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
+
+ head = alloc_page_cgroup_head(base_pfn, nid, gfpmask);
+ if (!head)
+ return ERR_PTR(-ENOMEM);
+
+ pc = &head->pc[pfn & PCGRP_MASK];
+
+ error = radix_tree_preload(gfpmask);
+ if (error)
+ goto out;
+ spin_lock_irqsave(&root->tree_lock, flags);
+ error = radix_tree_insert(&root->root_node, idx, head);
+ spin_unlock_irqrestore(&root->tree_lock, flags);
+ radix_tree_preload_end();
+out:
+ if (error) {
+ free_page_cgroup(head);
+ if (error == -EEXIST)
+ goto retry;
+ pc = ERR_PTR(error);
+ }
+ return pc;
+}
+
+static int __init page_cgroup_init(void)
+{
+ int tmp, nid;
+ struct page_cgroup_root *root;
+
+ calc_page_cgroup_order();
+
+ page_cgroup_cachep = kmem_cache_create("page_cgroup",
+ page_cgroup_head_size, 0,
+ SLAB_PANIC | SLAB_DESTROY_BY_RCU, NULL);
+
+ if (!page_cgroup_cachep) {
+ printk(KERN_ERR "page accouning setup failure\n");
+ return -ENOMEM;
+ }
+
+ for_each_node(nid) {
+ tmp = nid;
+ if (!node_state(nid, N_NORMAL_MEMORY))
+ tmp = -1;
+ root = kmalloc_node(sizeof(struct page_cgroup_root),
+ GFP_KERNEL, tmp);
+ if (!root)
+ goto unroll;
+ INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
+ spin_lock_init(&root->tree_lock);
+ tmp_root_dir[nid] = root;
+ }
+ /*
+ * By filling node_root[], this tree turns to be visible.
+ * Because we have to finish initialization of the tree before
+ * we make it visible, memory barrier is necessary.
+ */
+ smp_wmb();
+ for_each_node(nid)
+ root_node[nid] = tmp_root_dir[nid];
+
+ printk(KERN_INFO "Page Accounting is activated\n");
+ return 0;
+unroll:
+ for_each_node(nid)
+ kfree(tmp_root_dir[nid]);
+
+ return -ENOMEM;
+}
+late_initcall(page_cgroup_init);
Index: mm-2.6.25-rc5-mm1-k/mm/Makefile
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/Makefile
+++ mm-2.6.25-rc5-mm1-k/mm/Makefile
@@ -32,5 +32,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [-mm][PATCH 2/6] boost by per_cpu
2008-04-01 8:28 [-mm][PATCH 0/6] memcg: radix tree page_cgroup v3 KAMEZAWA Hiroyuki
2008-04-01 8:30 ` [-mm][PATCH 1/6] memcg: radix-tree lookup for page_cgroup KAMEZAWA Hiroyuki
@ 2008-04-01 8:31 ` KAMEZAWA Hiroyuki
2008-04-01 8:32 ` [-mm][PATCH 3/6] try_to_shirink page cgroup KAMEZAWA Hiroyuki
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-01 8:31 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, xemul, yamamoto, menage
This patch adds per-cpu look up cache for get_page_cgroup().
Works well when nearby pages are accessed continuously.
Changeloc v2 -> v3:
* added prefetch.
Changelog v1 -> v2:
* avoid inlining by adding function to page_cgroup.h
* set to be cacheline-aligned.
* added hashfunc() macro.
* changed what should be remembered in cache.
This version rememvers base address, not "head".
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/page_cgroup.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 52 insertions(+), 3 deletions(-)
Index: mm-2.6.25-rc5-mm1-k/mm/page_cgroup.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/page_cgroup.c
+++ mm-2.6.25-rc5-mm1-k/mm/page_cgroup.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/radix-tree.h>
#include <linux/memcontrol.h>
+#include <linux/interrupt.h>
#include <linux/err.h>
static int page_cgroup_order __read_mostly;
@@ -80,6 +81,49 @@ init_page_cgroup_head(struct page_cgroup
}
}
+#define PCGRP_CACHE_SIZE (0x10)
+#define PCGRP_CACHE_MASK (PCGRP_CACHE_SIZE - 1)
+struct page_cgroup_cache {
+ struct {
+ unsigned long idx;
+ struct page_cgroup *base;
+ } ents[PCGRP_CACHE_SIZE];
+};
+DEFINE_PER_CPU(struct page_cgroup_cache, pcpu_pcgroup_cache) ____cacheline_aligned;
+
+#define hashfunc(idx) ((idx) & PCGRP_CACHE_MASK)
+
+static struct page_cgroup *pcp_lookup(unsigned long pfn, unsigned long idx)
+{
+ struct page_cgroup *ret = NULL;
+ struct page_cgroup_cache *pcp;
+ int hnum = hashfunc(idx);
+
+ pcp = &get_cpu_var(pcpu_pcgroup_cache);
+ if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base) {
+ ret = pcp->ents[hnum].base + pfn;
+ prefetchw(ret);
+ }
+ put_cpu_var(pcpu_pcgroup_cache);
+ return ret;
+}
+
+static void cache_result(unsigned long idx, struct page_cgroup_head *head)
+{
+ struct page_cgroup_cache *pcp;
+ int hnum = hashfunc(idx);
+
+ /*
+ * Because look up is done under preempt_disable, don't modifies
+ * an entry in interrupt.
+ */
+ if (in_interrupt())
+ return;
+ pcp = &get_cpu_var(pcpu_pcgroup_cache);
+ pcp->ents[hnum].idx = idx;
+ pcp->ents[hnum].base = &head->pc[0] - (idx << PCGRP_SHIFT);
+ put_cpu_var(pcpu_pcgroup_cache);
+}
struct kmem_cache *page_cgroup_cachep;
@@ -126,6 +170,11 @@ struct page_cgroup *get_page_cgroup(stru
struct page_cgroup *ret = NULL;
unsigned long pfn, idx;
+ pfn = page_to_pfn(page);
+ idx = pfn >> PCGRP_SHIFT;
+ ret = pcp_lookup(pfn, idx);
+ if (ret)
+ return ret;
/*
* NULL can be returned before initialization
*/
@@ -133,8 +182,6 @@ struct page_cgroup *get_page_cgroup(stru
if (unlikely(!root))
return ret;
- pfn = page_to_pfn(page);
- idx = pfn >> PCGRP_SHIFT;
/*
* We don't need lock here because no one deletes this head.
* (Freeing routtine will be added later.)
@@ -143,8 +190,10 @@ struct page_cgroup *get_page_cgroup(stru
head = radix_tree_lookup(&root->root_node, idx);
rcu_read_unlock();
- if (likely(head))
+ if (likely(head)) {
+ cache_result(idx, head);
ret = &head->pc[pfn & PCGRP_MASK];
+ }
return ret;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [-mm][PATCH 3/6] try_to_shirink page cgroup
2008-04-01 8:28 [-mm][PATCH 0/6] memcg: radix tree page_cgroup v3 KAMEZAWA Hiroyuki
2008-04-01 8:30 ` [-mm][PATCH 1/6] memcg: radix-tree lookup for page_cgroup KAMEZAWA Hiroyuki
2008-04-01 8:31 ` [-mm][PATCH 2/6] boost by per_cpu KAMEZAWA Hiroyuki
@ 2008-04-01 8:32 ` KAMEZAWA Hiroyuki
2008-04-01 8:33 ` [-mm][PATCH 4/6] remove unnecessary page_cgroup_zoneinfo KAMEZAWA Hiroyuki
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-01 8:32 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, xemul, yamamoto, menage
This patch is for freeing page_cgroup if a chunk of pages are freed.
How this works
* when the order of free page reaches PCGRP_SHRINK_ORDER, pcgrp is freed.
This will be done by RCU.
Changelog v2 -> v3:
* removed redundant printk().
Changelog v1 -> v2:
* shrink_order is automatically calculated.
* added comments.
* added cpu mask to page_cgroup_head. (this will be help at clearing.)
* moved a routine for flushing percpu idx after deletion in radix-tree.
* removed memory barrier. (I noticed that clearing can be done after
lock/unlock.)
* add a sanity check not to access a page_cgroup for a page whose refcnt is 0.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsuc.com>
include/linux/page_cgroup.h | 13 ++++++++
mm/page_alloc.c | 2 +
mm/page_cgroup.c | 70 ++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 83 insertions(+), 2 deletions(-)
Index: mm-2.6.25-rc5-mm1-k/mm/page_cgroup.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/page_cgroup.c
+++ mm-2.6.25-rc5-mm1-k/mm/page_cgroup.c
@@ -31,12 +31,14 @@
static int page_cgroup_order __read_mostly;
static int page_cgroup_head_size __read_mostly;
+int pcgroup_shrink_order __read_mostly;
#define PCGRP_SHIFT (page_cgroup_order)
#define PCGRP_SIZE (1 << PCGRP_SHIFT)
#define PCGRP_MASK (PCGRP_SIZE - 1)
struct page_cgroup_head {
+ cpumask_t mask;
struct page_cgroup pc[0];
};
@@ -61,6 +63,10 @@ static void calc_page_cgroup_order(void)
page_cgroup_order = order;
page_cgroup_head_size = sizeof(struct page_cgroup_head) +
(sizeof(struct page_cgroup) << order);
+ if (order + 1 < MAX_ORDER)
+ pcgroup_shrink_order = order + 1;
+ else
+ pcgroup_shrink_order = MAX_ORDER - 1;
}
static struct page_cgroup_root __initdata *tmp_root_dir[MAX_NUMNODES];
@@ -72,7 +78,7 @@ init_page_cgroup_head(struct page_cgroup
struct page *page;
struct page_cgroup *pc;
int i;
-
+ cpus_clear(head->mask);
for (i = 0, page = pfn_to_page(pfn), pc = &head->pc[0];
i < PCGRP_SIZE; i++, page++, pc++) {
pc->refcnt = 0;
@@ -122,6 +128,7 @@ static void cache_result(unsigned long i
pcp = &get_cpu_var(pcpu_pcgroup_cache);
pcp->ents[hnum].idx = idx;
pcp->ents[hnum].base = &head->pc[0] - (idx << PCGRP_SHIFT);
+ cpu_set(smp_processor_id(), head->mask);
put_cpu_var(pcpu_pcgroup_cache);
}
@@ -152,6 +159,8 @@ static struct page_cgroup_root *pcgroup_
VM_BUG_ON(!page);
+ VM_BUG_ON(!page_count(page));
+
nid = page_to_nid(page);
return root_node[nid];
@@ -259,6 +268,62 @@ out:
}
return pc;
}
+/*
+ * This will be called from deep place in free_pages()
+ * Because this is called by page allocator, we can assume that
+ * 1. zone->lock is held.
+ * 2. no one touches pages in [page...page + (1 << order))
+ * 3. Because of 2, page_cgroups against [page...page + (1 << order)]
+ * is not touched and will not be touched while we hold zone->lock.
+ */
+void __shrink_page_cgroup(struct page *page, int order)
+{
+ struct page_cgroup_root *root;
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long end_pfn;
+ int cpu;
+
+ root = pcgroup_get_root(page);
+ if (in_interrupt() || !root || (order < PCGRP_SHIFT))
+ return;
+
+ pfn = page_to_pfn(page);
+ end_pfn = pfn + (1 << order);
+
+ while (pfn != end_pfn) {
+ if (spin_trylock(&root->tree_lock)) {
+ struct page_cgroup_cache *pcp;
+ struct page_cgroup_head *head = NULL;
+ int idx = pfn >> PCGRP_SHIFT;
+ /*
+ * Because [pfn, end_pfn) are free pages, we can assume
+ * no lookup in this range will occur. So this has no
+ * race. For rafix-tree, we have to take a lock.
+ * Radix tree is freed by RCU. so they will not call
+ * free_pages() directly under this.
+ */
+ head = radix_tree_delete(&root->root_node, idx);
+ spin_unlock(&root->tree_lock);
+
+ /*
+ * It's guaranteed that no one has access to this pfn
+ * because there isn't and won't be access to the page
+ * and page_cgroup.
+ */
+ if (head) {
+ int hnum = hashfunc(idx);
+ for_each_cpu_mask(cpu, head->mask) {
+ pcp = &per_cpu(pcpu_pcgroup_cache, cpu);
+ if (pcp->ents[hnum].idx == idx)
+ pcp->ents[hnum].base = NULL;
+ }
+ /* SLAB for head is SLAB_DESTROY_BY_RCU. */
+ free_page_cgroup(head);
+ }
+ }
+ pfn += PCGRP_SIZE;
+ }
+}
static int __init page_cgroup_init(void)
{
Index: mm-2.6.25-rc5-mm1-k/mm/page_alloc.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/page_alloc.c
+++ mm-2.6.25-rc5-mm1-k/mm/page_alloc.c
@@ -45,6 +45,7 @@
#include <linux/fault-inject.h>
#include <linux/page-isolation.h>
#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -463,6 +464,7 @@ static inline void __free_one_page(struc
order++;
}
set_page_order(page, order);
+ shrink_page_cgroup(page, order);
list_add(&page->lru,
&zone->free_area[order].free_list[migratetype]);
zone->free_area[order].nr_free++;
Index: mm-2.6.25-rc5-mm1-k/include/linux/page_cgroup.h
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/include/linux/page_cgroup.h
+++ mm-2.6.25-rc5-mm1-k/include/linux/page_cgroup.h
@@ -39,6 +39,15 @@ extern struct page_cgroup *get_page_cgro
extern struct page_cgroup *
get_alloc_page_cgroup(struct page *page, gfp_t gfpmask);
+extern int pcgroup_shrink_order;
+extern void __shrink_page_cgroup(struct page *page, int order);
+
+static inline void shrink_page_cgroup(struct page *page, int order)
+{
+ if (unlikely(order >= pcgroup_shrink_order))
+ __shrink_page_cgroup(page, order);
+}
+
#else
static inline struct page_cgroup *get_page_cgroup(struct page *page)
@@ -51,5 +60,9 @@ get_alloc_page_cgroup(struct page *page,
{
return NULL;
}
+static inline void
+shrink_page_cgroup(struct page *page, int order)
+{
+}
#endif
#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [-mm][PATCH 4/6] remove unnecessary page_cgroup_zoneinfo
2008-04-01 8:28 [-mm][PATCH 0/6] memcg: radix tree page_cgroup v3 KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2008-04-01 8:32 ` [-mm][PATCH 3/6] try_to_shirink page cgroup KAMEZAWA Hiroyuki
@ 2008-04-01 8:33 ` KAMEZAWA Hiroyuki
2008-04-01 8:34 ` [-mm][PATCH 5/6] remove refcnt use mapcount KAMEZAWA Hiroyuki
2008-04-01 8:35 ` [-mm][PATCH 6/6] mem_cgroup_map/new_charge KAMEZAWA Hiroyuki
5 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-01 8:33 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, xemul, yamamoto, menage
page_cgroup_zoneinfo() is called twice before and after lock.
This is wasteful.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
Index: mm-2.6.25-rc5-mm1-k/mm/memcontrol.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/memcontrol.c
+++ mm-2.6.25-rc5-mm1-k/mm/memcontrol.c
@@ -230,10 +230,10 @@ void mm_free_cgroup(struct mm_struct *mm
css_put(&mm->mem_cgroup->css);
}
-static void __mem_cgroup_remove_list(struct page_cgroup *pc)
+static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
+ struct page_cgroup *pc)
{
int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
- struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
if (from)
MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
@@ -244,10 +244,10 @@ static void __mem_cgroup_remove_list(str
list_del_init(&pc->lru);
}
-static void __mem_cgroup_add_list(struct page_cgroup *pc)
+static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
+ struct page_cgroup *pc)
{
int to = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
- struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
if (!to) {
MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
@@ -573,7 +573,7 @@ static int mem_cgroup_charge_common(stru
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
- __mem_cgroup_add_list(pc);
+ __mem_cgroup_add_list(mz, pc);
spin_unlock(&mz->lru_lock);
spin_unlock_irqrestore(&pc->lock, flags);
@@ -623,10 +623,10 @@ void mem_cgroup_uncharge_page(struct pag
spin_unlock_irqrestore(&pc->lock, flags);
return;
}
- mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
+ mem = pc->mem_cgroup;
spin_lock(&mz->lru_lock);
- __mem_cgroup_remove_list(pc);
+ __mem_cgroup_remove_list(mz, pc);
spin_unlock(&mz->lru_lock);
pc->flags = 0;
pc->mem_cgroup = 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [-mm][PATCH 5/6] remove refcnt use mapcount
2008-04-01 8:28 [-mm][PATCH 0/6] memcg: radix tree page_cgroup v3 KAMEZAWA Hiroyuki
` (3 preceding siblings ...)
2008-04-01 8:33 ` [-mm][PATCH 4/6] remove unnecessary page_cgroup_zoneinfo KAMEZAWA Hiroyuki
@ 2008-04-01 8:34 ` KAMEZAWA Hiroyuki
2008-04-01 8:35 ` [-mm][PATCH 6/6] mem_cgroup_map/new_charge KAMEZAWA Hiroyuki
5 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-01 8:34 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, xemul, yamamoto, menage
This patch removes page_cgroup->refcnt.
Instead of page_cgroup->refcnt, page->mapcount is used.
After this patch, rule is below.
- page is charged only if mapcount == 0.
- page is uncharged only if mapcount == 0.
- If page is accounted, page_cgroup->mem_cgroup of the page is not NULL.
For managing page-cache, which has no mapcount, PAGE_CGROUP_FLAG_CACHE
is used. (this works as refcnt from mapping->radix-tree.)
By introducing page->mapcount into accounting rule
- page_cgroup->refcnt can be omitted.
- fork() can be faster.
Under my easy test, works well. But needs harder test.
Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 1
include/linux/page_cgroup.h | 3 --
mm/filemap.c | 6 ++--
mm/memcontrol.c | 60 ++++++++++++++++++++++++++++++++++++--------
mm/page_cgroup.c | 2 -
5 files changed, 56 insertions(+), 16 deletions(-)
Index: mm-2.6.25-rc5-mm1-k/include/linux/page_cgroup.h
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/include/linux/page_cgroup.h
+++ mm-2.6.25-rc5-mm1-k/include/linux/page_cgroup.h
@@ -13,10 +13,9 @@ struct mem_cgroup;
struct page_cgroup {
spinlock_t lock; /* lock for all members */
- int refcnt; /* reference count */
+ int flags; /* See below */
struct mem_cgroup *mem_cgroup; /* current cgroup subsys */
struct list_head lru; /* for per cgroup LRU */
- int flags; /* See below */
struct page *page; /* the page this accounts for*/
};
Index: mm-2.6.25-rc5-mm1-k/mm/page_cgroup.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/page_cgroup.c
+++ mm-2.6.25-rc5-mm1-k/mm/page_cgroup.c
@@ -81,7 +81,7 @@ init_page_cgroup_head(struct page_cgroup
cpus_clear(head->mask);
for (i = 0, page = pfn_to_page(pfn), pc = &head->pc[0];
i < PCGRP_SIZE; i++, page++, pc++) {
- pc->refcnt = 0;
+ pc->mem_cgroup = NULL;
pc->page = page;
spin_lock_init(&pc->lock);
}
Index: mm-2.6.25-rc5-mm1-k/mm/memcontrol.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/memcontrol.c
+++ mm-2.6.25-rc5-mm1-k/mm/memcontrol.c
@@ -312,7 +312,7 @@ void mem_cgroup_move_lists(struct page *
*/
if (!spin_trylock_irqsave(&pc->lock, flags))
return;
- if (pc->refcnt) {
+ if (pc->mem_cgroup) {
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
__mem_cgroup_move_lists(pc, active);
@@ -486,7 +486,7 @@ static int mem_cgroup_charge_common(stru
/* Before kmalloc initialization, get_page_cgroup can return EBUSY */
if (unlikely(IS_ERR(pc))) {
if (PTR_ERR(pc) == -EBUSY)
- return NULL;
+ return 0;
return PTR_ERR(pc);
}
@@ -494,15 +494,12 @@ static int mem_cgroup_charge_common(stru
/*
* Has the page already been accounted ?
*/
- if (pc->refcnt > 0) {
- pc->refcnt++;
+ if (pc->mem_cgroup) {
spin_unlock_irqrestore(&pc->lock, flags);
goto success;
}
spin_unlock_irqrestore(&pc->lock, flags);
- /* Note: *new* pc's refcnt is still 0 here. */
-
/*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
@@ -552,9 +549,8 @@ static int mem_cgroup_charge_common(stru
*/
spin_lock_irqsave(&pc->lock, flags);
/* Is anyone charged ? */
- if (unlikely(pc->refcnt)) {
+ if (unlikely(pc->mem_cgroup)) {
/* Someone charged this page while we released the lock */
- pc->refcnt++;
spin_unlock_irqrestore(&pc->lock, flags);
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
@@ -567,7 +563,6 @@ static int mem_cgroup_charge_common(stru
pc->flags = PAGE_CGROUP_FLAG_ACTIVE | PAGE_CGROUP_FLAG_CACHE;
else
pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
- pc->refcnt = 1;
pc->mem_cgroup = mem;
mz = page_cgroup_zoneinfo(pc);
@@ -586,6 +581,8 @@ nomem:
int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
+ if (page_mapped(page))
+ return 0;
return mem_cgroup_charge_common(page, mm, gfp_mask,
MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
}
@@ -609,6 +606,8 @@ void mem_cgroup_uncharge_page(struct pag
if (mem_cgroup_subsys.disabled)
return;
+ if (page_mapped(page))
+ return;
/*
* Check if our page_cgroup is valid
*/
@@ -619,7 +618,9 @@ void mem_cgroup_uncharge_page(struct pag
struct mem_cgroup_per_zone *mz;
spin_lock_irqsave(&pc->lock, flags);
- if (!pc->refcnt || --pc->refcnt > 0) {
+ if (page_mapped(page) ||
+ pc->flags & PAGE_CGROUP_FLAG_CACHE ||
+ !pc->mem_cgroup) {
spin_unlock_irqrestore(&pc->lock, flags);
return;
}
@@ -637,6 +638,49 @@ void mem_cgroup_uncharge_page(struct pag
}
}
+void mem_cgroup_uncharge_cache_page(struct page *page)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *mem;
+ struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
+
+ if (mem_cgroup_subsys.disabled)
+ return;
+
+ pc = get_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ spin_lock_irqsave(&pc->lock, flags);
+ if (!pc->mem_cgroup)
+ goto unlock_return;
+ mem = pc->mem_cgroup;
+ /*
+ * This page is still alive as mapped page.
+ * Change this account as MAPPED page.
+ */
+ if (page_mapped(page)) {
+ mem_cgroup_charge_statistics(mem, pc->flags, false);
+ pc->flags &= ~PAGE_CGROUP_FLAG_CACHE;
+ mem_cgroup_charge_statistics(mem, pc->flags, true);
+ goto unlock_return;
+ }
+ mz = page_cgroup_zoneinfo(pc);
+ spin_lock(&mz->lru_lock);
+ __mem_cgroup_remove_list(mz, pc);
+ spin_unlock(&mz->lru_lock);
+ pc->flags = 0;
+ pc->mem_cgroup = 0;
+ spin_unlock_irqrestore(&pc->lock, flags);
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
+ return;
+unlock_return:
+ spin_unlock_irqrestore(&pc->lock, flags);
+ return;
+}
+
/*
* Pre-charge against newpage while moving a page.
* This function is called before taking page locks.
@@ -656,7 +700,7 @@ int mem_cgroup_prepare_migration(struct
if (pc) {
spin_lock_irqsave(&pc->lock, flags);
- if (pc->refcnt) {
+ if (pc->mem_cgroup) {
mem = pc->mem_cgroup;
css_get(&mem->css);
if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
Index: mm-2.6.25-rc5-mm1-k/include/linux/memcontrol.h
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/include/linux/memcontrol.h
+++ mm-2.6.25-rc5-mm1-k/include/linux/memcontrol.h
@@ -36,6 +36,7 @@ extern int mem_cgroup_charge(struct page
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_uncharge_page(struct page *page);
+extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern void mem_cgroup_move_lists(struct page *page, bool active);
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
Index: mm-2.6.25-rc5-mm1-k/mm/filemap.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/filemap.c
+++ mm-2.6.25-rc5-mm1-k/mm/filemap.c
@@ -118,7 +118,7 @@ void __remove_from_page_cache(struct pag
{
struct address_space *mapping = page->mapping;
- mem_cgroup_uncharge_page(page);
+ mem_cgroup_uncharge_cache_page(page);
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
@@ -477,12 +477,12 @@ int add_to_page_cache(struct page *page,
mapping->nrpages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
} else
- mem_cgroup_uncharge_page(page);
+ mem_cgroup_uncharge_cache_page(page);
write_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();
} else
- mem_cgroup_uncharge_page(page);
+ mem_cgroup_uncharge_cache_page(page);
out:
return error;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [-mm][PATCH 6/6] mem_cgroup_map/new_charge
2008-04-01 8:28 [-mm][PATCH 0/6] memcg: radix tree page_cgroup v3 KAMEZAWA Hiroyuki
` (4 preceding siblings ...)
2008-04-01 8:34 ` [-mm][PATCH 5/6] remove refcnt use mapcount KAMEZAWA Hiroyuki
@ 2008-04-01 8:35 ` KAMEZAWA Hiroyuki
5 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-01 8:35 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, xemul, yamamoto, menage
This patch adds mem_cgroup_new_charge() and mem_cgroup_map_charge().
After this, all charge funcs are divided into
- page_cgroup_map_charge(). -- for mapping page.
- page_cgroup_new_charge(). -- for newly allocated anon page.
- page_cgroup_cache_charge(). -- for page cache.
After this, a page passed by page_cgroup_new_charge() is guaranteed as not
used by anyone. we can avoid unncessary spinlock.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitu.com>
---
include/linux/memcontrol.h | 5 +
mm/memcontrol.c | 126 ++++++++++++++++++++++++++-------------------
2 files changed, 77 insertions(+), 54 deletions(-)
Index: mm-2.6.25-rc5-mm1-k/mm/memcontrol.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/memcontrol.c
+++ mm-2.6.25-rc5-mm1-k/mm/memcontrol.c
@@ -469,58 +469,14 @@ unsigned long mem_cgroup_isolate_pages(u
* 0 if the charge was successful
* < 0 if the cgroup is over its limit
*/
-static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype,
- struct mem_cgroup *memcg)
+static int mem_cgroup_charge_core(struct page_cgroup *pc,
+ struct mem_cgroup *mem,
+ gfp_t gfp_mask, enum charge_type ctype)
{
- struct mem_cgroup *mem;
- struct page_cgroup *pc;
unsigned long flags;
unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup_per_zone *mz;
- if (mem_cgroup_subsys.disabled)
- return 0;
-
- pc = get_alloc_page_cgroup(page, gfp_mask);
- /* Before kmalloc initialization, get_page_cgroup can return EBUSY */
- if (unlikely(IS_ERR(pc))) {
- if (PTR_ERR(pc) == -EBUSY)
- return 0;
- return PTR_ERR(pc);
- }
-
- spin_lock_irqsave(&pc->lock, flags);
- /*
- * Has the page already been accounted ?
- */
- if (pc->mem_cgroup) {
- spin_unlock_irqrestore(&pc->lock, flags);
- goto success;
- }
- spin_unlock_irqrestore(&pc->lock, flags);
-
- /*
- * We always charge the cgroup the mm_struct belongs to.
- * The mm_struct's mem_cgroup changes on task migration if the
- * thread group leader migrates. It's possible that mm is not
- * set, if so charge the init_mm (happens for pagecache usage).
- */
- if (memcg) {
- mem = memcg;
- css_get(&mem->css);
- } else {
- if (!mm)
- mm = &init_mm;
- rcu_read_lock();
- mem = rcu_dereference(mm->mem_cgroup);
- /*
- * For every charge from the cgroup, increment reference count
- */
- css_get(&mem->css);
- rcu_read_unlock();
- }
-
while (res_counter_charge(&mem->res, PAGE_SIZE)) {
if (!(gfp_mask & __GFP_WAIT))
goto nomem;
@@ -579,23 +535,83 @@ nomem:
return -ENOMEM;
}
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, enum charge_type ctype)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *mem;
+ unsigned long flags;
+
+ pc = get_alloc_page_cgroup(page, gfp_mask);
+ if (unlikely(IS_ERR(pc))) {
+ if (PTR_ERR(pc) == -EBUSY)
+ return 0;
+ return PTR_ERR(pc);
+ }
+ spin_lock_irqsave(&pc->lock, flags);
+ if (pc->mem_cgroup) {
+ spin_unlock_irqrestore(&pc->lock, flags);
+ return 0;
+ }
+ spin_unlock_irqrestore(&pc->lock, flags);
+
+ if (!mm)
+ mm = &init_mm;
+ rcu_read_lock();
+ mem = rcu_dereference(mm->mem_cgroup);
+ css_get(&mem->css);
+ rcu_read_unlock();
+
+ return mem_cgroup_charge_core(pc, mem, gfp_mask, ctype);
+}
+
+int mem_cgroup_map_charge(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask)
{
+ if (mem_cgroup_subsys.disabled)
+ return 0;
if (page_mapped(page))
return 0;
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+ MEM_CGROUP_CHARGE_TYPE_MAPPED);
+}
+
+int mem_cgroup_new_charge(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *mem;
+
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+
+ VM_BUG_ON(page_mapped(page));
+
+ pc = get_alloc_page_cgroup(page, gfp_mask);
+ if (unlikely(IS_ERR(pc))) {
+ if (PTR_ERR(pc) == -EBUSY)
+ return 0;
+ return PTR_ERR(pc);
+ }
+ /* mm is *always* valid under us .*/
+ mem = mm->mem_cgroup;
+ css_get(&mem->css);
+ return mem_cgroup_charge_core(pc, mem, gfp_mask,
+ MEM_CGROUP_CHARGE_TYPE_MAPPED);
}
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
{
+ if (mem_cgroup_subsys.disabled)
+ return 0;
if (!mm)
mm = &init_mm;
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+ MEM_CGROUP_CHARGE_TYPE_CACHE);
}
+
/*
* Uncharging is always a welcome operation, we never complain, simply
* uncharge.
@@ -710,8 +726,12 @@ int mem_cgroup_prepare_migration(struct
}
spin_unlock_irqrestore(&pc->lock, flags);
if (mem) {
- ret = mem_cgroup_charge_common(newpage, NULL,
- GFP_KERNEL, type, mem);
+ pc = get_alloc_page_cgroup(newpage, GFP_KERNEL);
+ if (!IS_ERR(pc)) {
+ ret = mem_cgroup_charge_core(pc, mem,
+ GFP_KERNEL, type);
+ } else
+ ret = PTR_ERR(pc);
css_put(&mem->css);
}
}
Index: mm-2.6.25-rc5-mm1-k/include/linux/memcontrol.h
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/include/linux/memcontrol.h
+++ mm-2.6.25-rc5-mm1-k/include/linux/memcontrol.h
@@ -27,14 +27,17 @@ struct page;
struct mm_struct;
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern struct cgroup_subsys mem_cgroup_subsys;
extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
extern void mm_free_cgroup(struct mm_struct *mm);
-extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
+extern int mem_cgroup_map_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+extern int mem_cgroup_new_charge(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask);
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern void mem_cgroup_move_lists(struct page *page, bool active);
Index: mm-2.6.25-rc5-mm1-k/mm/migrate.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/migrate.c
+++ mm-2.6.25-rc5-mm1-k/mm/migrate.c
@@ -176,7 +176,7 @@ static void remove_migration_pte(struct
* be reliable, and this charge can actually fail: oh well, we don't
* make the situation any worse by proceeding as if it had succeeded.
*/
- mem_cgroup_charge(new, mm, GFP_ATOMIC);
+ mem_cgroup_map_charge(new, mm, GFP_ATOMIC);
get_page(new);
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
Index: mm-2.6.25-rc5-mm1-k/mm/swapfile.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/swapfile.c
+++ mm-2.6.25-rc5-mm1-k/mm/swapfile.c
@@ -514,7 +514,7 @@ static int unuse_pte(struct vm_area_stru
pte_t *pte;
int ret = 1;
- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+ if (mem_cgroup_map_charge(page, vma->vm_mm, GFP_KERNEL))
ret = -ENOMEM;
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
Index: mm-2.6.25-rc5-mm1-k/mm/memory.c
===================================================================
--- mm-2.6.25-rc5-mm1-k.orig/mm/memory.c
+++ mm-2.6.25-rc5-mm1-k/mm/memory.c
@@ -1146,7 +1146,7 @@ static int insert_page(struct mm_struct
pte_t *pte;
spinlock_t *ptl;
- retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
+ retval = mem_cgroup_map_charge(page, mm, GFP_KERNEL);
if (retval)
goto out;
@@ -1649,7 +1649,7 @@ gotten:
cow_user_page(new_page, old_page, address, vma);
__SetPageUptodate(new_page);
- if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
+ if (mem_cgroup_new_charge(new_page, mm, GFP_KERNEL))
goto oom_free_new;
/*
@@ -2051,7 +2051,7 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}
- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (mem_cgroup_map_charge(page, mm, GFP_KERNEL)) {
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
ret = VM_FAULT_OOM;
goto out;
@@ -2135,7 +2135,7 @@ static int do_anonymous_page(struct mm_s
goto oom;
__SetPageUptodate(page);
- if (mem_cgroup_charge(page, mm, GFP_KERNEL))
+ if (mem_cgroup_new_charge(page, mm, GFP_KERNEL))
goto oom_free_page;
entry = mk_pte(page, vma->vm_page_prot);
@@ -2262,7 +2262,7 @@ static int __do_fault(struct mm_struct *
}
- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (mem_cgroup_map_charge(page, mm, GFP_KERNEL)) {
ret = VM_FAULT_OOM;
goto out;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-04-01 8:35 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-01 8:28 [-mm][PATCH 0/6] memcg: radix tree page_cgroup v3 KAMEZAWA Hiroyuki
2008-04-01 8:30 ` [-mm][PATCH 1/6] memcg: radix-tree lookup for page_cgroup KAMEZAWA Hiroyuki
2008-04-01 8:31 ` [-mm][PATCH 2/6] boost by per_cpu KAMEZAWA Hiroyuki
2008-04-01 8:32 ` [-mm][PATCH 3/6] try_to_shirink page cgroup KAMEZAWA Hiroyuki
2008-04-01 8:33 ` [-mm][PATCH 4/6] remove unnecessary page_cgroup_zoneinfo KAMEZAWA Hiroyuki
2008-04-01 8:34 ` [-mm][PATCH 5/6] remove refcnt use mapcount KAMEZAWA Hiroyuki
2008-04-01 8:35 ` [-mm][PATCH 6/6] mem_cgroup_map/new_charge KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox