From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id C2BD96B0055 for ; Mon, 18 May 2009 22:20:08 -0400 (EDT) Date: Tue, 19 May 2009 10:50:28 +0900 From: Daisuke Nishimura Subject: Re: [PATCH] memcg: handle accounting race in swapin-readahead and zap_pte Message-Id: <20090519105028.8ce4f8da.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20090515190027.e7d48d7a.kamezawa.hiroyu@jp.fujitsu.com> References: <20090515190027.e7d48d7a.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: "linux-mm@kvack.org" , "balbir@linux.vnet.ibm.com" , "hugh@veritas.com" , hannes@cmpxchg.org, "mingo@elte.hu" , "linux-kernel@vger.kernel.org" , "akpm@linux-foundation.org" , nishimura@mxp.nes.nec.co.jp List-ID: On Fri, 15 May 2009 19:00:27 +0900, KAMEZAWA Hiroyuki wrote: > Similar to previous series but this version is a bit claerer, I think. > == > From: KAMEZAWA Hiroyuki > > When a process exits, zap_pte() is called and free_swap_and_cache() > is called for freeing swp_entry. But free_swap_and_cache() uses trylock() > and entries may not be freed. (Later, global LRU will handle this.) > > > processA | processB > -------------------------------------+------------------------------------- > (free_swap_and_cache()) | (read_swap_cache_async()) > | swap_duplicate() > | __set_page_locked() > | add_to_swap_cache() > swap_entry_free() == 0 | > find_get_page() -> found | > try_lock_page() -> fail & return | > | lru_cache_add_anon() > | doesn't link this page to memcg's > | LRU, because of !PageCgroupUsed. > > At using memcg, above path is terrible because not freed swapcache will > never be freed until global LRU runs. This can be leak of swap entry > and cause OOM (as Nishimura reported) > > To fix this, one easy way is not to permit swapin-readahead. But it causes > unpleasant peformance penalty in case that swapin-readahead hits. > > This patch tries to fix above race by adding an private LRU, swapin-buffer. > This works as following. > 1. add swap-cache to swapin-buffer at readahead() > 2. check SwapCache in swapin-buffer again in delayed work. > 3. finally pages in swapin-buffer are moved to INACTIVE_ANON list. > > This patch uses delayed_work and moves pages from buffer to anon in > proportional number to the number of pages in swapin-buffer. > > > Changelog: > - redesigned again. > - A main difference from previous trials is PG_lru is not set until > we confirm the entry. We can avoid races and contention of zone's LRU. > - # of calls to schedule_work() is reduced. > - access to zone->lru is batched. > - don't handle races in writeback (handled by other patch) > > Signed-off-by: KAMEZAWA Hiroyuki This patch seems to work well. But, I think it would be a nitpick though, this patch has a race yet theoretically like bellow. free_swap_and_cache() | __check_swap_in_buffer() -------------------------------------+------------------------------------- | trylock_page() | try_to_free_swap() | page_swapcount() -> true & return swap_info_get() | swap_entry_free() == 1 | find_get_page() -> found | trylock_page() -> fail & return | | unlock_page() I don't think it happens in practice(unlock_page() would be called soon after try_to_free_swap() returns), and this patch seems to work well actually. I'm not sure whether we should handle this case more strictly or not, but I think it it would be better to add some comments about it at least. And I have a question. If the size of swap device(or the number of used swap entries not on SwapCache) is small enough not to hit "if (memcg_swapin_buffer.nr > ENOUGH_LARGE_SWAPIN_BUFFER)" in mem_cgroup_add_swapin_buffer(), those pages in swapin buffer are left and unfreed by swapoff(although swap entries are freed) ? Isn't it better to call directly mem_cgroup_drain_swapin_buffer() at the end of swapoff ? I prefer your v4(remembering only stale swap entries) to be honest, but I don't oppose strongly to this direction. Thanks, Daisuke Nishimura. > --- > include/linux/swap.h | 8 +++ > mm/memcontrol.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/swap_state.c | 10 +++- > 3 files changed, 136 insertions(+), 2 deletions(-) > > Index: mmotm-2.6.30-May13/mm/memcontrol.c > =================================================================== > --- mmotm-2.6.30-May13.orig/mm/memcontrol.c 2009-05-15 17:44:14.000000000 +0900 > +++ mmotm-2.6.30-May13/mm/memcontrol.c 2009-05-15 18:46:35.000000000 +0900 > @@ -834,6 +834,123 @@ > return num; > } > > +#ifdef CONFIG_SWAP > + > +struct swapin_buffer { > + spinlock_t lock; > + struct list_head list; > + int nr; > + struct delayed_work work; > +} memcg_swapin_buffer; > + > +/* Used at swapoff */ > +#define ENOUGH_LARGE_SWAPIN_BUFFER (1024) > + > +/* Hide swapin page from LRU for a while */ > + > +static int __check_swapin_buffer(struct page *page) > +{ > + /* Fast path (PG_writeback never be set.) */ > + if (!PageSwapCache(page) || page_mapped(page)) > + return 1; > + > + if (PageUptodate(page) && trylock_page(page)) { > + try_to_free_swap(page); > + unlock_page(page); > + return 1; > + } > + return 0; > +} > + > +static void mem_cgroup_drain_swapin_buffer(struct work_struct *work) > +{ > + struct page *page, *tmp; > + LIST_HEAD(scan); > + int nr, fail; > + > + if (!memcg_swapin_buffer.nr) > + return; > + > + /* > + * When swapin_buffer increasing rapidly, swapped-in pages tend to be > + * in use. Because page faulted thread should continue its own work > + * to cause large swapin, swapin-readahead should _hit_ if nr is large. > + * In that case, __check_swapin_buffer() will use fast-path. > + * Then, making _nr_ to be propotional to the total size. > + */ > + nr = memcg_swapin_buffer.nr/8 + 1; > + > + spin_lock(&memcg_swapin_buffer.lock); > + while (nr-- && !list_empty(&memcg_swapin_buffer.list)) { > + list_move(memcg_swapin_buffer.list.next, &scan); > + memcg_swapin_buffer.nr--; > + } > + spin_unlock(&memcg_swapin_buffer.lock); > + > + fail = 0; > + list_for_each_entry_safe(page, tmp, &scan, lru) { > + if (__check_swapin_buffer(page)) { > + list_del(&page->lru); > + lru_cache_add_anon(page); > + put_page(page); > + } else > + fail++; > + } > + if (!list_empty(&scan)) { > + spin_lock(&memcg_swapin_buffer.lock); > + list_splice_tail(&scan, &memcg_swapin_buffer.list); > + memcg_swapin_buffer.nr += fail; > + spin_unlock(&memcg_swapin_buffer.lock); > + } > + > + if (memcg_swapin_buffer.nr) > + schedule_delayed_work(&memcg_swapin_buffer.work, HZ/10); > +} > + > +static void mem_cgroup_force_drain_swapin_buffer(void) > +{ > + int swapin_buffer_thresh; > + > + swapin_buffer_thresh = (num_online_cpus() + 1) * (1 << page_cluster); > + if (memcg_swapin_buffer.nr > swapin_buffer_thresh) > + mem_cgroup_drain_swapin_buffer(NULL); > +} > + > +void mem_cgroup_lazy_drain_swapin_buffer(void) > +{ > + schedule_delayed_work(&memcg_swapin_buffer.work, HZ/10); > +} > + > +void mem_cgroup_add_swapin_buffer(struct page *page) > +{ > + get_page(page); > + spin_lock(&memcg_swapin_buffer.lock); > + list_add_tail(&page->lru, &memcg_swapin_buffer.list); > + memcg_swapin_buffer.nr++; > + spin_unlock(&memcg_swapin_buffer.lock); > + /* > + * Usually, this will not hit. At swapoff, we have to > + * drain ents manually. > + */ > + if (memcg_swapin_buffer.nr > ENOUGH_LARGE_SWAPIN_BUFFER) > + mem_cgroup_drain_swapin_buffer(NULL); > +} > + > +static __init int init_swapin_buffer(void) > +{ > + spin_lock_init(&memcg_swapin_buffer.lock); > + INIT_LIST_HEAD(&memcg_swapin_buffer.list); > + INIT_DELAYED_WORK(&memcg_swapin_buffer.work, > + mem_cgroup_drain_swapin_buffer); > + return 0; > +} > +late_initcall(init_swapin_buffer); > +#else > +static void mem_cgroup_force_drain_swain_buffer(void) > +{ > +} > +#endif /* CONFIG_SWAP */ > + > /* > * Visit the first child (need not be the first child as per the ordering > * of the cgroup list, since we track last_scanned_child) of @mem and use > @@ -892,6 +1009,8 @@ > int ret, total = 0; > int loop = 0; > > + mem_cgroup_force_drain_swapin_buffer(); > + > while (loop < 2) { > victim = mem_cgroup_select_victim(root_mem); > if (victim == root_mem) > @@ -1560,6 +1679,7 @@ > } > > #ifdef CONFIG_SWAP > + > /* > * called after __delete_from_swap_cache() and drop "page" account. > * memcg information is recorded to swap_cgroup of "ent" > Index: mmotm-2.6.30-May13/include/linux/swap.h > =================================================================== > --- mmotm-2.6.30-May13.orig/include/linux/swap.h 2009-05-15 17:44:14.000000000 +0900 > +++ mmotm-2.6.30-May13/include/linux/swap.h 2009-05-15 18:01:43.000000000 +0900 > @@ -336,11 +336,19 @@ > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR > extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent); > +extern void mem_cgroup_add_swapin_buffer(struct page *page); > +extern void mem_cgroup_lazy_drain_swapin_buffer(void); > #else > static inline void > mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent) > { > } > +static inline void mem_cgroup_add_swapin_buffer(struct page *page) > +{ > +} > +static inline void mem_cgroup_lazy_drain_swapin_buffer(void) > +{ > +} > #endif > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP > extern void mem_cgroup_uncharge_swap(swp_entry_t ent); > Index: mmotm-2.6.30-May13/mm/swap_state.c > =================================================================== > --- mmotm-2.6.30-May13.orig/mm/swap_state.c 2009-05-15 17:44:14.000000000 +0900 > +++ mmotm-2.6.30-May13/mm/swap_state.c 2009-05-15 18:01:43.000000000 +0900 > @@ -311,7 +311,10 @@ > /* > * Initiate read into locked page and return. > */ > - lru_cache_add_anon(new_page); > + if (mem_cgroup_disabled()) > + lru_cache_add_anon(new_page); > + else > + mem_cgroup_add_swapin_buffer(new_page); > swap_readpage(NULL, new_page); > return new_page; > } > @@ -368,6 +371,9 @@ > break; > page_cache_release(page); > } > - lru_add_drain(); /* Push any new pages onto the LRU now */ > + if (mem_cgroup_disabled()) > + lru_add_drain();/* Push any new pages onto the LRU now */ > + else > + mem_cgroup_lazy_drain_swapin_buffer(); > return read_swap_cache_async(entry, gfp_mask, vma, addr); > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org