* [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 6:56 [RFC] memory unplug v5 [0/6] intro KAMEZAWA Hiroyuki
@ 2007-06-14 6:59 ` KAMEZAWA Hiroyuki
2007-06-14 7:01 ` Christoph Lameter
2007-06-14 7:00 ` [RFC] memory unplug v5 [2/6] isolate lru page race fix KAMEZAWA Hiroyuki
` (4 subsequent siblings)
5 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 6:59 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
page migratio by kernel v5.
Changelog V4->V5
- Removed new functions. just add add/remove dummy_vma codes.
- page_lock_anon_vma() is exported.
In usual, migrate_pages(page,,) is called with holoding mm->sem by systemcall.
(mm here is a mm_struct which maps the migration target page.)
This semaphore helps avoiding some race conditions.
But, if we want to migrate a page by some kernel codes, we have to avoid
some races. This patch adds check code for following race condition.
1. A page which is not mapped can be target of migration. Then, we have
to check page_mapped() before calling try_to_unmap().
2. We can't trust page->mapping if page_mapcount() can goes down to 0.
But when we map newpage back to original ptes, we have to access
anon_vma from a page, which page_mapcount() is 0.
This patch adds a special dummy_vma to anon_vma for avoiding
anon_vma is freed while page is unmapped.
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
Index: devel-2.6.22-rc4-mm2/mm/migrate.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/migrate.c
+++ devel-2.6.22-rc4-mm2/mm/migrate.c
@@ -602,6 +602,8 @@ static int move_to_new_page(struct page
return rc;
}
+/* By this DUMMY VMA, vma_address() always return -EFAULT */
+#define DUMMY_VMA {.vm_mm = NULL, .vm_start = 0, .vm_end = 0,}
/*
* Obtain the lock on page, remove all ptes and migrate the page
* to the newly allocated page in newpage.
@@ -612,6 +614,8 @@ static int unmap_and_move(new_page_t get
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, &result);
+ struct anon_vma *anon_vma = NULL;
+ struct vm_area_struct dummy = DUMMY_VMA;
if (!newpage)
return -ENOMEM;
@@ -632,17 +636,30 @@ static int unmap_and_move(new_page_t get
goto unlock;
wait_on_page_writeback(page);
}
-
+ /* hold this anon_vma until page migration ends */
+ if (PageAnon(page) && page_mapped(page)) {
+ anon_vma = page_lock_anon_vma(page);
+ if (anon_vma) {
+ dummy.anon_vma = anon_vma;
+ __anon_vma_link(&dummy);
+ page_unlock_anon_vma(anon_vma);
+ }
+ }
/*
* Establish migration ptes or remove ptes
*/
- try_to_unmap(page, 1);
+ if (page_mapped(page))
+ try_to_unmap(page, 1);
+
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
if (rc)
remove_migration_ptes(page, page);
+ if (anon_vma)
+ anon_vma_unlink(&dummy);
+
unlock:
unlock_page(page);
Index: devel-2.6.22-rc4-mm2/include/linux/rmap.h
===================================================================
--- devel-2.6.22-rc4-mm2.orig/include/linux/rmap.h
+++ devel-2.6.22-rc4-mm2/include/linux/rmap.h
@@ -56,6 +56,9 @@ static inline void anon_vma_unlock(struc
spin_unlock(&anon_vma->lock);
}
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
/*
* anon_vma helper functions.
*/
Index: devel-2.6.22-rc4-mm2/mm/rmap.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/rmap.c
+++ devel-2.6.22-rc4-mm2/mm/rmap.c
@@ -178,7 +178,7 @@ void __init anon_vma_init(void)
* Getting a lock on a stable anon_vma from a page off the LRU is
* tricky: page_lock_anon_vma rely on RCU to guard against the races.
*/
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;
@@ -198,7 +198,7 @@ out:
return NULL;
}
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
spin_unlock(&anon_vma->lock);
rcu_read_unlock();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 6:59 ` [RFC] memory unplug v5 [1/6] migration by kernel KAMEZAWA Hiroyuki
@ 2007-06-14 7:01 ` Christoph Lameter
2007-06-14 7:11 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2007-06-14 7:01 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
> 1. A page which is not mapped can be target of migration. Then, we have
> to check page_mapped() before calling try_to_unmap().
How can we get an anonymous page that is not mapped?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 7:01 ` Christoph Lameter
@ 2007-06-14 7:11 ` KAMEZAWA Hiroyuki
2007-06-14 7:22 ` Christoph Lameter
0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 7:11 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007 00:01:51 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
>
> > 1. A page which is not mapped can be target of migration. Then, we have
> > to check page_mapped() before calling try_to_unmap().
>
> How can we get an anonymous page that is not mapped?
>
In my case, any pages linked to LRU can be target of migration.
"A page which is not mapped" may not be an anon. can be unmapped file caches.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 7:11 ` KAMEZAWA Hiroyuki
@ 2007-06-14 7:22 ` Christoph Lameter
2007-06-14 7:41 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2007-06-14 7:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jun 2007 00:01:51 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
> >
> > > 1. A page which is not mapped can be target of migration. Then, we have
> > > to check page_mapped() before calling try_to_unmap().
> >
> > How can we get an anonymous page that is not mapped?
> >
>
> In my case, any pages linked to LRU can be target of migration.
> "A page which is not mapped" may not be an anon. can be unmapped file caches.
But the code is checking for an anonymous page and then checks if it is
mapped.
If you have a valid anonymous page then it is mapped. How can an unmapped
anonymous page exist? It will be freed immediately (or am I missing a
special case. No need to check that it is mapped. Thus also no need to
call page_lock_anon_vma.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 7:22 ` Christoph Lameter
@ 2007-06-14 7:41 ` KAMEZAWA Hiroyuki
2007-06-14 7:47 ` Christoph Lameter
0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 7:41 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007 00:22:10 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> But the code is checking for an anonymous page and then checks if it is
> mapped.
>
Ouch, here ?
+ if (PageAnon(page) && page_mapped(page)) {
+ anon_vma = page_lock_anon_vma(page);
In my understanding:
PageAnon(page) checks (page->mapping & 0x1). And, as you know, page->mapping
is not cleared even if the page is removed from rmap.
See zap_pte_range(), it does
==
page = vm_normal_page(vma, addr, ptent);
if (unlikely(details) && page) {
<snip>
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
<snip>
if (PageAnon(page))
anon_rss--;
else {
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent))
SetPageReferenced(page);
file_rss--;
}
page_remove_rmap(page, vma);
tlb_remove_page(tlb, page); <-------------- page is freed here.
==
When a page is freed , the page is not locked. There is no lock.
But... page_lock_anon_vma() check page_mapped() by itself.
My patch should be
==
+ if (PageAnon(page)) {
+ anon_vma = page_lock_anon_vma(page);
==
This is my mistake.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 7:41 ` KAMEZAWA Hiroyuki
@ 2007-06-14 7:47 ` Christoph Lameter
2007-06-14 8:29 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2007-06-14 7:47 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
> In my understanding:
>
> PageAnon(page) checks (page->mapping & 0x1). And, as you know, page->mapping
> is not cleared even if the page is removed from rmap.
But in that case the refcount is zero. We will not migrate the page.
> My patch should be
> ==
> + if (PageAnon(page)) {
> + anon_vma = page_lock_anon_vma(page);
> ==
> This is my mistake.
Do not worry I make lots of mistakes.... We just need to pool our minds
and come up with the right solution. I think this is a critical piece of
code that needs to be right for defrag and for memory unplug.
Why do you lock the page there? Its already locked from sys_move_pages
etc. This will make normal page migration deadlock.
Just get the anonymous vma address from the mapping like in the last
conceptual patch that I sent you.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 7:47 ` Christoph Lameter
@ 2007-06-14 8:29 ` KAMEZAWA Hiroyuki
2007-06-14 14:19 ` Christoph Lameter
0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 8:29 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007 00:47:46 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
>
> > In my understanding:
> >
> > PageAnon(page) checks (page->mapping & 0x1). And, as you know, page->mapping
> > is not cleared even if the page is removed from rmap.
>
> But in that case the refcount is zero. We will not migrate the page.
>
yes. why we add dummy_vma to page here is
==
0. page_count(page) check.
1. does try_to_unmap() and page->mapcount goes down to 0. page->count goes down to 1.
2. page->mapping is copied to newpage.
3. remove_migration_ptes is called against newpage->mapping.
==
If page is zapped while 0->1, newpage->mapping can be untrustable value.
My point is that if page->mapcount goes down to 0, we should be careful to
access page->mapping value.
But...during discussion with you, I found anon_vma is now freed by RCU...
Ugh, then, what I have to do is rcu_read_lock() -> rcu_read_unlock() while
migrating anon ptes. If we can rcu read lock here, we don't need dummy_vma.
How about this ?
-Kame
p.s page_lock_anon_vma() locks anon_vma, not page.
==
page migratio by kernel v5.
Changelog V5->V6
- removed dummy_vma and uses rcu_read_lock().
In usual, migrate_pages(page,,) is called with holoding mm->sem by systemcall.
(mm here is a mm_struct which maps the migration target page.)
This semaphore helps avoiding some race conditions.
But, if we want to migrate a page by some kernel codes, we have to avoid
some races. This patch adds check code for following race condition.
1. A page which is not mapped can be target of migration. Then, we have
to check page_mapped() before calling try_to_unmap().
2. anon_vma can be freed while page is unmapped, but page->mapping remains as
it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping.
So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping will
not be freed during migration.
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
Index: devel-2.6.22-rc4-mm2/mm/migrate.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/migrate.c
+++ devel-2.6.22-rc4-mm2/mm/migrate.c
@@ -612,6 +612,7 @@ static int unmap_and_move(new_page_t get
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, &result);
+ int rcu_locked = 0;
if (!newpage)
return -ENOMEM;
@@ -632,16 +633,24 @@ static int unmap_and_move(new_page_t get
goto unlock;
wait_on_page_writeback(page);
}
-
+ /* anon_vma should not be freed while migration. */
+ if (PageAnon(page)) {
+ rcu_read_lock();
+ rcu_locked = 1;
+ }
/*
* Establish migration ptes or remove ptes
*/
- try_to_unmap(page, 1);
+ if (page_mapped(page))
+ try_to_unmap(page, 1);
+
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
if (rc)
remove_migration_ptes(page, page);
+ if (rcu_locked)
+ rcu_read_unlock();
unlock:
unlock_page(page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 8:29 ` KAMEZAWA Hiroyuki
@ 2007-06-14 14:19 ` Christoph Lameter
2007-06-14 16:02 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2007-06-14 14:19 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
> But...during discussion with you, I found anon_vma is now freed by RCU...
>
> Ugh, then, what I have to do is rcu_read_lock() -> rcu_read_unlock() while
> migrating anon ptes. If we can rcu read lock here, we don't need dummy_vma.
> How about this ?
Hmmmm... Looks good. Maybe take the RCU lock unconditionally? Is there a
problem if we do so? Then the patch becomes very small and it looks
cleaner.
Is there an issue with calling try_to_unmap for an unmapped page? We check
in try_to_unmap if the pte is valid. If it was unmapped then try_to_unmap
will fail anyways.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 14:19 ` Christoph Lameter
@ 2007-06-14 16:02 ` KAMEZAWA Hiroyuki
2007-06-14 16:12 ` Christoph Lameter
0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 16:02 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007 07:19:19 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
>
> > But...during discussion with you, I found anon_vma is now freed by RCU...
> >
> > Ugh, then, what I have to do is rcu_read_lock() -> rcu_read_unlock() while
> > migrating anon ptes. If we can rcu read lock here, we don't need dummy_vma.
> > How about this ?
>
> Hmmmm... Looks good. Maybe take the RCU lock unconditionally? Is there a
> problem if we do so? Then the patch becomes very small and it looks
> cleaner.
Ok, maybe no problem if "if" is removed.
>
> Is there an issue with calling try_to_unmap for an unmapped page? We check
> in try_to_unmap if the pte is valid. If it was unmapped then try_to_unmap
> will fail anyways.
>
I met following case.
---
CPU 0 CPU 1
do_swap_page()
-> read_swap_cache_async()
-> # alloc new page
# page is added to swapcache
# page is locked here.
# added to LRU <- we find this page because of PG_lru
# start asynchrous read I/O lock_page()
# page is unlocked here we acquire the lock.
-> lock_page()
wait.... unmap_and_move() is called.
try_to_unmap() is called.
PageAnon() returns 0. beacause the page is not
added to rmap yet. page->mapping is NULL, here.
try_to_unmap_file() is called.
try_to_unmap_file() touches NULL pointer.
--
An unmapped swapcache page, which is just added to LRU, may be accessed via migrate_page().
But page->mapping is NULL yet.
Hmm, should I add following check instead of page_mapped() ?
--
if (likely(page->mapping))
try_to_unmap(page,1)
--
Note: Because the page's page_count() is not migratable,(do_swap_page() has one ref.)
this migration will fail with -EAGAIN.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 16:02 ` KAMEZAWA Hiroyuki
@ 2007-06-14 16:12 ` Christoph Lameter
2007-06-14 16:15 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2007-06-14 16:12 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, hugh
On Fri, 15 Jun 2007, KAMEZAWA Hiroyuki wrote:
> > Is there an issue with calling try_to_unmap for an unmapped page? We check
> > in try_to_unmap if the pte is valid. If it was unmapped then try_to_unmap
> > will fail anyways.
> >
> I met following case.
> ---
> CPU 0 CPU 1
>
> do_swap_page()
> -> read_swap_cache_async()
> -> # alloc new page
> # page is added to swapcache
> # page is locked here.
> # added to LRU <- we find this page because of PG_lru
> # start asynchrous read I/O lock_page()
> # page is unlocked here we acquire the lock.
> -> lock_page()
> wait.... unmap_and_move() is called.
> try_to_unmap() is called.
> PageAnon() returns 0. beacause the page is not
> added to rmap yet. page->mapping is NULL, here.
> try_to_unmap_file() is called.
> try_to_unmap_file() touches NULL pointer.
> --
> An unmapped swapcache page, which is just added to LRU, may be accessed via migrate_page().
> But page->mapping is NULL yet.
Yes then lets add a check for page->mapping == NULL there.
if (!page->mapping)
goto unlock;
That will retry the migration on the next pass. Add some concise comment
explaining the situation. This is general bug in page migration.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 16:12 ` Christoph Lameter
@ 2007-06-14 16:15 ` KAMEZAWA Hiroyuki
2007-06-14 18:04 ` Mel Gorman
0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 16:15 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007 09:12:37 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> > An unmapped swapcache page, which is just added to LRU, may be accessed via migrate_page().
> > But page->mapping is NULL yet.
>
> Yes then lets add a check for page->mapping == NULL there.
>
> if (!page->mapping)
> goto unlock;
>
> That will retry the migration on the next pass. Add some concise comment
> explaining the situation. This is general bug in page migration.
>
Ok, will do. thank you for your advice.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 16:15 ` KAMEZAWA Hiroyuki
@ 2007-06-14 18:04 ` Mel Gorman
2007-06-14 22:31 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2007-06-14 18:04 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Christoph Lameter, linux-mm, y-goto, hugh
KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jun 2007 09:12:37 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
>>> An unmapped swapcache page, which is just added to LRU, may be accessed via migrate_page().
>>> But page->mapping is NULL yet.
>> Yes then lets add a check for page->mapping == NULL there.
>>
>> if (!page->mapping)
>> goto unlock;
>>
>> That will retry the migration on the next pass. Add some concise comment
>> explaining the situation. This is general bug in page migration.
>>
> Ok, will do. thank you for your advice.
>
I am currently testing what I believe your patches currently look like.
In combination with the isolate lru page fix patch, things are looking
better than they were. Previously I had seen some very bizarre errors
when migrating due to compaction of memory but I'm not seeing them now.
I hadn't been reporting because it was difficult to tell if migration
was at fault or what memory compaction was doing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 18:04 ` Mel Gorman
@ 2007-06-14 22:31 ` KAMEZAWA Hiroyuki
2007-06-15 9:43 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 22:31 UTC (permalink / raw)
To: Mel Gorman; +Cc: clameter, linux-mm, y-goto, hugh
On Thu, 14 Jun 2007 19:04:16 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> >> That will retry the migration on the next pass. Add some concise comment
> >> explaining the situation. This is general bug in page migration.
> >>
> > Ok, will do. thank you for your advice.
> >
>
> I am currently testing what I believe your patches currently look like.
> In combination with the isolate lru page fix patch, things are looking
> better than they were. Previously I had seen some very bizarre errors
> when migrating due to compaction of memory but I'm not seeing them now.
> I hadn't been reporting because it was difficult to tell if migration
> was at fault or what memory compaction was doing.
>
Thank you for reporting. I'm encouraged :)
I'll post updated version later.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-14 22:31 ` KAMEZAWA Hiroyuki
@ 2007-06-15 9:43 ` KAMEZAWA Hiroyuki
2007-06-15 9:53 ` KAMEZAWA Hiroyuki
2007-06-15 14:41 ` Christoph Lameter
0 siblings, 2 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-15 9:43 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Mel Gorman, clameter, linux-mm, y-goto, hugh
This is updated version.
-Kame
page migration by kernel v6.
Changelog V5->V6
- removed dummy_vma and uses rcu_read_lock().
- removed page_mapped() check and uses !page->mapping check.
In usual, migrate_pages(page,,) is called with holding mm->sem by system call.
(mm here is a mm_struct which maps the migration target page.)
This semaphore helps avoiding some race conditions.
But, if we want to migrate a page by some kernel codes, we have to avoid
some races. This patch adds check code for following race condition.
1. A page which page->mapping==NULL can be target of migration. Then, we have
to check page->mapping before calling try_to_unmap().
2. anon_vma can be freed while page is unmapped, but page->mapping remains as
it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping.
So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping from
being freed during migration.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
Index: devel-2.6.22-rc4-mm2/mm/migrate.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/migrate.c
+++ devel-2.6.22-rc4-mm2/mm/migrate.c
@@ -632,16 +632,30 @@ static int unmap_and_move(new_page_t get
goto unlock;
wait_on_page_writeback(page);
}
-
/*
- * Establish migration ptes or remove ptes
+ * This is a corner case handling.
+ * When a new swap-ache is read into, it is linked to LRU
+ * and treated as swapcache but has no rmap yet.
+ * Calling try_to_unmap() against a page->mapping==NULL page is
+ * BUG. So handle it here.
+ */
+ if (!page->mapping)
+ goto unlock;
+ /*
+ * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
+ * we cannot notice that anon_vma is freed while we migrates a pages
+ * This rcu_read_lock() delays freeing anon_vma pointer until the end
+ * of migration. File cache pages are no problem because of page_lock()
*/
+ rcu_read_lock();
try_to_unmap(page, 1);
+
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
if (rc)
remove_migration_ptes(page, page);
+ rcu_read_unlock();
unlock:
unlock_page(page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-15 9:43 ` KAMEZAWA Hiroyuki
@ 2007-06-15 9:53 ` KAMEZAWA Hiroyuki
2007-06-15 14:41 ` Christoph Lameter
1 sibling, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-15 9:53 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Mel Gorman, clameter, linux-mm, y-goto, hugh
Sorry...original comments were removed...
This is fixed version.
-Kame
==
page migration by kernel v6.
Changelog V5->V6
- removed dummy_vma and uses rcu_read_lock().
- removed page_mapped() check and uses !page->mapping check.
In usual, migrate_pages(page,,) is called with holding mm->sem by system call.
(mm here is a mm_struct which maps the migration target page.)
This semaphore helps avoiding some race conditions.
But, if we want to migrate a page by some kernel codes, we have to avoid
some races. This patch adds check code for following race condition.
1. A page which page->mapping==NULL can be target of migration. Then, we have
to check page->mapping before calling try_to_unmap().
2. anon_vma can be freed while page is unmapped, but page->mapping remains as
it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping.
So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping from
being freed during migration.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
Index: devel-2.6.22-rc4-mm2/mm/migrate.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/migrate.c
+++ devel-2.6.22-rc4-mm2/mm/migrate.c
@@ -632,16 +632,31 @@ static int unmap_and_move(new_page_t get
goto unlock;
wait_on_page_writeback(page);
}
-
/*
- * Establish migration ptes or remove ptes
+ * This is a corner case handling.
+ * When a new swap-ache is read into, it is linked to LRU
+ * and treated as swapcache but has no rmap yet.
+ * Calling try_to_unmap() against a page->mapping==NULL page is
+ * BUG. So handle it here.
*/
+ if (!page->mapping)
+ goto unlock;
+ /*
+ * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
+ * we cannot notice that anon_vma is freed while we migrates a pages
+ * This rcu_read_lock() delays freeing anon_vma pointer until the end
+ * of migration. File cache pages are no problem because of page_lock()
+ */
+ rcu_read_lock();
+ /* Establish migration ptes or remove ptes */
try_to_unmap(page, 1);
+
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
if (rc)
remove_migration_ptes(page, page);
+ rcu_read_unlock();
unlock:
unlock_page(page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-15 9:43 ` KAMEZAWA Hiroyuki
2007-06-15 9:53 ` KAMEZAWA Hiroyuki
@ 2007-06-15 14:41 ` Christoph Lameter
2007-06-15 15:36 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2007-06-15 14:41 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Mel Gorman, linux-mm, y-goto, hugh
On Fri, 15 Jun 2007, KAMEZAWA Hiroyuki wrote:
> /*
> - * Establish migration ptes or remove ptes
> + * This is a corner case handling.
> + * When a new swap-ache is read into, it is linked to LRU
> + * and treated as swapcache but has no rmap yet.
> + * Calling try_to_unmap() against a page->mapping==NULL page is
> + * BUG. So handle it here.
> + */
> + if (!page->mapping)
> + goto unlock;
> + /*
> + * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
> + * we cannot notice that anon_vma is freed while we migrates a pages
> + * This rcu_read_lock() delays freeing anon_vma pointer until the end
> + * of migration. File cache pages are no problem because of page_lock()
> */
> + rcu_read_lock();
> try_to_unmap(page, 1);
page->mapping needs to be checked after rcu_read_lock. The mapping may be
removed and the anon_vma dropped after you checked page->mapping.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [1/6] migration by kernel
2007-06-15 14:41 ` Christoph Lameter
@ 2007-06-15 15:36 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-15 15:36 UTC (permalink / raw)
To: Christoph Lameter; +Cc: mel, linux-mm, y-goto, hugh
On Fri, 15 Jun 2007 07:41:42 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 15 Jun 2007, KAMEZAWA Hiroyuki wrote:
>
> > /*
> > - * Establish migration ptes or remove ptes
> > + * This is a corner case handling.
> > + * When a new swap-ache is read into, it is linked to LRU
> > + * and treated as swapcache but has no rmap yet.
> > + * Calling try_to_unmap() against a page->mapping==NULL page is
> > + * BUG. So handle it here.
> > + */
> > + if (!page->mapping)
> > + goto unlock;
> > + /*
> > + * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
> > + * we cannot notice that anon_vma is freed while we migrates a pages
> > + * This rcu_read_lock() delays freeing anon_vma pointer until the end
> > + * of migration. File cache pages are no problem because of page_lock()
> > */
> > + rcu_read_lock();
> > try_to_unmap(page, 1);
>
> page->mapping needs to be checked after rcu_read_lock. The mapping may be
> removed and the anon_vma dropped after you checked page->mapping.
>
page->mapping is not clearred when the kernel removing rmap (it will not be
cleared even if it is freed in my understanding)
...but your point seems reasonable. I'll fix it.
BTW, I'll not able to touch my box until next Friday.
Regards,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [RFC] memory unplug v5 [2/6] isolate lru page race fix
2007-06-14 6:56 [RFC] memory unplug v5 [0/6] intro KAMEZAWA Hiroyuki
2007-06-14 6:59 ` [RFC] memory unplug v5 [1/6] migration by kernel KAMEZAWA Hiroyuki
@ 2007-06-14 7:00 ` KAMEZAWA Hiroyuki
2007-06-14 7:01 ` [RFC] memory unplug v5 [3/6] walk memory resources assist function KAMEZAWA Hiroyuki
` (3 subsequent siblings)
5 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 7:00 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
release_pages() in mm/swap.c changes page_count() to be 0
without removing PageLRU flag...
This means isolate_lru_page() can see a page, PageLRU() && page_count(page)==0..
This is BUG. (get_page() will be called against count=0 page.)
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
Index: devel-2.6.22-rc4-mm2/mm/migrate.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/migrate.c
+++ devel-2.6.22-rc4-mm2/mm/migrate.c
@@ -49,9 +49,8 @@ int isolate_lru_page(struct page *page,
struct zone *zone = page_zone(page);
spin_lock_irq(&zone->lru_lock);
- if (PageLRU(page)) {
+ if (PageLRU(page) && get_page_unless_zero(page)) {
ret = 0;
- get_page(page);
ClearPageLRU(page);
if (PageActive(page))
del_page_from_active_list(zone, page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* [RFC] memory unplug v5 [3/6] walk memory resources assist function.
2007-06-14 6:56 [RFC] memory unplug v5 [0/6] intro KAMEZAWA Hiroyuki
2007-06-14 6:59 ` [RFC] memory unplug v5 [1/6] migration by kernel KAMEZAWA Hiroyuki
2007-06-14 7:00 ` [RFC] memory unplug v5 [2/6] isolate lru page race fix KAMEZAWA Hiroyuki
@ 2007-06-14 7:01 ` KAMEZAWA Hiroyuki
2007-06-15 6:05 ` David Rientjes
2007-06-14 7:03 ` [RFC] memory unplug v5 [4/6] page isolation KAMEZAWA Hiroyuki
` (2 subsequent siblings)
5 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 7:01 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
A clean up patch for "scanning memory resource [start, end)" operation.
Now, find_next_system_ram() function is used in memory hotplug, but this
interface is not easy to use and codes are complicated.
This patch adds walk_memory_resouce(start,len,arg,func) function.
The function 'func' is called per valid memory resouce range in [start,pfn).
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/ioport.h | 3 --
include/linux/memory_hotplug.h | 9 ++++++++
kernel/resource.c | 26 ++++++++++++++++++++++-
mm/memory_hotplug.c | 45 +++++++++++++++++------------------------
4 files changed, 53 insertions(+), 30 deletions(-)
Index: devel-2.6.22-rc4-mm2/kernel/resource.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/kernel/resource.c
+++ devel-2.6.22-rc4-mm2/kernel/resource.c
@@ -244,7 +244,7 @@ EXPORT_SYMBOL(release_resource);
* the caller must specify res->start, res->end, res->flags.
* If found, returns 0, res is overwritten, if not found, returns -1.
*/
-int find_next_system_ram(struct resource *res)
+static int find_next_system_ram(struct resource *res)
{
resource_size_t start, end;
struct resource *p;
@@ -277,6 +277,30 @@ int find_next_system_ram(struct resource
res->end = p->end;
return 0;
}
+
+int walk_memory_resource(unsigned long start_pfn, unsigned long nr_pages,
+ void *arg, walk_memory_callback_t func)
+{
+ struct resource res;
+ unsigned long pfn, len;
+ u64 orig_end;
+ int ret;
+ res.start = (u64) start_pfn << PAGE_SHIFT;
+ res.end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
+ res.flags = IORESOURCE_MEM;
+ orig_end = res.end;
+ while ((res.start < res.end) && (find_next_system_ram(&res) >= 0)) {
+ pfn = (unsigned long)(res.start >> PAGE_SHIFT);
+ len = (unsigned long)(res.end + 1 - res.start) >> PAGE_SHIFT;
+ ret = (*func)(pfn, len, arg);
+ if (ret)
+ break;
+ res.start = res.end + 1;
+ res.end = orig_end;
+ }
+ return ret;
+}
+
#endif
/*
Index: devel-2.6.22-rc4-mm2/include/linux/ioport.h
===================================================================
--- devel-2.6.22-rc4-mm2.orig/include/linux/ioport.h
+++ devel-2.6.22-rc4-mm2/include/linux/ioport.h
@@ -110,9 +110,6 @@ extern int allocate_resource(struct reso
int adjust_resource(struct resource *res, resource_size_t start,
resource_size_t size);
-/* get registered SYSTEM_RAM resources in specified area */
-extern int find_next_system_ram(struct resource *res);
-
/* Convenience shorthand with allocation */
#define request_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name))
#define request_mem_region(start,n,name) __request_region(&iomem_resource, (start), (n), (name))
Index: devel-2.6.22-rc4-mm2/include/linux/memory_hotplug.h
===================================================================
--- devel-2.6.22-rc4-mm2.orig/include/linux/memory_hotplug.h
+++ devel-2.6.22-rc4-mm2/include/linux/memory_hotplug.h
@@ -64,6 +64,15 @@ extern int online_pages(unsigned long, u
extern int __add_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
+/*
+ * Walk thorugh all memory which is registered as resource.
+ * arg is (start_pfn, nr_pages, private_arg_pointer)
+ */
+typedef int (*walk_memory_callback_t)(unsigned long, unsigned long, void *);
+extern int walk_memory_resource(unsigned long start_pfn,
+ unsigned long nr_pages,
+ void *arg, walk_memory_callback_t func);
+
#ifdef CONFIG_NUMA
extern int memory_add_physaddr_to_nid(u64 start);
#else
Index: devel-2.6.22-rc4-mm2/mm/memory_hotplug.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/memory_hotplug.c
+++ devel-2.6.22-rc4-mm2/mm/memory_hotplug.c
@@ -161,14 +161,27 @@ static void grow_pgdat_span(struct pglis
pgdat->node_start_pfn;
}
-int online_pages(unsigned long pfn, unsigned long nr_pages)
+static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
+ void *arg)
{
unsigned long i;
+ unsigned long onlined_pages = *(unsigned long *)arg;
+ struct page *page;
+ if (PageReserved(pfn_to_page(start_pfn)))
+ for (i = 0; i < nr_pages; i++) {
+ page = pfn_to_page(start_pfn + i);
+ online_page(page);
+ onlined_pages++;
+ }
+ *(unsigned long *)arg = onlined_pages;
+ return 0;
+}
+
+
+int online_pages(unsigned long pfn, unsigned long nr_pages)
+{
unsigned long flags;
unsigned long onlined_pages = 0;
- struct resource res;
- u64 section_end;
- unsigned long start_pfn;
struct zone *zone;
int need_zonelists_rebuild = 0;
@@ -191,28 +204,8 @@ int online_pages(unsigned long pfn, unsi
if (!populated_zone(zone))
need_zonelists_rebuild = 1;
- res.start = (u64)pfn << PAGE_SHIFT;
- res.end = res.start + ((u64)nr_pages << PAGE_SHIFT) - 1;
- res.flags = IORESOURCE_MEM; /* we just need system ram */
- section_end = res.end;
-
- while ((res.start < res.end) && (find_next_system_ram(&res) >= 0)) {
- start_pfn = (unsigned long)(res.start >> PAGE_SHIFT);
- nr_pages = (unsigned long)
- ((res.end + 1 - res.start) >> PAGE_SHIFT);
-
- if (PageReserved(pfn_to_page(start_pfn))) {
- /* this region's page is not onlined now */
- for (i = 0; i < nr_pages; i++) {
- struct page *page = pfn_to_page(start_pfn + i);
- online_page(page);
- onlined_pages++;
- }
- }
-
- res.start = res.end + 1;
- res.end = section_end;
- }
+ walk_memory_resource(pfn, nr_pages, &onlined_pages,
+ online_pages_range);
zone->present_pages += onlined_pages;
zone->zone_pgdat->node_present_pages += onlined_pages;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [3/6] walk memory resources assist function.
2007-06-14 7:01 ` [RFC] memory unplug v5 [3/6] walk memory resources assist function KAMEZAWA Hiroyuki
@ 2007-06-15 6:05 ` David Rientjes
2007-06-15 6:11 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: David Rientjes @ 2007-06-15 6:05 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
> Index: devel-2.6.22-rc4-mm2/kernel/resource.c
> ===================================================================
> --- devel-2.6.22-rc4-mm2.orig/kernel/resource.c
> +++ devel-2.6.22-rc4-mm2/kernel/resource.c
> @@ -244,7 +244,7 @@ EXPORT_SYMBOL(release_resource);
> * the caller must specify res->start, res->end, res->flags.
> * If found, returns 0, res is overwritten, if not found, returns -1.
> */
> -int find_next_system_ram(struct resource *res)
> +static int find_next_system_ram(struct resource *res)
> {
> resource_size_t start, end;
> struct resource *p;
> @@ -277,6 +277,30 @@ int find_next_system_ram(struct resource
> res->end = p->end;
> return 0;
> }
> +
> +int walk_memory_resource(unsigned long start_pfn, unsigned long nr_pages,
> + void *arg, walk_memory_callback_t func)
> +{
> + struct resource res;
> + unsigned long pfn, len;
> + u64 orig_end;
> + int ret;
> + res.start = (u64) start_pfn << PAGE_SHIFT;
> + res.end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
> + res.flags = IORESOURCE_MEM;
> + orig_end = res.end;
> + while ((res.start < res.end) && (find_next_system_ram(&res) >= 0)) {
> + pfn = (unsigned long)(res.start >> PAGE_SHIFT);
> + len = (unsigned long)(res.end + 1 - res.start) >> PAGE_SHIFT;
This needs to be
len = (unsigned long)((res.end + 1 - res.start) >> PAGE_SHIFT);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [3/6] walk memory resources assist function.
2007-06-15 6:05 ` David Rientjes
@ 2007-06-15 6:11 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-15 6:11 UTC (permalink / raw)
To: David Rientjes; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Thu, 14 Jun 2007 23:05:22 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > + len = (unsigned long)(res.end + 1 - res.start) >> PAGE_SHIFT;
>
> This needs to be
>
> len = (unsigned long)((res.end + 1 - res.start) >> PAGE_SHIFT);
>
Okay, thank you for review. will fix.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [RFC] memory unplug v5 [4/6] page isolation
2007-06-14 6:56 [RFC] memory unplug v5 [0/6] intro KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2007-06-14 7:01 ` [RFC] memory unplug v5 [3/6] walk memory resources assist function KAMEZAWA Hiroyuki
@ 2007-06-14 7:03 ` KAMEZAWA Hiroyuki
2007-06-15 15:46 ` Dave Hansen
2007-06-14 7:04 ` [RFC] memory unplug v5 [5/6] page unplug KAMEZAWA Hiroyuki
2007-06-14 7:06 ` [RFC] memory unplug v5 [6/6] ia64 interface KAMEZAWA Hiroyuki
5 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 7:03 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
Implement generic chunk-of-pages isolation method by using page grouping ops.
This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
- MIGRATE_TYPES increases.
- bitmap for migratetype is enlarged.
pages of MIGRATE_ISOLATE migratetype will not be allocated even if it is free.
By this, you can isolated *freed* pages from users. How-to-free pages is not
a purpose of this patch. You may use reclaim and migrate codes to free pages.
If start_isolate_page_range(start,end) is called,
- migratetype of the range turns to be MIGRATE_ISOLATE if
its type is MIGRATE_MOVABLE. (*) this check can be updated if other
memory reclaiming works make progress.
- MIGRATE_ISOLATE is not on migratetype fallback list.
- All free pages and will-be-freed pages are isolated.
To check all pages in the range are isolated or not, use test_pages_isolated(),
To cancel isolation, use undo_isolate_page_range().
Changes V4 -> V5
- tried to simplify....
- removed alignment adjustmetns.added alignment check instead.
calles must guarantee it.
- test_page_isolated() is available just for range of pages [start,end) now.
- using pageblock_order instread of MAX_ORDER
There are HOLES_IN_ZONE handling codes...I'm glad if we can remove them..
Signed-Off-By: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/mmzone.h | 3
include/linux/page-isolation.h | 37 ++++++++++
include/linux/pageblock-flags.h | 2
mm/Makefile | 2
mm/page_alloc.c | 44 +++++++++++
mm/page_isolation.c | 148 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 233 insertions(+), 3 deletions(-)
Index: devel-2.6.22-rc4-mm2/include/linux/mmzone.h
===================================================================
--- devel-2.6.22-rc4-mm2.orig/include/linux/mmzone.h
+++ devel-2.6.22-rc4-mm2/include/linux/mmzone.h
@@ -39,7 +39,8 @@ extern int page_group_by_mobility_disabl
#define MIGRATE_RECLAIMABLE 1
#define MIGRATE_MOVABLE 2
#define MIGRATE_RESERVE 3
-#define MIGRATE_TYPES 4
+#define MIGRATE_ISOLATE 4 /* can't allocate from here */
+#define MIGRATE_TYPES 5
#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
Index: devel-2.6.22-rc4-mm2/include/linux/pageblock-flags.h
===================================================================
--- devel-2.6.22-rc4-mm2.orig/include/linux/pageblock-flags.h
+++ devel-2.6.22-rc4-mm2/include/linux/pageblock-flags.h
@@ -31,7 +31,7 @@
/* Bit indices that affect a whole block of pages */
enum pageblock_bits {
- PB_range(PB_migrate, 2), /* 2 bits required for migrate types */
+ PB_range(PB_migrate, 3), /* 3 bits required for migrate types */
NR_PAGEBLOCK_BITS
};
Index: devel-2.6.22-rc4-mm2/mm/page_alloc.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/page_alloc.c
+++ devel-2.6.22-rc4-mm2/mm/page_alloc.c
@@ -41,6 +41,7 @@
#include <linux/pfn.h>
#include <linux/backing-dev.h>
#include <linux/fault-inject.h>
+#include <linux/page-isolation.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -4409,3 +4410,46 @@ void set_pageblock_flags_group(struct pa
else
__clear_bit(bitidx + start_bitidx, bitmap);
}
+
+/*
+ * This is designed as sub function...plz see page_isolation.c also.
+ * set/clear page block's type to be ISOLATE.
+ * page allocater never alloc memory from ISOLATE block.
+ */
+
+int set_migratetype_isolate(struct page *page)
+{
+ struct zone *zone;
+ unsigned long flags;
+ int ret = -EBUSY;
+
+ zone = page_zone(page);
+ spin_lock_irqsave(&zone->lock, flags);
+ /*
+ * In future, more migrate types will be able to be isolation target.
+ */
+ if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE)
+ goto out;
+ set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+ move_freepages_block(zone, page, MIGRATE_ISOLATE);
+ ret = 0;
+out:
+ spin_unlock_irqrestore(&zone->lock, flags);
+ if (!ret)
+ drain_all_local_pages();
+ return ret;
+}
+
+void unset_migratetype_isolate(struct page *page)
+{
+ struct zone *zone;
+ unsigned long flags;
+ zone = page_zone(page);
+ spin_lock_irqsave(&zone->lock, flags);
+ if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
+ goto out;
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ move_freepages_block(zone, page, MIGRATE_MOVABLE);
+out:
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
Index: devel-2.6.22-rc4-mm2/mm/page_isolation.c
===================================================================
--- /dev/null
+++ devel-2.6.22-rc4-mm2/mm/page_isolation.c
@@ -0,0 +1,148 @@
+/*
+ * linux/mm/page_isolation.c
+ */
+
+#include <stddef.h>
+#include <linux/mm.h>
+#include <linux/page-isolation.h>
+#include <linux/pageblock-flags.h>
+#include "internal.h"
+
+#ifdef CONFIG_HOLES_IN_ZONE
+static inline struct page *
+__first_valid_page(unsigned long pfn, unsigned long nr_page)
+{
+ int i;
+ struct page *page;
+ for (i = 0; i < nr_page; i++)
+ if (pfn_valid_within(pfn + i))
+ break;
+ if (unlikely(i == nr_pages))
+ return NULL;
+ return pfn_to_page(pfn + i);
+}
+#else
+static inline struct page *
+__first_valid_page(unsigned long pfn, unsigned long nr_page)
+{
+ return pfn_to_page(pfn);
+}
+#endif
+
+
+/*
+ * start_isolate_page_range() -- make page-allocation-type of range of pages
+ * to be MIGRATE_ISOLATE.
+ * @start_pfn: The lower PFN of the range to be isolated.
+ * @end_pfn: The upper PFN of the range to be isolated.
+ *
+ * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
+ * the range will never be allocated. Any free pages and pages freed in the
+ * future will not be allocated again.
+ *
+ * start_pfn/end_pfn must be aligned to pageblock_order.
+ * Returns 0 on success and -EBUSY if any part of range cannot be isolated.
+ */
+int
+start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ unsigned long undo_pfn;
+ struct page *page;
+
+ BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
+ BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
+
+ for (pfn = start_pfn;
+ pfn < end_pfn;
+ pfn += pageblock_nr_pages) {
+ page = __first_valid_page(pfn, pageblock_nr_pages);
+ if (page && set_migratetype_isolate(page)) {
+ undo_pfn = pfn;
+ goto undo;
+ }
+ }
+ return 0;
+undo:
+ for (pfn = start_pfn;
+ pfn <= undo_pfn;
+ pfn += pageblock_nr_pages)
+ unset_migratetype_isolate(pfn_to_page(pfn));
+
+ return -EBUSY;
+}
+
+/*
+ * Make isolated pages available again.
+ */
+int
+undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct page *page;
+ BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
+ BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
+ for (pfn = start_pfn;
+ pfn < end_pfn;
+ pfn += pageblock_nr_pages) {
+ page = __first_valid_page(pfn, pageblock_nr_pages);
+ if (!page || get_pageblock_flags(page) != MIGRATE_ISOLATE)
+ continue;
+ unset_migratetype_isolate(page);
+ }
+ return 0;
+}
+/*
+ * Test all pages in the range is free(means isolated) or not.
+ * all pages in [start_pfn...end_pfn) must be in the same zone.
+ * zone->lock must be held before call this.
+ *
+ * Returns 0 if all pages in the range is isolated.
+ */
+static int
+__test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn)
+{
+ struct page *page;
+
+ while (pfn < end_pfn) {
+ if (!pfn_valid_within(pfn)) {
+ pfn++;
+ continue;
+ }
+ page = pfn_to_page(pfn);
+ if (PageBuddy(page))
+ pfn += 1 << page_order(page);
+ else if (page_count(page) == 0 &&
+ page_private(page) == MIGRATE_ISOLATE)
+ pfn += 1;
+ else
+ break;
+ }
+ if (pfn < end_pfn)
+ return 0;
+ return 1;
+}
+
+int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct page *page;
+
+ pfn = start_pfn;
+ /*
+ * Note: pageblock_nr_page != MAX_ORDER. Then, chunks of free page
+ * is not aligned to pageblock_nr_pages.
+ * Then we just check pagetype fist.
+ */
+ for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ page = __first_valid_page(pfn, pageblock_nr_pages);
+ if (page && get_pageblock_flags(page) != MIGRATE_ISOLATE)
+ break;
+ }
+ if (pfn < end_pfn)
+ return -EBUSY;
+ /* Check all pages are free or Marked as ISOLATED */
+ if (__test_page_isolated_in_pageblock(start_pfn, end_pfn))
+ return 0;
+ return -EBUSY;
+}
Index: devel-2.6.22-rc4-mm2/include/linux/page-isolation.h
===================================================================
--- /dev/null
+++ devel-2.6.22-rc4-mm2/include/linux/page-isolation.h
@@ -0,0 +1,37 @@
+#ifndef __LINUX_PAGEISOLATION_H
+#define __LINUX_PAGEISOLATION_H
+
+/*
+ * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
+ * If specified range includes migrate types other than MOVABLE,
+ * this will fail with -EBUSY.
+ *
+ * For isolating all pages in the range finally, the caller have to
+ * free all pages in the range. test_page_isolated() can be used for
+ * test it.
+ */
+extern int
+start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn);
+
+/*
+ * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
+ * target range is [start_pfn, end_pfn)
+ */
+extern int
+undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn);
+
+/*
+ * test all pages in [start_pfn, end_pfn)are isolated or not.
+ */
+extern int
+test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
+
+/*
+ * Internal funcs.Changes pageblock's migrate type.
+ * Please use make_pagetype_isolated()/make_pagetype_movable().
+ */
+extern int set_migratetype_isolate(struct page *page);
+extern void unset_migratetype_isolate(struct page *page);
+
+
+#endif
Index: devel-2.6.22-rc4-mm2/mm/Makefile
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/Makefile
+++ devel-2.6.22-rc4-mm2/mm/Makefile
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
page_alloc.o page-writeback.o pdflush.o \
readahead.o swap.o truncate.o vmscan.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
- $(mmu-y)
+ page_isolation.o $(mmu-y)
obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [4/6] page isolation
2007-06-14 7:03 ` [RFC] memory unplug v5 [4/6] page isolation KAMEZAWA Hiroyuki
@ 2007-06-15 15:46 ` Dave Hansen
2007-06-15 16:59 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2007-06-15 15:46 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Thu, 2007-06-14 at 16:03 +0900, KAMEZAWA Hiroyuki wrote:
> +#ifdef CONFIG_HOLES_IN_ZONE
> +static inline struct page *
> +__first_valid_page(unsigned long pfn, unsigned long nr_page)
> +{
> + int i;
> + struct page *page;
> + for (i = 0; i < nr_page; i++)
> + if (pfn_valid_within(pfn + i))
> + break;
> + if (unlikely(i == nr_pages))
> + return NULL;
> + return pfn_to_page(pfn + i);
> +}
> +#else
> +static inline struct page *
> +__first_valid_page(unsigned long pfn, unsigned long nr_page)
> +{
> + return pfn_to_page(pfn);
> +}
> +#endif
I think this entire #ifdef is unneeded. pfn_valid_within() will be
#defined to 1 if CONFIG_HOLES_IN_ZONE=n, so that function will come out
looking like this:
+__first_valid_page(unsigned long pfn, unsigned long nr_page)
> +{
> + int i;
> + struct page *page;
> + for (i = 0; i < nr_page; i++)
> + if (1)
> + break;
> + if (unlikely(i == nr_pages))
> + return NULL;
> + return pfn_to_page(pfn + i);
> +}
I think the compiler can optimize that. :)
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [4/6] page isolation
2007-06-15 15:46 ` Dave Hansen
@ 2007-06-15 16:59 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-15 16:59 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Fri, 15 Jun 2007 08:46:45 -0700
Dave Hansen <hansendc@us.ibm.com> wrote:
> +__first_valid_page(unsigned long pfn, unsigned long nr_page)
> > +{
> > + int i;
> > + struct page *page;
> > + for (i = 0; i < nr_page; i++)
> > + if (1)
> > + break;
> > + if (unlikely(i == nr_pages))
> > + return NULL;
> > + return pfn_to_page(pfn + i);
> > +}
>
> I think the compiler can optimize that. :)
>
ok, I'll take your advice.
Thank you.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [RFC] memory unplug v5 [5/6] page unplug
2007-06-14 6:56 [RFC] memory unplug v5 [0/6] intro KAMEZAWA Hiroyuki
` (3 preceding siblings ...)
2007-06-14 7:03 ` [RFC] memory unplug v5 [4/6] page isolation KAMEZAWA Hiroyuki
@ 2007-06-14 7:04 ` KAMEZAWA Hiroyuki
2007-06-15 6:04 ` David Rientjes
2007-06-15 15:52 ` Dave Hansen
2007-06-14 7:06 ` [RFC] memory unplug v5 [6/6] ia64 interface KAMEZAWA Hiroyuki
5 siblings, 2 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 7:04 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
Changes V4->V5
- adjusted to meet changes in lower patch stack.
Logic.
- set all pages in [start,end) as isolated migration-type.
by this, all free pages in the range will be not-for-use.
- Migrate all LRU pages in the range.
- Test all pages in the range's refcnt is zero or not.
Todo:
- allocate migration destination page from better area.
- confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
(I don't like this kind of page but..
- Find out pages which cannot be migrated.
- more running tests.
- need more comments.
- Use reclaim for unplugging other memory type area.
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-Off-By: Yasunori Goto <y-goto@jp.fujitsu.com>
---
include/linux/memory_hotplug.h | 5
mm/Kconfig | 5
mm/memory_hotplug.c | 256 +++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 48 +++++++
4 files changed, 313 insertions(+), 1 deletion(-)
Index: devel-2.6.22-rc4-mm2/mm/Kconfig
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/Kconfig
+++ devel-2.6.22-rc4-mm2/mm/Kconfig
@@ -126,6 +126,11 @@ config MEMORY_HOTPLUG_SPARSE
def_bool y
depends on SPARSEMEM && MEMORY_HOTPLUG
+config MEMORY_HOTREMOVE
+ bool "Allow for memory hot remove"
+ depends on MEMORY_HOTPLUG
+ depends on MIGRATION
+
# Heavily threaded applications may benefit from splitting the mm-wide
# page_table_lock, so that faults on different parts of the user address
# space can be handled with less contention: split it at this NR_CPUS.
Index: devel-2.6.22-rc4-mm2/mm/memory_hotplug.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/memory_hotplug.c
+++ devel-2.6.22-rc4-mm2/mm/memory_hotplug.c
@@ -23,6 +23,9 @@
#include <linux/vmalloc.h>
#include <linux/ioport.h>
#include <linux/cpuset.h>
+#include <linux/delay.h>
+#include <linux/migrate.h>
+#include <linux/page-isolation.h>
#include <asm/tlbflush.h>
@@ -301,3 +304,256 @@ error:
return ret;
}
EXPORT_SYMBOL_GPL(add_memory);
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/*
+ * Confirm all pages in a range [start, end) is belongs to the same zone.
+ */
+static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct zone *zone = NULL;
+ struct page *page;
+ for (pfn = start_pfn;
+ pfn < end_pfn;
+ pfn += MAX_ORDER_NR_PAGES) {
+#ifdef CONFIG_HOLES_IN_ZONE
+ int i;
+ for (i = 0; i < MAX_ORDER_NR_PAGES; i++) {
+ if (pfn_valid_within(pfn + i))
+ break;
+ }
+ if (i == MAX_ORDER_NR_PAGES)
+ continue;
+ page = pfn_to_page(pfn + i);
+#else
+ page = pfn_to_page(pfn);
+#endif
+ if (zone && page_zone(page) != zone)
+ return 0;
+ zone = page_zone(page);
+ }
+ return 1;
+}
+
+/*
+ * Scanning pfn is much easier than scanning lru list.
+ * Scan pfn from start to end and Find LRU page.
+ */
+int scan_lru_pages(unsigned long start, unsigned long end)
+{
+ unsigned long pfn;
+ struct page *page;
+ for (pfn = start; pfn < end; pfn++) {
+ if (pfn_valid(pfn)) {
+ page = pfn_to_page(pfn);
+ if (PageLRU(page))
+ return pfn;
+ }
+ }
+ return 0;
+}
+
+static struct page *
+hotremove_migrate_alloc(struct page *page,
+ unsigned long private,
+ int **x)
+{
+ /* This should be improoooooved!! */
+ return alloc_page(GFP_HIGHUSER_PAGECACHE);
+}
+
+
+#define NR_OFFLINE_AT_ONCE_PAGES (256)
+static int
+do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct page *page;
+ int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
+ int not_managed = 0;
+ int ret = 0;
+ LIST_HEAD(source);
+
+ for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ if (!page_count(page))
+ continue;
+ /*
+ * We can skip free pages. And we can only deal with pages on
+ * LRU.
+ */
+ ret = isolate_lru_page(page, &source);
+ if (!ret) { /* Success */
+ move_pages--;
+ } else {
+ /* Becasue we don't have big zone->lock. we should
+ check this again here. */
+ if (page_count(page))
+ not_managed++;
+#ifdef CONFIG_DEBUG_VM
+ printk("Not Migratable page found %lx/%d/%lx\n",
+ pfn, page_count(page), page->flags);
+#endif
+ }
+ }
+ ret = -EBUSY;
+ if (not_managed) {
+ if (!list_empty(&source))
+ putback_lru_pages(&source);
+ goto out;
+ }
+ ret = 0;
+ if (list_empty(&source))
+ goto out;
+ /* this function returns # of failed pages */
+ ret = migrate_pages(&source, hotremove_migrate_alloc, 0);
+
+out:
+ return ret;
+}
+
+/*
+ * remove from free_area[] and mark all as Reserved.
+ */
+static int
+offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
+ void *data)
+{
+ __offline_isolated_pages(start, start + nr_pages);
+ return 0;
+}
+
+static void
+offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+{
+ walk_memory_resource(start_pfn, end_pfn - start_pfn, NULL,
+ offline_isolated_pages_cb);
+}
+
+/*
+ * Check all pages in range, recoreded as memory resource, are isolated.
+ */
+static int
+check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
+ void *data)
+{
+ int ret;
+ long offlined = *(long*)data;
+ ret = test_pages_isolated(start_pfn, start_pfn + nr_pages);
+ offlined = nr_pages;
+ if (!ret)
+ *(long*)data += offlined;
+ return ret;
+}
+
+static long
+check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
+{
+ long offlined = 0;
+ int ret;
+
+ ret = walk_memory_resource(start_pfn, end_pfn - start_pfn, &offlined,
+ check_pages_isolated_cb);
+ if (ret < 0)
+ offlined = (long)ret;
+ return offlined;
+}
+
+extern void drain_all_local_pages(void);
+
+int offline_pages(unsigned long start_pfn,
+ unsigned long end_pfn, unsigned long timeout)
+{
+ unsigned long pfn, nr_pages, expire;
+ long offlined_pages;
+ int ret, drain, retry_max;
+ struct zone *zone;
+
+ BUG_ON(start_pfn >= end_pfn);
+ /* at least, alignment against pageblock is necessary */
+ if (start_pfn & (pageblock_nr_pages - 1))
+ return -EINVAL;
+ if (end_pfn & (pageblock_nr_pages - 1))
+ return -EINVAL;
+ /* This makes hotplug much easier...and readable.
+ we assume this for now. .*/
+ if (test_pages_in_a_zone(start_pfn, end_pfn))
+ return -EINVAL;
+ /* set above range as isolated */
+ ret = start_isolate_page_range(start_pfn, end_pfn);
+ if (ret)
+ return ret;
+ nr_pages = end_pfn - start_pfn;
+ pfn = start_pfn;
+ expire = jiffies + timeout;
+ drain = 0;
+ retry_max = 5;
+repeat:
+ /* start memory hot removal */
+ ret = -EAGAIN;
+ if (time_after(jiffies, expire))
+ goto failed_removal;
+ ret = -EINTR;
+ if (signal_pending(current))
+ goto failed_removal;
+ ret = 0;
+ if (drain) {
+ lru_add_drain_all();
+ flush_scheduled_work();
+ cond_resched();
+ drain_all_local_pages();
+ }
+
+ pfn = scan_lru_pages(start_pfn, end_pfn);
+ if (pfn) { /* We have page on LRU */
+ ret = do_migrate_range(pfn, end_pfn);
+ if (!ret) {
+ drain = 1;
+ goto repeat;
+ } else {
+ if (ret < 0)
+ if (--retry_max == 0)
+ goto failed_removal;
+ yield();
+ drain = 1;
+ goto repeat;
+ }
+ }
+ /* drain all zone's lru pagevec, this is asyncronous... */
+ lru_add_drain_all();
+ flush_scheduled_work();
+ yield();
+ /* drain pcp pages , this is synchrouns. */
+ drain_all_local_pages();
+ /* check again */
+ offlined_pages = check_pages_isolated(start_pfn, end_pfn);
+ if (offlined_pages < 0) {
+ ret = -EBUSY;
+ goto failed_removal;
+ }
+ printk("Offlined Pages %ld\n",offlined_pages);
+ /* Ok, all of our target is islaoted.
+ We cannot do rollback at this point. */
+ offline_isolated_pages(start_pfn, end_pfn);
+ /* reset pagetype flags */
+ start_isolate_page_range(start_pfn, end_pfn);
+ /* removal success */
+ zone = page_zone(pfn_to_page(start_pfn));
+ zone->present_pages -= offlined_pages;
+ zone->zone_pgdat->node_present_pages -= offlined_pages;
+ totalram_pages -= offlined_pages;
+ num_physpages -= offlined_pages;
+ vm_total_pages = nr_free_pagecache_pages();
+ writeback_set_ratelimit();
+ return 0;
+
+failed_removal:
+ printk("memory offlining %lx to %lx failed\n",start_pfn, end_pfn);
+ /* pushback to free area */
+ undo_isolate_page_range(start_pfn, end_pfn);
+ return ret;
+}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
Index: devel-2.6.22-rc4-mm2/include/linux/memory_hotplug.h
===================================================================
--- devel-2.6.22-rc4-mm2.orig/include/linux/memory_hotplug.h
+++ devel-2.6.22-rc4-mm2/include/linux/memory_hotplug.h
@@ -59,7 +59,10 @@ extern int add_one_highpage(struct page
extern void online_page(struct page *page);
/* VM interface that may be used by firmware interface */
extern int online_pages(unsigned long, unsigned long);
-
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern int offline_pages(unsigned long, unsigned long, unsigned long);
+extern void __offline_isolated_pages(unsigned long, unsigned long);
+#endif
/* reasonably generic interface to expand the physical pages in a zone */
extern int __add_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
Index: devel-2.6.22-rc4-mm2/mm/page_alloc.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/mm/page_alloc.c
+++ devel-2.6.22-rc4-mm2/mm/page_alloc.c
@@ -4453,3 +4453,51 @@ void unset_migratetype_isolate(struct pa
out:
spin_unlock_irqrestore(&zone->lock, flags);
}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/*
+ * All pages in the range must be isolated before calling this.
+ */
+void
+__offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+{
+ struct page *page;
+ struct zone *zone;
+ int order, i;
+ unsigned long pfn;
+ unsigned long flags;
+ /* find the first valid pfn */
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ if (pfn_valid(pfn))
+ break;
+ if (pfn == end_pfn)
+ return;
+ zone = page_zone(pfn_to_page(pfn));
+ spin_lock_irqsave(&zone->lock, flags);
+ printk("do offline \n");
+ pfn = start_pfn;
+ while (pfn < end_pfn) {
+ if (!pfn_valid(pfn)) {
+ pfn++;
+ continue;
+ }
+ page = pfn_to_page(pfn);
+ BUG_ON(page_count(page));
+ BUG_ON(!PageBuddy(page));
+ order = page_order(page);
+#ifdef CONFIG_DEBUG_VM
+ printk("remove from free list %lx %d %lx\n",
+ pfn, 1 << order, end_pfn);
+#endif
+ list_del(&page->lru);
+ rmv_page_order(page);
+ zone->free_area[order].nr_free--;
+ __mod_zone_page_state(zone, NR_FREE_PAGES,
+ - (1UL << order));
+ for (i = 0; i < (1 << order); i++)
+ SetPageReserved((page+i));
+ pfn += (1 << order);
+ }
+ spin_unlock_irqrestore(&zone->lock,flags);
+}
+#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [5/6] page unplug
2007-06-14 7:04 ` [RFC] memory unplug v5 [5/6] page unplug KAMEZAWA Hiroyuki
@ 2007-06-15 6:04 ` David Rientjes
2007-06-15 6:12 ` KAMEZAWA Hiroyuki
2007-06-15 14:35 ` Christoph Lameter
2007-06-15 15:52 ` Dave Hansen
1 sibling, 2 replies; 34+ messages in thread
From: David Rientjes @ 2007-06-15 6:04 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Thu, 14 Jun 2007, KAMEZAWA Hiroyuki wrote:
> Index: devel-2.6.22-rc4-mm2/mm/memory_hotplug.c
> ===================================================================
> --- devel-2.6.22-rc4-mm2.orig/mm/memory_hotplug.c
> +++ devel-2.6.22-rc4-mm2/mm/memory_hotplug.c
> @@ -23,6 +23,9 @@
> #include <linux/vmalloc.h>
> #include <linux/ioport.h>
> #include <linux/cpuset.h>
> +#include <linux/delay.h>
> +#include <linux/migrate.h>
> +#include <linux/page-isolation.h>
>
> #include <asm/tlbflush.h>
>
> @@ -301,3 +304,256 @@ error:
> return ret;
> }
> EXPORT_SYMBOL_GPL(add_memory);
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +/*
> + * Confirm all pages in a range [start, end) is belongs to the same zone.
> + */
> +static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned long pfn;
> + struct zone *zone = NULL;
> + struct page *page;
> + for (pfn = start_pfn;
> + pfn < end_pfn;
> + pfn += MAX_ORDER_NR_PAGES) {
> +#ifdef CONFIG_HOLES_IN_ZONE
> + int i;
> + for (i = 0; i < MAX_ORDER_NR_PAGES; i++) {
> + if (pfn_valid_within(pfn + i))
> + break;
> + }
> + if (i == MAX_ORDER_NR_PAGES)
> + continue;
> + page = pfn_to_page(pfn + i);
> +#else
> + page = pfn_to_page(pfn);
> +#endif
Please extract this out to inlined functions that are conditional are
CONFIG_HOLES_IN_ZONE.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [5/6] page unplug
2007-06-15 6:04 ` David Rientjes
@ 2007-06-15 6:12 ` KAMEZAWA Hiroyuki
2007-06-15 14:35 ` Christoph Lameter
1 sibling, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-15 6:12 UTC (permalink / raw)
To: David Rientjes; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Thu, 14 Jun 2007 23:04:50 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > + page = pfn_to_page(pfn);
> > +#endif
>
> Please extract this out to inlined functions that are conditional are
> CONFIG_HOLES_IN_ZONE.
>
Hmm. ok, I"ll do.
thanks.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [5/6] page unplug
2007-06-15 6:04 ` David Rientjes
2007-06-15 6:12 ` KAMEZAWA Hiroyuki
@ 2007-06-15 14:35 ` Christoph Lameter
2007-06-15 14:40 ` Andy Whitcroft
1 sibling, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2007-06-15 14:35 UTC (permalink / raw)
To: Andy Whitcroft
Cc: David Rientjes, KAMEZAWA Hiroyuki, linux-mm, mel, y-goto, hugh
On Thu, 14 Jun 2007, David Rientjes wrote:
> > + struct zone *zone = NULL;
> > + struct page *page;
> > + for (pfn = start_pfn;
> > + pfn < end_pfn;
> > + pfn += MAX_ORDER_NR_PAGES) {
> > +#ifdef CONFIG_HOLES_IN_ZONE
> > + int i;
> > + for (i = 0; i < MAX_ORDER_NR_PAGES; i++) {
> > + if (pfn_valid_within(pfn + i))
> > + break;
> > + }
> > + if (i == MAX_ORDER_NR_PAGES)
> > + continue;
> > + page = pfn_to_page(pfn + i);
> > +#else
> > + page = pfn_to_page(pfn);
> > +#endif
>
> Please extract this out to inlined functions that are conditional are
> CONFIG_HOLES_IN_ZONE.
And we need to deal with HOLES_IN_ZONE because the sparsemem virtual
memmap patchset was not merged and therefore we cannot get rid of
VIRTUAL_MEM_MAP.
Andy, any progress? Do you want me to do another patchset?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [5/6] page unplug
2007-06-15 14:35 ` Christoph Lameter
@ 2007-06-15 14:40 ` Andy Whitcroft
0 siblings, 0 replies; 34+ messages in thread
From: Andy Whitcroft @ 2007-06-15 14:40 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Rientjes, KAMEZAWA Hiroyuki, linux-mm, mel, y-goto, hugh
Christoph Lameter wrote:
> On Thu, 14 Jun 2007, David Rientjes wrote:
>
>>> + struct zone *zone = NULL;
>>> + struct page *page;
>>> + for (pfn = start_pfn;
>>> + pfn < end_pfn;
>>> + pfn += MAX_ORDER_NR_PAGES) {
>>> +#ifdef CONFIG_HOLES_IN_ZONE
>>> + int i;
>>> + for (i = 0; i < MAX_ORDER_NR_PAGES; i++) {
>>> + if (pfn_valid_within(pfn + i))
>>> + break;
>>> + }
>>> + if (i == MAX_ORDER_NR_PAGES)
>>> + continue;
>>> + page = pfn_to_page(pfn + i);
>>> +#else
>>> + page = pfn_to_page(pfn);
>>> +#endif
>> Please extract this out to inlined functions that are conditional are
>> CONFIG_HOLES_IN_ZONE.
>
> And we need to deal with HOLES_IN_ZONE because the sparsemem virtual
> memmap patchset was not merged and therefore we cannot get rid of
> VIRTUAL_MEM_MAP.
>
> Andy, any progress? Do you want me to do another patchset?
I believe I've got the latest. I'll sort out the feedback over the
weekend and get a new patchset out, to Andrew.
-apw
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] memory unplug v5 [5/6] page unplug
2007-06-14 7:04 ` [RFC] memory unplug v5 [5/6] page unplug KAMEZAWA Hiroyuki
2007-06-15 6:04 ` David Rientjes
@ 2007-06-15 15:52 ` Dave Hansen
2007-06-15 17:03 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2007-06-15 15:52 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Thu, 2007-06-14 at 16:04 +0900, KAMEZAWA Hiroyuki wrote:
>
> + if (start_pfn & (pageblock_nr_pages - 1))
> + return -EINVAL;
> + if (end_pfn & (pageblock_nr_pages - 1))
> + return -EINVAL;
After reading these, I'm still not sure I know what a pageblock is
supposed to be. :) Did those come from Mel's patches?
In any case, I think it might be helpful to wrap up some of those
references in functions. I was always looking at the patches trying to
find if "pageblock_nr_pages" was a local variable or not. A function
would surely tell me.
static inline int pfn_is_pageblock_aligned(unsigned long pfn)
{
return pfn & (pageblock_nr_pages - 1)
}
and, then you get
BUG_ON(!pfn_is_pageblock_aligned(start_pfn));
It's pretty obvious what is going on, there.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [5/6] page unplug
2007-06-15 15:52 ` Dave Hansen
@ 2007-06-15 17:03 ` KAMEZAWA Hiroyuki
2007-06-15 21:09 ` Dave Hansen
0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-15 17:03 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Fri, 15 Jun 2007 08:52:41 -0700
Dave Hansen <hansendc@us.ibm.com> wrote:
> On Thu, 2007-06-14 at 16:04 +0900, KAMEZAWA Hiroyuki wrote:
> >
> > + if (start_pfn & (pageblock_nr_pages - 1))
> > + return -EINVAL;
> > + if (end_pfn & (pageblock_nr_pages - 1))
> > + return -EINVAL;
>
> After reading these, I'm still not sure I know what a pageblock is
> supposed to be. :) Did those come from Mel's patches?
>
yes.
> In any case, I think it might be helpful to wrap up some of those
> references in functions. I was always looking at the patches trying to
> find if "pageblock_nr_pages" was a local variable or not. A function
> would surely tell me.
>
> static inline int pfn_is_pageblock_aligned(unsigned long pfn)
> {
> return pfn & (pageblock_nr_pages - 1)
> }
>
> and, then you get
>
> BUG_ON(!pfn_is_pageblock_aligned(start_pfn));
>
> It's pretty obvious what is going on, there.
>
Hmm...I'll try that in the next version. But Is there some macro
to do this ? like..
--
#define IS_ALIGNED(val, align) ((val) & (align - 1))
--
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [RFC] memory unplug v5 [5/6] page unplug
2007-06-15 17:03 ` KAMEZAWA Hiroyuki
@ 2007-06-15 21:09 ` Dave Hansen
0 siblings, 0 replies; 34+ messages in thread
From: Dave Hansen @ 2007-06-15 21:09 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
On Sat, 2007-06-16 at 02:03 +0900, KAMEZAWA Hiroyuki wrote:
>
> Hmm...I'll try that in the next version. But Is there some macro
> to do this ? like..
> --
> #define IS_ALIGNED(val, align) ((val) & (align - 1))
Yep, that's a bit better.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [RFC] memory unplug v5 [6/6] ia64 interface
2007-06-14 6:56 [RFC] memory unplug v5 [0/6] intro KAMEZAWA Hiroyuki
` (4 preceding siblings ...)
2007-06-14 7:04 ` [RFC] memory unplug v5 [5/6] page unplug KAMEZAWA Hiroyuki
@ 2007-06-14 7:06 ` KAMEZAWA Hiroyuki
5 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-06-14 7:06 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter, hugh
IA64 memory unplug interface.
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
arch/ia64/mm/init.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
Index: devel-2.6.22-rc4-mm2/arch/ia64/mm/init.c
===================================================================
--- devel-2.6.22-rc4-mm2.orig/arch/ia64/mm/init.c
+++ devel-2.6.22-rc4-mm2/arch/ia64/mm/init.c
@@ -724,7 +724,17 @@ int arch_add_memory(int nid, u64 start,
int remove_memory(u64 start, u64 size)
{
- return -EINVAL;
+ unsigned long start_pfn, end_pfn;
+ unsigned long timeout = 120 * HZ;
+ int ret;
+ start_pfn = start >> PAGE_SHIFT;
+ end_pfn = start_pfn + (size >> PAGE_SHIFT);
+ ret = offline_pages(start_pfn, end_pfn, timeout);
+ if (ret)
+ goto out;
+ /* we can free mem_map at this point */
+out:
+ return ret;
}
EXPORT_SYMBOL_GPL(remove_memory);
#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread