Re: [RFC] memory defragmentation to satisfy high order allocations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC] memory defragmentation to satisfy high order allocations
@ 2004-10-11 16:40 linux
  0 siblings, 0 replies; 45+ messages in thread
From: linux @ 2004-10-11 16:40 UTC (permalink / raw)
  To: linux-mm

I just thought I'd mention something that Doug Lea found Very
Important when designing dlmalloc to reduce fragmentation:

- Maintain the free lists in FIFO order.  LIFO has severe problems.
- When a free block is merged into a larger block, it goes back on
  the end of the appropriate list.

If you maintain free blocks as a LIFO stack (something that occurs to
people thinking about cache effects), then you end up with a steady
state where the top few blocks are allocated very rapidly and never get
a chance to merge, while the bottom is made up of blocks whose neighbours
are permanently allocated and never get merged.

What you *want* to do is make small (low-order) allocations from blocks
which will never grow any larger (the ones on the bottom of the stack),
and keep other blocks on the free list until they've been combined with
their neighbours.

FIFO ordering works best for this, giving everything an equal chance to
be merged, and allocating the blocks that have had their chance and not
been merged.

I haven't grovelled through the Linux code to figure out what it does
do exactly, but if you're trying to reduce external fragmentation,
that's a proven technique.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC] memory defragmentation to satisfy high order allocations
@ 2004-10-01 18:22 Marcelo Tosatti
  2004-10-01 20:11 ` Andrew Morton
                   ` (4 more replies)
  0 siblings, 5 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-01 18:22 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, Nick Piggin, arjanv, linux-kernel

Hi fellows,

So I've been playing with memory defragmentation for the last couple
of weeks.

The following patch implements a "coalesce_memory()" function 
which takes "zone" and "order" as a parameter. 

It tries to move enough physically nearby pages to form a free area
of "order" size.

It does that by checking whether the page can be moved, allocating a new page, 
unmapping the pte's to it, copying data to new page, remapping the ptes, 
and reinserting the page on the radix/LRU.

Its very uncomplete yet - for one SMP concurrent radix lookups will screw file
page unmapping (swapcache lookup should be safe), and lots of other buggies inside. 
For example it doesnt re establishes pte's once it has unmapped them.

I'm working on those.

But it works fine on UP (for a few minutes :)), and easily creates large 
physically contiguous areas of memory.

With such a thing in place we can build a mechanism for kswapd 
(or a separate kernel thread, if needed) to notice when we are low on 
high order pages, and use the coalescing algorithm instead blindly 
freeing unique pages from LRU in the hope to build large physically 
contiguous memory areas.

Comments appreciated.

Lots of this has been copied from rmap.c/etc.

Yes, the code needs to be cleanup up.

--- page_alloc.c.orig	2004-09-19 16:53:52.000000000 -0300
+++ page_alloc.c	2004-10-01 16:26:21.602387344 -0300
@@ -33,6 +33,8 @@
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/nodemask.h>
+#include <linux/rmap.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -97,7 +99,471 @@
 	page->mapping = NULL;
 }
 
-#ifndef CONFIG_HUGETLB_PAGE
+#define REMAP_FAIL 0
+#define REMAP_SUCCESS 1
+
+
+void page_remove_rmap(struct page *page);
+void page_add_anon_rmap(struct page *page,
+        struct vm_area_struct *vma, unsigned long address);
+struct anon_vma *page_lock_anon_vma(struct page *page);
+inline unsigned long avma_address(struct page *page, struct vm_area_struct *vma);
+
+inline unsigned long
+avma_address(struct page *page, struct vm_area_struct *vma)
+{
+        pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+        unsigned long address;
+
+        address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+
+        if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+                /* page should be within any vma from prio_tree_next */
+		printk(KERN_ERR "address: %x pgoff:%x vma->start:%x vma->end:%x\n",
+				address, pgoff,vma->vm_start, vma->vm_end );
+                BUG_ON(!PageAnon(page));
+                return -EFAULT;
+        }
+        return address;
+}
+
+
+
+int try_to_remap_file(struct page *page, struct page *newpage, struct vm_area_struct *vma)
+{
+	unsigned long address;
+	struct mm_struct *mm = vma->vm_mm;	
+	pgd_t *pgd;
+        pmd_t *pmd;
+        pte_t *pte;
+        pte_t pteval;
+	int ret;
+
+	printk(KERN_ERR "try_to_remap_file!\n");
+
+	if (!mm->rss) 
+		return REMAP_FAIL;
+
+	address = avma_address(page, vma);
+
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out_unlock;
+
+	pmd = pmd_offset(pgd, address);
+	if (!pmd_present(*pmd))
+		goto out_unlock;
+
+
+        pte = pte_offset_map(pmd, address);
+        if (!pte_present(*pte))
+                goto out_unlock;
+
+        if (page_to_pfn(page) != pte_pfn(*pte))
+		goto out_unlock;
+	
+	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED))) 
+		goto out_unlock;
+
+	/* Nuke the pte */
+
+        flush_cache_page(vma, address);
+
+       pteval = ptep_clear_flush(vma, address, pte);
+
+	page_remove_rmap(page);
+
+	/* transfer the dirty bit to the new page */
+	if (pte_dirty(pteval))
+		set_page_dirty(newpage);
+
+	pteval = mk_pte(newpage, vma->vm_page_prot);
+
+	set_pte(pte, pteval);
+
+	page_add_file_rmap(newpage);
+
+	return REMAP_SUCCESS;
+
+out_unlock:
+	return REMAP_FAIL;
+}
+
+
+
+
+int try_to_remap_anon(struct page *page, struct page *newpage, struct vm_area_struct *vma)
+{
+	unsigned long address;
+	struct mm_struct *mm = vma->vm_mm;	
+	pgd_t *pgd;
+        pmd_t *pmd;
+        pte_t *pte;
+        pte_t pteval;
+	int ret;
+
+
+	if (!vma)
+		printk(KERN_ERR "!vma\n");
+
+	spin_lock(&mm->page_table_lock);
+
+	address = avma_address(page, vma);
+
+	if (address == -EFAULT) 
+		return REMAP_FAIL;
+
+	if (!mm) 
+		return REMAP_FAIL;
+
+	if (!mm->rss) 
+		return REMAP_FAIL;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd)) 
+		goto out_unlock;
+
+	pmd = pmd_offset(pgd, address);
+	if (!pmd_present(*pmd))
+		goto out_unlock;
+
+        pte = pte_offset_map(pmd, address);
+        if (!pte_present(*pte))
+                goto out_unlock;
+
+        if (page_to_pfn(page) != pte_pfn(*pte))
+		goto out_unlock;
+	
+	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)))
+		ret = REMAP_FAIL;
+
+	/* Nuke the pte */
+
+        flush_cache_page(vma, address);
+        pteval = ptep_clear_flush(vma, address, pte);
+
+	page_remove_rmap(page);
+
+	/* transfer the dirty bit to the new page */
+	if (pte_dirty(pteval))
+		set_page_dirty(newpage);
+
+	pteval = mk_pte(newpage, vma->vm_page_prot);
+
+	set_pte(pte, pteval);
+
+	page_add_anon_rmap(newpage, vma, address);
+
+	spin_unlock(&mm->page_table_lock);
+
+	return REMAP_SUCCESS;
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return REMAP_FAIL;
+
+}
+
+/* Move LRU pages to other locations, undo the remapping operation 
+* if any of the mapped pte's fails to be remapped.
+* 
+*/
+
+int can_move_page(struct page *page) 
+{
+	int ret;
+	int ptes_unmapped = 0;
+	struct page *newpage;
+
+	if (PageLocked(page))
+		return 0;
+
+	if (PageReserved(page))
+		return 0;
+
+	if (PageWriteback(page))
+		return 0;
+
+	if (page_count(page) == 0)
+		return 1;
+
+	if (PageLRU(page)) {
+		if (PageAnon(page) && page_count(page) == 1 + PageSwapCache(page)) {
+			struct anon_vma *anon_vma;
+			struct vm_area_struct *vma;
+			unsigned long anon_mapping = (unsigned long) page->mapping;
+			unsigned long savedindex;
+			int error;
+			
+			newpage = alloc_pages(GFP_HIGHUSER, 0);
+
+			if (PageSwapCache(page) &&
+				page_count(page) != page_mapcount(page) + 1) {
+					free_page(newpage);
+					goto out;
+			}
+
+			if (!PageAnon(page) || anon_mapping != page->mapping) {
+				free_page (newpage);
+				goto out;
+			}
+
+			page_cache_get(page);
+
+			if (TestSetPageLocked(page)) {
+				free_page(newpage);
+				page_cache_release(page);
+				goto out;
+			}
+
+			if (PageSwapCache(page)) { 
+				write_lock_irq(&swapper_space.tree_lock);
+				/* recheck under swapper address space tree lock */
+				if (!PageSwapCache(page) || page_count(page) != 3) {
+					write_unlock_irq(&swapper_space.tree_lock);
+					free_page(newpage);
+					unlock_page(page);
+					page_cache_release(page);
+				}
+				radix_tree_delete(&swapper_space.page_tree, page->private);
+				savedindex = page->private;
+			}
+
+			anon_vma = page_lock_anon_vma(page);
+
+			if (!anon_vma)  {
+				free_page(newpage);
+				page_cache_release(page);
+				goto out;
+			}
+			
+			list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+				ret = try_to_remap_anon(page, newpage, vma);
+				if (ret == REMAP_FAIL) {
+					if (PageSwapCache(page))
+						write_unlock_irq(&swapper_space
+							.tree_lock);
+					spin_unlock(&anon_vma->lock);
+					free_page(newpage);
+					unlock_page(page);
+					page_cache_release(page);
+					goto redo_unmaps;
+				}
+				ptes_unmapped++;
+			}
+
+			copy_highpage(newpage, page);
+
+			unlock_page(page);
+
+			page_cache_release(page);
+			page_cache_release(page);
+
+			newpage->private = savedindex;
+	
+			if (PageSwapCache(page)) {
+				error = radix_tree_insert(&swapper_space.page_tree,
+                   						savedindex, newpage);
+
+				//if (error) 
+			}
+
+			
+			spin_unlock(&anon_vma->lock);
+			write_unlock_irq(&swapper_space.tree_lock);
+
+			return 1;
+
+		} else if (!PageAnon(page) && 
+				page_count(page) == 1) {
+			struct vm_area_struct *vma;
+			struct prio_tree_iter iter;
+			struct zone *zone = page_zone(page);
+			struct address_space *mapping = page->mapping;
+			struct page *testpage;
+			int mapped;
+			pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - 
+					PAGE_SHIFT);
+			pgoff_t savedindex = page->index;
+
+			if (!mapping)
+				goto out;
+
+			if (!list_empty(&mapping->i_mmap_nonlinear)) {
+				spin_unlock(&mapping->i_mmap_lock);
+				goto out;
+			}
+
+			if (PagePrivate(page))
+				printk(KERN_ERR "PagePrivate!\n");
+			if (PageWriteback(page)) {
+				printk(KERN_ERR "PageWriteback! quitting\n");
+				goto out;
+			}
+
+			newpage = alloc_pages(GFP_HIGHUSER, 0);
+
+			if (page_count(page) != 1 ||
+			  !PageLRU(page) || PageAnon(page) ||
+			  page->mapping != mapping ||
+			  page->index != savedindex) {
+				free_page(newpage);
+				goto out;
+			}
+
+			page_cache_get(page);
+
+			if (TestSetPageLocked(page)) {
+				page_cache_release(page);
+				printk(KERN_ERR "page locked!!!\n");
+				goto out;
+			}
+
+			// remove radix entry and block page faults for SMP systems
+
+		        spin_lock(&mapping->i_mmap_lock);
+
+			mapped = page_mapcount(page);
+
+			vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 
+				pgoff, pgoff) 
+			{
+				ret = try_to_remap_file(page, newpage, vma);
+				if (ret == REMAP_FAIL) {
+					unlock_page(page);
+					goto redo_unmaps;
+				}
+				ptes_unmapped++;
+				mapped--;
+				if (!mapped)
+					break;
+					
+			}
+
+			if (TestClearPageLRU(page))
+				del_page_from_lru(zone, page);
+			
+			remove_from_page_cache(page);
+
+			copy_highpage(newpage, page);
+
+			newpage->flags = page->flags;
+
+			unlock_page(page);
+
+			add_to_page_cache_lru(newpage, mapping, savedindex, 
+				GFP_KERNEL);
+
+			page_cache_release(page);
+			page_cache_release(page);
+
+			unlock_page(newpage);
+
+		        spin_unlock(&mapping->i_mmap_lock);
+			return 1;
+		}
+
+	}
+
+
+out:
+	preempt_enable();
+	return 0;
+
+redo_unmaps:
+	free_page(newpage);
+	printk(KERN_ERR "unmap PTE failed!@#$^5! ptes_unmapped:%d\n", ptes_unmapped);
+	return 0;
+}
+
+#define MAX_ORDER_DEC	3	/* maximum order decrease */
+
+int coalesce_memory(unsigned int order, struct zone *zone)
+{
+	unsigned int torder;
+	unsigned int nr_freed_pages = 0, nr_pages = 0;
+	
+	if (order < 1) {
+		printk(KERN_ERR "order <= 2");
+		return -1;
+	}
+
+	preempt_disable();
+
+	for (torder = order - 1; torder > order - MAX_ORDER_DEC; torder--) {
+		struct list_head *entry;
+		struct page *pwalk, *page;
+		int walkcount = 0;
+		struct free_area *area = zone->free_area + torder;
+		nr_pages = (1UL << order) - (1UL << torder); 
+
+		entry = area->free_list.next;
+
+		while (entry != &area->free_list) {
+			int ret;
+			page = list_entry(entry, struct page, lru);
+			entry = entry->next;
+
+			pwalk = page;
+
+			/* Look backwards */
+
+			for (walkcount = 1; walkcount<nr_pages; walkcount++) {
+				pwalk = page-walkcount;
+
+				ret = can_move_page(pwalk);
+				if (ret)
+					nr_freed_pages++;
+				else
+					goto forward;
+
+				if (nr_freed_pages == nr_pages)
+					goto success;
+					
+			}
+
+forward:
+
+			pwalk = page;
+
+			/* Look forward, skipping the page frames from this 
+			  high order page we are looking at */
+
+			for (walkcount = (1UL << torder); walkcount<nr_pages; 
+					walkcount++) {
+				pwalk = page+walkcount;
+
+				ret = can_move_page(pwalk);
+
+				if (ret) 
+					nr_freed_pages++;
+				else
+					goto loopey;
+
+				if (nr_freed_pages == nr_pages)
+					goto success;
+			}
+
+loopey:
+	
+//			goto bailout;
+		}
+	}
+
+bailout:
+	preempt_enable();
+	printk(KERN_ERR "failure nr_pages:%d nr_freed_pages:%d!\n", nr_pages,
+nr_freed_pages);
+return 0;
+
+success:
+printk(KERN_ERR "SUCCESS coalesced %d pages!\n", nr_freed_pages);
+return 1;
+	
+}
+
+#ifndef CONFIG_HUGETL _PAGE
 #define prep_compound_page(page, order) do { } while (0)
 #define destroy_compound_page(page, order) do { } while (0)
 #else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 18:22 Marcelo Tosatti
@ 2004-10-01 20:11 ` Andrew Morton
  2004-10-01 19:04   ` Marcelo Tosatti
  2004-10-02  2:30 ` Nick Piggin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2004-10-01 20:11 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, piggin, arjanv, linux-kernel

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> The following patch implements a "coalesce_memory()" function 
> which takes "zone" and "order" as a parameter. 
> 
> It tries to move enough physically nearby pages to form a free area
> of "order" size.
> 
> It does that by checking whether the page can be moved, allocating a new page, 
> unmapping the pte's to it, copying data to new page, remapping the ptes, 
> and reinserting the page on the radix/LRU.

Presumably this duplicates some of the memory hot-remove patches.

Apparently Dave Hansen has working and sane-looking hot remove code
which is in a close-to-submittable state.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 20:11 ` Andrew Morton
@ 2004-10-01 19:04   ` Marcelo Tosatti
  2004-10-01 21:00     ` Andrew Morton
  2004-10-01 21:57     ` Dave Hansen
  0 siblings, 2 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-01 19:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, piggin, arjanv, linux-kernel, haveblue

On Fri, Oct 01, 2004 at 01:11:47PM -0700, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > The following patch implements a "coalesce_memory()" function 
> > which takes "zone" and "order" as a parameter. 
> > 
> > It tries to move enough physically nearby pages to form a free area
> > of "order" size.
> > 
> > It does that by checking whether the page can be moved, allocating a new page, 
> > unmapping the pte's to it, copying data to new page, remapping the ptes, 
> > and reinserting the page on the radix/LRU.
> 
> Presumably this duplicates some of the memory hot-remove patches.

As far as I have researched, the memory moving/remapping code 
on the hot remove patches dont work correctly. Please correct me.

And what I've seen (from the Fujitsu guys) was quite ugly IMHO.

> Apparently Dave Hansen has working and sane-looking hot remove code
> which is in a close-to-submittable state.

Dave?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 19:04   ` Marcelo Tosatti
@ 2004-10-01 21:00     ` Andrew Morton
  2004-10-01 21:57     ` Dave Hansen
  1 sibling, 0 replies; 45+ messages in thread
From: Andrew Morton @ 2004-10-01 21:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, piggin, arjanv, linux-kernel, haveblue

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> As far as I have researched, the memory moving/remapping code 
> on the hot remove patches dont work correctly. Please correct me.
> 
> And what I've seen (from the Fujitsu guys) was quite ugly IMHO.

That's a totally different patch.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 19:04   ` Marcelo Tosatti
  2004-10-01 21:00     ` Andrew Morton
@ 2004-10-01 21:57     ` Dave Hansen
  2004-10-01 23:42       ` Marcelo Tosatti
  1 sibling, 1 reply; 45+ messages in thread
From: Dave Hansen @ 2004-10-01 21:57 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Morton, linux-mm, piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Fri, 2004-10-01 at 12:04, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2004 at 01:11:47PM -0700, Andrew Morton wrote:
> > Presumably this duplicates some of the memory hot-remove patches.
> 
> As far as I have researched, the memory moving/remapping code 
> on the hot remove patches dont work correctly. Please correct me.

I definitely see some commonality, but Marcelo's approach has handling
for the different kinds of pages broken out much more nicely.  Can't
tell yet if this produces extra code, or is just plain better.  

We worked pretty hard to try and copy as little code as possible.  Was
there any reason that there was so much stuff copied out of rmap.c? 
Just for proof-of-concept?

Here's one of the recent patch sets that we're working on:

http://sprucegoose.sr71.net/patches/2.6.9-rc2-mm4-mhp-test2/

In that directory, the K* patches hijack some of the swap code (but
require memory pressure to work last time I tried), and the p000*
patches (by Hirokazu Takahashi) actively migrate pages around.  Both
approaches work, but the K* one is smaller and less intrusive, while the
p000* one is much more complete.  They may end up being able to coexist
in the end.  

> And what I've seen (from the Fujitsu guys) was quite ugly IMHO.

I don't work for Fujitsu :)  Please take a look at the patches in the
above directory and see what you think.  I'm sure you have some very
good stuff in your patch, but I need to take a closer look.

I'm just about to head out of town for the weekend, but I'll take a much
more detailed look on Monday.  

-- 
Dave Hansen
haveblue@us.ibm.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 21:57     ` Dave Hansen
@ 2004-10-01 23:42       ` Marcelo Tosatti
  2004-10-02  1:17         ` Andrew Morton
  2004-10-02  9:30         ` Hirokazu Takahashi
  0 siblings, 2 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-01 23:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-mm, piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Fri, Oct 01, 2004 at 02:57:03PM -0700, Dave Hansen wrote:
> On Fri, 2004-10-01 at 12:04, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2004 at 01:11:47PM -0700, Andrew Morton wrote:
> > > Presumably this duplicates some of the memory hot-remove patches.
> > 
> > As far as I have researched, the memory moving/remapping code 
> > on the hot remove patches dont work correctly. Please correct me.
> 
> I definitely see some commonality, but Marcelo's approach has handling
> for the different kinds of pages broken out much more nicely.  Can't
> tell yet if this produces extra code, or is just plain better.  
> 
> We worked pretty hard to try and copy as little code as possible.  Was
> there any reason that there was so much stuff copied out of rmap.c? 
> Just for proof-of-concept?

Just proof of concept really, to have an equivalent of "try_to_unmap()" - 
which you call from the migrate page code. 

Just that "try_to_remap_{file,anon}" do the pte clearing + remapping in
one function.

> Here's one of the recent patch sets that we're working on:
> 
> http://sprucegoose.sr71.net/patches/2.6.9-rc2-mm4-mhp-test2/
> 
> In that directory, the K* patches hijack some of the swap code (but
> require memory pressure to work last time I tried), and the p000*
> patches (by Hirokazu Takahashi) actively migrate pages around.  Both
> approaches work, but the K* one is smaller and less intrusive, while the
> p000* one is much more complete.  They may end up being able to coexist
> in the end.  

The page migration code (p000*) looks nice - quite complete indeed (nice error
handling, etc) but somewhat specific to the migration procedure, which is more 
critical (cannot fail so easily as) then the remapping for high-order allocations.

For example this in migrate_page_common


+		switch (ret) {
+		case 0:
+		case -ENOENT:
+			copy_highpage(newpage, page);
+			return ret;
+		case -EBUSY:
+			return ret;
+		case -EAGAIN:
+			writeback_and_free_buffers(page);
+			unlock_page(page);
+			msleep(10);
+			timeout -= 10;
+			lock_page(page);
+			continue;

Which retries undefinately to migrate the page

For the "defragmentation" operation we want to do an "easy" try - ie if we
can't remap giveup.

I feel we should try to "untie" the code which checks for remapping availability / 
does the remapping from the page migration - so to be able to share the most 
code between it and other users of the same functionality. 

Curiosity: How did you guys test the migration operation? Several threads on 
several processors operating on the memory, etc? 

> I don't work for Fujitsu :)  Please take a look at the patches in the
> above directory and see what you think.  I'm sure you have some very
> good stuff in your patch, but I need to take a closer look.
> 
> I'm just about to head out of town for the weekend, but I'll take a much
> more detailed look on Monday.  

Cool. I'll take a closer look at the relevant parts of memory hotplug patches 
this weekend, hopefully. See if I can help with testing of these patches too.

Andrew, what are your thoughts wrt merging this to mainline?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 23:42       ` Marcelo Tosatti
@ 2004-10-02  1:17         ` Andrew Morton
  2004-10-02  9:30         ` Hirokazu Takahashi
  1 sibling, 0 replies; 45+ messages in thread
From: Andrew Morton @ 2004-10-02  1:17 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: haveblue, linux-mm, piggin, arjanv, linux-kernel

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> > Here's one of the recent patch sets that we're working on:
>  > 
>  > http://sprucegoose.sr71.net/patches/2.6.9-rc2-mm4-mhp-test2/
>  > 
> ...
>  Andrew, what are your thoughts wrt merging this to mainline?

It's the first I've seen of it.  I guess I'd be looking for testing results
as well as the outcome of discussions/review with the ia64 guys whose
hardware is not quite as cooperative as that on the ppc64 machines.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 23:42       ` Marcelo Tosatti
  2004-10-02  1:17         ` Andrew Morton
@ 2004-10-02  9:30         ` Hirokazu Takahashi
  2004-10-02 18:33           ` Marcelo Tosatti
  1 sibling, 1 reply; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-02  9:30 UTC (permalink / raw)
  To: marcelo.tosatti; +Cc: haveblue, akpm, linux-mm, piggin, arjanv, linux-kernel

Hello, Marcelo.

Generic memory defragmentation will be very nice for me to implement
hugetlbpage migration, as allocating a new hugetlbpage is a hard job.

> For the "defragmentation" operation we want to do an "easy" try - ie if we
> can't remap giveup.
> 
> I feel we should try to "untie" the code which checks for remapping availability / 
> does the remapping from the page migration - so to be able to share the most 
> code between it and other users of the same functionality. 

I think it's possible to introduce non-wait mode to the migration code,
as you may expect. Shall I implement it?

> Curiosity: How did you guys test the migration operation? Several threads on 
> several processors operating on the memory, etc? 

I always test it with the zone hotplug emulation patch, which Mr.Iwamoto
has made. I usually run following jobs concurrently while zones are added
and removed repeatedly on a SMP machine.
      - making linux kernel
      - copying file trees.
      - overwriting file trees.
      - removing file trees
      - some pages are swapped out automatically:)

And Mr.Iwamoto has some small programs to check any kind of page
can be migrated. The programs repeat one of following actions:
    - read/write files .
    - use MAP_SHARED and MAP_PRIVATE mmap()'s and read/write there.
    - use Direct I/O.
    - use AIO.
    - fork to have COW pages.
    - use shmem.
    - use sendfile.

> Cool. I'll take a closer look at the relevant parts of memory hotplug patches 
> this weekend, hopefully. See if I can help with testing of these patches too.

Any comments are very welcome.

Thank you,
Hirokazu Takahashi.






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-02  9:30         ` Hirokazu Takahashi
@ 2004-10-02 18:33           ` Marcelo Tosatti
  2004-10-03  4:13             ` Hirokazu Takahashi
  2004-10-04  4:09             ` IWAMOTO Toshihiro
  0 siblings, 2 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-02 18:33 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: haveblue, akpm, linux-mm, piggin, arjanv, linux-kernel

On Sat, Oct 02, 2004 at 06:30:15PM +0900, Hirokazu Takahashi wrote:
> Hello, Marcelo.
> 
> Generic memory defragmentation will be very nice for me to implement
> hugetlbpage migration, as allocating a new hugetlbpage is a hard job.
> 
> > For the "defragmentation" operation we want to do an "easy" try - ie if we
> > can't remap giveup.
> > 
> > I feel we should try to "untie" the code which checks for remapping availability / 
> > does the remapping from the page migration - so to be able to share the most 
> > code between it and other users of the same functionality. 
> 
> I think it's possible to introduce non-wait mode to the migration code,
> as you may expect. Shall I implement it?
> 
> > Curiosity: How did you guys test the migration operation? Several threads on 
> > several processors operating on the memory, etc? 
> 
> I always test it with the zone hotplug emulation patch, which Mr.Iwamoto
> has made. I usually run following jobs concurrently while zones are added
> and removed repeatedly on a SMP machine.
>       - making linux kernel
>       - copying file trees.
>       - overwriting file trees.
>       - removing file trees
>       - some pages are swapped out automatically:)
> 
> And Mr.Iwamoto has some small programs to check any kind of page
> can be migrated. The programs repeat one of following actions:
>     - read/write files .
>     - use MAP_SHARED and MAP_PRIVATE mmap()'s and read/write there.
>     - use Direct I/O.
>     - use AIO.
>     - fork to have COW pages.
>     - use shmem.
>     - use sendfile.
> 
> > Cool. I'll take a closer look at the relevant parts of memory hotplug patches 
> > this weekend, hopefully. See if I can help with testing of these patches too.
> 
> Any comments are very welcome.

I have a few comments about the code:

1) 
I'm pretty sure you should transfer the radix tree tag at radix_tree_replace().
If for example you transfer a dirty tagged page to another zone, an mpage_writepages()
will miss it (because it uses pagevec_lookup_tag(PAGECACHE_DIRTY_TAG)). 

Should be quite trivial to do (save tags before deleting and set to new entry, 
all in radix_tree_replace).

My implementation also contained the same bug.

2) 
At migrate_onepage you add anonymous pages which aren't swap allocated
to the swap cache
+       /*
+        * Put the page in a radix tree if it isn't in the tree yet.
+        */
+#ifdef CONFIG_SWAP
+       if (PageAnon(page) && !PageSwapCache(page))
+               if (!add_to_swap(page, GFP_KERNEL)) {
+                       unlock_page(page);
+                       return ERR_PTR(-ENOSPC);
+               }
+#endif /* CONFIG_SWAP */

Why's that? You can copy anonymous pages without adding them to swap (thats
what the patch I posted does).

3) At migrate_page_common you assume additional page references 
(page_migratable returning -EAGAIN) means the code should try to writeout 
the page.

Is that assumption always valid?

In theory there is no need to writeout pages when migrating them to 
other zones - they will be copied and the dirty information retained (either
in the PageDirty bit or radix tree tag). 

I just noticed you do that on further patches (migrate_page_buffer), but AFAICS 
the writeout remains. Why arent you using migrate_page_buffer yet?

I think the final aim should be to remove the need for "pageout()" 
completly.

4) 
About implementing a nonblocking version of it. The easier way, it
seems to me, is to pass a "block" argument to generic_migrate_page() and
use that.

Questions: are there any documents on the memory hotplug userspace tools? 
Where can I find them?

Are Iwamoto's test programs available?

In general the code looks nice to me! I'll jump in and help with 
testing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-02 18:33           ` Marcelo Tosatti
@ 2004-10-03  4:13             ` Hirokazu Takahashi
  2004-10-03 14:07               ` Marcelo Tosatti
  2004-10-04  2:22               ` Dave Hansen
  2004-10-04  4:09             ` IWAMOTO Toshihiro
  1 sibling, 2 replies; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-03  4:13 UTC (permalink / raw)
  To: marcelo.tosatti
  Cc: iwamoto, haveblue, akpm, linux-mm, piggin, arjanv, linux-kernel

Hi,

> > > Cool. I'll take a closer look at the relevant parts of memory hotplug patches 
> > > this weekend, hopefully. See if I can help with testing of these patches too.
> > 
> > Any comments are very welcome.
> 
> 
> I have a few comments about the code:
> 
> 1) 
> I'm pretty sure you should transfer the radix tree tag at radix_tree_replace().
> If for example you transfer a dirty tagged page to another zone, an mpage_writepages()
> will miss it (because it uses pagevec_lookup_tag(PAGECACHE_DIRTY_TAG)). 
> 
> Should be quite trivial to do (save tags before deleting and set to new entry, 
> all in radix_tree_replace).
> 
> My implementation also contained the same bug.

Yes, it's one of the issues to do. The tag should be transferred in
radix_tree_replace() as you pointed out. The current implementation
sets the tag in set_page_dirty(newpage).

> 2) 
> At migrate_onepage you add anonymous pages which aren't swap allocated
> to the swap cache
> +       /*
> +        * Put the page in a radix tree if it isn't in the tree yet.
> +        */
> +#ifdef CONFIG_SWAP
> +       if (PageAnon(page) && !PageSwapCache(page))
> +               if (!add_to_swap(page, GFP_KERNEL)) {
> +                       unlock_page(page);
> +                       return ERR_PTR(-ENOSPC);
> +               }
> +#endif /* CONFIG_SWAP */
> 
> Why's that? You can copy anonymous pages without adding them to swap (thats
> what the patch I posted does).

The reason is to guarantee that any anonymous page can be migrated anytime.
I want to block newly occurred accesses to the page during the migration
because it can't be migrated if there remain some references on it by
system calls, direct I/O and page faults.

Your approach will work fine on most of anonymous pages, which aren't
heavily accessed. I think it will be enough for memory defragmentation.

> 3) At migrate_page_common you assume additional page references 
> (page_migratable returning -EAGAIN) means the code should try to writeout 
> the page.
> 
> Is that assumption always valid?

-EAGAIN means that the page may require to be written back or
just to wait for a while since the page is just referred by system call 
or pagefault handler.

> In theory there is no need to writeout pages when migrating them to 
> other zones - they will be copied and the dirty information retained (either
> in the PageDirty bit or radix tree tag). 
> 
> I just noticed you do that on further patches (migrate_page_buffer), but AFAICS 
> the writeout remains. Why arent you using migrate_page_buffer yet?

I've designed migrate_page_buffer() for this purpose.
At this moment ext2 only uses this yet.

> I think the final aim should be to remove the need for "pageout()" 
> completly.

Yes!

> 4) 
> About implementing a nonblocking version of it. The easier way, it
> seems to me, is to pass a "block" argument to generic_migrate_page() and
> use that.

Yes.

> Questions: are there any documents on the memory hotplug userspace tools? 
> Where can I find them?

IBM guys and Fujitsu guys are designing user interface independently.
IBM team is implementing memory section hotplug while Fujitsu team
try to implement NUMA node hotplug. But both of the designs use
regular hot-plug mechanism, which kicks /sbin/hotplug script to control
devices via sysfs.

Dave, would you explain about it?

> Are Iwamoto's test programs available?

Ok, I'll notice him to post them.

> In general the code looks nice to me! I'll jump in and help with 
> testing.

I appreciate your offer. I'm very happy with that.

Thank you,
Hirokazu Takahasih.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03  4:13             ` Hirokazu Takahashi
@ 2004-10-03 14:07               ` Marcelo Tosatti
  2004-10-03 18:35                 ` Hirokazu Takahashi
  2004-10-04  3:24                 ` IWAMOTO Toshihiro
  2004-10-04  2:22               ` Dave Hansen
  1 sibling, 2 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-03 14:07 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: iwamoto, haveblue, akpm, linux-mm, piggin, arjanv, linux-kernel

On Sun, Oct 03, 2004 at 01:13:38PM +0900, Hirokazu Takahashi wrote:
> Hi,
> 
> > > > Cool. I'll take a closer look at the relevant parts of memory hotplug patches 
> > > > this weekend, hopefully. See if I can help with testing of these patches too.
> > > 
> > > Any comments are very welcome.
> > 
> > 
> > I have a few comments about the code:
> > 
> > 1) 
> > I'm pretty sure you should transfer the radix tree tag at radix_tree_replace().
> > If for example you transfer a dirty tagged page to another zone, an mpage_writepages()
> > will miss it (because it uses pagevec_lookup_tag(PAGECACHE_DIRTY_TAG)). 
> > 
> > Should be quite trivial to do (save tags before deleting and set to new entry, 
> > all in radix_tree_replace).
> > 
> > My implementation also contained the same bug.
> 
> Yes, it's one of the issues to do. The tag should be transferred in
> radix_tree_replace() as you pointed out. The current implementation
> sets the tag in set_page_dirty(newpage).

Oh I missed that, right.

But yes, anyway, the tag should be transferred at radix_tree_replace (earlier)
or pagevec_lookup_tag() can miss those pages.

> > 2) 
> > At migrate_onepage you add anonymous pages which aren't swap allocated
> > to the swap cache
> > +       /*
> > +        * Put the page in a radix tree if it isn't in the tree yet.
> > +        */
> > +#ifdef CONFIG_SWAP
> > +       if (PageAnon(page) && !PageSwapCache(page))
> > +               if (!add_to_swap(page, GFP_KERNEL)) {
> > +                       unlock_page(page);
> > +                       return ERR_PTR(-ENOSPC);
> > +               }
> > +#endif /* CONFIG_SWAP */
> > 
> > Why's that? You can copy anonymous pages without adding them to swap (thats
> > what the patch I posted does).
> 
> The reason is to guarantee that any anonymous page can be migrated anytime.
> I want to block newly occurred accesses to the page during the migration
> because it can't be migrated if there remain some references on it by
> system calls, direct I/O and page faults.

It would be nice if we could block pte faults in a way such to not need
adding each anonymous page to swap. It can be too costly if you have a lot memory
and it makes the whole operation dependable on swap size (if you dont have enough
swap, you're dead).

Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
migration is not a common operation anyway), or create a semaphore? 

> Your approach will work fine on most of anonymous pages, which aren't
> heavily accessed. I think it will be enough for memory defragmentation.

Yes...

> > 3) At migrate_page_common you assume additional page references 
> > (page_migratable returning -EAGAIN) means the code should try to writeout 
> > the page.
> > 
> > Is that assumption always valid?
> 
> -EAGAIN means that the page may require to be written back 

But why is it needed to writeout pages? We shouldnt need to. At least
from what I can understand.


> or
> just to wait for a while since the page is just referred by system call 
> or pagefault handler.

I'm not sure if making that assumption is always valid.

Kernel code can have an additional count on the page meaning "this page is pinned, 
dont move it". At least that should be valid.

Any piece of code which holds a reference on a page for a long 
time is going to be a pain for the algorithm right?

> > In theory there is no need to writeout pages when migrating them to 
> > other zones - they will be copied and the dirty information retained (either
> > in the PageDirty bit or radix tree tag). 
> > 
> > I just noticed you do that on further patches (migrate_page_buffer), but AFAICS 
> > the writeout remains. Why arent you using migrate_page_buffer yet?
> 
> I've designed migrate_page_buffer() for this purpose.
> At this moment ext2 only uses this yet.

Ah ok I haven't looked at those patches.

> > I think the final aim should be to remove the need for "pageout()" 
> > completly.
> 
> Yes!
> 
> > 4) 
> > About implementing a nonblocking version of it. The easier way, it
> > seems to me, is to pass a "block" argument to generic_migrate_page() and
> > use that.
> 
> Yes.

OK. I'll try to implement it this week (plus the radix_tree_replace 
tag thingie).

> > Questions: are there any documents on the memory hotplug userspace tools? 
> > Where can I find them?
> 
> IBM guys and Fujitsu guys are designing user interface independently.
> IBM team is implementing memory section hotplug while Fujitsu team
> try to implement NUMA node hotplug. But both of the designs use
> regular hot-plug mechanism, which kicks /sbin/hotplug script to control
> devices via sysfs.
> 
> Dave, would you explain about it?

Please :)

> > Are Iwamoto's test programs available?
> 
> Ok, I'll notice him to post them.
> 
> > In general the code looks nice to me! I'll jump in and help with 
> > testing.
> 
> I appreciate your offer. I'm very happy with that.

Me too! :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03 14:07               ` Marcelo Tosatti
@ 2004-10-03 18:35                 ` Hirokazu Takahashi
  2004-10-03 19:21                   ` Trond Myklebust
  2004-10-04 17:24                   ` Marcelo Tosatti
  2004-10-04  3:24                 ` IWAMOTO Toshihiro
  1 sibling, 2 replies; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-03 18:35 UTC (permalink / raw)
  To: marcelo.tosatti
  Cc: iwamoto, haveblue, akpm, linux-mm, piggin, arjanv, linux-kernel

Hi, Marcelo

> > > 2) 
> > > At migrate_onepage you add anonymous pages which aren't swap allocated
> > > to the swap cache
> > > +       /*
> > > +        * Put the page in a radix tree if it isn't in the tree yet.
> > > +        */
> > > +#ifdef CONFIG_SWAP
> > > +       if (PageAnon(page) && !PageSwapCache(page))
> > > +               if (!add_to_swap(page, GFP_KERNEL)) {
> > > +                       unlock_page(page);
> > > +                       return ERR_PTR(-ENOSPC);
> > > +               }
> > > +#endif /* CONFIG_SWAP */
> > > 
> > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > what the patch I posted does).
> > 
> > The reason is to guarantee that any anonymous page can be migrated anytime.
> > I want to block newly occurred accesses to the page during the migration
> > because it can't be migrated if there remain some references on it by
> > system calls, direct I/O and page faults.
> 
> It would be nice if we could block pte faults in a way such to not need
> adding each anonymous page to swap. It can be too costly if you have a lot memory
> and it makes the whole operation dependable on swap size (if you dont have enough
> swap, you're dead).
> 
> Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> migration is not a common operation anyway), or create a semaphore? 

I think the problem of the holding mm->page_table_lock approach is
that it doesn't allow the migration code blocked. the semaphore
approach would be better.

I have another idea that each anonymous page can detach its swap entry
after its migration. It can be done by remove_exclusive_swap_page()
if the page is remapped to the same spaces forcibly by
touch_unmapped_address() I made.

> > Your approach will work fine on most of anonymous pages, which aren't
> > heavily accessed. I think it will be enough for memory defragmentation.
> 
> Yes...
> 
> > > 3) At migrate_page_common you assume additional page references 
> > > (page_migratable returning -EAGAIN) means the code should try to writeout 
> > > the page.
> > > 
> > > Is that assumption always valid?
> > 
> > -EAGAIN means that the page may require to be written back 
> 
> But why is it needed to writeout pages? We shouldnt need to. At least
> from what I can understand.

The migration code allows each filesystem to implement its own
migration code or just use migrate_page_buffer() or
migrate_page_common(). 

migrate_page_common() is a default function if filesystem doesn't
implement anything. The function is the most generic and it tries
to writeback pages only if they are dirty and have buffers.

> > or
> > just to wait for a while since the page is just referred by system call 
> > or pagefault handler.
> 
> I'm not sure if making that assumption is always valid.
> 
> Kernel code can have an additional count on the page meaning "this page is pinned, 
> dont move it". At least that should be valid.

Yes, I know. I have checked all of the code.

AIO event buffers are pinned, therefore the memory-hotplug team plans
to make pages for the event buffers assigned to non-hotpluggable
memory regions.

And pages in sendfile() might be pinned for a while in case of network
problems. I think there may be some workarounds. The easiest way
is just waiting its timeout, and another way is changing the mode
of sendfile() to copy pages in advance. 

Pages for NFS also might be pinned with network problems.
One of the ideas is to restrict NFS to allocate pages from
specific memory region, sot that all memory except the region
can be hot-removed. And it's possible to implementing whole
migrate_page method, which may handled stuck pages.

If the migration code is used for memory defragmentation, pinned pages
must be avoided. I think it can be done with the non-blocking mode.

> Any piece of code which holds a reference on a page for a long 
> time is going to be a pain for the algorithm right?
> 

> > > 4) 
> > > About implementing a nonblocking version of it. The easier way, it
> > > seems to me, is to pass a "block" argument to generic_migrate_page() and
> > > use that.
> > 
> > Yes.
> 
> OK. I'll try to implement it this week (plus the radix_tree_replace 
> tag thingie).

Thank you for that.

Hirokazu Takahashi.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03 18:35                 ` Hirokazu Takahashi
@ 2004-10-03 19:21                   ` Trond Myklebust
  2004-10-03 20:03                     ` Hirokazu Takahashi
  2004-10-04 17:24                   ` Marcelo Tosatti
  1 sibling, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2004-10-03 19:21 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: Marcelo Tosatti, iwamoto, haveblue, Andrew Morton, linux-mm,
	piggin, arjanv, linux-kernel

Pa su , 03/10/2004 klokka 20:35, skreiv Hirokazu Takahashi:

> Pages for NFS also might be pinned with network problems.
> One of the ideas is to restrict NFS to allocate pages from
> specific memory region, sot that all memory except the region
> can be hot-removed. And it's possible to implementing whole
> migrate_page method, which may handled stuck pages.

Why do you want to special-case this?

The above is a generic condition: any filesystem can suffer from the
equivalent problem of a failure or slow response in the underlying
device. Making an NFS-specific hack is just counter-productive to
solving the generic problem.

Cheers,
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03 19:21                   ` Trond Myklebust
@ 2004-10-03 20:03                     ` Hirokazu Takahashi
  2004-10-03 20:44                       ` Trond Myklebust
  0 siblings, 1 reply; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-03 20:03 UTC (permalink / raw)
  To: trond.myklebust
  Cc: marcelo.tosatti, iwamoto, haveblue, akpm, linux-mm, piggin,
	arjanv, linux-kernel

Hello,

> > Pages for NFS also might be pinned with network problems.
> > One of the ideas is to restrict NFS to allocate pages from
> > specific memory region, sot that all memory except the region
> > can be hot-removed. And it's possible to implementing whole
> > migrate_page method, which may handled stuck pages.
> 
> Why do you want to special-case this?
>
> The above is a generic condition: any filesystem can suffer from the
> equivalent problem of a failure or slow response in the underlying
> device. Making an NFS-specific hack is just counter-productive to
> solving the generic problem.

However, while network is down network/cluster filesystems might not
release pages forever unlike in the case of block devices, which may
timeout or returns a error in case of failure.

Each filesystem can control what the migration code does.
If it doesn't have anything to help memory migration, it's possible
to wait for the network coming up before starting memory migration,
or give up it if the network happen to be down. That's no problem.

Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03 20:03                     ` Hirokazu Takahashi
@ 2004-10-03 20:44                       ` Trond Myklebust
  2004-10-04 13:02                         ` Hirokazu Takahashi
  0 siblings, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2004-10-03 20:44 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: Marcelo Tosatti, iwamoto, haveblue, Andrew Morton, linux-mm,
	piggin, arjanv, linux-kernel

Pa su , 03/10/2004 klokka 22:03, skreiv Hirokazu Takahashi:

> However, while network is down network/cluster filesystems might not
> release pages forever unlike in the case of block devices, which may
> timeout or returns a error in case of failure.

Where is the difference? As far as the VM is concerned, it is a latency
problem. The fact of whether or not it is a permanent hang, a hang with
a long timeout, or just a slow device is irrelevant because the VM
doesn't actually know about these devices.

> Each filesystem can control what the migration code does.
> If it doesn't have anything to help memory migration, it's possible
> to wait for the network coming up before starting memory migration,
> or give up it if the network happen to be down. That's no problem.

Wrong. It *is* a problem: Filesystems aren't required to know anything
about the particulars of the underlying block/network/... device timeout
semantics either.

Think, for instance about EXT2. Where in the current code do you see
that it is required to detect that it is running on top of something
like the NBD device? Where does it figure out what the latencies of this
device is?

AFAICS, most filesystems in linux/fs/* have no knowledge whatsoever
about the underlying block/network/... devices and their timeout values.
Basing your decision about whether or not you need to manage high
latency situations just by inspecting the filesystem type is therefore
not going to give very reliable results.

Cheers,
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03 20:44                       ` Trond Myklebust
@ 2004-10-04 13:02                         ` Hirokazu Takahashi
  0 siblings, 0 replies; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-04 13:02 UTC (permalink / raw)
  To: trond.myklebust
  Cc: marcelo.tosatti, iwamoto, haveblue, akpm, linux-mm, piggin,
	arjanv, linux-kernel

Hello,

Yes, I know what you're talking about.
The current kernel doesn't have any features about it.

So that I've been wondering if there might be any good solution
to help memory hot-removal. It would be nice if there were support
from filesystems and block devices.

> > However, while network is down network/cluster filesystems might not
> > release pages forever unlike in the case of block devices, which may
> > timeout or returns a error in case of failure.
> 
> Where is the difference? As far as the VM is concerned, it is a latency
> problem. The fact of whether or not it is a permanent hang, a hang with
> a long timeout, or just a slow device is irrelevant because the VM
> doesn't actually know about these devices.
> 
> > Each filesystem can control what the migration code does.
> > If it doesn't have anything to help memory migration, it's possible
> > to wait for the network coming up before starting memory migration,
> > or give up it if the network happen to be down. That's no problem.
> 
> Wrong. It *is* a problem: Filesystems aren't required to know anything
> about the particulars of the underlying block/network/... device timeout
> semantics either.
> 
> Think, for instance about EXT2. Where in the current code do you see
> that it is required to detect that it is running on top of something
> like the NBD device? Where does it figure out what the latencies of this
> device is?
> 
> AFAICS, most filesystems in linux/fs/* have no knowledge whatsoever
> about the underlying block/network/... devices and their timeout values.
> Basing your decision about whether or not you need to manage high
> latency situations just by inspecting the filesystem type is therefore
> not going to give very reliable results.

Thank you,
Hirokazu Takahashi.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03 18:35                 ` Hirokazu Takahashi
  2004-10-03 19:21                   ` Trond Myklebust
@ 2004-10-04 17:24                   ` Marcelo Tosatti
  2004-10-05  2:53                     ` Hirokazu Takahashi
  1 sibling, 1 reply; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-04 17:24 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: iwamoto, haveblue, akpm, linux-mm, piggin, arjanv, linux-kernel

On Mon, Oct 04, 2004 at 03:35:59AM +0900, Hirokazu Takahashi wrote:
> Hi, Marcelo
> 
> > > > 2) 
> > > > At migrate_onepage you add anonymous pages which aren't swap allocated
> > > > to the swap cache
> > > > +       /*
> > > > +        * Put the page in a radix tree if it isn't in the tree yet.
> > > > +        */
> > > > +#ifdef CONFIG_SWAP
> > > > +       if (PageAnon(page) && !PageSwapCache(page))
> > > > +               if (!add_to_swap(page, GFP_KERNEL)) {
> > > > +                       unlock_page(page);
> > > > +                       return ERR_PTR(-ENOSPC);
> > > > +               }
> > > > +#endif /* CONFIG_SWAP */
> > > > 
> > > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > > what the patch I posted does).
> > > 
> > > The reason is to guarantee that any anonymous page can be migrated anytime.
> > > I want to block newly occurred accesses to the page during the migration
> > > because it can't be migrated if there remain some references on it by
> > > system calls, direct I/O and page faults.
> > 
> > It would be nice if we could block pte faults in a way such to not need
> > adding each anonymous page to swap. It can be too costly if you have a lot memory
> > and it makes the whole operation dependable on swap size (if you dont have enough
> > swap, you're dead).
> > 
> > Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> > migration is not a common operation anyway), or create a semaphore? 
> 
> I think the problem of the holding mm->page_table_lock approach is
> that it doesn't allow the migration code blocked. the semaphore
> approach would be better.

OK, I think the problem is that can be more than one thread with different address 
spaces (ie different "current->mm", after fork) accessing the page.

Adding a waitqueue to "anon_vma" structure (to be slept at do_swap_page time, 
and awake after copy-page-and-flags/unlock), can be do the job I think.

> I have another idea that each anonymous page can detach its swap entry
> after its migration. 

Yes thats a nice idea.

> It can be done by remove_exclusive_swap_page()
> if the page is remapped to the same spaces forcibly by
> touch_unmapped_address() I made.

touch_unmapped_address() ?

> > > Your approach will work fine on most of anonymous pages, which aren't
> > > heavily accessed. I think it will be enough for memory defragmentation.
> > 
> > Yes...
> > 
> > > > 3) At migrate_page_common you assume additional page references 
> > > > (page_migratable returning -EAGAIN) means the code should try to writeout 
> > > > the page.
> > > > 
> > > > Is that assumption always valid?
> > > 
> > > -EAGAIN means that the page may require to be written back 
> > 
> > But why is it needed to writeout pages? We shouldnt need to. At least
> > from what I can understand.
> 
> The migration code allows each filesystem to implement its own
> migration code or just use migrate_page_buffer() or
> migrate_page_common(). 
> 
> migrate_page_common() is a default function if filesystem doesn't
> implement anything. The function is the most generic and it tries
> to writeback pages only if they are dirty and have buffers.

The thing is: What is the point of writing out pages?

We're just trying to migrate pages to another zone. 

If its under writeout, wait, if its dirty, just move it to the other
zone.

Can you enlight me?

> > > or
> > > just to wait for a while since the page is just referred by system call 
> > > or pagefault handler.
> > 
> > I'm not sure if making that assumption is always valid.
> > 
> > Kernel code can have an additional count on the page meaning "this page is pinned, 
> > dont move it". At least that should be valid.
> 
> Yes, I know. I have checked all of the code.
> 
> AIO event buffers are pinned, therefore the memory-hotplug team plans
> to make pages for the event buffers assigned to non-hotpluggable
> memory regions.
> 
> And pages in sendfile() might be pinned for a while in case of network
> problems. I think there may be some workarounds. The easiest way
> is just waiting its timeout, and another way is changing the mode
> of sendfile() to copy pages in advance. 
> 
> Pages for NFS also might be pinned with network problems.
> One of the ideas is to restrict NFS to allocate pages from
> specific memory region, sot that all memory except the region
> can be hot-removed. And it's possible to implementing whole
> migrate_page method, which may handled stuck pages.
> 
> If the migration code is used for memory defragmentation, pinned pages
> must be avoided. I think it can be done with the non-blocking mode.

Right.

> > Any piece of code which holds a reference on a page for a long 
> > time is going to be a pain for the algorithm right?
> > 
> 
> > > > 4) 
> > > > About implementing a nonblocking version of it. The easier way, it
> > > > seems to me, is to pass a "block" argument to generic_migrate_page() and
> > > > use that.
> > > 
> > > Yes.
> > 
> > OK. I'll try to implement it this week (plus the radix_tree_replace 
> > tag thingie).
> 
> Thank you for that.

Any news about Iwamoto's test programs? :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-04 17:24                   ` Marcelo Tosatti
@ 2004-10-05  2:53                     ` Hirokazu Takahashi
  2004-10-07 12:06                       ` Marcelo Tosatti
  0 siblings, 1 reply; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-05  2:53 UTC (permalink / raw)
  To: marcelo.tosatti
  Cc: iwamoto, haveblue, akpm, linux-mm, piggin, arjanv, linux-kernel

Hi, Marcelo

> > > > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > > > what the patch I posted does).
> > > > 
> > > > The reason is to guarantee that any anonymous page can be migrated anytime.
> > > > I want to block newly occurred accesses to the page during the migration
> > > > because it can't be migrated if there remain some references on it by
> > > > system calls, direct I/O and page faults.
> > > 
> > > It would be nice if we could block pte faults in a way such to not need
> > > adding each anonymous page to swap. It can be too costly if you have a lot memory
> > > and it makes the whole operation dependable on swap size (if you dont have enough
> > > swap, you're dead).
> > > 
> > > Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> > > migration is not a common operation anyway), or create a semaphore? 
> > 
> > I think the problem of the holding mm->page_table_lock approach is
> > that it doesn't allow the migration code blocked. the semaphore
> > approach would be better.
> 
> OK, I think the problem is that can be more than one thread with different address 
> spaces (ie different "current->mm", after fork) accessing the page.

Yes. 

> Adding a waitqueue to "anon_vma" structure (to be slept at do_swap_page time, 
> and awake after copy-page-and-flags/unlock), can be do the job I think.
> 
> > I have another idea that each anonymous page can detach its swap entry
> > after its migration. 
> 
> Yes thats a nice idea.

I'll do it in a week.

> > It can be done by remove_exclusive_swap_page()
> > if the page is remapped to the same spaces forcibly by
> > touch_unmapped_address() I made.
> 
> touch_unmapped_address() ?

I've made two functions.

record_unmapped_address() holds mm and address where a target page
has been mapped. touch_unmapped_address() remaps the page again.

> > > > Your approach will work fine on most of anonymous pages, which aren't
> > > > heavily accessed. I think it will be enough for memory defragmentation.
> > > 
> > > Yes...
> > > 
> > > > > 3) At migrate_page_common you assume additional page references 
> > > > > (page_migratable returning -EAGAIN) means the code should try to writeout 
> > > > > the page.
> > > > > 
> > > > > Is that assumption always valid?
> > > > 
> > > > -EAGAIN means that the page may require to be written back 
> > > 
> > > But why is it needed to writeout pages? We shouldnt need to. At least
> > > from what I can understand.
> > 
> > The migration code allows each filesystem to implement its own
> > migration code or just use migrate_page_buffer() or
> > migrate_page_common(). 
> > 
> > migrate_page_common() is a default function if filesystem doesn't
> > implement anything. The function is the most generic and it tries
> > to writeback pages only if they are dirty and have buffers.
> 
> The thing is: What is the point of writing out pages?

It was the easiest way to handle pages with buffers when Iwamoto
and I started to implement it. We thought it was slow but it would
work for all kinds of filesystems.

> We're just trying to migrate pages to another zone. 
> 
> If its under writeout, wait, if its dirty, just move it to the other
> zone.
> 
> Can you enlight me?

Yes, I also realize that.
migrate_page_buffer() will do this, but I'm not certain it will work
for all kinds of filesystems. I guess there might be some exceptions.
We may need a special operation to handle pages on a filesystem,
which has releasepage method.


Thanks,
Hirokazu Takahashi.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-05  2:53                     ` Hirokazu Takahashi
@ 2004-10-07 12:06                       ` Marcelo Tosatti
  2004-10-08  7:00                         ` Hirokazu Takahashi
  0 siblings, 1 reply; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-07 12:06 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: iwamoto, haveblue, linux-mm

On Tue, Oct 05, 2004 at 11:53:47AM +0900, Hirokazu Takahashi wrote:

> It was the easiest way to handle pages with buffers when Iwamoto
> and I started to implement it. We thought it was slow but it would
> work for all kinds of filesystems.
> 
> > We're just trying to migrate pages to another zone. 
> > 
> > If its under writeout, wait, if its dirty, just move it to the other
> > zone.
> > 
> > Can you enlight me?
> 
> Yes, I also realize that.
> migrate_page_buffer() will do this, but I'm not certain it will work
> for all kinds of filesystems. I guess there might be some exceptions.
> We may need a special operation to handle pages on a filesystem,
> which has releasepage method.

It seems there is typo in the current version of the patch:

int try_to_migrate_pages(struct list_head *page_list)
{
...
        current->flags |= PF_KSWAPD;    /*  It's fake */
        list_for_each_entry_safe(page, page2, page_list, lru) {
                /*
                 * Start writeback I/O if it's a dirty page with buffers
                 * and it doesn't have migrate_page method.
                 */
                if (PageDirty(page) && PagePrivate(page)) {
                        if (!TestSetPageLocked(page)) {
                                mapping = page_mapping(page);
                                if (!mapping || mapping->a_ops->migrate_page ||
						^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                    pageout(page, mapping) != PAGE_SUCCESS) {
                                        unlock_page(page);
                                }
                        }
                }


Shouldnt that be "!mapping->a_ops->migrate_page"?

That is, if we can't migrate the page, try to write it out?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-07 12:06                       ` Marcelo Tosatti
@ 2004-10-08  7:00                         ` Hirokazu Takahashi
  2004-10-08 10:00                           ` Marcelo Tosatti
  0 siblings, 1 reply; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-08  7:00 UTC (permalink / raw)
  To: marcelo.tosatti; +Cc: iwamoto, haveblue, linux-mm

Hi, Marcelo.

> It seems there is typo in the current version of the patch:
> 
> int try_to_migrate_pages(struct list_head *page_list)
> {
> ...
>         current->flags |= PF_KSWAPD;    /*  It's fake */
>         list_for_each_entry_safe(page, page2, page_list, lru) {
>                 /*
>                  * Start writeback I/O if it's a dirty page with buffers
>                  * and it doesn't have migrate_page method.
>                  */
>                 if (PageDirty(page) && PagePrivate(page)) {
>                         if (!TestSetPageLocked(page)) {
>                                 mapping = page_mapping(page);
>                                 if (!mapping || mapping->a_ops->migrate_page ||
a> 						^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                                     pageout(page, mapping) != PAGE_SUCCESS) {
>                                         unlock_page(page);
>                                 }
>                         }
>                 }
> 
> 
> Shouldnt that be "!mapping->a_ops->migrate_page"?

"mapping->a_ops->migrate_page" is correct.

This code is just for optimization. If mapping->a_ops->migrate_page
isn't implemented, migrate_page_common() is used to migrate the page.
migrate_page_common() will try to write it back if it's dirty, so that
it would be better to start writeback I/O for target pages in advance
without waiting the I/O completions.

As you may know the migration code will work fine without this code.

> That is, if we can't migrate the page, try to write it out?

Thank you,
Hirokazu Takahashi.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-08  7:00                         ` Hirokazu Takahashi
@ 2004-10-08 10:00                           ` Marcelo Tosatti
  2004-10-08 12:23                             ` Hirokazu Takahashi
  0 siblings, 1 reply; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-08 10:00 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: iwamoto, haveblue, linux-mm

On Fri, Oct 08, 2004 at 04:00:28PM +0900, Hirokazu Takahashi wrote:
> Hi, Marcelo.
> 
> > It seems there is typo in the current version of the patch:
> > 
> > int try_to_migrate_pages(struct list_head *page_list)
> > {
> > ...
> >         current->flags |= PF_KSWAPD;    /*  It's fake */
> >         list_for_each_entry_safe(page, page2, page_list, lru) {
> >                 /*
> >                  * Start writeback I/O if it's a dirty page with buffers
> >                  * and it doesn't have migrate_page method.
> >                  */
> >                 if (PageDirty(page) && PagePrivate(page)) {
> >                         if (!TestSetPageLocked(page)) {
> >                                 mapping = page_mapping(page);
> >                                 if (!mapping || mapping->a_ops->migrate_page ||
> a> 						^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >                                     pageout(page, mapping) != PAGE_SUCCESS) {
> >                                         unlock_page(page);
> >                                 }
> >                         }
> >                 }
> > 
> > 
> > Shouldnt that be "!mapping->a_ops->migrate_page"?
> 
> "mapping->a_ops->migrate_page" is correct.
> 
> This code is just for optimization. If mapping->a_ops->migrate_page
> isn't implemented, migrate_page_common() is used to migrate the page.
> migrate_page_common() will try to write it back if it's dirty, so that
> it would be better to start writeback I/O for target pages in advance
> without waiting the I/O completions.

Right. But if migrate_page _is_ implemented, we also start writeback! 

Shouldnt that be "if we dont have migrate_page(), start writeback, since 
in this case well use migrate_page_common() anyway."

It seems the logic is inverted, or maybe I'm wrong.

> As you may know the migration code will work fine without this code.

OK!

> > That is, if we can't migrate the page, try to write it out?

I just didnt understand the logic very well, maybe I should just 
go reread the code.

Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-08 10:00                           ` Marcelo Tosatti
@ 2004-10-08 12:23                             ` Hirokazu Takahashi
  2004-10-08 12:41                               ` Marcelo Tosatti
  0 siblings, 1 reply; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-08 12:23 UTC (permalink / raw)
  To: marcelo.tosatti; +Cc: iwamoto, haveblue, linux-mm

Hi, Marcelo.

> On Fri, Oct 08, 2004 at 04:00:28PM +0900, Hirokazu Takahashi wrote:
> > Hi, Marcelo.
> > 
> > > It seems there is typo in the current version of the patch:
> > > 
> > > int try_to_migrate_pages(struct list_head *page_list)
> > > {
> > > ...
> > >         current->flags |= PF_KSWAPD;    /*  It's fake */
> > >         list_for_each_entry_safe(page, page2, page_list, lru) {
> > >                 /*
> > >                  * Start writeback I/O if it's a dirty page with buffers
> > >                  * and it doesn't have migrate_page method.
> > >                  */
> > >                 if (PageDirty(page) && PagePrivate(page)) {
> > >                         if (!TestSetPageLocked(page)) {
> > >                                 mapping = page_mapping(page);
> > >                                 if (!mapping || mapping->a_ops->migrate_page ||
> > > 						^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >                                     pageout(page, mapping) != PAGE_SUCCESS) {
> > >                                         unlock_page(page);
> > >                                 }
> > >                         }
> > >                 }
> > > 
> > > 
> > > Shouldnt that be "!mapping->a_ops->migrate_page"?
> > 
> > "mapping->a_ops->migrate_page" is correct.
> > 
> > This code is just for optimization. If mapping->a_ops->migrate_page
> > isn't implemented, migrate_page_common() is used to migrate the page.
> > migrate_page_common() will try to write it back if it's dirty, so that
> > it would be better to start writeback I/O for target pages in advance
> > without waiting the I/O completions.
> 
> Right. But if migrate_page _is_ implemented, we also start writeback! 
> 
> Shouldnt that be "if we dont have migrate_page(), start writeback, since 
> in this case well use migrate_page_common() anyway."
> 
> It seems the logic is inverted, or maybe I'm wrong.

It's my understanding that the previous code is equivalent to
the following code. Am I missing something?

if (PageDirty(page) && PagePrivate(page)) {
        if (!TestSetPageLocked(page)) {
                mapping = page_mapping(page);
                if (!mapping) {
                        unlock_page(page);
		} else if (mapping->a_ops->migrate_page) {
                        unlock_page(page);
		} else if (pageout(page, mapping) != PAGE_SUCCESS) {
                        unlock_page(page);
		}
	}
}


> > As you may know the migration code will work fine without this code.
> 
> OK!
> 
> > > That is, if we can't migrate the page, try to write it out?
> 
> I just didnt understand the logic very well, maybe I should just 
> go reread the code.
> 
> Thanks!
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-08 12:23                             ` Hirokazu Takahashi
@ 2004-10-08 12:41                               ` Marcelo Tosatti
  2004-10-08 16:52                                 ` Hirokazu Takahashi
  0 siblings, 1 reply; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-08 12:41 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: iwamoto, haveblue, linux-mm

On Fri, Oct 08, 2004 at 09:23:19PM +0900, Hirokazu Takahashi wrote:
> Hi, Marcelo.
> 
> > On Fri, Oct 08, 2004 at 04:00:28PM +0900, Hirokazu Takahashi wrote:
> > > Hi, Marcelo.
> > > 
> > > > It seems there is typo in the current version of the patch:
> > > > 
> > > > int try_to_migrate_pages(struct list_head *page_list)
> > > > {
> > > > ...
> > > >         current->flags |= PF_KSWAPD;    /*  It's fake */
> > > >         list_for_each_entry_safe(page, page2, page_list, lru) {
> > > >                 /*
> > > >                  * Start writeback I/O if it's a dirty page with buffers
> > > >                  * and it doesn't have migrate_page method.
> > > >                  */
> > > >                 if (PageDirty(page) && PagePrivate(page)) {
> > > >                         if (!TestSetPageLocked(page)) {
> > > >                                 mapping = page_mapping(page);
> > > >                                 if (!mapping || mapping->a_ops->migrate_page ||
> > > > 						^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > >                                     pageout(page, mapping) != PAGE_SUCCESS) {
> > > >                                         unlock_page(page);
> > > >                                 }
> > > >                         }
> > > >                 }
> > > > 
> > > > 
> > > > Shouldnt that be "!mapping->a_ops->migrate_page"?
> > > 
> > > "mapping->a_ops->migrate_page" is correct.
> > > 
> > > This code is just for optimization. If mapping->a_ops->migrate_page
> > > isn't implemented, migrate_page_common() is used to migrate the page.
> > > migrate_page_common() will try to write it back if it's dirty, so that
> > > it would be better to start writeback I/O for target pages in advance
> > > without waiting the I/O completions.
> > 
> > Right. But if migrate_page _is_ implemented, we also start writeback! 
> > 
> > Shouldnt that be "if we dont have migrate_page(), start writeback, since 
> > in this case well use migrate_page_common() anyway."
> > 
> > It seems the logic is inverted, or maybe I'm wrong.
> 
> It's my understanding that the previous code is equivalent to
> the following code. Am I missing something?
> 
> if (PageDirty(page) && PagePrivate(page)) {
>         if (!TestSetPageLocked(page)) {
>                 mapping = page_mapping(page);
>                 if (!mapping) {
>                         unlock_page(page);
> 		} else if (mapping->a_ops->migrate_page) {
>                         unlock_page(page);
> 		} else if (pageout(page, mapping) != PAGE_SUCCESS) {
>                         unlock_page(page);
> 		}
> 	}
> }

OK I'm wrong.

> > > As you may know the migration code will work fine without this code.
> > 
> > OK!
> > 
> > > > That is, if we can't migrate the page, try to write it out?
> > 
> > I just didnt understand the logic very well, maybe I should just 
> > go reread the code.
> > 
> > Thanks!

I'm thinking about how to implement a nonblocking version of generic_migrate_page().

For this purpose its really bad to allocate swap space to anonymous pages, well
need to figure out someother way of blocking the users via pagetablefault.

Like a "virtual" swap space but without allocating swap map space. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-08 12:41                               ` Marcelo Tosatti
@ 2004-10-08 16:52                                 ` Hirokazu Takahashi
  2004-10-08 15:36                                   ` Marcelo Tosatti
  0 siblings, 1 reply; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-08 16:52 UTC (permalink / raw)
  To: marcelo.tosatti; +Cc: iwamoto, haveblue, linux-mm

Hi, Marcelo.

> > > > > That is, if we can't migrate the page, try to write it out?
> > > 
> > > I just didnt understand the logic very well, maybe I should just 
> > > go reread the code.
> > > 
> > > Thanks!
> 
> I'm thinking about how to implement a nonblocking version of generic_migrate_page().
> 
> For this purpose its really bad to allocate swap space to anonymous pages, well
> need to figure out someother way of blocking the users via pagetablefault.
> 
> Like a "virtual" swap space but without allocating swap map space. 

I've also ever thought to implement such a device.
It would be nice if you can design it simple.

Mr.Iwamoto thought otherwise and posted another opinion on the lhms
list, though. I felt it also has a point.

iwamoto> I don't think requiring swap is a big deal.  If you don't have a
iwamoto> dedicated swap device, which case I think unusual, you can swapon a
iwamoto> regular file.

Thanks,
Hirokazu Takahashi.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-08 16:52                                 ` Hirokazu Takahashi
@ 2004-10-08 15:36                                   ` Marcelo Tosatti
  2004-10-12 10:56                                     ` IWAMOTO Toshihiro
  0 siblings, 1 reply; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-08 15:36 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: iwamoto, haveblue, linux-mm

On Sat, Oct 09, 2004 at 01:52:39AM +0900, Hirokazu Takahashi wrote:
> Hi, Marcelo.
> 
> > > > > > That is, if we can't migrate the page, try to write it out?
> > > > 
> > > > I just didnt understand the logic very well, maybe I should just 
> > > > go reread the code.
> > > > 
> > > > Thanks!
> > 
> > I'm thinking about how to implement a nonblocking version of generic_migrate_page().
> > 
> > For this purpose its really bad to allocate swap space to anonymous pages, well
> > need to figure out someother way of blocking the users via pagetablefault.
> > 
> > Like a "virtual" swap space but without allocating swap map space. 
> 
> I've also ever thought to implement such a device.
> It would be nice if you can design it simple.
> 
> Mr.Iwamoto thought otherwise and posted another opinion on the lhms
> list, though. I felt it also has a point.
> 
> iwamoto> I don't think requiring swap is a big deal.  If you don't have a
> iwamoto> dedicated swap device, which case I think unusual, you can swapon a
> iwamoto> regular file.

Sure its not a big deal, but nicer if it doesnt require swap.

For memory defragmentation it is a big deal.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-08 15:36                                   ` Marcelo Tosatti
@ 2004-10-12 10:56                                     ` IWAMOTO Toshihiro
  2004-10-12 10:35                                       ` Marcelo Tosatti
  2004-10-12 14:26                                       ` Martin J. Bligh
  0 siblings, 2 replies; 45+ messages in thread
From: IWAMOTO Toshihiro @ 2004-10-12 10:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Hirokazu Takahashi, iwamoto, haveblue, linux-mm

At Fri, 8 Oct 2004 12:36:46 -0300,
Marcelo Tosatti wrote:
> 
> On Sat, Oct 09, 2004 at 01:52:39AM +0900, Hirokazu Takahashi wrote:
> > Hi, Marcelo.
> > 
> > > > > > > That is, if we can't migrate the page, try to write it out?
> > > > > 
> > > > > I just didnt understand the logic very well, maybe I should just 
> > > > > go reread the code.
> > > > > 
> > > > > Thanks!
> > > 
> > > I'm thinking about how to implement a nonblocking version of generic_migrate_page().
> > > 
> > > For this purpose its really bad to allocate swap space to anonymous pages, well
> > > need to figure out someother way of blocking the users via pagetablefault.
> > > 
> > > Like a "virtual" swap space but without allocating swap map space. 
> > 
> > I've also ever thought to implement such a device.
> > It would be nice if you can design it simple.
> > 
> > Mr.Iwamoto thought otherwise and posted another opinion on the lhms
> > list, though. I felt it also has a point.
> > 
> > iwamoto> I don't think requiring swap is a big deal.  If you don't have a
> > iwamoto> dedicated swap device, which case I think unusual, you can swapon a
> > iwamoto> regular file.
> 
> Sure its not a big deal, but nicer if it doesnt require swap.

> For memory defragmentation it is a big deal.

Why?  IMO, it isn't very rewarding to tune memory
migration/defragmentation performance as they involve memory copy
anyway.

Or, do you want memory defragmentation everywhere, including embedded
systems?

--
IWAMOTO Toshihiro
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-12 10:56                                     ` IWAMOTO Toshihiro
@ 2004-10-12 10:35                                       ` Marcelo Tosatti
  2004-10-12 17:55                                         ` Hirokazu Takahashi
  2004-10-12 14:26                                       ` Martin J. Bligh
  1 sibling, 1 reply; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-12 10:35 UTC (permalink / raw)
  To: IWAMOTO Toshihiro; +Cc: Hirokazu Takahashi, haveblue, linux-mm

On Tue, Oct 12, 2004 at 07:56:57PM +0900, IWAMOTO Toshihiro wrote:
> At Fri, 8 Oct 2004 12:36:46 -0300,
> Marcelo Tosatti wrote:
> > 
> > On Sat, Oct 09, 2004 at 01:52:39AM +0900, Hirokazu Takahashi wrote:
> > > Hi, Marcelo.
> > > 
> > > > > > > > That is, if we can't migrate the page, try to write it out?
> > > > > > 
> > > > > > I just didnt understand the logic very well, maybe I should just 
> > > > > > go reread the code.
> > > > > > 
> > > > > > Thanks!
> > > > 
> > > > I'm thinking about how to implement a nonblocking version of generic_migrate_page().
> > > > 
> > > > For this purpose its really bad to allocate swap space to anonymous pages, well
> > > > need to figure out someother way of blocking the users via pagetablefault.
> > > > 
> > > > Like a "virtual" swap space but without allocating swap map space. 
> > > 
> > > I've also ever thought to implement such a device.
> > > It would be nice if you can design it simple.
> > > 
> > > Mr.Iwamoto thought otherwise and posted another opinion on the lhms
> > > list, though. I felt it also has a point.
> > > 
> > > iwamoto> I don't think requiring swap is a big deal.  If you don't have a
> > > iwamoto> dedicated swap device, which case I think unusual, you can swapon a
> > > iwamoto> regular file.
> > 
> > Sure its not a big deal, but nicer if it doesnt require swap.
> 
> > For memory defragmentation it is a big deal.
> 
> Why?  IMO, it isn't very rewarding to tune memory
> migration/defragmentation performance as they involve memory copy
> anyway.

Hi Iwamoto,

Oh yes, they already involve memory copy, but then if they use swap
its worse!

> Or, do you want memory defragmentation everywhere, including embedded
> systems?

Yes I want defragmentation everywhere!

The thing is grabbing swap pages for memory migration is 
not a very optimal operation.

First, we interfere with swap allocation patterns (if true
swap is going on at the moment, we screw up is performance). 

We are probably also not going to use that swap space, 
there is no point in allocating it.

And finally if we run out of swap space, we are dead. 

Sure, it works fine with all this restrictions, but it would
be an advantage if we didnt had such swap usage overhead for memory
migration.

I'm writting a "migration cache" - its basically a swapcache without
backing store, instead we use idr (lib/idr.c) to allocate the offsets.

It will be much faster and not interfere with swap space.

I'll use one bit of "swap type" to identify such "migration pte's".

I'll test it with memory migration operation first then with 
memory defragmentation.

Hope it works fine.


struct idr migration_idr;
struct address_space migration_space = {
        .page_tree      = RADIX_TREE_INIT(GFP_ATOMIC),
        .tree_lock      = RW_LOCK_UNLOCKED,
        .a_ops          = NULL,
        .flags          = GFP_HIGHUSER,
        .i_mmap_nonlinear = LIST_HEAD_INIT(migration_space.i_mmap_nonlinear),
        .backing_dev_info = NULL,
};

int init_migration_cache(void) 
{
	idr_init(&migration_idr);

	printk(KERN_INFO "Initializating migration cache!\n");

}

__initcall(init_migration_cache);

struct page *lookup_migration_cache(int id) { 
	return find_get_page(&migration_space, id);
}

int remove_from_migration_cache(struct page *page, int id)
{
	write_lock_irq(&migration_space.tree_lock);
        idr_remove(&migration_idr, id);
	radix_tree_delete(&migration_space.page_tree, id);
	write_unlock_irq(&migration_space.tree_lock);
}

int add_to_migration_cache(struct page *page) 
{
	int error, offset;
	int gfp_mask = GFP_KERNEL;

	BUG_ON(PageSwapCache(page));
	BUG_ON(PagePrivate(page));

        if (idr_pre_get(&migration_idr, GFP_ATOMIC) == 0)
                return -ENOMEM;

	error = radix_tree_preload(gfp_mask);

	if (!error) {
		write_lock_irq(&migration_space.tree_lock);
	        error = idr_get_new(&migration_idr, NULL, &offset);

		error = radix_tree_insert(&migration_space.page_tree, offset,
							page);

		if (!error) {
			page_cache_get(page);
			SetPageLocked(page);
			page->private = offset;
			page->mapping = &migration_space;
		}
		write_unlock_irq(&migration_cache.tree_lock);
                radix_tree_preload_end();

	}

	return error;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-12 10:35                                       ` Marcelo Tosatti
@ 2004-10-12 17:55                                         ` Hirokazu Takahashi
  0 siblings, 0 replies; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-12 17:55 UTC (permalink / raw)
  To: marcelo.tosatti; +Cc: iwamoto, haveblue, linux-mm

Hi,

> It will be much faster and not interfere with swap space.
> 
> I'll use one bit of "swap type" to identify such "migration pte's".
>
> I'll test it with memory migration operation first then with 
> memory defragmentation.
> 
> Hope it works fine.

IMHO, if one "swap type" is reserved for it, a target page can be
inserted in swapper_space directly. I guess it may be possible to
reuse the swapper_space and some of the existing codes, which should
be moved out of #ifdef CONFIG_SWAP though.

> struct idr migration_idr;
> struct address_space migration_space = {
>         .page_tree      = RADIX_TREE_INIT(GFP_ATOMIC),
>         .tree_lock      = RW_LOCK_UNLOCKED,
>         .a_ops          = NULL,
>         .flags          = GFP_HIGHUSER,
>         .i_mmap_nonlinear = LIST_HEAD_INIT(migration_space.i_mmap_nonlinear),
>         .backing_dev_info = NULL,
> };
> 
> int init_migration_cache(void) 
> {
> 	idr_init(&migration_idr);
> 
> 	printk(KERN_INFO "Initializating migration cache!\n");
> 
> }
> 
> __initcall(init_migration_cache);
> 
> struct page *lookup_migration_cache(int id) { 
> 	return find_get_page(&migration_space, id);
> }
> 
> int remove_from_migration_cache(struct page *page, int id)
> {
> 	write_lock_irq(&migration_space.tree_lock);
>         idr_remove(&migration_idr, id);
> 	radix_tree_delete(&migration_space.page_tree, id);
> 	write_unlock_irq(&migration_space.tree_lock);
> }
> 
> int add_to_migration_cache(struct page *page) 
> {
> 	int error, offset;
> 	int gfp_mask = GFP_KERNEL;
> 
> 	BUG_ON(PageSwapCache(page));
> 	BUG_ON(PagePrivate(page));
> 
>         if (idr_pre_get(&migration_idr, GFP_ATOMIC) == 0)
>                 return -ENOMEM;

I guess GFP_KERNEL is enough.

> 	error = radix_tree_preload(gfp_mask);
> 
> 	if (!error) {
> 		write_lock_irq(&migration_space.tree_lock);
> 	        error = idr_get_new(&migration_idr, NULL, &offset);
> 
> 		error = radix_tree_insert(&migration_space.page_tree, offset,
> 							page);
> 
> 		if (!error) {
> 			page_cache_get(page);
> 			SetPageLocked(page);
> 			page->private = offset;
> 			page->mapping = &migration_space;
> 		}
> 		write_unlock_irq(&migration_cache.tree_lock);
>                 radix_tree_preload_end();
> 
> 	}
> 
> 	return error;
> }
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-12 10:56                                     ` IWAMOTO Toshihiro
  2004-10-12 10:35                                       ` Marcelo Tosatti
@ 2004-10-12 14:26                                       ` Martin J. Bligh
  2004-10-12 12:17                                         ` Marcelo Tosatti
  2004-10-12 15:01                                         ` Dave Hansen
  1 sibling, 2 replies; 45+ messages in thread
From: Martin J. Bligh @ 2004-10-12 14:26 UTC (permalink / raw)
  To: IWAMOTO Toshihiro, Marcelo Tosatti; +Cc: Hirokazu Takahashi, haveblue, linux-mm

>> > iwamoto> I don't think requiring swap is a big deal.  If you don't have a
>> > iwamoto> dedicated swap device, which case I think unusual, you can swapon a
>> > iwamoto> regular file.
>> 
>> Sure its not a big deal, but nicer if it doesnt require swap.
> 
>> For memory defragmentation it is a big deal.
> 
> Why?  IMO, it isn't very rewarding to tune memory
> migration/defragmentation performance as they involve memory copy
> anyway.
> 
> Or, do you want memory defragmentation everywhere, including embedded
> systems?

Lots of systems nowadays don't have swap configured, not just embedded.
What do we gain from making defrag slower and harder to use, by forcing
it to use swap? Isn't pushing it into the swapcache sufficient?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-12 14:26                                       ` Martin J. Bligh
@ 2004-10-12 12:17                                         ` Marcelo Tosatti
  2004-10-12 15:01                                         ` Dave Hansen
  1 sibling, 0 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-12 12:17 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: IWAMOTO Toshihiro, Hirokazu Takahashi, haveblue, linux-mm

On Tue, Oct 12, 2004 at 07:26:32AM -0700, Martin J. Bligh wrote:
> >> > iwamoto> I don't think requiring swap is a big deal.  If you don't have a
> >> > iwamoto> dedicated swap device, which case I think unusual, you can swapon a
> >> > iwamoto> regular file.
> >> 
> >> Sure its not a big deal, but nicer if it doesnt require swap.
> > 
> >> For memory defragmentation it is a big deal.
> > 
> > Why?  IMO, it isn't very rewarding to tune memory
> > migration/defragmentation performance as they involve memory copy
> > anyway.
> > 
> > Or, do you want memory defragmentation everywhere, including embedded
> > systems?
> 
> Lots of systems nowadays don't have swap configured, not just embedded.
> What do we gain from making defrag slower and harder to use, by forcing
> it to use swap? Isn't pushing it into the swapcache sufficient?

Hi Martin,

Yes pushing it to swapcache is sufficient - but doing so requires swap
map space (the "index" for swapcache pages is retrieved from swap map space 
position).

As I posted in the other message I'm working on a idr-based cache (migration cache)
which should solve things.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-12 14:26                                       ` Martin J. Bligh
  2004-10-12 12:17                                         ` Marcelo Tosatti
@ 2004-10-12 15:01                                         ` Dave Hansen
  1 sibling, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2004-10-12 15:01 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: IWAMOTO Toshihiro, Marcelo Tosatti, Hirokazu Takahashi, linux-mm

On Tue, 2004-10-12 at 07:26, Martin J. Bligh wrote:
> Lots of systems nowadays don't have swap configured, not just embedded.
> What do we gain from making defrag slower and harder to use, by forcing
> it to use swap? Isn't pushing it into the swapcache sufficient?

For now, with no swap space configured and CONFIG_SWAP=y, no pages will
even make it into the swap cache.  It'll take more code on top of what
we have to get that to work.  So, we're sticking with the smallest
amount of code that we can for now.  We'll fix that up later.  

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03 14:07               ` Marcelo Tosatti
  2004-10-03 18:35                 ` Hirokazu Takahashi
@ 2004-10-04  3:24                 ` IWAMOTO Toshihiro
  1 sibling, 0 replies; 45+ messages in thread
From: IWAMOTO Toshihiro @ 2004-10-04  3:24 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Hirokazu Takahashi, iwamoto, haveblue, akpm, linux-mm, piggin,
	arjanv, linux-kernel

At Sun, 3 Oct 2004 11:07:23 -0300,
Marcelo Tosatti wrote:
> 
> On Sun, Oct 03, 2004 at 01:13:38PM +0900, Hirokazu Takahashi wrote:
> > > 2) 
> > > At migrate_onepage you add anonymous pages which aren't swap allocated
> > > to the swap cache
> > > +       /*
> > > +        * Put the page in a radix tree if it isn't in the tree yet.
> > > +        */
> > > +#ifdef CONFIG_SWAP
> > > +       if (PageAnon(page) && !PageSwapCache(page))
> > > +               if (!add_to_swap(page, GFP_KERNEL)) {
> > > +                       unlock_page(page);
> > > +                       return ERR_PTR(-ENOSPC);
> > > +               }
> > > +#endif /* CONFIG_SWAP */
> > > 
> > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > what the patch I posted does).
> > 
> > The reason is to guarantee that any anonymous page can be migrated anytime.
> > I want to block newly occurred accesses to the page during the migration
> > because it can't be migrated if there remain some references on it by
> > system calls, direct I/O and page faults.
> 
> It would be nice if we could block pte faults in a way such to not need
> adding each anonymous page to swap. It can be too costly if you have a lot memory
> and it makes the whole operation dependable on swap size (if you dont have enough
> swap, you're dead).
> 
> Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> migration is not a common operation anyway), or create a semaphore? 

I chose the swap cache based implementation in order to minimize
slowdown of the normal code path.  (I thought there's zero code
addition on the normal pagefault path when I designed this, but it's
no longer true...)

If we can agree on adding a new lock, there might be a better
implementation.

--
IWAMOTO Toshihiro
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-03  4:13             ` Hirokazu Takahashi
  2004-10-03 14:07               ` Marcelo Tosatti
@ 2004-10-04  2:22               ` Dave Hansen
  1 sibling, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2004-10-04  2:22 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: marcelo.tosatti, IWAMOTO Toshihiro, Andrew Morton, linux-mm,
	piggin, Arjan van de Ven, Linux Kernel Mailing List

On Sat, 2004-10-02 at 21:13, Hirokazu Takahashi wrote:
> > Questions: are there any documents on the memory hotplug userspace tools? 
> > Where can I find them?
> 
> IBM guys and Fujitsu guys are designing user interface independently.
> IBM team is implementing memory section hotplug while Fujitsu team
> try to implement NUMA node hotplug. But both of the designs use
> regular hot-plug mechanism, which kicks /sbin/hotplug script to control
> devices via sysfs.
> 
> Dave, would you explain about it?

First of all, we're still on the first set of these APIs.  So, either
we're really, really smart (unlikely) or we have a few revisions an
rewrites to go before everybody is happy.

ls /sys/devices/system/memory/ gives you each memory area, with
arbitrary numbers like this:
memory0
memory1
memory2
memory8953

We haven't decided whether to make each of those represent a constant
sized area, or let them be variable.  In any case, there will either be
a range inside of each or a global block size something like here:

	/sys/devices/system/memory/block_size

Each memory device would have a directory like this:

# ls /sys/devices/system/memory/memory8953/
node -> ../../node/node4 (for the NUMA case)
state
phys_start_addr

To take a memory section offline, you 

	echo offline > /sys/devices/system/memory/memory8953/state

For now, that takes the section offline by allocating all of its pages
and migrating the test.  It also removes the sysfs node, triggering a
/sbin/hotplug event for the device removal.  We might makes this 2
different states in the future (offline and removal).  This could also
potentially be triggered by hardware alone.

For now, you can also add memory, but it's hackish and will certainly
change:

	echo 0x8000000 > /sys/devices/system/memory/probe

will add SECTION_SIZE amount of memory at 2GB.  Yes, SECTION_SIZE is
hard-coded, but this is only for testing.  We'll eventually take ranges
and maybe NUMA information into there somehow.  Why can't the hardware
just do this?  It's a long story :)

-- 
Dave Hansen
haveblue@us.ibm.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-02 18:33           ` Marcelo Tosatti
  2004-10-03  4:13             ` Hirokazu Takahashi
@ 2004-10-04  4:09             ` IWAMOTO Toshihiro
  2004-10-04 17:29               ` Marcelo Tosatti
  1 sibling, 1 reply; 45+ messages in thread
From: IWAMOTO Toshihiro @ 2004-10-04  4:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Hirokazu Takahashi, haveblue, akpm, linux-mm, piggin, arjanv,
	linux-kernel

At Sat, 2 Oct 2004 15:33:49 -0300,
Marcelo Tosatti wrote:
> 
> On Sat, Oct 02, 2004 at 06:30:15PM +0900, Hirokazu Takahashi wrote:
> 3) At migrate_page_common you assume additional page references 
> (page_migratable returning -EAGAIN) means the code should try to writeout 
> the page.
> 
> Is that assumption always valid?
> 
> In theory there is no need to writeout pages when migrating them to 
> other zones - they will be copied and the dirty information retained (either
> in the PageDirty bit or radix tree tag). 

It's true only when page->private is NULL.  Otherwise writeback is
necessary to free buffer_head.

> I just noticed you do that on further patches (migrate_page_buffer), but AFAICS 
> the writeout remains. Why arent you using migrate_page_buffer yet?
> 
> I think the final aim should be to remove the need for "pageout()" 
> completly.

Are you going to implement migrate_page_buffer for every file system?
I don't think it's worthwhile.

> Questions: are there any documents on the memory hotplug userspace tools? 
> Where can I find them?
> 
> Are Iwamoto's test programs available?

I've put them at the following URL, but I doubt they are useful for
you; there are no documentation for them.

http://people.valinux.co.jp/~iwamoto/mh/tests/

--
IWAMOTO Toshihiro
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-04  4:09             ` IWAMOTO Toshihiro
@ 2004-10-04 17:29               ` Marcelo Tosatti
  0 siblings, 0 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-04 17:29 UTC (permalink / raw)
  To: IWAMOTO Toshihiro
  Cc: Hirokazu Takahashi, haveblue, akpm, linux-mm, piggin, arjanv,
	linux-kernel

On Mon, Oct 04, 2004 at 01:09:10PM +0900, IWAMOTO Toshihiro wrote:
> At Sat, 2 Oct 2004 15:33:49 -0300,
> Marcelo Tosatti wrote:
> > 
> > On Sat, Oct 02, 2004 at 06:30:15PM +0900, Hirokazu Takahashi wrote:
> > 3) At migrate_page_common you assume additional page references 
> > (page_migratable returning -EAGAIN) means the code should try to writeout 
> > the page.
> > 
> > Is that assumption always valid?
> > 
> > In theory there is no need to writeout pages when migrating them to 
> > other zones - they will be copied and the dirty information retained (either
> > in the PageDirty bit or radix tree tag). 
> 
> It's true only when page->private is NULL.  Otherwise writeback is
> necessary to free buffer_head.

You can move the buffer_head's also cant you? Adjusting bh->b_page etc.

Thats what migrate_page_buffer does, no?

Writting pages which contain buffer_head's on memory migration
is really, very bad. 

Imagine gigabytes of pages with buffer_head's. 

> > I just noticed you do that on further patches (migrate_page_buffer), but AFAICS 
> > the writeout remains. Why arent you using migrate_page_buffer yet?
> > 
> > I think the final aim should be to remove the need for "pageout()" 
> > completly.
> 
> Are you going to implement migrate_page_buffer for every file system?
> I don't think it's worthwhile.
> 
> > Questions: are there any documents on the memory hotplug userspace tools? 
> > Where can I find them?
> > 
> > Are Iwamoto's test programs available?
> 
> I've put them at the following URL, but I doubt they are useful for
> you; there are no documentation for them.
> 
> http://people.valinux.co.jp/~iwamoto/mh/tests/

I'll take a look thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 18:22 Marcelo Tosatti
  2004-10-01 20:11 ` Andrew Morton
@ 2004-10-02  2:30 ` Nick Piggin
  2004-10-02  3:08   ` Marcelo Tosatti
  2004-10-02  2:41 ` Nick Piggin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 45+ messages in thread
From: Nick Piggin @ 2004-10-02  2:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, akpm, arjanv, linux-kernel


Marcelo Tosatti wrote:

>
>With such a thing in place we can build a mechanism for kswapd 
>(or a separate kernel thread, if needed) to notice when we are low on 
>high order pages, and use the coalescing algorithm instead blindly 
>freeing unique pages from LRU in the hope to build large physically 
>contiguous memory areas.
>
>Comments appreciated.
>
>

Hi Marcelo,
Seems like a good idea... even with regular dumb kswapd "merging",
you may easily get stuck for example on systems without swap...

Anyway, I'd like to get those beat kswapd patches in first. Then
your mechanism just becomes something like:

    if order-0 pages are low {
        try to free memory
    }
    else if order-1 or higher pages are low {
         try to coalesce_memory
         if that fails, try to free memory
    }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-02  2:30 ` Nick Piggin
@ 2004-10-02  3:08   ` Marcelo Tosatti
  2004-10-04  8:15     ` Nick Piggin
  0 siblings, 1 reply; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-02  3:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, akpm, arjanv, linux-kernel

On Sat, Oct 02, 2004 at 12:30:01PM +1000, Nick Piggin wrote:
> 
> 
> Marcelo Tosatti wrote:
> 
> >
> >With such a thing in place we can build a mechanism for kswapd 
> >(or a separate kernel thread, if needed) to notice when we are low on 
> >high order pages, and use the coalescing algorithm instead blindly 
> >freeing unique pages from LRU in the hope to build large physically 
> >contiguous memory areas.
> >
> >Comments appreciated.
> >
> >
> 
> Hi Marcelo,
> Seems like a good idea... even with regular dumb kswapd "merging",
> you may easily get stuck for example on systems without swap...
> 
> Anyway, I'd like to get those beat kswapd patches in first. Then
> your mechanism just becomes something like:
> 
>    if order-0 pages are low {
>        try to free memory
>    }
>    else if order-1 or higher pages are low {
>         try to coalesce_memory
>         if that fails, try to free memory
>    }

Hi Nick!

I understand that kswapd is broken, and it needs to go into the page reclaim path 
to free pages when we are out of high order pages (what your 
"beat kswapd" patches do and fix high-order failures by doing so), but 
Linus's argument against it seems to be that "it potentially frees too much pages" 
causing harm to the system. He also says this has been tried in the past, 
with not nice results.

And that is why its has not been merged into mainline.

Is my interpretation correct?

But right, kswapd needs to get fixed to honour high order
pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-02  3:08   ` Marcelo Tosatti
@ 2004-10-04  8:15     ` Nick Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2004-10-04  8:15 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, akpm, arjanv, linux-kernel


Marcelo Tosatti wrote:

>On Sat, Oct 02, 2004 at 12:30:01PM +1000, Nick Piggin wrote:
>
>>
>>Marcelo Tosatti wrote:
>>
>>
>>>With such a thing in place we can build a mechanism for kswapd 
>>>(or a separate kernel thread, if needed) to notice when we are low on 
>>>high order pages, and use the coalescing algorithm instead blindly 
>>>freeing unique pages from LRU in the hope to build large physically 
>>>contiguous memory areas.
>>>
>>>Comments appreciated.
>>>
>>>
>>>
>>Hi Marcelo,
>>Seems like a good idea... even with regular dumb kswapd "merging",
>>you may easily get stuck for example on systems without swap...
>>
>>Anyway, I'd like to get those beat kswapd patches in first. Then
>>your mechanism just becomes something like:
>>
>>   if order-0 pages are low {
>>       try to free memory
>>   }
>>   else if order-1 or higher pages are low {
>>        try to coalesce_memory
>>        if that fails, try to free memory
>>   }
>>
>
>Hi Nick!
>
>

Sorry, I'd been away for the weekend which is why I didn't get a
chance to reply to you.

>I understand that kswapd is broken, and it needs to go into the page reclaim path 
>to free pages when we are out of high order pages (what your 
>"beat kswapd" patches do and fix high-order failures by doing so), but 
>Linus's argument against it seems to be that "it potentially frees too much pages" 
>causing harm to the system. He also says this has been tried in the past, 
>with not nice results.
>
>

Not quite. I think a (the) big thing with my patch is that it will
check order-0...n watermarks when an order-n allocation is made.

So if there is no order >2 allocations happening, it won't attempt
to keep higher order memory available (until someone attempts an
allocation).

Basically, it gets kswapd doing the work when it would otherwise
have to be done in direct reclaim, *OR* otherwise indefinitely fail
if the allocations aren't blockable.

>And that is why its has not been merged into mainline.
>
>Is my interpretation correct?
>
>But right, kswapd needs to get fixed to honour high order
>pages.
>
>

Well Linus was silent on the issue after I answered his concerns.
I mailed him privately and he basically said that it seems sane,
and he is waiting for patches. Of course, by that stage it was
fairly late into 2.6.9, and the current behaviour isn't a regression,
so I'm shooting for 2.6.10.

Your defragmentor should sit very nicely on top of it, of course.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 18:22 Marcelo Tosatti
  2004-10-01 20:11 ` Andrew Morton
  2004-10-02  2:30 ` Nick Piggin
@ 2004-10-02  2:41 ` Nick Piggin
  2004-10-02  3:50   ` Hirokazu Takahashi
  2004-10-02 16:06   ` Marcelo Tosatti
  2004-10-04  2:38 ` Hiroyuki KAMEZAWA
  2004-10-04  6:58 ` Hiroyuki KAMEZAWA
  4 siblings, 2 replies; 45+ messages in thread
From: Nick Piggin @ 2004-10-02  2:41 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, akpm, arjanv, linux-kernel


Marcelo Tosatti wrote:

>
>For example it doesnt re establishes pte's once it has unmapped them.
>
>

Another thing - I don't know if I'd bother re-establishing ptes....
I'd say just leave it to happen lazily at fault time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-02  2:41 ` Nick Piggin
@ 2004-10-02  3:50   ` Hirokazu Takahashi
  2004-10-02 16:06   ` Marcelo Tosatti
  1 sibling, 0 replies; 45+ messages in thread
From: Hirokazu Takahashi @ 2004-10-02  3:50 UTC (permalink / raw)
  To: piggin; +Cc: marcelo.tosatti, linux-mm, akpm, arjanv, linux-kernel

Hello,

> >For example it doesnt re establishes pte's once it has unmapped them.
> >
> >
> 
> Another thing - I don't know if I'd bother re-establishing ptes....
> I'd say just leave it to happen lazily at fault time.

I think the reason is that his current implementation doesn't assign
a swap entry to an anonymous page to move.


Thank you,
Hirokazu Takahashi.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-02  2:41 ` Nick Piggin
  2004-10-02  3:50   ` Hirokazu Takahashi
@ 2004-10-02 16:06   ` Marcelo Tosatti
  1 sibling, 0 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-02 16:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, akpm, arjanv, linux-kernel

On Sat, Oct 02, 2004 at 12:41:14PM +1000, Nick Piggin wrote:
> 
> 
> Marcelo Tosatti wrote:
> 
> >
> >For example it doesnt re establishes pte's once it has unmapped them.
> >
> >
> 
> Another thing - I don't know if I'd bother re-establishing ptes....
> I'd say just leave it to happen lazily at fault time.

Indeed it should work lazily.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 18:22 Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2004-10-02  2:41 ` Nick Piggin
@ 2004-10-04  2:38 ` Hiroyuki KAMEZAWA
  2004-10-04 17:32   ` Marcelo Tosatti
  2004-10-04  6:58 ` Hiroyuki KAMEZAWA
  4 siblings, 1 reply; 45+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-10-04  2:38 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, akpm, Nick Piggin, arjanv, linux-kernel

how about inserting this if-sentense ?

-- Kame

Marcelo Tosatti wrote:
> +int coalesce_memory(unsigned int order, struct zone *zone)
> +{
<snip>

> +		while (entry != &area->free_list) {
> +			int ret;
> +			page = list_entry(entry, struct page, lru);
> +			entry = entry->next;
> +

   +              if (((page_to_pfn(page) - zone->zone_start_pfn) & (1 << toorder)) {

> +			pwalk = page;
> +
> +			/* Look backwards */
> +
> +			for (walkcount = 1; walkcount<nr_pages; walkcount++) {
                         ..................
> +			}
> +
   +               } else {
> +forward:
> +
> +			pwalk = page;
> +
> +			/* Look forward, skipping the page frames from this 
> +			  high order page we are looking at */
> +
> +			for (walkcount = (1UL << torder); walkcount<nr_pages; 
> +					walkcount++) {
> +				pwalk = page+walkcount;
> +
> +				ret = can_move_page(pwalk);
> +
> +				if (ret) 
> +					nr_freed_pages++;
> +				else
> +					goto loopey;
> +
> +				if (nr_freed_pages == nr_pages)
> +					goto success;
> +			}
> +
   +                }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-04  2:38 ` Hiroyuki KAMEZAWA
@ 2004-10-04 17:32   ` Marcelo Tosatti
  0 siblings, 0 replies; 45+ messages in thread
From: Marcelo Tosatti @ 2004-10-04 17:32 UTC (permalink / raw)
  To: Hiroyuki KAMEZAWA; +Cc: linux-mm, akpm, Nick Piggin, arjanv, linux-kernel

Yeap that is a nice optimization, thanks Hiroyuki.

On Mon, Oct 04, 2004 at 11:38:32AM +0900, Hiroyuki KAMEZAWA wrote:
> 
> how about inserting this if-sentense ?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] memory defragmentation to satisfy high order allocations
  2004-10-01 18:22 Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2004-10-04  2:38 ` Hiroyuki KAMEZAWA
@ 2004-10-04  6:58 ` Hiroyuki KAMEZAWA
  4 siblings, 0 replies; 45+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-10-04  6:58 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, akpm, Nick Piggin, arjanv, linux-kernel

Hi,

Marcelo Tosatti wrote:

> +int can_move_page(struct page *page) 
> +{
   <snip>
> +	if (page_count(page) == 0)
> +		return 1;

I think there are 3 cases when page_count(page) == 0.

1. a page is free and in the buddy allocator.
2. a page is free and in per-cpu-pages list.
3. a page is in pagevec .

I think only case 1 pages meet your requirements.

I used PG_private flag for distinguishing case 1 from 2 and 3
in my no-bitmap buddy allocator posted before.
I added PG_private flag to a page which is in buddy allocator's free_list.

Regards

-- Kame
<kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2004-10-12 17:55 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-11 16:40 [RFC] memory defragmentation to satisfy high order allocations linux
  -- strict thread matches above, loose matches on Subject: below --
2004-10-01 18:22 Marcelo Tosatti
2004-10-01 20:11 ` Andrew Morton
2004-10-01 19:04   ` Marcelo Tosatti
2004-10-01 21:00     ` Andrew Morton
2004-10-01 21:57     ` Dave Hansen
2004-10-01 23:42       ` Marcelo Tosatti
2004-10-02  1:17         ` Andrew Morton
2004-10-02  9:30         ` Hirokazu Takahashi
2004-10-02 18:33           ` Marcelo Tosatti
2004-10-03  4:13             ` Hirokazu Takahashi
2004-10-03 14:07               ` Marcelo Tosatti
2004-10-03 18:35                 ` Hirokazu Takahashi
2004-10-03 19:21                   ` Trond Myklebust
2004-10-03 20:03                     ` Hirokazu Takahashi
2004-10-03 20:44                       ` Trond Myklebust
2004-10-04 13:02                         ` Hirokazu Takahashi
2004-10-04 17:24                   ` Marcelo Tosatti
2004-10-05  2:53                     ` Hirokazu Takahashi
2004-10-07 12:06                       ` Marcelo Tosatti
2004-10-08  7:00                         ` Hirokazu Takahashi
2004-10-08 10:00                           ` Marcelo Tosatti
2004-10-08 12:23                             ` Hirokazu Takahashi
2004-10-08 12:41                               ` Marcelo Tosatti
2004-10-08 16:52                                 ` Hirokazu Takahashi
2004-10-08 15:36                                   ` Marcelo Tosatti
2004-10-12 10:56                                     ` IWAMOTO Toshihiro
2004-10-12 10:35                                       ` Marcelo Tosatti
2004-10-12 17:55                                         ` Hirokazu Takahashi
2004-10-12 14:26                                       ` Martin J. Bligh
2004-10-12 12:17                                         ` Marcelo Tosatti
2004-10-12 15:01                                         ` Dave Hansen
2004-10-04  3:24                 ` IWAMOTO Toshihiro
2004-10-04  2:22               ` Dave Hansen
2004-10-04  4:09             ` IWAMOTO Toshihiro
2004-10-04 17:29               ` Marcelo Tosatti
2004-10-02  2:30 ` Nick Piggin
2004-10-02  3:08   ` Marcelo Tosatti
2004-10-04  8:15     ` Nick Piggin
2004-10-02  2:41 ` Nick Piggin
2004-10-02  3:50   ` Hirokazu Takahashi
2004-10-02 16:06   ` Marcelo Tosatti
2004-10-04  2:38 ` Hiroyuki KAMEZAWA
2004-10-04 17:32   ` Marcelo Tosatti
2004-10-04  6:58 ` Hiroyuki KAMEZAWA

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox