large page patch

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* large page patch
@ 2002-08-02  0:37 Andrew Morton
  2002-08-02  0:43 ` David S. Miller
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Andrew Morton @ 2002-08-02  0:37 UTC (permalink / raw)
  To: lkml, linux-mm; +Cc: Seth, Rohit, Saxena, Sunil, Mallick, Asit K

This is a large-page support patch from Rohit Seth, forwarded
with his permission (thanks!).



> Attached is the large_page support for IA-32.  For most part there are no
> changes over IA-64 patch.  System calls and their semantics remain the
> same. Though there are still some little parts of code that are arch
> specfic (like for IA-64 there is seperate region for large_pages whereas
> on IA-32 it is the same linear address space etc.) I will appreciate if
> you all could provide your input and any issues that you think we need to
> resolve.
> 
> Attached is the large_page patch including the following support: 1-
> Private and Shared Anonymous large pages(This is the earlier patch +
> Anonymous share Large_page support).  Private Anonymous large_pages stay
> with the particular process and vm segments corresponding to these get
> VM_DONTCOPY attribute.  Shared Anonymous pages get shared by children.
> (Children share the same physical large_pages with parent.)  Allocation
> and deallocation of this is done using the following two system calls:
> 
>    sys_get_large_pages (unsigned long addr, unsigned long len, int prot, int flags)
>         where prot could be PROT_READ, PROT_WRITE, PROT_EXEC and flags
>         is MAP_PRIVATE or MAP_SHARE
>    sys_free_large_pages(unsigned long addr)
> 
> 2- Shared Large Pages across different processes.  Allocation and
> deallocation of large_pages that a process can share and unshare across
> different procecess is using follwoign two systm calls:
> 
>    sys_share_large_pages(int key, unsigned long addr, unsigned long len, int prot, int flag)
> 
> where key is the system wide unique identifier that processes use to share
> pages.  This should be non-zero positive number. prot is identical as in
> above cases.  flag could be set to IPC_CREAT so that if the segment
> corresponding to key is not already there then it is created (Else -ENOENT
> is returned if there is no existing segment).
> 
>    sys_unshare_large_pages(unsigned long addr)
> 
> is used to unshare the large_pages from process's address space.  The
> large_pages are put on lpage_freelist only when the last user has sent the
> request for unsharing it (kind of SHM_DEST attribute).
> 
> Most of the support needed for above two cases (Anonymous and Sharing
> across processes) is quite similar in kernel except for binding of
> large_pages to key and temporary inode structure.
> 
> 3) Currently the large_page memory is dynamically configurable through
> /proc/sys/kernel/numlargepages User can specify the number (negative
> meaning shrink) that the number of large_page pages need to change.  For
> e.g. a value of -2 will reduce the number of large_page pages currently
> configured in system by 2.  Note that this change will depend on the
> availability of free large_pages. If none is available then the value
> remains same.  (Any cleaner suggestions?)

Some observations which have been made thus far:

- Minimal impact on the VM and MM layers

- Delegates most of it to the arch layer

- Generic code is not tied to pagetables so (for example) PPC could
  implement the system calls with BAT registers

- The change to MAX_ORDER is unneeded

- swapping of large pages and making them pagecache-coherent is
  unpopular.

- may be better to implement the shm API with fd's, not keys.

- an ia64 implementation is available


diff -Naru linux.org/arch/i386/config.in linux.lp/arch/i386/config.in
--- linux.org/arch/i386/config.in	Mon Feb 25 11:37:52 2002
+++ linux.lp/arch/i386/config.in	Tue Jul  2 17:49:15 2002
@@ -184,6 +184,8 @@
 
 bool 'Math emulation' CONFIG_MATH_EMULATION
 bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
+bool 'IA-32 Large Page Support (if available on processor)' CONFIG_LARGE_PAGE
+
 bool 'Symmetric multi-processing support' CONFIG_SMP
 if [ "$CONFIG_SMP" != "y" ]; then
    bool 'Local APIC support on uniprocessors' CONFIG_X86_UP_APIC
@@ -205,7 +207,6 @@
 
 mainmenu_option next_comment
 comment 'General setup'
-
 bool 'Networking support' CONFIG_NET
 
 # Visual Workstation support is utterly broken.
diff -Naru linux.org/arch/i386/kernel/entry.S linux.lp/arch/i386/kernel/entry.S
--- linux.org/arch/i386/kernel/entry.S	Mon Feb 25 11:37:53 2002
+++ linux.lp/arch/i386/kernel/entry.S	Tue Jul  2 15:12:23 2002
@@ -634,6 +634,10 @@
 	.long SYMBOL_NAME(sys_ni_syscall)	/* 235 reserved for removexattr */
 	.long SYMBOL_NAME(sys_ni_syscall)	/* reserved for lremovexattr */
 	.long SYMBOL_NAME(sys_ni_syscall)	/* reserved for fremovexattr */
+	.long SYMBOL_NAME(sys_get_large_pages)	/* Get large_page pages */
+	.long SYMBOL_NAME(sys_free_large_pages)	/* Free large_page pages */
+	.long SYMBOL_NAME(sys_share_large_pages)/* Share large_page pages */
+	.long SYMBOL_NAME(sys_unshare_large_pages)/* UnShare large_page pages */
 
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long SYMBOL_NAME(sys_ni_syscall)
diff -Naru linux.org/arch/i386/kernel/sys_i386.c linux.lp/arch/i386/kernel/sys_i386.c
--- linux.org/arch/i386/kernel/sys_i386.c	Mon Mar 19 12:35:09 2001
+++ linux.lp/arch/i386/kernel/sys_i386.c	Wed Jul  3 14:28:16 2002
@@ -254,3 +254,126 @@
 	return -ERESTARTNOHAND;
 }
 
+#ifdef CONFIG_LARGE_PAGE
+#define LPAGE_ALIGN(x)	(((unsigned long)x + (LPAGE_SIZE -1)) & LPAGE_MASK)
+extern long	sys_munmap(unsigned long, size_t);
+
+/* get_addr function gets the currently unused virtaul range in 
+ * current process's address space.  It returns the LARGE_PAGE_SIZE 
+ * aligned address (in cases of success).  Other kernel generic 
+ * routines only could gurantee that allocated address is PAGE_SIZSE aligned.
+ */
+unsigned long
+get_addr(unsigned long addr, unsigned long len)
+{
+	struct vm_area_struct	*vma;
+	if (addr) {
+		addr = LPAGE_ALIGN(addr);
+		vma = find_vma(current->mm, addr);
+		if (((TASK_SIZE - len) >= addr) &&
+		      (!vma || addr + len <= vma->vm_start))
+			goto found_addr;
+	}
+	addr = LPAGE_ALIGN(TASK_UNMAPPED_BASE);
+	 for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
+		 if (TASK_SIZE - len < addr)
+			 return -ENOMEM;
+		 if (!vma || ((addr + len) < vma->vm_start))
+			 goto found_addr;
+		 addr = vma->vm_end;
+	}
+found_addr:
+	addr = LPAGE_ALIGN(addr);
+	return addr;
+}
+
+asmlinkage unsigned long
+sys_get_large_pages(unsigned long addr, unsigned long len, int prot, int flags)
+{
+	extern int make_lpages_present(unsigned long, unsigned long, int);
+	int	temp;
+
+	if (!(cpu_has_pse))
+		return -EINVAL;
+	if (len & (LPAGE_SIZE - 1)) 
+		return -EINVAL;
+	addr = get_addr(addr, len);
+	if (addr  ==  -ENOMEM)
+		return addr;
+	temp = MAP_SHARED | MAP_ANONYMOUS |MAP_FIXED;
+	addr = do_mmap_pgoff(NULL, addr, len, prot, temp, 0);
+	printk("Returned addr %x\n", addr);
+	if (!(addr & (LPAGE_SIZE -1))) {
+		 if (make_lpages_present(addr, (addr+len), flags) < 0) {
+			 addr = sys_munmap(addr, len);
+			 return -ENOMEM;
+		}
+	}
+	return addr;
+}
+
+asmlinkage unsigned long 
+sys_share_large_pages(int key, unsigned long addr, unsigned long len, int prot, int flag)
+{
+	unsigned long	raddr;
+	int	retval;
+	extern int set_lp_shm_seg(int, unsigned long *, unsigned long, int, int);
+	if (!(cpu_has_pse))
+		return -EINVAL;
+	if (key <= 0) 
+		return -EINVAL;
+	if (len & (LPAGE_SIZE - 1)) 
+		return -EINVAL;
+	raddr = get_addr(addr, len);
+	if (raddr == -ENOMEM)
+		return raddr;
+	retval = set_lp_shm_seg(key, &raddr, len, prot, flag);
+	if (retval < 0) 
+		return (unsigned long) retval;
+	return raddr;
+}
+
+asmlinkage int
+sys_free_large_pages(unsigned long addr)
+{
+	struct vm_area_struct	*vma;
+	extern int unmap_large_pages(struct vm_area_struct *);
+
+	vma = find_vma(current->mm, addr);
+	if ((!vma) || (!(vma->vm_flags & VM_LARGEPAGE)) || 
+		(vma->vm_start!=addr)) 
+		return -EINVAL;
+	return unmap_large_pages(vma);
+}
+
+asmlinkage int
+sys_unshare_large_pages(unsigned long addr)
+{
+	return sys_free_large_pages(addr);
+}
+
+#else
+asmlinkage unsigned long
+sys_get_large_pages(unsigned long addr, size_t len, int prot, int flags)
+{
+	return -ENOSYS;
+}
+
+asmlinkage unsigned long 
+sys_share_large_apges(int key, unsigned long addr, size_t len, int prot, int flag)
+{
+	return -ENOSYS;
+}
+
+asmlinkage int
+sys_free_large_apges(unsigned long addr)
+{
+	return -ENOSYS;
+}
+
+asmlinkage int
+sys_unshare_large_pages(unsigned long addr)
+{
+	return -ENOSYS;
+}
+#endif
diff -Naru linux.org/arch/i386/mm/Makefile linux.lp/arch/i386/mm/Makefile
--- linux.org/arch/i386/mm/Makefile	Fri Dec 29 14:07:20 2000
+++ linux.lp/arch/i386/mm/Makefile	Tue Jul  2 16:55:53 2002
@@ -10,5 +10,6 @@
 O_TARGET := mm.o
 
 obj-y	 := init.o fault.o ioremap.o extable.o
+obj-$(CONFIG_LARGE_PAGE) += lpage.o
 
 include $(TOPDIR)/Rules.make
diff -Naru linux.org/arch/i386/mm/init.c linux.lp/arch/i386/mm/init.c
--- linux.org/arch/i386/mm/init.c	Fri Dec 21 09:41:53 2001
+++ linux.lp/arch/i386/mm/init.c	Tue Jul  2 18:39:13 2002
@@ -447,6 +447,12 @@
 	return 0;
 }
 	
+#ifdef CONFIG_LARGE_PAGE
+long	lpagemem = 0;
+int	lp_max;
+long	lpzone_pages;
+extern struct	list_head lpage_freelist;
+#endif
 void __init mem_init(void)
 {
 	extern int ppro_with_ram_bug(void);
@@ -532,6 +538,32 @@
 	zap_low_mappings();
 #endif
 
+#ifdef CONFIG_LARGE_PAGE
+	{
+	long	i;
+	long	j;
+	struct page	*page, *map;
+	
+	/*For now reserve quarter for large_pages.*/
+	lpzone_pages = (max_low_pfn >> ((LPAGE_SHIFT - PAGE_SHIFT) + 2)) ; 
+		/*Will make this kernel command line. */
+	INIT_LIST_HEAD(&lpage_freelist);
+	for (i=0; i<lpzone_pages; i++) {
+		page = alloc_pages(GFP_ATOMIC, LARGE_PAGE_ORDER);
+		if (page == NULL)
+			break;
+		map = page;
+		for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+			SetPageReserved(map);
+			map++;
+		}
+		list_add(&page->list, &lpage_freelist);
+	}
+	printk("Total Large_page memory pages allocated %ld\n", i);
+	lpzone_pages = lpagemem = i;
+	lp_max = i;
+	}
+#endif
 }
 
 /* Put this after the callers, so that it cannot be inlined */
diff -Naru linux.org/arch/i386/mm/lpage.c linux.lp/arch/i386/mm/lpage.c
--- linux.org/arch/i386/mm/lpage.c	Wed Dec 31 16:00:00 1969
+++ linux.lp/arch/i386/mm/lpage.c	Wed Jul  3 16:09:59 2002
@@ -0,0 +1,475 @@
+/*
+ * IA-32 Large Page Support for Kernel.
+ *
+ * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
+ */
+
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/devfs_fs_kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/string.h>
+#include <linux/locks.h>
+#include <linux/smp_lock.h>
+#include <linux/slab.h>
+
+#include <asm/uaccess.h>
+#include <asm/mman.h>
+
+static struct vm_operations_struct	lp_vm_ops;
+struct list_head 			lpage_freelist;
+spinlock_t				lpage_lock = SPIN_LOCK_UNLOCKED;
+extern	long 				lpagemem;
+
+#define MAX_ID 	32
+struct lpkey {
+	struct inode *in;
+	int			key;
+} lpk[MAX_ID];
+
+static struct inode *
+find_key_inode(int key)
+{
+	int				i;
+
+	for (i=0; i<MAX_ID; i++) {
+		if (lpk[i].key == key) 
+			return (lpk[i].in);
+	}
+	return NULL;
+}
+static struct page *
+alloc_large_page(void)
+{
+	struct list_head	*curr, *head;
+	struct page			*page;
+
+	spin_lock(&lpage_lock);
+
+	head = &lpage_freelist;
+	curr = head->next;
+
+	if (curr == head)  {
+		spin_unlock(&lpage_lock);
+		return NULL;
+	}
+	page = list_entry(curr, struct page, list);
+	list_del(curr);
+	lpagemem--;
+	spin_unlock(&lpage_lock);
+	set_page_count(page, 1);
+	memset(page_address(page), 0, LPAGE_SIZE);
+	return page;
+}
+
+static void
+free_large_page(struct page *page)
+{
+	if ((page->mapping != NULL) && (page_count(page) == 2)) {
+		struct inode *inode = page->mapping->host;
+		int 	i;
+
+		lru_cache_del(page);
+		remove_inode_page(page);
+		set_page_count(page, 1);
+		if ((inode->i_size -= LPAGE_SIZE) == 0) {
+			for (i=0;i<MAX_ID;i++)
+				if (lpk[i].key == inode->i_ino) {
+					lpk[i].key = 0;
+					break;
+			}
+			kfree(inode);
+		}
+	}
+	if (put_page_testzero(page)) {
+		spin_lock(&lpage_lock);
+		list_add(&page->list, &lpage_freelist);
+		lpagemem++;
+		spin_unlock(&lpage_lock);
+	}
+}
+
+static pte_t *
+lp_pte_alloc(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t			*pgd;
+	pmd_t			*pmd = NULL;
+
+	pgd = pgd_offset(mm, addr);
+	pmd = pmd_alloc(mm, pgd, addr);
+	return (pte_t *)pmd;
+}
+
+static pte_t *
+lp_pte_offset(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t			*pgd;
+	pmd_t			*pmd = NULL;
+	
+	pgd =pgd_offset(mm, addr);
+	pmd = pmd_offset(pgd, addr);
+	return (pte_t *)pmd;
+}
+
+#define mk_pte_large(entry) {entry.pte_low |= (_PAGE_PRESENT | _PAGE_PSE);}
+	
+static void
+set_lp_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t *page_table, int write_access)
+{
+	pte_t           entry;
+
+	mm->rss += (LPAGE_SIZE/PAGE_SIZE);
+	if (write_access) {
+		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
+	} else
+		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
+	entry = pte_mkyoung(entry);
+	mk_pte_large(entry);
+	set_pte(page_table, entry);
+	printk("VIRTUAL_ADDRESS_OF_LPAGE IS %p\n", page->virtual);
+	return;
+}
+
+static int
+anon_get_lpage(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, pte_t *page_table)
+{
+	struct	page *page;
+
+	page = alloc_large_page();
+	if (page == NULL) 
+		return -1;
+	set_lp_pte(mm, vma, page, page_table, write_access);
+	return 1;
+}
+
+int
+make_lpages_present(unsigned long addr, unsigned long end, int flags)
+{
+	int write;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct * vma;
+	pte_t	*pte;
+
+	vma = find_vma(mm, addr);
+	if (!vma)
+		goto out_error1;
+
+	write = (vma->vm_flags & VM_WRITE) != 0;
+	if ((vma->vm_end - vma->vm_start) & (LPAGE_SIZE-1))
+		goto out_error1;
+	spin_lock(&mm->page_table_lock);
+	do {    
+		pte = lp_pte_alloc(mm, addr);
+		if ((pte) && (pte_none(*pte))) {
+			if (anon_get_lpage(mm, vma, 
+				write ? VM_WRITE : VM_READ, pte) == -1)
+				goto out_error;
+		} else
+			goto out_error;
+		addr += LPAGE_SIZE;
+	} while (addr < end); 
+	spin_unlock(&mm->page_table_lock);
+	vma->vm_flags |= (VM_LARGEPAGE | VM_RESERVED);
+	if (flags & MAP_PRIVATE )
+		vma->vm_flags |= VM_DONTCOPY;
+	vma->vm_ops = &lp_vm_ops;
+	return 0;
+out_error: /*Error case, remove the partial lp_resources. */
+	if (addr > vma->vm_start) { 
+	   	vma->vm_end = addr ;
+	   	zap_lp_resources(vma);
+	   	vma->vm_end = end;
+	}
+	spin_unlock(&mm->page_table_lock);
+out_error1:
+	return -1;
+}
+
+int
+copy_lpage_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma)
+{
+	pte_t *src_pte, *dst_pte, entry;
+	struct page 	*ptepage;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+
+		while (addr < end) {
+			dst_pte = lp_pte_alloc(dst, addr);
+			if (!dst_pte)
+				goto nomem;
+			src_pte = lp_pte_offset(src, addr);
+			entry = *src_pte;
+			ptepage = pte_page(entry);
+			get_page(ptepage);
+			set_pte(dst_pte, entry);
+			dst->rss += (LPAGE_SIZE/PAGE_SIZE);
+			addr += LPAGE_SIZE; 
+		}
+    return 0;
+
+nomem:
+    return -ENOMEM;
+}
+int
+follow_large_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *st, int *length, int i)
+{
+	pte_t			*ptep, pte;
+	unsigned long	start = *st;
+	unsigned long	pstart;
+	int				len = *length;
+	struct page		*page;
+
+	do {
+		pstart = start;
+		ptep = lp_pte_offset(mm, start);
+		pte = *ptep;
+
+back1:
+		page = pte_page(pte);
+		if (pages) {
+			page += ((start & ~LPAGE_MASK) >> PAGE_SHIFT);
+			pages[i] = page;
+			page_cache_get(page);
+		}
+		if (vmas)
+			vmas[i] = vma;
+		i++;
+		len--;
+		start += PAGE_SIZE;
+		if (((start & LPAGE_MASK) == pstart) && len && (start < vma->vm_end))
+			goto back1;
+	} while (len && start < vma->vm_end);
+	*length = len;
+	*st = start;
+	return i;
+}
+
+static void
+zap_lp_resources(struct vm_area_struct *mpnt)
+{
+	struct mm_struct *mm = mpnt->vm_mm;
+	unsigned long 	len, addr, end;
+	pte_t			*ptep;
+	struct page		*page;
+
+	addr = mpnt->vm_start;
+	end = mpnt->vm_end;
+	len = end - addr;
+	do {
+		ptep = lp_pte_offset(mm, addr);
+		page = pte_page(*ptep);
+		pte_clear(ptep);
+		free_large_page(page);
+		addr += LPAGE_SIZE;
+	} while (addr < end);
+	mm->rss -= (len >> PAGE_SHIFT);
+}
+
+static void
+unlink_vma(struct vm_area_struct *mpnt)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct	*vma;
+
+	vma = mm->mmap;
+	if (vma == mpnt) {
+		mm->mmap = vma->vm_next;
+	}
+	else {
+		while (vma->vm_next != mpnt) {
+			vma = vma->vm_next;
+		}
+		vma->vm_next = mpnt->vm_next;
+	}
+	rb_erase(&mpnt->vm_rb, &mm->mm_rb);
+	mm->mmap_cache = NULL;
+	mm->map_count--;
+}
+
+int
+unmap_large_pages(struct vm_area_struct *mpnt)
+{
+	struct mm_struct *mm = current->mm;
+
+	unlink_vma(mpnt);
+	spin_lock(&mm->page_table_lock);
+	zap_lp_resources(mpnt);
+	spin_unlock(&mm->page_table_lock);
+	kmem_cache_free(vm_area_cachep, mpnt);
+	return 1;
+}
+
+static struct inode *
+set_new_inode(unsigned long len, int prot, int flag, int key)
+{
+	struct inode	*inode;
+	int	i;
+
+	for (i=0; i<MAX_ID; i++) {
+		if (lpk[i].key == 0)
+			break;
+	}
+	if (i == MAX_ID)
+		return NULL;
+	inode = kmalloc(sizeof(struct inode), GFP_ATOMIC);
+	if (inode == NULL)
+		return NULL;
+	
+	memset(inode, 0, sizeof(struct inode));
+	INIT_LIST_HEAD(&inode->i_hash);
+	inode->i_mapping = &inode->i_data;
+	inode->i_mapping->host = inode;
+	INIT_LIST_HEAD(&inode->i_data.clean_pages);
+	INIT_LIST_HEAD(&inode->i_data.dirty_pages);
+	INIT_LIST_HEAD(&inode->i_data.locked_pages);
+	spin_lock_init(&inode->i_data.i_shared_lock);
+	inode->i_ino = (unsigned long)key;
+
+	lpk[i].key = key;
+	lpk[i].in = inode;
+	inode->i_uid = current->fsuid;
+	inode->i_gid = current->fsgid;
+	inode->i_mode = prot;
+	inode->i_size = len;
+	return inode;
+}
+
+static int
+check_size_prot(struct inode *inode, unsigned long len, int prot, int flag)
+{
+	if (inode->i_uid != current->fsuid)
+		return -1;
+	if (inode->i_gid != current->fsgid)
+		return -1;
+	if (inode->i_mode != prot)
+		return -1;
+	if (inode->i_size != len)
+		return -1;
+	return 0;
+}
+
+int
+set_lp_shm_seg(int key, unsigned long *raddr, unsigned long len, int prot, int flag)
+{
+	struct	mm_struct		*mm = current->mm;
+	struct	vm_area_struct	*vma;
+	struct	inode			*inode;
+	struct	address_space	*mapping;
+	struct	page			*page;
+	unsigned long 			addr = *raddr;
+	int		idx;
+	int 	retval = -ENOMEM;
+
+	if (len & (LPAGE_SIZE -1))
+		return -EINVAL;
+
+	inode = find_key_inode(key);
+	if (inode == NULL) {
+		if (!(flag & IPC_CREAT))
+			return -ENOENT;
+		inode = set_new_inode(len, prot, flag, key);
+		if (inode == NULL) 
+			return -ENOMEM;
+	}
+	else
+		if (check_size_prot(inode, len, prot, flag) < 0)
+			return -EINVAL;
+	mapping = inode->i_mapping;
+
+	addr = do_mmap_pgoff(NULL, addr, len, (unsigned long)prot, 
+			MAP_FIXED|MAP_PRIVATE | MAP_ANONYMOUS, 0);
+	if (IS_ERR((void *)addr)) 
+		return -ENOMEM; 
+
+	vma = find_vma(mm, addr);
+	if (!vma)
+		return -EINVAL;
+	
+	*raddr = addr;
+	spin_lock(&mm->page_table_lock);
+	do {
+		pte_t * pte = lp_pte_alloc(mm, addr);
+		if ((pte) && (pte_none(*pte))) {
+			idx = (addr - vma->vm_start) >> LPAGE_SHIFT;
+			page = find_get_page(mapping, idx);
+			if (page == NULL) {
+				page = alloc_large_page();	
+				if (page == NULL) 
+					goto out;	
+				add_to_page_cache(page, mapping, idx);
+			}
+			set_lp_pte(mm, vma, page, pte, (vma->vm_flags & VM_WRITE));
+		} else 
+			goto out;
+		addr += LPAGE_SIZE;
+	} while (addr < vma->vm_end); 
+	retval = 0;
+	vma->vm_flags |= (VM_LARGEPAGE | VM_RESERVED);
+	vma->vm_ops = &lp_vm_ops;
+	spin_unlock(&mm->page_table_lock);
+	return retval;
+out:
+	if (addr > vma->vm_start) {
+		raddr = vma->vm_end;
+		vma->vm_end = addr;
+		zap_lp_resources(vma);
+		vma->vm_end = raddr;
+	}
+	spin_unlock(&mm->page_table_lock);
+	return retval;
+}
+
+int
+change_large_page_mem_size(int count)
+{
+	int j;
+	struct page     *page, *map;
+	extern long        lpzone_pages;
+	extern struct list_head lpage_freelist;
+
+	if (count == 0)
+		return (int)lpzone_pages;
+	if (count > 0) {/*Increase the mem size. */
+		while (count--) {
+			page = alloc_pages(GFP_ATOMIC, LARGE_PAGE_ORDER);
+			if (page == NULL)
+				break;
+			map = page;
+			for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+				SetPageReserved(map);
+				map++;
+			}
+			spin_lock(&lpage_lock);
+			list_add(&page->list, &lpage_freelist);
+			lpagemem++;
+			lpzone_pages++;
+			spin_unlock(&lpage_lock);
+		}
+		return (int)lpzone_pages;
+	}
+	/*Shrink the memory size. */
+	while (count++) {
+		page = alloc_large_page();
+		if (page == NULL)
+			break;
+		spin_lock(&lpage_lock);
+		lpzone_pages--;
+		spin_unlock(&lpage_lock);
+		map = page;
+		for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+			ClearPageReserved(map);
+			map++;
+		}
+		__free_pages(page, LARGE_PAGE_ORDER);
+	}
+	return (int)lpzone_pages;
+}
+static struct vm_operations_struct	lp_vm_ops = {
+	close: zap_lp_resources,
+};
diff -Naru linux.org/fs/proc/array.c linux.lp/fs/proc/array.c
--- linux.org/fs/proc/array.c	Thu Oct 11 09:00:01 2001
+++ linux.lp/fs/proc/array.c	Wed Jul  3 16:59:09 2002
@@ -486,6 +486,17 @@
 			pgd_t *pgd = pgd_offset(mm, vma->vm_start);
 			int pages = 0, shared = 0, dirty = 0, total = 0;
 
+			if (is_vm_large_page(vma)) {
+				int num_pages = ((vma->vm_end - vma->vm_start)/PAGE_SIZE);
+				resident += num_pages;
+				if ((vma->vm_flags & VM_DONTCOPY))
+					share += num_pages;
+				if (vma->vm_flags & VM_WRITE)
+					dt += num_pages;
+				drs += num_pages;
+				vma = vma->vm_next;
+				continue;
+			}
 			statm_pgd_range(pgd, vma->vm_start, vma->vm_end, &pages, &shared, &dirty, &total);
 			resident += pages;
 			share += shared;
diff -Naru linux.org/fs/proc/proc_misc.c linux.lp/fs/proc/proc_misc.c
--- linux.org/fs/proc/proc_misc.c	Tue Nov 20 21:29:09 2001
+++ linux.lp/fs/proc/proc_misc.c	Wed Jul  3 10:48:21 2002
@@ -151,6 +151,14 @@
 		B(i.sharedram), B(i.bufferram),
 		B(pg_size), B(i.totalswap),
 		B(i.totalswap-i.freeswap), B(i.freeswap));
+#ifdef CONFIG_LARGE_PAGE
+	{
+		extern  unsigned long lpagemem, lpzone_pages;
+		len += sprintf(page+len,"Total # of LargePages: %8lu\t\tAvailable: %8lu\n"
+		"LargePageSize: %8lu(0x%xKB)\n",
+		lpzone_pages, lpagemem, LPAGE_SIZE, (LPAGE_SIZE/1024));
+	}
+#endif
 	/*
 	 * Tagged format, for easy grepping and expansion.
 	 * The above will go away eventually, once the tools
diff -Naru linux.org/include/asm-i386/page.h linux.lp/include/asm-i386/page.h
--- linux.org/include/asm-i386/page.h	Mon Feb 25 11:38:12 2002
+++ linux.lp/include/asm-i386/page.h	Wed Jul  3 10:49:54 2002
@@ -41,14 +41,22 @@
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
 #define pte_val(x)	((x).pte_low | ((unsigned long long)(x).pte_high << 32))
+#define	LPAGE_SHIFT	21
 #else
 typedef struct { unsigned long pte_low; } pte_t;
 typedef struct { unsigned long pmd; } pmd_t;
 typedef struct { unsigned long pgd; } pgd_t;
 #define pte_val(x)	((x).pte_low)
+#define	LPAGE_SHIFT	22
 #endif
 #define PTE_MASK	PAGE_MASK
 
+#ifdef CONFIG_LARGE_PAGE
+#define LPAGE_SIZE 	((1UL) << LPAGE_SHIFT)
+#define	LPAGE_MASK	(~(LPAGE_SIZE - 1))
+#define LARGE_PAGE_ORDER	(LPAGE_SHIFT - PAGE_SHIFT)
+#endif
+
 typedef struct { unsigned long pgprot; } pgprot_t;
 
 #define pmd_val(x)	((x).pmd)
diff -Naru linux.org/include/linux/mm.h linux.lp/include/linux/mm.h
--- linux.org/include/linux/mm.h	Fri Dec 21 09:42:03 2001
+++ linux.lp/include/linux/mm.h	Wed Jul  3 10:49:54 2002
@@ -103,6 +103,7 @@
 #define VM_DONTEXPAND	0x00040000	/* Cannot expand with mremap() */
 #define VM_RESERVED	0x00080000	/* Don't unmap it from swap_out */
 
+#define VM_LARGEPAGE	0x00400000	/* Large_Page mapping. */
 #define VM_STACK_FLAGS	0x00000177
 
 #define VM_READHINTMASK			(VM_SEQ_READ | VM_RAND_READ)
@@ -425,6 +426,16 @@
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
 
+#ifdef CONFIG_LARGE_PAGE
+#define is_vm_large_page(vma) (vma->vm_flags & VM_LARGEPAGE)
+extern int	copy_large_page(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
+extern int	follow_large_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
+#else
+#define	is_vm_large_page(vma)	(0)
+#define	follow_large_page(mm, vma, pages, vmas, &start, &len, i) (0)
+#define	copy_large_page(dst, src, vma) (0)
+#endif 
+
 /*
  * On a two-level page table, this ends up being trivial. Thus the
  * inlining and the symmetry break with pte_alloc() that does all
diff -Naru linux.org/include/linux/mmzone.h linux.lp/include/linux/mmzone.h
--- linux.org/include/linux/mmzone.h	Thu Nov 22 11:46:19 2001
+++ linux.lp/include/linux/mmzone.h	Wed Jul  3 10:49:54 2002
@@ -13,7 +13,7 @@
  */
 
 #ifndef CONFIG_FORCE_MAX_ZONEORDER
-#define MAX_ORDER 10
+#define MAX_ORDER 15
 #else
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
diff -Naru linux.org/include/linux/sysctl.h linux.lp/include/linux/sysctl.h
--- linux.org/include/linux/sysctl.h	Mon Nov 26 05:29:17 2001
+++ linux.lp/include/linux/sysctl.h	Wed Jul  3 10:49:54 2002
@@ -124,6 +124,7 @@
 	KERN_CORE_USES_PID=52,		/* int: use core or core.%pid */
 	KERN_TAINTED=53,	/* int: various kernel tainted flags */
 	KERN_CADPID=54,		/* int: PID of the process to notify on CAD */
+	KERN_LARGE_PAGE_MEM=55,	/* Number of large_page pages configured */
 };
 
 
diff -Naru linux.org/kernel/sysctl.c linux.lp/kernel/sysctl.c
--- linux.org/kernel/sysctl.c	Fri Dec 21 09:42:04 2001
+++ linux.lp/kernel/sysctl.c	Tue Jul  2 14:07:28 2002
@@ -96,6 +96,10 @@
 extern int acct_parm[];
 #endif
 
+#ifdef CONFIG_LARGE_PAGE
+extern int	lp_max;
+extern int	change_large_page_mem_size(int );
+#endif
 extern int pgt_cache_water[];
 
 static int parse_table(int *, int, void *, size_t *, void *, size_t,
@@ -256,6 +260,10 @@
 	{KERN_S390_USER_DEBUG_LOGGING,"userprocess_debug",
 	 &sysctl_userprocess_debug,sizeof(int),0644,NULL,&proc_dointvec},
 #endif
+#ifdef CONFIG_LARGE_PAGE
+	{KERN_LARGE_PAGE_MEM, "numlargepages", &lp_max, sizeof(int), 0644, NULL,
+	&proc_dointvec},
+#endif
 	{0}
 };
 
@@ -866,6 +874,10 @@
 				val = -val;
 			buffer += len;
 			left -= len;
+#if CONFIG_LARGE_PAGE
+			if (i == &lp_max)
+				val = change_large_page_mem_size(val);
+#endif
 			switch(op) {
 			case OP_SET:	*i = val; break;
 			case OP_AND:	*i &= val; break;
diff -Naru linux.org/mm/memory.c linux.lp/mm/memory.c
--- linux.org/mm/memory.c	Mon Feb 25 11:38:13 2002
+++ linux.lp/mm/memory.c	Wed Jul  3 16:14:01 2002
@@ -179,6 +179,9 @@
 	unsigned long end = vma->vm_end;
 	unsigned long cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
+	if (is_vm_large_page(vma) )
+		return copy_lpage_range(dst, src, vma);
+
 	src_pgd = pgd_offset(src, address)-1;
 	dst_pgd = pgd_offset(dst, address)-1;
 
@@ -471,6 +474,10 @@
 		if ( !vma || (pages && vma->vm_flags & VM_IO) || !(flags & vma->vm_flags) )
 			return i ? : -EFAULT;
 
+		if (is_vm_large_page(vma)) {
+			i += follow_large_page(mm, vma, pages, vmas, &start, &len, i);
+			continue;
+		}
 		spin_lock(&mm->page_table_lock);
 		do {
 			struct page *map;
@@ -1360,6 +1367,8 @@
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
+	if (is_vm_large_page(vma) )
+		return -1;
 
 	current->state = TASK_RUNNING;
 	pgd = pgd_offset(mm, address);
diff -Naru linux.org/mm/mmap.c linux.lp/mm/mmap.c
--- linux.org/mm/mmap.c	Mon Feb 25 11:38:14 2002
+++ linux.lp/mm/mmap.c	Tue Jul  2 14:15:50 2002
@@ -917,6 +917,9 @@
 	if (mpnt->vm_start >= addr+len)
 		return 0;
 
+	if (is_vm_large_page(mpnt)) /*Large pages can not be unmapped like this. */
+		return -EINVAL;
+
 	/* If we'll make "hole", check the vm areas limit */
 	if ((mpnt->vm_start < addr && mpnt->vm_end > addr+len)
 	    && mm->map_count >= MAX_MAP_COUNT)
diff -Naru linux.org/mm/mprotect.c linux.lp/mm/mprotect.c
--- linux.org/mm/mprotect.c	Mon Sep 17 15:30:23 2001
+++ linux.lp/mm/mprotect.c	Tue Jul  2 14:18:13 2002
@@ -287,6 +287,8 @@
 	error = -EFAULT;
 	if (!vma || vma->vm_start > start)
 		goto out;
+	if (is_vm_large_page(vma))
+		return -EINVAL; /* Cann't change protections on large_page mappings. */
 
 	for (nstart = start ; ; ) {
 		unsigned int newflags;
diff -Naru linux.org/mm/mremap.c linux.lp/mm/mremap.c
--- linux.org/mm/mremap.c	Thu Sep 20 20:31:26 2001
+++ linux.lp/mm/mremap.c	Tue Jul  2 14:20:05 2002
@@ -267,6 +267,10 @@
 	vma = find_vma(current->mm, addr);
 	if (!vma || vma->vm_start > addr)
 		goto out;
+	if (is_vm_large_page(vma)) {
+		ret = -EINVAL; /* Cann't remap large_page mappings. */
+		goto out;
+	}
 	/* We can't remap across vm area boundaries */
 	if (old_len > vma->vm_end - addr)
 		goto out;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
@ 2002-08-02  0:43 ` David S. Miller
  2002-08-02  1:26   ` Andrew Morton
  2002-08-02  1:55   ` Rik van Riel
  2002-08-02  1:09 ` Martin J. Bligh
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 30+ messages in thread
From: David S. Miller @ 2002-08-02  0:43 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

   Some observations which have been made thus far:

   - Minimal impact on the VM and MM layers

Well the downside of this is that it means it isn't transparent
to userspace.  For example, specfp2000 results aren't going to
improve after installing these changes.  Some of the other large
page implementations would.

   - The change to MAX_ORDER is unneeded

This is probably done to increase the likelyhood that 4MB page orders
are available.  If we collapse 4MB pages deeper, they are less likely
to be broken up because smaller orders would be selected first.

Maybe it doesn't make a difference....

   - swapping of large pages and making them pagecache-coherent is
     unpopular.

Swapping them is easy, any time you hit a large PTE you unlarge it.
This is what some of other large page implementations do.  Basically
the implementation is that set_pte() breaks apart large ptes when
necessary.

I agree on the pagecache side.

Actually to be honest the other implementations seemed less
intrusive and easier to add support for.  The downside is that
handling of weird cases like x86 using pmd's for 4MB pages
was not complete last time I checked.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  0:43 ` David S. Miller
@ 2002-08-02  1:26   ` Andrew Morton
  2002-08-02  1:19     ` David S. Miller
  2002-08-02  1:55   ` Rik van Riel
  1 sibling, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2002-08-02  1:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: linux-kernel, linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

"David S. Miller" wrote:
> 
>    From: Andrew Morton <akpm@zip.com.au>
>    Date: Thu, 01 Aug 2002 17:37:46 -0700
> 
>    Some observations which have been made thus far:
> 
>    - Minimal impact on the VM and MM layers
> 
> Well the downside of this is that it means it isn't transparent
> to userspace.  For example, specfp2000 results aren't going to
> improve after installing these changes.  Some of the other large
> page implementations would.
> 
>    - The change to MAX_ORDER is unneeded
> 
> This is probably done to increase the likelyhood that 4MB page orders
> are available.  If we collapse 4MB pages deeper, they are less likely
> to be broken up because smaller orders would be selected first.

This is leakage from ia64, which supports up to 256k pages.

> Maybe it doesn't make a difference....
> 
>    - swapping of large pages and making them pagecache-coherent is
>      unpopular.
> 
> Swapping them is easy, any time you hit a large PTE you unlarge it.
> This is what some of other large page implementations do.  Basically
> the implementation is that set_pte() breaks apart large ptes when
> necessary.

As far as mm/*.c is concerned, there is no pte.  It's just a vma
which is marked "don't touch"  These pages aren't on the LRU, nothing
knows about them.

Apparently a page-table based representation could not be used by PPC.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  1:26   ` Andrew Morton
@ 2002-08-02  1:19     ` David S. Miller
  0 siblings, 0 replies; 30+ messages in thread
From: David S. Miller @ 2002-08-02  1:19 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

   "David S. Miller" wrote:
   > This is probably done to increase the likelyhood that 4MB page orders
   > are available.  If we collapse 4MB pages deeper, they are less likely
   > to be broken up because smaller orders would be selected first.
   
   This is leakage from ia64, which supports up to 256k pages.

Ummm, 4MB > 256K and even with a 4K PAGE_SIZE MAX_ORDER coalesces
up to 4MB already :-)
   
   Apparently a page-table based representation could not be used by PPC.
   
The page-table is just an abstraction, there is no reason dummy
"large" ptes could not be used which are just ignored by the HW TLB
reload code.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  0:43 ` David S. Miller
  2002-08-02  1:26   ` Andrew Morton
@ 2002-08-02  1:55   ` Rik van Riel
  2002-08-02  1:50     ` David S. Miller
  2002-08-02  2:29     ` Gerrit Huizenga
  1 sibling, 2 replies; 30+ messages in thread
From: Rik van Riel @ 2002-08-02  1:55 UTC (permalink / raw)
  To: David S. Miller
  Cc: Andrew Morton, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick, gh

On Thu, 1 Aug 2002, David S. Miller wrote:
>    From: Andrew Morton <akpm@zip.com.au>

>    - Minimal impact on the VM and MM layers
>
> Well the downside of this is that it means it isn't transparent
> to userspace.  For example, specfp2000 results aren't going to
> improve after installing these changes.  Some of the other large
> page implementations would.

It also means we can't automatically switch to large pages for
SHM segments, which is the number one area where we need large
pages...

We should also take into account that the main application that
needs large pages for its SHM segments is Oracle, which we don't
have the source code for so we can't recompile it to use the new
syscalls introduced by this patch ...

IMHO we shouldn't blindly decide for (or against!) this patch
but also carefully look at the large page patch from RHAS (which
got added to -aa recently) and the large page patch which IBM
is working on.

kind regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  1:55   ` Rik van Riel
@ 2002-08-02  1:50     ` David S. Miller
  2002-08-02  2:29     ` Gerrit Huizenga
  1 sibling, 0 replies; 30+ messages in thread
From: David S. Miller @ 2002-08-02  1:50 UTC (permalink / raw)
  To: riel
  Cc: akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick, gh

   
   IMHO we shouldn't blindly decide for (or against!) this patch
   but also carefully look at the large page patch from RHAS (which
   got added to -aa recently) and the large page patch which IBM
   is working on.

And the one from Naohiko Shimizu which is my personal favorite
because sparc64 support is there :)

http://shimizu-lab.dt.u-tokai.ac.jp/lsp.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  1:55   ` Rik van Riel
  2002-08-02  1:50     ` David S. Miller
@ 2002-08-02  2:29     ` Gerrit Huizenga
  2002-08-02  2:23       ` David S. Miller
  2002-08-02  5:24       ` David Mosberger
  1 sibling, 2 replies; 30+ messages in thread
From: Gerrit Huizenga @ 2002-08-02  2:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David S. Miller, Andrew Morton, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

In message <Pine.LNX.4.44L.0208012246390.23404-100000@imladris.surriel.com>, > 
: Rik van Riel writes:
> On Thu, 1 Aug 2002, David S. Miller wrote:
> >    From: Andrew Morton <akpm@zip.com.au>
> 
> >    - Minimal impact on the VM and MM layers
> >
> > Well the downside of this is that it means it isn't transparent
> > to userspace.  For example, specfp2000 results aren't going to
> > improve after installing these changes.  Some of the other large
> > page implementations would.
> 
> We should also take into account that the main application that
> needs large pages for its SHM segments is Oracle, which we don't
> have the source code for so we can't recompile it to use the new
> syscalls introduced by this patch ...

There are quite a few other applications that can benefit from large
page support.  IBM Watson Research published JVM and some scientific
workload results using large pages which showed substantial benefits.
Also, we believe DB2, Domino, other memory piggish apps (e.g. think
scientific) would benefit equally on many architectures.

It would sure be nice if the interface wasn't some kludgey back door
but more integrated with things like mmap() or shm*(), with semantics
and behaviors that were roughly more predictable.  Other than that,
no comments as yet on the patch internals...

gerrit 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  2:29     ` Gerrit Huizenga
@ 2002-08-02  2:23       ` David S. Miller
  2002-08-02  2:53         ` Gerrit Huizenga
  2002-08-02  5:24       ` David Mosberger
  1 sibling, 1 reply; 30+ messages in thread
From: David S. Miller @ 2002-08-02  2:23 UTC (permalink / raw)
  To: gh
  Cc: riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

   
   other memory piggish apps (e.g. think scientific) would benefit

There are example "benchmark'ish" example on Naohiko Shimizu's
large page project page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  2:23       ` David S. Miller
@ 2002-08-02  2:53         ` Gerrit Huizenga
  0 siblings, 0 replies; 30+ messages in thread
From: Gerrit Huizenga @ 2002-08-02  2:53 UTC (permalink / raw)
  To: David S. Miller
  Cc: riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

In message <20020801.192304.85746519.davem@redhat.com>, > : "David S. Miller" w
rites:
>    From: Gerrit Huizenga <gh@us.ibm.com>
>    Date: Thu, 01 Aug 2002 19:29:52 -0700
>    
>    other memory piggish apps (e.g. think scientific) would benefit
> 
> There are example "benchmark'ish" example on Naohiko Shimizu's
> large page project page.

No 2.5 code, though.  :(  But there *is* 2.2.16 code!

gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  2:29     ` Gerrit Huizenga
  2002-08-02  2:23       ` David S. Miller
@ 2002-08-02  5:24       ` David Mosberger
  2002-08-02  5:20         ` David S. Miller
  1 sibling, 1 reply; 30+ messages in thread
From: David Mosberger @ 2002-08-02  5:24 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Rik van Riel, David S. Miller, Andrew Morton, linux-kernel,
	linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

>>>>> On Thu, 01 Aug 2002 19:29:52 -0700, Gerrit Huizenga <gh@us.ibm.com> said:

  Gerrit> It would sure be nice if the interface wasn't some kludgey
  Gerrit> back door but more integrated with things like mmap() or
  Gerrit> shm*(), with semantics and behaviors that were roughly more
  Gerrit> predictable.  Other than that, no comments as yet on the
  Gerrit> patch internals...

In my opinion the proposed large-page patch addresses a relatively
pressing need for databases (primarily).  Longer term, I'd hope that
it can be replaced by a transparent superpage scheme.  But the
existing patch can also serve as a nice benchmark for transparent
schemes (and frankly, since it doesn't have to do anything smart
behind the scenes, it's likely that the existing patch, where
applicable, will always do slightly better than a transparent one).

In any case, the big issue of physical memory fragmentation can be
experimented with indepent what the user-level interface looks like.
So the existing patch is useful in that sense as well.

	--david
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  5:24       ` David Mosberger
@ 2002-08-02  5:20         ` David S. Miller
  2002-08-02  6:26           ` David Mosberger
  0 siblings, 1 reply; 30+ messages in thread
From: David S. Miller @ 2002-08-02  5:20 UTC (permalink / raw)
  To: davidm, davidm
  Cc: gh, riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

   
   In my opinion the proposed large-page patch addresses a relatively
   pressing need for databases (primarily).

Databases want large pages with IPC_SHM, how can this special
syscal hack address that?

It's great for experimentation, but give up syscall slots for
this?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  5:20         ` David S. Miller
@ 2002-08-02  6:26           ` David Mosberger
  2002-08-02  6:33             ` Martin J. Bligh
  2002-08-02  8:20             ` David S. Miller
  0 siblings, 2 replies; 30+ messages in thread
From: David Mosberger @ 2002-08-02  6:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: davidm, davidm, gh, riel, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

>>>>> On Thu, 01 Aug 2002 22:20:53 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  DaveM>    From: David Mosberger <davidm@napali.hpl.hp.com> Date:
  DaveM> Thu, 1 Aug 2002 22:24:05 -0700

  DaveM>    In my opinion the proposed large-page patch addresses a
  DaveM> relatively pressing need for databases (primarily).

  DaveM> Databases want large pages with IPC_SHM, how can this
  DaveM> special syscal hack address that?

I believe the interface is OK in that regard.  AFAIK, Oracle is happy
with it.

  DaveM> It's great for experimentation, but give up syscall slots
  DaveM> for this?

I'm a bit concerned about this, too.  My preference would have been to
use the regular mmap() and shmat() syscalls with some
augmentation/hint as to what the preferred page size is (Simon
Winwood's OLS 2002 paper talks about some options here).  I like this
because hints could be useful even with a transparent superpage
scheme.

The original Intel patch did use more of a hint-like approach (the
hint was a simple binary flag though: give me regular pages or give me
large pages), but Linus preferred a separate syscall interface, so the
Intel folks switched over to doing that.

	--david
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  6:26           ` David Mosberger
@ 2002-08-02  6:33             ` Martin J. Bligh
  2002-08-02  6:44               ` David Mosberger
  2002-08-02  7:08               ` Andrew Morton
  2002-08-02  8:20             ` David S. Miller
  1 sibling, 2 replies; 30+ messages in thread
From: Martin J. Bligh @ 2002-08-02  6:33 UTC (permalink / raw)
  To: davidm, David S. Miller
  Cc: gh, riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

>   DaveM>    In my opinion the proposed large-page patch addresses a
>   DaveM> relatively pressing need for databases (primarily).
> 
>   DaveM> Databases want large pages with IPC_SHM, how can this
>   DaveM> special syscal hack address that?
> 
> I believe the interface is OK in that regard.  AFAIK, Oracle is happy
> with it.

Is Oracle now the world's only database? I think not.
 
>   DaveM> It's great for experimentation, but give up syscall slots
>   DaveM> for this?
> 
> I'm a bit concerned about this, too.  My preference would have been to
> use the regular mmap() and shmat() syscalls with some
> augmentation/hint as to what the preferred page size is 

I think that's what most users would prefer, and I don't think it
adds a vast amount of kernel complexity. Linus doesn't seem to
be dead set against the shmem modifications at least ... so that's
half way there ;-)

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  6:33             ` Martin J. Bligh
@ 2002-08-02  6:44               ` David Mosberger
  2002-08-02 10:00                 ` Marcin Dalecki
  2002-08-02  7:08               ` Andrew Morton
  1 sibling, 1 reply; 30+ messages in thread
From: David Mosberger @ 2002-08-02  6:44 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: davidm, David S. Miller, riel, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

>>>>> On Thu, 01 Aug 2002 23:33:26 -0700, "Martin J. Bligh" <Martin.Bligh@us.ibm.com> said:

  DaveM> In my opinion the proposed large-page patch addresses a
  DaveM> relatively pressing need for databases (primarily).
  >>
  DaveM> Databases want large pages with IPC_SHM, how can this special
  DaveM> syscal hack address that?

  >>  I believe the interface is OK in that regard.  AFAIK, Oracle is
  >> happy with it.

  Martin> Is Oracle now the world's only database? I think not.

I didn't say such a thing.  I just don't know what other db vendors/authors
think of the proposed interface.  I'm sure their feedback would be welcome.

	--david
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  6:44               ` David Mosberger
@ 2002-08-02 10:00                 ` Marcin Dalecki
  0 siblings, 0 replies; 30+ messages in thread
From: Marcin Dalecki @ 2002-08-02 10:00 UTC (permalink / raw)
  To: davidm; +Cc: Martin J. Bligh, David S. Miller

Uz.ytkownik David Mosberger napisa?:
>>>>>>On Thu, 01 Aug 2002 23:33:26 -0700, "Martin J. Bligh" <Martin.Bligh@us.ibm.com> said:
>>>>>
> 
>   DaveM> In my opinion the proposed large-page patch addresses a
>   DaveM> relatively pressing need for databases (primarily).
>   >>
>   DaveM> Databases want large pages with IPC_SHM, how can this special
>   DaveM> syscal hack address that?
> 
>   >>  I believe the interface is OK in that regard.  AFAIK, Oracle is
>   >> happy with it.
> 
>   Martin> Is Oracle now the world's only database? I think not.
> 
> I didn't say such a thing.  I just don't know what other db vendors/authors
> think of the proposed interface.  I'm sure their feedback would be welcome.

You better don't ask DB people and in esp. the Oracle people
about opinnions on interface design. Unless you wan't something
fscking ugly internally looking like FORTRAN/COBOL coding.
They will always scrap portability/usability use undocumented behaviour 
and so on in the case they can presumably increase theyr pet benchmark
values.
One of the reasons Solaris is *feeling* so slow is that they asked
Oracle people too frequent about oppinions apparently. In esp. they did 
forgett that there are other uses then DB servers ;-).

PS. I just got too much in touch with Oracle to not hate it...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  6:33             ` Martin J. Bligh
  2002-08-02  6:44               ` David Mosberger
@ 2002-08-02  7:08               ` Andrew Morton
  2002-08-02  7:15                 ` William Lee Irwin III
  1 sibling, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2002-08-02  7:08 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: davidm, David S. Miller, riel, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

"Martin J. Bligh" wrote:
> 
> >   DaveM>    In my opinion the proposed large-page patch addresses a
> >   DaveM> relatively pressing need for databases (primarily).
> >
> >   DaveM> Databases want large pages with IPC_SHM, how can this
> >   DaveM> special syscal hack address that?
> >
> > I believe the interface is OK in that regard.  AFAIK, Oracle is happy
> > with it.
> 
> Is Oracle now the world's only database? I think not.

Is a draft of Simon's patch available against 2.5?

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  7:08               ` Andrew Morton
@ 2002-08-02  7:15                 ` William Lee Irwin III
  0 siblings, 0 replies; 30+ messages in thread
From: William Lee Irwin III @ 2002-08-02  7:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin J. Bligh, davidm, David S. Miller

At some point in the past, Dave Miller wrote:
DaveM>    In my opinion the proposed large-page patch addresses a
DaveM> relatively pressing need for databases (primarily).
DaveM> Databases want large pages with IPC_SHM, how can this
DaveM> special syscal hack address that?

At some point in the past, David Mosberger wrote:
I believe the interface is OK in that regard.  AFAIK, Oracle is happy
with it.

"Martin J. Bligh" wrote:
>> Is Oracle now the world's only database? I think not.

On Fri, Aug 02, 2002 at 12:08:50AM -0700, Andrew Morton wrote:
> Is a draft of Simon's patch available against 2.5?

Unless I can turn blood into wine, walk on water, and produce a working
2.5 version of the thing in < 6 hours (not that I'm not trying), this
will probably have to wait until Hubertus remeterializes tomorrow
morning (EDT) and further porting is done. I'll be up early.

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  6:26           ` David Mosberger
  2002-08-02  6:33             ` Martin J. Bligh
@ 2002-08-02  8:20             ` David S. Miller
  2002-08-02  9:05               ` Ryan Cumming
  2002-08-02 15:27               ` David Mosberger
  1 sibling, 2 replies; 30+ messages in thread
From: David S. Miller @ 2002-08-02  8:20 UTC (permalink / raw)
  To: davidm, davidm
  Cc: gh, riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

   I'm a bit concerned about this, too.  My preference would have been to
   use the regular mmap() and shmat() syscalls with some
   augmentation/hint as to what the preferred page size is (Simon
   Winwood's OLS 2002 paper talks about some options here).  I like this
   because hints could be useful even with a transparent superpage
   scheme.

A "hint" to use superpages?  That's absurd.

Any time you are able to translate N pages instead of 1 page with 1
TLB entry it's always preferable.

I also don't buy the swapping complexity bit.  The fact is, SHM and
anonymous pages are easy.  Just stay away from the page cache and it
is pretty simple to just make the normal VM do this.

If set_pte sees a large page, you simply undo the large ptes in that
group and the complexity ends right there.  This means the only maker
of large pages is the bit that creates the large mappings and has all
of the ideal conditions up front.  Any time anything happens to that
pte you undo the large'ness of it so that you get normal PAGE_SIZE
ptes back.

Using superpages for anonymous+SHM pages is really the only area I
still think Linux's MM can offer inferior performance compared to what
the hardware is actually capable of.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  8:20             ` David S. Miller
@ 2002-08-02  9:05               ` Ryan Cumming
  2002-08-02  9:06                 ` David S. Miller
  2002-08-02 12:52                 ` Rik van Riel
  2002-08-02 15:27               ` David Mosberger
  1 sibling, 2 replies; 30+ messages in thread
From: Ryan Cumming @ 2002-08-02  9:05 UTC (permalink / raw)
  To: David S. Miller, davidm, davidm
  Cc: gh, riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

On August 2, 2002 01:20, David S. Miller wrote:
>    From: David Mosberger <davidm@napali.hpl.hp.com>
>    Date: Thu, 1 Aug 2002 23:26:07 -0700
>
>    I'm a bit concerned about this, too.  My preference would have been to
>    use the regular mmap() and shmat() syscalls with some
>    augmentation/hint as to what the preferred page size is (Simon
>    Winwood's OLS 2002 paper talks about some options here).  I like this
>    because hints could be useful even with a transparent superpage
>    scheme.
>
> A "hint" to use superpages?  That's absurd.

What about applications that want fine-grained page aging? 4MB is a tad on the 
course side for most desktop applications.

-Ryan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  9:05               ` Ryan Cumming
@ 2002-08-02  9:06                 ` David S. Miller
  2002-08-02 12:52                 ` Rik van Riel
  1 sibling, 0 replies; 30+ messages in thread
From: David S. Miller @ 2002-08-02  9:06 UTC (permalink / raw)
  To: ryan
  Cc: davidm, davidm, gh, riel, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

   
   What about applications that want fine-grained page aging? 4MB is a
   tad on the course side for most desktop applications.

Once vmscan sees the page and tries to liberate it, then it
will be unlarge'd and thus you'll get fine-grained page aging.

That's the beauty of my implementation suggestion.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  9:05               ` Ryan Cumming
  2002-08-02  9:06                 ` David S. Miller
@ 2002-08-02 12:52                 ` Rik van Riel
  1 sibling, 0 replies; 30+ messages in thread
From: Rik van Riel @ 2002-08-02 12:52 UTC (permalink / raw)
  To: Ryan Cumming
  Cc: David S. Miller, davidm, davidm, gh, akpm, linux-kernel,
	linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

On Fri, 2 Aug 2002, Ryan Cumming wrote:
> On August 2, 2002 01:20, David S. Miller wrote:

> > A "hint" to use superpages?  That's absurd.
>
> What about applications that want fine-grained page aging? 4MB is a tad
> on the course side for most desktop applications.

Of course we wouldn't want to use superpages for VMAs smaller
than, say, 4 of these superpages.

That would fix this problem automagically.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  8:20             ` David S. Miller
  2002-08-02  9:05               ` Ryan Cumming
@ 2002-08-02 15:27               ` David Mosberger
  1 sibling, 0 replies; 30+ messages in thread
From: David Mosberger @ 2002-08-02 15:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: davidm, davidm, gh, riel, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

>>>>> On Fri, 02 Aug 2002 01:20:40 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  Dave.M> A "hint" to use superpages?  That's absurd.

  Dave.M> Any time you are able to translate N pages instead of 1 page
  Dave.M> with 1 TLB entry it's always preferable.

Yeah, right.  So you think a 256MB page-size is optimal for all apps?

What you're missing is how you *get* to the point where you can map N
pages with a single TLB entry.  For that to happen, you need to
allocate physically contiguous and properly aligned memory (at least
given the hw that's common today).  Doing has certain costs, no matter
what your approach is.

	--david
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
  2002-08-02  0:43 ` David S. Miller
@ 2002-08-02  1:09 ` Martin J. Bligh
  2002-08-02  1:36 ` Andrew Morton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Martin J. Bligh @ 2002-08-02  1:09 UTC (permalink / raw)
  To: Andrew Morton, lkml, linux-mm; +Cc: Seth, Rohit, Saxena, Sunil, Mallick, Asit K

> - The change to MAX_ORDER is unneeded

It's not only unneeded, it's detrimental. Not only will we spend more
time merging stuff up and down to no effect, it also makes the 
config_nonlinear stuff harder (or we have to #ifdef it, which just causes
more unnecessary differentiation). Please don't do that little bit ....

M.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
  2002-08-02  0:43 ` David S. Miller
  2002-08-02  1:09 ` Martin J. Bligh
@ 2002-08-02  1:36 ` Andrew Morton
  2002-08-02  4:31   ` Daniel Phillips
  2002-08-02  3:47 ` William Lee Irwin III
  2002-08-02 23:40 ` Chris Wedgwood
  4 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2002-08-02  1:36 UTC (permalink / raw)
  To: lkml, linux-mm, Seth, Rohit, Saxena, Sunil, Mallick, Asit K

Merged up to 2.5.30.  It compiles.

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.30/lpp.patch
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  1:36 ` Andrew Morton
@ 2002-08-02  4:31   ` Daniel Phillips
  2002-08-02  4:47     ` Andrew Morton
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2002-08-02  4:31 UTC (permalink / raw)
  To: Andrew Morton, lkml, linux-mm, Seth, Rohit, Saxena, Sunil,
	Mallick, Asit K

On Friday 02 August 2002 03:36, Andrew Morton wrote:
> Merged up to 2.5.30.  It compiles.
> 
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.30/lpp.patch

What was the original against?

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  4:31   ` Daniel Phillips
@ 2002-08-02  4:47     ` Andrew Morton
  0 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2002-08-02  4:47 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: lkml, linux-mm, Seth, Rohit, Saxena, Sunil, Mallick, Asit K

Daniel Phillips wrote:
> 
> On Friday 02 August 2002 03:36, Andrew Morton wrote:
> > Merged up to 2.5.30.  It compiles.
> >
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.30/lpp.patch
> 
> What was the original against?

2.4.18
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
                   ` (2 preceding siblings ...)
  2002-08-02  1:36 ` Andrew Morton
@ 2002-08-02  3:47 ` William Lee Irwin III
  2002-08-02 23:40 ` Chris Wedgwood
  4 siblings, 0 replies; 30+ messages in thread
From: William Lee Irwin III @ 2002-08-02  3:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm, Seth, Rohit, Saxena, Sunil, Mallick, Asit K

On Thu, Aug 01, 2002 at 05:37:46PM -0700, Andrew Morton wrote:
> This is a large-page support patch from Rohit Seth, forwarded
> with his permission (thanks!).

Overall, the code looks very clean.

(1) So there are now 4 of these things. How do they compare to each
	other? Where are the comparison benchmarks? How do their
	features compare? Which one(s) do users want?

(2) The allocation policies for pagetables mapping the things may as
	well do some kind of lookup, sharing, and cacheing; it's likely
	a significant number of the users of the shm segment will be
	mapping them more or less the same way given database usage
	patterns. It's not a significant amount of space, but kernels
	should be frugal about space, and with many tasks as is typical
	of databases, the savings may well add up to a small but
	respectable chunk of ZONE_NORMAL.

(3) As long as the interface is explicit, it might as well drop flags
	into shm and mmap. There isn't even C library support for these
	things as they are... time to int $0x80 again so I can test.

(4) Requiring app awareness of page alignment looks like an irritating
	porting issue, which doesn't sound as trivial as it would
	otherwise be in already extremely cramped 32-bit virtual
	address spaces.

(5) What's in it for the average user? It's doubtful GNOME will be
	registering memory blocks with these syscalls anytime soon.
	Granted, the opportunities for reducing TLB load this way
	are small on desktop systems, but it doesn't feel quite
	right to just throw mappings of magic physical memory into the
	hands of a few enlightened apps on machines with memory to burn
	and leave all others in the cold.
	By several accounts "scalability" is defined as "performing as
	well on large machines as it does on small ones" ... but this
	seems to be a method of circumventing the kernel's own memory
	management as opposed to a method of improving it in all cases.

(6) As far as reconfiguring, I'm slightly concerned about the robustness
	of change_large_page_mem_size() in terms of how likely it is to
	succeed. Some on-demand defragmentation looks like it should be
	implemented to make it more reliable (now possible thanks to
	rmap). In general, the sysctl seems to lack some adaptivity.
	Granting root privileges to the workload vs. perpetual
	monitoring to find the ideal pool size sounds like a headache.

(7) I'm a little worried by the following:

zone(0): 4096 pages.
zone(1): 225280 pages.
BUG: wrong zone alignment, it will crash
zone(2): 3964928 pages.


My machine doesn't seem to care, but others' might.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
                   ` (3 preceding siblings ...)
  2002-08-02  3:47 ` William Lee Irwin III
@ 2002-08-02 23:40 ` Chris Wedgwood
  4 siblings, 0 replies; 30+ messages in thread
From: Chris Wedgwood @ 2002-08-02 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm, Seth, Rohit, Saxena, Sunil, Mallick, Asit K

On Thu, Aug 01, 2002 at 05:37:46PM -0700, Andrew Morton wrote:

    diff -Naru linux.org/arch/i386/kernel/entry.S linux.lp/arch/i386/kernel/entry.S
    --- linux.org/arch/i386/kernel/entry.S	Mon Feb 25 11:37:53 2002
    +++ linux.lp/arch/i386/kernel/entry.S	Tue Jul  2 15:12:23 2002
    @@ -634,6 +634,10 @@
     	.long SYMBOL_NAME(sys_ni_syscall)	/* 235 reserved for removexattr */
     	.long SYMBOL_NAME(sys_ni_syscall)	/* reserved for lremovexattr */
     	.long SYMBOL_NAME(sys_ni_syscall)	/* reserved for fremovexattr */
    +	.long SYMBOL_NAME(sys_get_large_pages)	/* Get large_page pages */
    +	.long SYMBOL_NAME(sys_free_large_pages)	/* Free large_page pages */
    +	.long SYMBOL_NAME(sys_share_large_pages)/* Share large_page pages */
    +	.long SYMBOL_NAME(sys_unshare_large_pages)/* UnShare large_page pages */


Must large pages be allocated this way?

At some point I would like to see code that mmap's large amounts of
data (over 1GB) and have it take advantage of this once the kernel is
potentially extended to deal with mapping of large and/or variable
sized pages backed to disk.

Also, some scientific applications will malloc(3) gobs of ram, again
in excess of 1GB, is it unreasonable to expect that the kernel will
notice large allocations and try to provide large pages sbrk in
invoked with suitable high values?



  --cw
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: large page patch
@ 2002-08-02  1:34 Seth, Rohit
  0 siblings, 0 replies; 30+ messages in thread
From: Seth, Rohit @ 2002-08-02  1:34 UTC (permalink / raw)
  To: 'David S. Miller', akpm
  Cc: linux-kernel, linux-mm, Seth, Rohit, Saxena, Sunil, Mallick, Asit K

There is typo in Andrew's mail.  It is not 256K, but it is 256MB.

-----Original Message-----
From: David S. Miller [mailto:davem@redhat.com]
Sent: Thursday, August 01, 2002 6:20 PM
To: akpm@zip.com.au
Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
rohit.seth@intel.com; sunil.saxena@intel.com; asit.k.mallick@intel.com
Subject: Re: large page patch


   From: Andrew Morton <akpm@zip.com.au>
   Date: Thu, 01 Aug 2002 18:26:40 -0700

   "David S. Miller" wrote:
   > This is probably done to increase the likelyhood that 4MB page orders
   > are available.  If we collapse 4MB pages deeper, they are less likely
   > to be broken up because smaller orders would be selected first.
   
   This is leakage from ia64, which supports up to 256k pages.

Ummm, 4MB > 256K and even with a 4K PAGE_SIZE MAX_ORDER coalesces
up to 4MB already :-)
   
   Apparently a page-table based representation could not be used by PPC.
   
The page-table is just an abstraction, there is no reason dummy
"large" ptes could not be used which are just ignored by the HW TLB
reload code.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: large page patch
@ 2002-08-02 19:31 Seth, Rohit
  0 siblings, 0 replies; 30+ messages in thread
From: Seth, Rohit @ 2002-08-02 19:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: gh, riel, akpm, Seth, Rohit, Saxena, Sunil, Mallick, Asit K,
	David S. Miller, 'davidm@hpl.hp.com'

We agree that there are few different ways to get this support implemented
in base kernel.  Also, the extent to which this support needs to go is also
debatable (like whether the large_pages could be made swapable etc.)  Just
to give little history, we also started with prototyping changes in kernel
that would get the large page support transparent to end user (as we wanted
to see the benefit of large apps like databases, spec benchmark and HPC
applications using different page sizes on IA-64).  And under some
conditions automagically user start using large pages for shm and private
anonymous pages.  But we would call this at best a kludge because there are
quite a number of conditions in these execution paths that one has to do
differently for large_pages.  For example,
make_pages_present/handle_mm_fault for anonymous or shmem type of pages need
to be modified to embed the knowledge of different page size in generic
kernel. Also, there are places where semantics of changes may not completely
match.  For example, doing a shm_lock/unlock on these segments were not
exactly doing the expected.  All those extra changes add cost in the normal
execution path (severity could differ from app to app). 

So, we needed to treat the large pages as a special case and want to make
sure that the application that will be using the large pages understand that
these pages are special (avoid transperent usage model until the large pages
are treated the same way as normal pages). This led to cleaner solution
(input for which also came from Linus himself).  The new APIs enable the
kernel to contain the changes to be architecture specific and limited to
very few kernel changes.  And above all it looks so much portable. Fact is,
the initial implementation was done for IA-64 and porting to x86 took couple
of hours. One of the other key advantage is that this design does not tie
the supported large_page size(s) to any specific size in the generic mm
code.  It supports all the underlying architecture supported page sizes
quite independent of generic code.  And architecture dependent code could
support multiple large_page sizes in the same kernel.

We presented our work to Oracle and they were acceptable to the new APIs
(not saying Oracle is the only DB in world that one has to worry about, but
it clearly indicates that the move from shm apis to this new APIs is easy.
Obviously the input from other big app vendors will be highly appreciated.).

Sceintific apps people who have the sources should also like this approach,
as there changes will be even more trivial (changes to malloc).  And above
all, for those people who really want to get this extra buck transparently,
the changes could be done to user land libraries to selectively map to these
new APIs.  LD_PRELOAD could be another way to do.  Ofcourse, there will be
changes that need to be done in user land.  But they are self contained
changes.  And one of the key point is that application knows what it is
demanding/getting form kernel.

Now to the point where the large_pages themselves could be made swapable. In
our opinion (and this may not be this API dependent), it is not a good idea
to look at these pages as swapable candidates.  Most of the big apps who are
going to use this feature will use them for the data that they really need
available all the time (prefereably in RAM if not on caches :-)).  And the
sysadm could easily configure the amount of large mem pool as per the needs
for a specific environment.

To the point where the whole kernel starts supporting (as David Mosberger
refered) superpages where support is built in kernel to basically treat
superpages as just another size the whole kernel supports will be great too.
But those need quite a lot of exhaustive changes in kernel layers as weill
as lot of tuning.....may be a little further away in future.

thanks,
asit & rohit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2002-08-02 23:40 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-02  0:37 large page patch Andrew Morton
2002-08-02  0:43 ` David S. Miller
2002-08-02  1:26   ` Andrew Morton
2002-08-02  1:19     ` David S. Miller
2002-08-02  1:55   ` Rik van Riel
2002-08-02  1:50     ` David S. Miller
2002-08-02  2:29     ` Gerrit Huizenga
2002-08-02  2:23       ` David S. Miller
2002-08-02  2:53         ` Gerrit Huizenga
2002-08-02  5:24       ` David Mosberger
2002-08-02  5:20         ` David S. Miller
2002-08-02  6:26           ` David Mosberger
2002-08-02  6:33             ` Martin J. Bligh
2002-08-02  6:44               ` David Mosberger
2002-08-02 10:00                 ` Marcin Dalecki
2002-08-02  7:08               ` Andrew Morton
2002-08-02  7:15                 ` William Lee Irwin III
2002-08-02  8:20             ` David S. Miller
2002-08-02  9:05               ` Ryan Cumming
2002-08-02  9:06                 ` David S. Miller
2002-08-02 12:52                 ` Rik van Riel
2002-08-02 15:27               ` David Mosberger
2002-08-02  1:09 ` Martin J. Bligh
2002-08-02  1:36 ` Andrew Morton
2002-08-02  4:31   ` Daniel Phillips
2002-08-02  4:47     ` Andrew Morton
2002-08-02  3:47 ` William Lee Irwin III
2002-08-02 23:40 ` Chris Wedgwood
2002-08-02  1:34 Seth, Rohit
2002-08-02 19:31 Seth, Rohit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox