linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
@ 2006-03-20 13:35 Stone Wang
  2006-03-20 13:41 ` Arjan van de Ven
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Stone Wang @ 2006-03-20 13:35 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4044 bytes --]

Both one of my friends(who is working on a DBMS oriented from
PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.

After careful code-reading and tests,i found out that the reason of the
OOM is that VM's LRU algorithm treating mlocked pages as Active/Inactive,
regardless of that the mlocked pages could not be reclaimed.

Mlocking many pages will easily cause unbalance between LRU and slab:
VM tend to reclaim from Active/Inactive list,most of which are mlocked,
thus OOM may be triggered. While in fact,there are enough pages to be
reclaimed in slab.
( Setting a large "vfs_cache_pressure" may help to avoid the OOM
  under this situation, but i think it's better "do things right" than
  depending on the "vfs_cache_pressure" tunable)

We think that it's wrong semantic treating mlocked as Active/Inactive.
Mlocked pages should not be counted in page-reclaiming algorithm,
for in fact they will never be affected by page reclaims.

Following patch patch try to fix this, with some additions.

The patch brings Linux with:
1. Posix mlock/munlock/mlockall/munlockall.
   Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
   just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
   Thus users of mlock system call series will always have an clear map of
   mlocked areas.
2. More consistent LRU semantics in Memory Management.
   Mlocked pages is placed on a separate LRU list: Wired List.
   The pages dont take part in LRU algorithms,for they could never be swapped,
   until munlocked.
3. Output the Wired(mlocked) pages count through /proc/meminfo.
   One line is added to /proc/meminfo: "Wired:     N kB",thus Linux system
   administrators/programmers can have a clearer map of physical memory usage.


Test of the patch:

Test envioronment:
     RHEL4.
     Totoal physical memory size: 256MB,no swap.
     One ext3 directory("/mnt/test") with about 256 thousand small
files (each size: 2kB).

Step 1. run a task mlocking 220 MB
Step 2. run: "find /mnt/test -size 100"


Case A. Standard kernel.org kernel 2.6.15

Linux soon run OOM, OOM-time memory info:

[root@Linux ~]# cat /proc/meminfo
MemTotal:       254248 kB
MemFree:          3144 kB
Buffers:           124 kB
Cached:           1584 kB
SwapCached:          0 kB
Active:         229308 kB
Inactive:          596 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       254248 kB
LowFree:          3144 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:         228556 kB
Slab:            20076 kB
CommitLimit:    127124 kB
Committed_AS:   238424 kB
PageTables:        584 kB
VmallocTotal:   770040 kB
VmallocUsed:       180 kB
VmallocChunk:   769844 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB


Case B. Patched 2.6.15

No OOM happened.

[root@Linux ~]# cat /proc/meminfo
MemTotal:       254344 kB
MemFree:          3508 kB
Buffers:          6352 kB
Cached:           2684 kB
SwapCached:          0 kB
Active:           7140 kB
Inactive:         4732 kB
Wired:          225284 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       254344 kB
LowFree:          3508 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:              72 kB
Writeback:           0 kB
Mapped:         229208 kB
Slab:            12552 kB
CommitLimit:    127172 kB
Committed_AS:   238168 kB
PageTables:        572 kB
VmallocTotal:   770040 kB
VmallocUsed:       180 kB
VmallocChunk:   769844 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB


A lot thanks to Mel Gorman for your book: <Understanding the Linux Virtual
Memory Manager>. Also, thanks to other 2 great Linux kernel books: ULK3 and
LDD3.

FreeBSD's VM implementation enlightened me,thanks to FreeBSD guys.

Attachment is the full patch,following mails are what it splits up,.

Shaoping Wang

[-- Attachment #2: patch-2.6.15-memlock --]
[-- Type: application/octet-stream, Size: 49453 bytes --]

diff -urN --exclude-from=./exclude.files linux-2.6.15/arch/cris/arch-v32/drivers/cryptocop.c /home/backup/linux-2.6.15-release/arch/cris/arch-v32/drivers/cryptocop.c
--- linux-2.6.15/arch/cris/arch-v32/drivers/cryptocop.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/arch/cris/arch-v32/drivers/cryptocop.c	2006-03-06 08:38:48.000000000 -0500
@@ -2724,6 +2724,7 @@
 			     noinpages,
 			     0,  /* read access only for in data */
 			     0, /* no force */
+			     0, /* do not set wire */
 			     inpages,
 			     NULL);
 
@@ -2741,6 +2742,7 @@
 				     nooutpages,
 				     1, /* write access for out data */
 				     0, /* no force */
+				     0, /* do not set wire*/
 				     outpages,
 				     NULL);
 		up_read(&current->mm->mmap_sem);
diff -urN --exclude-from=./exclude.files linux-2.6.15/Documentation/vm/hugetlbpage.txt /home/backup/linux-2.6.15-release/Documentation/vm/hugetlbpage.txt
--- linux-2.6.15/Documentation/vm/hugetlbpage.txt	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/Documentation/vm/hugetlbpage.txt	2006-03-06 06:30:06.000000000 -0500
@@ -59,7 +59,7 @@
 
 This command will try to configure 20 hugepages in the system.  The success
 or failure of allocation depends on the amount of physically contiguous
-memory that is preset in system at this time.  System administrators may want
+memory that is present in system at this time.  System administrators may want
 to put this command in one of the local rc init file.  This will enable the
 kernel to request huge pages early in the boot process (when the possibility
 of getting physical contiguous pages is still very high).
diff -urN --exclude-from=./exclude.files linux-2.6.15/Documentation/vm/locking /home/backup/linux-2.6.15-release/Documentation/vm/locking
--- linux-2.6.15/Documentation/vm/locking	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/Documentation/vm/locking	2006-03-07 03:43:44.000000000 -0500
@@ -37,7 +37,7 @@
 4. The exception to this rule is expand_stack, which just
    takes the read lock and the page_table_lock, this is ok
    because it doesn't really modify fields anybody relies on.
-5. You must be able to guarantee that while holding page_table_lock
+5. You must be able to guarantee that while holding mmap_sem
    or page_table_lock of mm A, you will not try to get either lock
    for mm B.
 
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/infiniband/core/uverbs_mem.c /home/backup/linux-2.6.15-release/drivers/infiniband/core/uverbs_mem.c
--- linux-2.6.15/drivers/infiniband/core/uverbs_mem.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/infiniband/core/uverbs_mem.c	2006-03-06 08:40:06.000000000 -0500
@@ -110,7 +110,7 @@
 		ret = get_user_pages(current, current->mm, cur_base,
 				     min_t(int, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
-				     1, !write, page_list, NULL);
+				     1, !write, 0, page_list, NULL);
 
 		if (ret < 0)
 			goto out;
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/infiniband/hw/mthca/mthca_memfree.c /home/backup/linux-2.6.15-release/drivers/infiniband/hw/mthca/mthca_memfree.c
--- linux-2.6.15/drivers/infiniband/hw/mthca/mthca_memfree.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/infiniband/hw/mthca/mthca_memfree.c	2006-03-06 08:41:10.000000000 -0500
@@ -396,7 +396,7 @@
 		goto out;
 	}
 
-	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0,
+	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0, 0,
 			     &db_tab->page[i].mem.page, NULL);
 	if (ret < 0)
 		goto out;
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/media/video/video-buf.c /home/backup/linux-2.6.15-release/drivers/media/video/video-buf.c
--- linux-2.6.15/drivers/media/video/video-buf.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/media/video/video-buf.c	2006-03-06 08:41:54.000000000 -0500
@@ -149,7 +149,7 @@
 	down_read(&current->mm->mmap_sem);
 	err = get_user_pages(current,current->mm,
 			     data & PAGE_MASK, dma->nr_pages,
-			     rw == READ, 1, /* force */
+			     rw == READ, 1, 0, /* force,do not set wire */
 			     dma->pages, NULL);
 	up_read(&current->mm->mmap_sem);
 	if (err != dma->nr_pages) {
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/scsi/sg.c /home/backup/linux-2.6.15-release/drivers/scsi/sg.c
--- linux-2.6.15/drivers/scsi/sg.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/scsi/sg.c	2006-03-06 08:09:43.000000000 -0500
@@ -1815,6 +1815,7 @@
 		nr_pages,
 		rw == READ,
 		0, /* don't force */
+		0,
 		pages,
 		NULL);
 	up_read(&current->mm->mmap_sem);
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/scsi/st.c /home/backup/linux-2.6.15-release/drivers/scsi/st.c
--- linux-2.6.15/drivers/scsi/st.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/scsi/st.c	2006-03-06 07:57:47.000000000 -0500
@@ -4453,6 +4453,7 @@
 		nr_pages,
 		rw == READ,
 		0, /* don't force */
+		0,
 		pages,
 		NULL);
 	up_read(&current->mm->mmap_sem);
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/video/pvr2fb.c /home/backup/linux-2.6.15-release/drivers/video/pvr2fb.c
--- linux-2.6.15/drivers/video/pvr2fb.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/video/pvr2fb.c	2006-03-06 08:42:53.000000000 -0500
@@ -690,7 +690,7 @@
 	
 	down_read(&current->mm->mmap_sem);
 	ret = get_user_pages(current, current->mm, (unsigned long)buf,
-			     nr_pages, WRITE, 0, pages, NULL);
+			     nr_pages, WRITE, 0, 0, pages, NULL);
 	up_read(&current->mm->mmap_sem);
 
 	if (ret < nr_pages) {
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/aio.c /home/backup/linux-2.6.15-release/fs/aio.c
--- linux-2.6.15/fs/aio.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/aio.c	2006-03-06 06:30:06.000000000 -0500
@@ -146,7 +146,7 @@
 	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
 	info->nr_pages = get_user_pages(current, ctx->mm,
 					info->mmap_base, nr_pages, 
-					1, 0, info->ring_pages, NULL);
+					1, 0, 0, info->ring_pages, NULL);
 	up_write(&ctx->mm->mmap_sem);
 
 	if (unlikely(info->nr_pages != nr_pages)) {
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/binfmt_elf.c /home/backup/linux-2.6.15-release/fs/binfmt_elf.c
--- linux-2.6.15/fs/binfmt_elf.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/binfmt_elf.c	2006-03-06 06:30:06.000000000 -0500
@@ -1600,7 +1600,7 @@
 			struct page* page;
 			struct vm_area_struct *vma;
 
-			if (get_user_pages(current, current->mm, addr, 1, 0, 1,
+			if (get_user_pages(current, current->mm, addr, 1, 0, 1, 0,
 						&page, &vma) <= 0) {
 				DUMP_SEEK (file->f_pos + PAGE_SIZE);
 			} else {
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/bio.c /home/backup/linux-2.6.15-release/fs/bio.c
--- linux-2.6.15/fs/bio.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/bio.c	2006-03-06 06:30:06.000000000 -0500
@@ -608,7 +608,7 @@
 		down_read(&current->mm->mmap_sem);
 		ret = get_user_pages(current, current->mm, uaddr,
 				     local_nr_pages,
-				     write_to_vm, 0, &pages[cur_page], NULL);
+				     write_to_vm, 0, 0, &pages[cur_page], NULL);
 		up_read(&current->mm->mmap_sem);
 
 		if (ret < local_nr_pages)
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/direct-io.c /home/backup/linux-2.6.15-release/fs/direct-io.c
--- linux-2.6.15/fs/direct-io.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/direct-io.c	2006-03-06 06:30:06.000000000 -0500
@@ -157,6 +157,7 @@
 		nr_pages,			/* How many pages? */
 		dio->rw == READ,		/* Write to memory? */
 		0,				/* force (?) */
+		0,
 		&dio->pages[0],
 		NULL);				/* vmas */
 	up_read(&current->mm->mmap_sem);
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/fuse/dev.c /home/backup/linux-2.6.15-release/fs/fuse/dev.c
--- linux-2.6.15/fs/fuse/dev.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/fuse/dev.c	2006-03-06 06:30:07.000000000 -0500
@@ -462,7 +462,7 @@
 		cs->nr_segs --;
 	}
 	down_read(&current->mm->mmap_sem);
-	err = get_user_pages(current, current->mm, cs->addr, 1, cs->write, 0,
+	err = get_user_pages(current, current->mm, cs->addr, 1, cs->write, 0, 0,
 			     &cs->pg, NULL);
 	up_read(&current->mm->mmap_sem);
 	if (err < 0)
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/fuse/file.c /home/backup/linux-2.6.15-release/fs/fuse/file.c
--- linux-2.6.15/fs/fuse/file.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/fuse/file.c	2006-03-06 06:30:07.000000000 -0500
@@ -457,7 +457,7 @@
 	npages = min(npages, FUSE_MAX_PAGES_PER_REQ);
 	down_read(&current->mm->mmap_sem);
 	npages = get_user_pages(current, current->mm, user_addr, npages, write,
-				0, req->pages, NULL);
+				0, 0, req->pages, NULL);
 	up_read(&current->mm->mmap_sem);
 	if (npages < 0)
 		return npages;
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/nfs/direct.c /home/backup/linux-2.6.15-release/fs/nfs/direct.c
--- linux-2.6.15/fs/nfs/direct.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/nfs/direct.c	2006-03-06 06:30:07.000000000 -0500
@@ -104,7 +104,7 @@
 	if (*pages) {
 		down_read(&current->mm->mmap_sem);
 		result = get_user_pages(current, current->mm, user_addr,
-					page_count, (rw == READ), 0,
+					page_count, (rw == READ), 0, 0,
 					*pages, NULL);
 		up_read(&current->mm->mmap_sem);
 	}
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/proc/proc_misc.c /home/backup/linux-2.6.15-release/fs/proc/proc_misc.c
--- linux-2.6.15/fs/proc/proc_misc.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/proc/proc_misc.c	2006-03-06 06:44:50.000000000 -0500
@@ -123,6 +123,7 @@
 	struct page_state ps;
 	unsigned long inactive;
 	unsigned long active;
+	unsigned long wired;
 	unsigned long free;
 	unsigned long committed;
 	unsigned long allowed;
@@ -130,7 +131,7 @@
 	long cached;
 
 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &wired, &free);
 
 /*
  * display in kilobytes.
@@ -159,6 +160,7 @@
 		"SwapCached:   %8lu kB\n"
 		"Active:       %8lu kB\n"
 		"Inactive:     %8lu kB\n"
+		"Wired:        %8lu kB\n"
 		"HighTotal:    %8lu kB\n"
 		"HighFree:     %8lu kB\n"
 		"LowTotal:     %8lu kB\n"
@@ -182,6 +184,7 @@
 		K(total_swapcache_pages),
 		K(active),
 		K(inactive),
+		K(wired),
 		K(i.totalhigh),
 		K(i.freehigh),
 		K(i.totalram-i.totalhigh),
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/mm.h /home/backup/linux-2.6.15-release/include/linux/mm.h
--- linux-2.6.15/include/linux/mm.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/mm.h	2006-03-07 01:49:12.000000000 -0500
@@ -59,6 +59,9 @@
 	unsigned long vm_start;		/* Our start address within vm_mm. */
 	unsigned long vm_end;		/* The first byte after our end address
 					   within vm_mm. */
+	int vm_wire_change;			/* VM_LOCKED bit of vm_flags was just changed.
+								 * For rollback support of sys_mlock series system calls.
+								 */
 
 	/* linked list of VM areas per task, sorted by address */
 	struct vm_area_struct *vm_next;
@@ -218,6 +221,10 @@
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
 	atomic_t _count;		/* Usage count, see below. */
+	unsigned short wired_count; /* Count of wirings of the page.
+					 * If not zero,the page would be SetPageWired,
+					 * and put on Wired list of the zone.
+					 */
 	atomic_t _mapcount;		/* Count of ptes mapped in mms,
 					 * to show when page is mapped
 					 * & limit reverse map searches.
@@ -699,12 +706,13 @@
 	return __handle_mm_fault(mm, vma, address, write_access) & (~VM_FAULT_WRITE);
 }
 
-extern int make_pages_present(unsigned long addr, unsigned long end);
+extern int make_pages_wired(unsigned long addr, unsigned long end);
+void make_pages_unwired(struct mm_struct *mm, unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 void install_arg_page(struct vm_area_struct *, struct page *, unsigned long);
 
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
-		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
+		int len, int write, int force, int wire, struct page **pages, struct vm_area_struct **vmas);
 void print_bad_pte(struct vm_area_struct *, pte_t, unsigned long);
 
 int __set_page_dirty_buffers(struct page *page);
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/mm_inline.h /home/backup/linux-2.6.15-release/include/linux/mm_inline.h
--- linux-2.6.15/include/linux/mm_inline.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/mm_inline.h	2006-03-07 01:56:10.000000000 -0500
@@ -1,3 +1,9 @@
+/*
+ * There are 3 per-zone lists in LRU: 
+ * 			Active: pages which were accessed more frequently.
+ *			Inactive: pages accessed less frequently.
+ *			Wired: pages mlocked by some tasks.
+ */
 
 static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
@@ -14,6 +20,13 @@
 }
 
 static inline void
+add_page_to_wired_list(struct zone *zone, struct page *page)
+{
+	list_add(&page->lru, &zone->wired_list);
+	zone->nr_wired++;
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
@@ -28,10 +41,20 @@
 }
 
 static inline void
+del_page_from_wired_list(struct zone *zone, struct page *page)
+{
+	list_del(&page->lru);
+	zone->nr_wired--;
+}
+
+static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	if (PageActive(page)) {
+	if(PageWired(page)){
+		ClearPageWired(page);
+		zone->nr_wired--;
+	} else if (PageActive(page)) {
 		ClearPageActive(page);
 		zone->nr_active--;
 	} else {
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/mmzone.h /home/backup/linux-2.6.15-release/include/linux/mmzone.h
--- linux-2.6.15/include/linux/mmzone.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/mmzone.h	2006-03-07 01:58:26.000000000 -0500
@@ -143,10 +143,12 @@
 	spinlock_t		lru_lock;	
 	struct list_head	active_list;
 	struct list_head	inactive_list;
+	struct list_head	wired_list; /* Pages wired. */
 	unsigned long		nr_scan_active;
 	unsigned long		nr_scan_inactive;
 	unsigned long		nr_active;
 	unsigned long		nr_inactive;
+	unsigned long		nr_wired;
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
@@ -315,9 +317,9 @@
 extern struct pglist_data *pgdat_list;
 
 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+		unsigned long *wired, unsigned long *free, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+		unsigned long *wired, unsigned long *free);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/page-flags.h /home/backup/linux-2.6.15-release/include/linux/page-flags.h
--- linux-2.6.15/include/linux/page-flags.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/page-flags.h	2006-03-06 06:30:07.000000000 -0500
@@ -76,6 +76,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
+#define PG_wired		20  /* Page is on Wired list */
+
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
  * allowed.
@@ -198,7 +200,14 @@
 #define __ClearPageDirty(page)	__clear_bit(PG_dirty, &(page)->flags)
 #define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags)
 
+#define SetPageWired(page)	set_bit(PG_wired, &(page)->flags)
+#define ClearPageWired(page) clear_bit(PG_wired,&(page)->flags)
+#define PageWired(page)		test_bit(PG_wired, &(page)->flags)
+#define TestSetPageWired(page)	test_and_set_bit(PG_wired, &(page)->flags)
+#define TestClearPageWired(page)	test_and_clear_bit(PG_wired, &(page)->flags)
+
 #define SetPageLRU(page)	set_bit(PG_lru, &(page)->flags)
+#define ClearPageLRU(page)	clear_bit(PG_lru, &(page)->flags)
 #define PageLRU(page)		test_bit(PG_lru, &(page)->flags)
 #define TestSetPageLRU(page)	test_and_set_bit(PG_lru, &(page)->flags)
 #define TestClearPageLRU(page)	test_and_clear_bit(PG_lru, &(page)->flags)
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/rmap.h /home/backup/linux-2.6.15-release/include/linux/rmap.h
--- linux-2.6.15/include/linux/rmap.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/rmap.h	2006-03-06 06:30:07.000000000 -0500
@@ -71,8 +71,8 @@
  * rmap interfaces called when adding or removing pte of page
  */
 void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_add_file_rmap(struct page *, struct vm_area_struct *);
+void page_remove_rmap(struct page *, struct vm_area_struct *);
 
 /**
  * page_dup_rmap - duplicate pte mapping to a page
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/swap.h /home/backup/linux-2.6.15-release/include/linux/swap.h
--- linux-2.6.15/include/linux/swap.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/swap.h	2006-03-06 06:30:07.000000000 -0500
@@ -165,6 +165,8 @@
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(wire_page(struct page *));
+extern void FASTCALL(unwire_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int rotate_reclaimable_page(struct page *page);
diff -urN --exclude-from=./exclude.files linux-2.6.15/kernel/futex.c /home/backup/linux-2.6.15-release/kernel/futex.c
--- linux-2.6.15/kernel/futex.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/kernel/futex.c	2006-03-06 06:30:07.000000000 -0500
@@ -201,7 +201,7 @@
 	 * from swap.  But that's a lot of code to duplicate here
 	 * for a rare case, so we simply fetch the page.
 	 */
-	err = get_user_pages(current, mm, uaddr, 1, 0, 0, &page, NULL);
+	err = get_user_pages(current, mm, uaddr, 1, 0, 0, 0, &page, NULL);
 	if (err >= 0) {
 		key->shared.pgoff =
 			page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
diff -urN --exclude-from=./exclude.files linux-2.6.15/kernel/ptrace.c /home/backup/linux-2.6.15-release/kernel/ptrace.c
--- linux-2.6.15/kernel/ptrace.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/kernel/ptrace.c	2006-03-06 06:30:07.000000000 -0500
@@ -228,7 +228,7 @@
 		void *maddr;
 
 		ret = get_user_pages(tsk, mm, addr, 1,
-				write, 1, &page, &vma);
+				write, 1, 0, &page, &vma);
 		if (ret <= 0)
 			break;
 
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/filemap_xip.c /home/backup/linux-2.6.15-release/mm/filemap_xip.c
--- linux-2.6.15/mm/filemap_xip.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/filemap_xip.c	2006-03-06 06:30:07.000000000 -0500
@@ -189,7 +189,7 @@
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
+			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/fremap.c /home/backup/linux-2.6.15-release/mm/fremap.c
--- linux-2.6.15/mm/fremap.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/fremap.c	2006-03-06 06:30:07.000000000 -0500
@@ -33,7 +33,7 @@
 		if (page) {
 			if (pte_dirty(pte))
 				set_page_dirty(page);
-			page_remove_rmap(page);
+			page_remove_rmap(page, vma);
 			page_cache_release(page);
 		}
 	} else {
@@ -80,7 +80,7 @@
 
 	flush_icache_page(vma, page);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
-	page_add_file_rmap(page);
+	page_add_file_rmap(page, vma);
 	pte_val = *pte;
 	update_mmu_cache(vma, addr, pte_val);
 	err = 0;
@@ -203,6 +203,8 @@
 			spin_unlock(&mapping->i_mmap_lock);
 		}
 
+		if(vma->vm_flags & VM_LOCKED)
+			flags &= ~MAP_NONBLOCK;
 		err = vma->vm_ops->populate(vma, start, size,
 					    vma->vm_page_prot,
 					    pgoff, flags & MAP_NONBLOCK);
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/memory.c /home/backup/linux-2.6.15-release/mm/memory.c
--- linux-2.6.15/mm/memory.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/memory.c	2006-03-07 11:14:59.000000000 -0500
@@ -656,7 +656,7 @@
 					mark_page_accessed(page);
 				file_rss--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap(page, vma);
 			tlb_remove_page(tlb, page);
 			continue;
 		}
@@ -950,8 +950,30 @@
 	return page;
 }
 
+void make_pages_unwired(struct mm_struct *mm,
+					unsigned long start,unsigned long end)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+	unsigned int foll_flags;
+
+	foll_flags =0;
+
+	vma=find_vma(mm,start);
+	if(!vma)
+		BUG();
+	if(is_vm_hugetlb_page(vma))
+		return;
+	
+	for(; start<end ; start+=PAGE_SIZE) {
+		page=follow_page(vma,start,foll_flags);
+		if(page)
+			unwire_page(page);
+	}
+}
+
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, int len, int write, int force,
+		unsigned long start, int len, int write,int force, int wire,
 		struct page **pages, struct vm_area_struct **vmas)
 {
 	int i;
@@ -973,6 +995,7 @@
 		if (!vma && in_gate_area(tsk, start)) {
 			unsigned long pg = start & PAGE_MASK;
 			struct vm_area_struct *gate_vma = get_gate_vma(tsk);
+			struct page *page;	
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
@@ -994,6 +1017,7 @@
 				pte_unmap(pte);
 				return i ? : -EFAULT;
 			}
+			page = vm_normal_page(gate_vma, start, *pte);
 			if (pages) {
 				struct page *page = vm_normal_page(gate_vma, start, *pte);
 				pages[i] = page;
@@ -1003,9 +1027,12 @@
 			pte_unmap(pte);
 			if (vmas)
 				vmas[i] = gate_vma;
+			if(wire)
+				wire_page(page);
 			i++;
 			start += PAGE_SIZE;
 			len--;
+
 			continue;
 		}
 
@@ -1013,6 +1040,7 @@
 				|| !(vm_flags & vma->vm_flags))
 			return i ? : -EFAULT;
 
+		/* We dont account wired HugeTLB pages */
 		if (is_vm_hugetlb_page(vma)) {
 			i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &len, i);
@@ -1067,17 +1095,20 @@
 			}
 			if (vmas)
 				vmas[i] = vma;
+			if(wire)
+				wire_page(page);
 			i++;
 			start += PAGE_SIZE;
 			len--;
 		} while (len && start < vma->vm_end);
 	} while (len);
+
 	return i;
 }
 EXPORT_SYMBOL(get_user_pages);
 
-static int zeromap_pte_range(struct mm_struct *mm, pmd_t *pmd,
-			unsigned long addr, unsigned long end, pgprot_t prot)
+static int zeromap_pte_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t prot)
 {
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -1089,7 +1120,7 @@
 		struct page *page = ZERO_PAGE(addr);
 		pte_t zero_pte = pte_wrprotect(mk_pte(page, prot));
 		page_cache_get(page);
-		page_add_file_rmap(page);
+		page_add_file_rmap(page,vma);
 		inc_mm_counter(mm, file_rss);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, zero_pte);
@@ -1098,8 +1129,8 @@
 	return 0;
 }
 
-static inline int zeromap_pmd_range(struct mm_struct *mm, pud_t *pud,
-			unsigned long addr, unsigned long end, pgprot_t prot)
+static inline int zeromap_pmd_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			 pud_t *pud, unsigned long addr, unsigned long end, pgprot_t prot)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -1109,14 +1140,14 @@
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (zeromap_pte_range(mm, pmd, addr, next, prot))
+		if (zeromap_pte_range(mm, vma, pmd, addr, next, prot))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
-static inline int zeromap_pud_range(struct mm_struct *mm, pgd_t *pgd,
-			unsigned long addr, unsigned long end, pgprot_t prot)
+static inline int zeromap_pud_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			pgd_t *pgd, unsigned long addr, unsigned long end, pgprot_t prot)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -1126,7 +1157,7 @@
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (zeromap_pmd_range(mm, pud, addr, next, prot))
+		if (zeromap_pmd_range(mm, vma, pud, addr, next, prot))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
@@ -1146,7 +1177,7 @@
 	flush_cache_range(vma, addr, end);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = zeromap_pud_range(mm, pgd, addr, next, prot);
+		err = zeromap_pud_range(mm, vma, pgd, addr, next, prot);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
@@ -1172,7 +1203,8 @@
  * old drivers should use this, and they needed to mark their
  * pages reserved for the old functions anyway.
  */
-static int insert_page(struct mm_struct *mm, unsigned long addr, struct page *page, pgprot_t prot)
+static int insert_page(struct mm_struct *mm, struct vm_area_struct *vma, 
+			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	int retval;
 	pte_t *pte;
@@ -1193,7 +1225,7 @@
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
 	inc_mm_counter(mm, file_rss);
-	page_add_file_rmap(page);
+	page_add_file_rmap(page,vma);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
 
 	retval = 0;
@@ -1229,7 +1261,7 @@
 	if (!page_count(page))
 		return -EINVAL;
 	vma->vm_flags |= VM_INSERTPAGE;
-	return insert_page(vma->vm_mm, addr, page, vma->vm_page_prot);
+	return insert_page(vma->vm_mm, vma, addr, page, vma->vm_page_prot);
 }
 EXPORT_SYMBOL(vm_insert_page);
 
@@ -1484,7 +1516,7 @@
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
-			page_remove_rmap(old_page);
+			page_remove_rmap(old_page, vma);
 			if (!PageAnon(old_page)) {
 				dec_mm_counter(mm, file_rss);
 				inc_mm_counter(mm, anon_rss);
@@ -1967,7 +1999,7 @@
 		if (!pte_none(*page_table))
 			goto release;
 		inc_mm_counter(mm, file_rss);
-		page_add_file_rmap(page);
+		page_add_file_rmap(page, vma);
 	}
 
 	set_pte_at(mm, address, page_table, entry);
@@ -2089,7 +2121,7 @@
 			page_add_anon_rmap(new_page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
-			page_add_file_rmap(new_page);
+			page_add_file_rmap(new_page, vma);
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2306,10 +2338,13 @@
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-int make_pages_present(unsigned long addr, unsigned long end)
+int make_pages_wired(unsigned long addr, unsigned long end)
 {
 	int ret, len, write;
+	struct page *page;
 	struct vm_area_struct * vma;
+	struct mm_struct *mm=current->mm;
+	int wire_change;
 
 	vma = find_vma(current->mm, addr);
 	if (!vma)
@@ -2320,13 +2355,26 @@
 	if (end > vma->vm_end)
 		BUG();
 	len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
-	ret = get_user_pages(current, current->mm, addr,
-			len, write, 0, NULL, NULL);
-	if (ret < 0)
-		return ret;
-	return ret == len ? 0 : -1;
+	wire_change = vma->vm_wire_change;
+	vma->vm_wire_change = 1;
+	ret = get_user_pages(current, mm, addr,
+			len, write, 1, 1, NULL, NULL); /* write,set_wire */
+	vma->vm_wire_change = wire_change;
+	if(ret < len) {
+	    for(; addr< end ; addr += PAGE_SIZE) {
+        	page=follow_page(vma,addr,0);
+            if(page)
+				unwire_page(page);
+			else
+				BUG();
+   		}
+		return -1;
+	} 
+	else
+		return 0;
 }
 
+
 /* 
  * Map a vmalloc()-space virtual address to the physical page.
  */
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/mempolicy.c /home/backup/linux-2.6.15-release/mm/mempolicy.c
--- linux-2.6.15/mm/mempolicy.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/mempolicy.c	2006-03-06 06:30:08.000000000 -0500
@@ -440,7 +440,7 @@
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, 0, &p, NULL);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/mlock.c /home/backup/linux-2.6.15-release/mm/mlock.c
--- linux-2.6.15/mm/mlock.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/mlock.c	2006-03-07 10:50:52.000000000 -0500
@@ -3,6 +3,7 @@
  *
  *  (C) Copyright 1995 Linus Torvalds
  *  (C) Copyright 2002 Christoph Hellwig
+ *  (C) Copyright 2006 Shaoping Wang
  */
 
 #include <linux/mman.h>
@@ -10,72 +11,119 @@
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 
-
-static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
-	unsigned long start, unsigned long end, unsigned int newflags)
+static int do_mlock(unsigned long start, size_t len,unsigned int jump_hole)
 {
-	struct mm_struct * mm = vma->vm_mm;
-	pgoff_t pgoff;
-	int pages;
-	int ret = 0;
+	unsigned long  end=0,vmoff=0;
+	unsigned long  pages=0;
+	struct mm_struct *mm=current->mm;
+	struct vm_area_struct * vma, *prev, **pprev,*next;
+	int ret=0;
 
-	if (newflags == vma->vm_flags) {
-		*prev = vma;
-		goto out;
-	}
+	len = PAGE_ALIGN(len);
+	end = start + len;
+	if (end < start)
+		return -EINVAL;
+	if (end == start)
+		return 0;
 
-	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
-	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
-			  vma->vm_file, pgoff, vma_policy(vma));
-	if (*prev) {
-		vma = *prev;
-		goto success;
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (!vma || vma->vm_start > start)
+		return -ENOMEM;
+    else if (vma->vm_start < start){
+			prev=vma;
+		    ret = split_vma(mm, prev, start, 0);
+			if(!ret)
+				vma=prev->vm_next;
+			else {
+				return ret;
+			}
 	}
 
-	*prev = vma;
-
-	if (start != vma->vm_start) {
-		ret = split_vma(mm, vma, start, 1);
-		if (ret)
+	while(vma->vm_start < end){
+		vmoff =vma->vm_end; /* Record where we have proceeded. */
+		if (vma->vm_end > end){
+    	   	ret = split_vma(mm, vma, end, 0);
+   			if (ret) 
+		   		goto out;
+		}
+		if(vma->vm_flags & VM_LOCKED)
+			goto next;
+		vma->vm_flags |= VM_LOCKED;
+		vma->vm_wire_change =1;
+		pages += (vma->vm_end-vma->vm_start) >> PAGE_SHIFT;
+
+    	if (!(vma->vm_flags & VM_IO)) {
+   			ret = make_pages_wired(vma->vm_start, vma->vm_end);
+			if(ret<0){ 
+				vma->vm_flags &= ~VM_LOCKED;
+				vma->vm_wire_change =0;
+				goto out;
+			}
+		}
+next:
+		if(vma->vm_end ==end)
+			break;
+		prev =vma;
+		vma =vma->vm_next;
+		
+		/* If called from do_mlockall, 
+		 * we may jump over holes. 
+         */
+		if(jump_hole){ 
+			if(vma)
+				continue;
+			else
+				goto out;
+		}
+		else if (!vma || vma->vm_start != prev->vm_end){
+			ret = -ENOMEM;
 			goto out;
+		}
 	}
 
-	if (end != vma->vm_end) {
-		ret = split_vma(mm, vma, end, 0);
-		if (ret)
-			goto out;
-	}
+out:
+	pprev =&prev;
+	vma = find_vma_prev(mm, start, pprev);
 
-success:
-	/*
-	 * vm_flags is protected by the mmap_sem held in write mode.
-	 * It's okay if try_to_unmap_one unmaps a page just after we
-	 * set VM_LOCKED, make_pages_present below will bring it back.
-	 */
-	vma->vm_flags = newflags;
-
-	/*
-	 * Keep track of amount of locked VM.
-	 */
-	pages = (end - start) >> PAGE_SHIFT;
-	if (newflags & VM_LOCKED) {
-		pages = -pages;
-		if (!(newflags & VM_IO))
-			ret = make_pages_present(start, end);
+	/* If error happened,do rollback.
+	 * Whether success or not,try to merge the vmas.
+     */
+	while( vma && vma->vm_end <= vmoff ){
+		if(vma->vm_wire_change) {
+			if(ret){
+				make_pages_unwired(mm, vma->vm_start, vma->vm_end);
+				vma->vm_flags &= ~VM_LOCKED;
+			}
+			vma->vm_wire_change =0;
+		}
+		next=vma->vm_next;
+		if(next && next->vm_wire_change) {
+			if(ret){
+				make_pages_unwired(mm, next->vm_start, next->vm_end);
+				next->vm_flags &= ~VM_LOCKED;
+			}
+			next->vm_wire_change=0;
+		}
+		*pprev=vma_merge(mm, *pprev, vma->vm_start, vma->vm_end, vma->vm_flags,
+					vma->anon_vma,vma->vm_file, vma->vm_pgoff, vma_policy(vma));
+		if(*pprev)
+			vma =*pprev;
+		vma =vma->vm_next;
 	}
 
-	vma->vm_mm->locked_vm -= pages;
-out:
-	if (ret == -ENOMEM)
-		ret = -EAGAIN;
+	if(!ret)
+		mm->locked_vm += pages;
 	return ret;
 }
 
-static int do_mlock(unsigned long start, size_t len, int on)
+
+static int do_munlock(unsigned long start, size_t len, unsigned int jump_hole)
 {
-	unsigned long nstart, end, tmp;
-	struct vm_area_struct * vma, * prev;
-	int error;
+	unsigned long  end=0,vmoff=0;
+	unsigned long  pages=0;
+	struct mm_struct *mm=current->mm;
+	struct vm_area_struct * vma, *prev, **pprev, *next;
+	int ret=0;
 
 	len = PAGE_ALIGN(len);
 	end = start + len;
@@ -86,38 +134,81 @@
 	vma = find_vma_prev(current->mm, start, &prev);
 	if (!vma || vma->vm_start > start)
 		return -ENOMEM;
+    else if (vma->vm_start < start){
+		prev=vma;
+	    ret = split_vma(mm, prev, start, 0);
+		if(!ret)
+			vma=prev->vm_next;
+		else 
+			return ret;
+	}
 
-	if (start > vma->vm_start)
-		prev = vma;
+	while(vma->vm_start < end){
+		vmoff =vma->vm_end;
+    	if (vma->vm_end > end){
+    	   	ret = split_vma(mm, vma, end, 0);
+   	 		if (ret) 
+	   			goto out;
+		}
 
-	for (nstart = start ; ; ) {
-		unsigned int newflags;
+		if(!(vma->vm_flags & VM_LOCKED))
+			goto next;
 
-		/* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
+		vma->vm_wire_change=1;
+		pages += (vma->vm_end -vma->vm_start) >>PAGE_SHIFT;
 
-		newflags = vma->vm_flags | VM_LOCKED;
-		if (!on)
-			newflags &= ~VM_LOCKED;
-
-		tmp = vma->vm_end;
-		if (tmp > end)
-			tmp = end;
-		error = mlock_fixup(vma, &prev, nstart, tmp, newflags);
-		if (error)
-			break;
-		nstart = tmp;
-		if (nstart < prev->vm_end)
-			nstart = prev->vm_end;
-		if (nstart >= end)
+next:
+		if(vma->vm_end ==end)
 			break;
+		prev =vma;
+		vma =vma->vm_next;
 
-		vma = prev->vm_next;
-		if (!vma || vma->vm_start != nstart) {
-			error = -ENOMEM;
-			break;
+		/* If called from munlockall,
+		 * we may jump over holes.
+		 */
+		if(jump_hole){
+			if(!vma)
+				goto out;
+			else
+				continue;
+		}
+		else if (!vma || (vma->vm_start != prev->vm_end) ){
+			ret= -ENOMEM;
+			goto out;
 		}
 	}
-	return error;
+
+out:
+	pprev =&prev;
+	vma = find_vma_prev(current->mm, start, pprev);
+
+	while( vma && vma->vm_end <= vmoff ){
+			if(!ret && vma->vm_wire_change){
+	    		if (!(vma->vm_flags & VM_IO))
+					make_pages_unwired(mm, vma->vm_start, vma->vm_end);
+				vma->vm_flags &=~VM_LOCKED;
+			}
+			vma->vm_wire_change =0;
+			next = vma->vm_next;
+			if(next){ 
+				if(!ret && next->vm_wire_change){
+		    		if (!(next->vm_flags & VM_IO))
+						make_pages_unwired(mm, next->vm_start,next->vm_end);
+					next->vm_flags &=~VM_LOCKED;
+				}
+				next->vm_wire_change =0;
+			}
+		*pprev =vma_merge(mm, *pprev, vma->vm_start, vma->vm_end, vma->vm_flags,
+		vma->anon_vma,vma->vm_file, vma->vm_pgoff, vma_policy(vma));
+		if(*pprev)
+			vma =*pprev;
+		vma =vma->vm_next;
+	}
+
+	if(!ret)
+		mm->locked_vm -= pages;
+	
+	return ret;
 }
 
 asmlinkage long sys_mlock(unsigned long start, size_t len)
@@ -141,7 +232,7 @@
 
 	/* check against resource limits */
 	if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
-		error = do_mlock(start, len, 1);
+		error = do_mlock(start, len, 0);
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
@@ -153,33 +244,41 @@
 	down_write(&current->mm->mmap_sem);
 	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
 	start &= PAGE_MASK;
-	ret = do_mlock(start, len, 0);
+	ret = do_munlock(start, len,0);
 	up_write(&current->mm->mmap_sem);
 	return ret;
 }
 
 static int do_mlockall(int flags)
 {
-	struct vm_area_struct * vma, * prev = NULL;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct * vma;
 	unsigned int def_flags = 0;
+	unsigned long start;
+	int ret = 0;
 
 	if (flags & MCL_FUTURE)
 		def_flags = VM_LOCKED;
-	current->mm->def_flags = def_flags;
+	mm->def_flags = def_flags;
 	if (flags == MCL_FUTURE)
 		goto out;
+	vma=mm->mmap;
+	start = vma->vm_start;
+	ret=do_mlock(start,TASK_SIZE,1);
+out:
+	return ret;
+}
 
-	for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
-		unsigned int newflags;
-
-		newflags = vma->vm_flags | VM_LOCKED;
-		if (!(flags & MCL_CURRENT))
-			newflags &= ~VM_LOCKED;
+static int do_munlockall(void)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct * vma;
+	unsigned long start;
+
+	vma=mm->mmap;
+	start = vma->vm_start;
+	do_munlock(start,TASK_SIZE,1);
 
-		/* Ignore errors */
-		mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
-	}
-out:
 	return 0;
 }
 
@@ -214,7 +313,7 @@
 	int ret;
 
 	down_write(&current->mm->mmap_sem);
-	ret = do_mlockall(0);
+	ret = do_munlockall();
 	up_write(&current->mm->mmap_sem);
 	return ret;
 }
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/mmap.c /home/backup/linux-2.6.15-release/mm/mmap.c
--- linux-2.6.15/mm/mmap.c	2006-02-17 05:24:09.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/mmap.c	2006-03-06 06:30:08.000000000 -0500
@@ -1119,7 +1119,7 @@
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		make_pages_wired(addr, addr + len);
 	}
 	if (flags & MAP_POPULATE) {
 		up_write(&mm->mmap_sem);
@@ -1551,7 +1551,7 @@
 	if (!prev || expand_stack(prev, addr))
 		return NULL;
 	if (prev->vm_flags & VM_LOCKED) {
-		make_pages_present(addr, prev->vm_end);
+		make_pages_wired(addr, prev->vm_end);
 	}
 	return prev;
 }
@@ -1614,7 +1614,7 @@
 	if (expand_stack(vma, addr))
 		return NULL;
 	if (vma->vm_flags & VM_LOCKED) {
-		make_pages_present(addr, start);
+		make_pages_wired(addr, start);
 	}
 	return vma;
 }
@@ -1921,7 +1921,7 @@
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		make_pages_wired(addr, addr + len);
 	}
 	return addr;
 }
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/mremap.c /home/backup/linux-2.6.15-release/mm/mremap.c
--- linux-2.6.15/mm/mremap.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/mremap.c	2006-03-06 06:30:08.000000000 -0500
@@ -230,7 +230,7 @@
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += new_len >> PAGE_SHIFT;
 		if (new_len > old_len)
-			make_pages_present(new_addr + old_len,
+			make_pages_wired(new_addr + old_len,
 					   new_addr + new_len);
 	}
 
@@ -367,7 +367,7 @@
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
-				make_pages_present(addr + old_len,
+				make_pages_wired(addr + old_len,
 						   addr + new_len);
 			}
 			ret = addr;
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/nommu.c /home/backup/linux-2.6.15-release/mm/nommu.c
--- linux-2.6.15/mm/nommu.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/nommu.c	2006-03-06 06:30:08.000000000 -0500
@@ -124,7 +124,7 @@
  * The nommu dodgy version :-)
  */
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-	unsigned long start, int len, int write, int force,
+	unsigned long start, int len, int force,
 	struct page **pages, struct vm_area_struct **vmas)
 {
 	int i;
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/page_alloc.c /home/backup/linux-2.6.15-release/mm/page_alloc.c
--- linux-2.6.15/mm/page_alloc.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/page_alloc.c	2006-03-06 06:30:08.000000000 -0500
@@ -348,7 +348,8 @@
 			1 << PG_slab	|
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_reserved )))
+			1 << PG_reserved |
+			1 << PG_wired )))
 		bad_page(function, page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
@@ -481,6 +482,7 @@
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_wired   |
 			1 << PG_dirty	|
 			1 << PG_reclaim	|
 			1 << PG_slab    |
@@ -1252,35 +1254,39 @@
 EXPORT_SYMBOL(__mod_page_state);
 
 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *wired,unsigned long *free, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
 
 	*active = 0;
 	*inactive = 0;
+	*wired = 0;
 	*free = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
+		*wired += zones[i].nr_wired;
 		*free += zones[i].free_pages;
 	}
 }
 
 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *wired, unsigned long *free)
 {
 	struct pglist_data *pgdat;
 
 	*active = 0;
 	*inactive = 0;
+	*wired = 0;
 	*free = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n, o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
-		*free += n;
+		*wired += n;
+		*free += o;
 	}
 }
 
@@ -1328,6 +1334,7 @@
 	int cpu, temperature;
 	unsigned long active;
 	unsigned long inactive;
+	unsigned long wired;
 	unsigned long free;
 	struct zone *zone;
 
@@ -1358,16 +1365,17 @@
 	}
 
 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &wired, &free);
 
 	printk("Free pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));
 
-	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
+	printk("Active:%lu inactive:%lu wired:%lu dirty:%lu writeback:%lu "
 		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
+		wired,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
@@ -1387,6 +1395,7 @@
 			" high:%lukB"
 			" active:%lukB"
 			" inactive:%lukB"
+			" wired:%lukB"
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1398,6 +1407,7 @@
 			K(zone->pages_high),
 			K(zone->nr_active),
 			K(zone->nr_inactive),
+			K(zone->nr_wired),
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone->all_unreclaimable ? "yes" : "no")
@@ -2009,10 +2019,12 @@
 		zone_pcp_init(zone);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
+		INIT_LIST_HEAD(&zone->wired_list);
 		zone->nr_scan_active = 0;
 		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
 		zone->nr_inactive = 0;
+		zone->nr_wired = 0;
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
 			continue;
@@ -2161,6 +2173,7 @@
 			   "\n        high     %lu"
 			   "\n        active   %lu"
 			   "\n        inactive %lu"
+			   "\n        wired    %lu"
 			   "\n        scanned  %lu (a: %lu i: %lu)"
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
@@ -2170,6 +2183,7 @@
 			   zone->pages_high,
 			   zone->nr_active,
 			   zone->nr_inactive,
+			   zone->nr_wired,
 			   zone->pages_scanned,
 			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/readahead.c /home/backup/linux-2.6.15-release/mm/readahead.c
--- linux-2.6.15/mm/readahead.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/readahead.c	2006-03-06 06:30:08.000000000 -0500
@@ -564,8 +564,9 @@
 {
 	unsigned long active;
 	unsigned long inactive;
+	unsigned long wired;
 	unsigned long free;
 
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &wired, &free, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/rmap.c /home/backup/linux-2.6.15-release/mm/rmap.c
--- linux-2.6.15/mm/rmap.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/rmap.c	2006-03-07 06:17:57.000000000 -0500
@@ -449,6 +449,8 @@
 		struct anon_vma *anon_vma = vma->anon_vma;
 
 		BUG_ON(!anon_vma);
+		if((vma->vm_flags & VM_LOCKED) && !vma->vm_wire_change)
+	        wire_page(page);
 		anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 		page->mapping = (struct address_space *) anon_vma;
 
@@ -465,11 +467,13 @@
  *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma)
 {
 	BUG_ON(PageAnon(page));
 	BUG_ON(!pfn_valid(page_to_pfn(page)));
 
+	if((vma->vm_flags & VM_LOCKED) && !vma->vm_wire_change)
+		wire_page(page);
 	if (atomic_inc_and_test(&page->_mapcount))
 		inc_page_state(nr_mapped);
 }
@@ -480,8 +484,11 @@
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
 {
+	if(PageWired(page) && (vma->vm_flags&VM_LOCKED))
+		unwire_page(page);
+
 	if (atomic_add_negative(-1, &page->_mapcount)) {
 		BUG_ON(page_mapcount(page) < 0);
 		/*
@@ -562,7 +569,7 @@
 	} else
 		dec_mm_counter(mm, file_rss);
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, vma);
 	page_cache_release(page);
 
 out_unmap:
@@ -652,7 +659,7 @@
 		if (pte_dirty(pteval))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, vma);
 		page_cache_release(page);
 		dec_mm_counter(mm, file_rss);
 		(*mapcount)--;
@@ -712,8 +719,10 @@
 
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
+
+		/* If VM_LOCKED set, the page will be moved to Wired list.*/
+		if (vma->vm_flags & VM_LOCKED) 
+			continue;                  
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
 			max_nl_cursor = cursor;
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/swap.c /home/backup/linux-2.6.15-release/mm/swap.c
--- linux-2.6.15/mm/swap.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/swap.c	2006-03-07 11:45:37.000000000 -0500
@@ -110,6 +110,44 @@
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+/* Wire the page; if the page is in LRU,
+ * try move it to Wired list. 
+ */
+void fastcall wire_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	page->wired_count ++;
+	if(!PageWired(page)){
+		if(PageLRU(page)){
+			del_page_from_lru(zone, page);
+			add_page_to_wired_list(zone,page);
+			SetPageWired(page);
+		}
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
+/* Unwire the page.
+ * If it isnt wired by any process, try move it to active list.
+ */
+void fastcall unwire_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	page->wired_count --;
+	if(!page->wired_count){
+		if(PageLRU(page) && TestClearPageWired(page)){
+			del_page_from_wired_list(zone,page);
+			add_page_to_active_list(zone,page);
+			SetPageActive(page);
+		}
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 /*
  * Mark a page as having seen activity.
  *
@@ -119,11 +157,13 @@
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-	} else if (!PageReferenced(page)) {
-		SetPageReferenced(page);
+	if(!PageWired(page)) {
+		if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+			activate_page(page);
+			ClearPageReferenced(page);
+		} else if (!PageReferenced(page)) {
+			SetPageReferenced(page);
+		}
 	}
 }
 
@@ -178,13 +218,15 @@
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	if (TestClearPageLRU(page))
-		del_page_from_lru(zone, page);
-	if (page_count(page) != 0)
-		page = NULL;
+	if(!PageWired(page)) {
+		if (TestClearPageLRU(page))
+			del_page_from_lru(zone, page);
+		if (page_count(page) != 0)
+			page = NULL;
+		if (page)
+			free_hot_page(page);
+	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
-	if (page)
-		free_hot_page(page);
 }
 
 EXPORT_SYMBOL(__page_cache_release);
@@ -214,7 +256,8 @@
 
 		if (!put_page_testzero(page))
 			continue;
-
+		if(PageWired(page))
+			continue;
 		pagezone = page_zone(page);
 		if (pagezone != zone) {
 			if (zone)
@@ -301,7 +344,12 @@
 		}
 		if (TestSetPageLRU(page))
 			BUG();
-		add_page_to_inactive_list(zone, page);
+		if(!page->wired_count)
+			add_page_to_inactive_list(zone, page);
+		else {
+			SetPageWired(page);
+			add_page_to_wired_list(zone,page);
+		}
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
@@ -330,7 +378,12 @@
 			BUG();
 		if (TestSetPageActive(page))
 			BUG();
-		add_page_to_active_list(zone, page);
+		if(!page->wired_count)
+			add_page_to_active_list(zone, page);
+		else{
+			SetPageWired(page);
+			add_page_to_wired_list(zone,page);
+		}
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-03-24 16:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-20 13:35 [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic Stone Wang
2006-03-20 13:41 ` Arjan van de Ven
2006-03-20 23:52   ` Nate Diller
2006-03-21  7:10     ` Arjan van de Ven
2006-03-21 12:24     ` Nick Piggin
2006-03-20 17:27 ` Christoph Lameter
2006-03-21  5:23   ` Stone Wang
2006-03-21 15:20   ` Stone Wang
2006-03-24  4:45   ` Rik van Riel
2006-03-21 12:20 ` Nick Piggin
2006-03-24 15:05   ` Stone Wang
2006-03-24 16:57     ` Nick Piggin
2006-03-24 14:36 ` Andi Kleen
2006-03-24 14:54   ` Stone Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox