[PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
@ 2006-03-20 13:35 Stone Wang
  2006-03-20 13:41 ` Arjan van de Ven
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Stone Wang @ 2006-03-20 13:35 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4044 bytes --]

Both one of my friends(who is working on a DBMS oriented from
PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.

After careful code-reading and tests,i found out that the reason of the
OOM is that VM's LRU algorithm treating mlocked pages as Active/Inactive,
regardless of that the mlocked pages could not be reclaimed.

Mlocking many pages will easily cause unbalance between LRU and slab:
VM tend to reclaim from Active/Inactive list,most of which are mlocked,
thus OOM may be triggered. While in fact,there are enough pages to be
reclaimed in slab.
( Setting a large "vfs_cache_pressure" may help to avoid the OOM
  under this situation, but i think it's better "do things right" than
  depending on the "vfs_cache_pressure" tunable)

We think that it's wrong semantic treating mlocked as Active/Inactive.
Mlocked pages should not be counted in page-reclaiming algorithm,
for in fact they will never be affected by page reclaims.

Following patch patch try to fix this, with some additions.

The patch brings Linux with:
1. Posix mlock/munlock/mlockall/munlockall.
   Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
   just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
   Thus users of mlock system call series will always have an clear map of
   mlocked areas.
2. More consistent LRU semantics in Memory Management.
   Mlocked pages is placed on a separate LRU list: Wired List.
   The pages dont take part in LRU algorithms,for they could never be swapped,
   until munlocked.
3. Output the Wired(mlocked) pages count through /proc/meminfo.
   One line is added to /proc/meminfo: "Wired:     N kB",thus Linux system
   administrators/programmers can have a clearer map of physical memory usage.


Test of the patch:

Test envioronment:
     RHEL4.
     Totoal physical memory size: 256MB,no swap.
     One ext3 directory("/mnt/test") with about 256 thousand small
files (each size: 2kB).

Step 1. run a task mlocking 220 MB
Step 2. run: "find /mnt/test -size 100"


Case A. Standard kernel.org kernel 2.6.15

Linux soon run OOM, OOM-time memory info:

[root@Linux ~]# cat /proc/meminfo
MemTotal:       254248 kB
MemFree:          3144 kB
Buffers:           124 kB
Cached:           1584 kB
SwapCached:          0 kB
Active:         229308 kB
Inactive:          596 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       254248 kB
LowFree:          3144 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:         228556 kB
Slab:            20076 kB
CommitLimit:    127124 kB
Committed_AS:   238424 kB
PageTables:        584 kB
VmallocTotal:   770040 kB
VmallocUsed:       180 kB
VmallocChunk:   769844 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB


Case B. Patched 2.6.15

No OOM happened.

[root@Linux ~]# cat /proc/meminfo
MemTotal:       254344 kB
MemFree:          3508 kB
Buffers:          6352 kB
Cached:           2684 kB
SwapCached:          0 kB
Active:           7140 kB
Inactive:         4732 kB
Wired:          225284 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       254344 kB
LowFree:          3508 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:              72 kB
Writeback:           0 kB
Mapped:         229208 kB
Slab:            12552 kB
CommitLimit:    127172 kB
Committed_AS:   238168 kB
PageTables:        572 kB
VmallocTotal:   770040 kB
VmallocUsed:       180 kB
VmallocChunk:   769844 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB


A lot thanks to Mel Gorman for your book: <Understanding the Linux Virtual
Memory Manager>. Also, thanks to other 2 great Linux kernel books: ULK3 and
LDD3.

FreeBSD's VM implementation enlightened me,thanks to FreeBSD guys.

Attachment is the full patch,following mails are what it splits up,.

Shaoping Wang

[-- Attachment #2: patch-2.6.15-memlock --]
[-- Type: application/octet-stream, Size: 49453 bytes --]

diff -urN --exclude-from=./exclude.files linux-2.6.15/arch/cris/arch-v32/drivers/cryptocop.c /home/backup/linux-2.6.15-release/arch/cris/arch-v32/drivers/cryptocop.c
--- linux-2.6.15/arch/cris/arch-v32/drivers/cryptocop.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/arch/cris/arch-v32/drivers/cryptocop.c	2006-03-06 08:38:48.000000000 -0500
@@ -2724,6 +2724,7 @@
 			     noinpages,
 			     0,  /* read access only for in data */
 			     0, /* no force */
+			     0, /* do not set wire */
 			     inpages,
 			     NULL);
 
@@ -2741,6 +2742,7 @@
 				     nooutpages,
 				     1, /* write access for out data */
 				     0, /* no force */
+				     0, /* do not set wire*/
 				     outpages,
 				     NULL);
 		up_read(&current->mm->mmap_sem);
diff -urN --exclude-from=./exclude.files linux-2.6.15/Documentation/vm/hugetlbpage.txt /home/backup/linux-2.6.15-release/Documentation/vm/hugetlbpage.txt
--- linux-2.6.15/Documentation/vm/hugetlbpage.txt	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/Documentation/vm/hugetlbpage.txt	2006-03-06 06:30:06.000000000 -0500
@@ -59,7 +59,7 @@
 
 This command will try to configure 20 hugepages in the system.  The success
 or failure of allocation depends on the amount of physically contiguous
-memory that is preset in system at this time.  System administrators may want
+memory that is present in system at this time.  System administrators may want
 to put this command in one of the local rc init file.  This will enable the
 kernel to request huge pages early in the boot process (when the possibility
 of getting physical contiguous pages is still very high).
diff -urN --exclude-from=./exclude.files linux-2.6.15/Documentation/vm/locking /home/backup/linux-2.6.15-release/Documentation/vm/locking
--- linux-2.6.15/Documentation/vm/locking	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/Documentation/vm/locking	2006-03-07 03:43:44.000000000 -0500
@@ -37,7 +37,7 @@
 4. The exception to this rule is expand_stack, which just
    takes the read lock and the page_table_lock, this is ok
    because it doesn't really modify fields anybody relies on.
-5. You must be able to guarantee that while holding page_table_lock
+5. You must be able to guarantee that while holding mmap_sem
    or page_table_lock of mm A, you will not try to get either lock
    for mm B.
 
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/infiniband/core/uverbs_mem.c /home/backup/linux-2.6.15-release/drivers/infiniband/core/uverbs_mem.c
--- linux-2.6.15/drivers/infiniband/core/uverbs_mem.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/infiniband/core/uverbs_mem.c	2006-03-06 08:40:06.000000000 -0500
@@ -110,7 +110,7 @@
 		ret = get_user_pages(current, current->mm, cur_base,
 				     min_t(int, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
-				     1, !write, page_list, NULL);
+				     1, !write, 0, page_list, NULL);
 
 		if (ret < 0)
 			goto out;
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/infiniband/hw/mthca/mthca_memfree.c /home/backup/linux-2.6.15-release/drivers/infiniband/hw/mthca/mthca_memfree.c
--- linux-2.6.15/drivers/infiniband/hw/mthca/mthca_memfree.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/infiniband/hw/mthca/mthca_memfree.c	2006-03-06 08:41:10.000000000 -0500
@@ -396,7 +396,7 @@
 		goto out;
 	}
 
-	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0,
+	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0, 0,
 			     &db_tab->page[i].mem.page, NULL);
 	if (ret < 0)
 		goto out;
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/media/video/video-buf.c /home/backup/linux-2.6.15-release/drivers/media/video/video-buf.c
--- linux-2.6.15/drivers/media/video/video-buf.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/media/video/video-buf.c	2006-03-06 08:41:54.000000000 -0500
@@ -149,7 +149,7 @@
 	down_read(&current->mm->mmap_sem);
 	err = get_user_pages(current,current->mm,
 			     data & PAGE_MASK, dma->nr_pages,
-			     rw == READ, 1, /* force */
+			     rw == READ, 1, 0, /* force,do not set wire */
 			     dma->pages, NULL);
 	up_read(&current->mm->mmap_sem);
 	if (err != dma->nr_pages) {
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/scsi/sg.c /home/backup/linux-2.6.15-release/drivers/scsi/sg.c
--- linux-2.6.15/drivers/scsi/sg.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/scsi/sg.c	2006-03-06 08:09:43.000000000 -0500
@@ -1815,6 +1815,7 @@
 		nr_pages,
 		rw == READ,
 		0, /* don't force */
+		0,
 		pages,
 		NULL);
 	up_read(&current->mm->mmap_sem);
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/scsi/st.c /home/backup/linux-2.6.15-release/drivers/scsi/st.c
--- linux-2.6.15/drivers/scsi/st.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/scsi/st.c	2006-03-06 07:57:47.000000000 -0500
@@ -4453,6 +4453,7 @@
 		nr_pages,
 		rw == READ,
 		0, /* don't force */
+		0,
 		pages,
 		NULL);
 	up_read(&current->mm->mmap_sem);
diff -urN --exclude-from=./exclude.files linux-2.6.15/drivers/video/pvr2fb.c /home/backup/linux-2.6.15-release/drivers/video/pvr2fb.c
--- linux-2.6.15/drivers/video/pvr2fb.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/drivers/video/pvr2fb.c	2006-03-06 08:42:53.000000000 -0500
@@ -690,7 +690,7 @@
 	
 	down_read(&current->mm->mmap_sem);
 	ret = get_user_pages(current, current->mm, (unsigned long)buf,
-			     nr_pages, WRITE, 0, pages, NULL);
+			     nr_pages, WRITE, 0, 0, pages, NULL);
 	up_read(&current->mm->mmap_sem);
 
 	if (ret < nr_pages) {
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/aio.c /home/backup/linux-2.6.15-release/fs/aio.c
--- linux-2.6.15/fs/aio.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/aio.c	2006-03-06 06:30:06.000000000 -0500
@@ -146,7 +146,7 @@
 	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
 	info->nr_pages = get_user_pages(current, ctx->mm,
 					info->mmap_base, nr_pages, 
-					1, 0, info->ring_pages, NULL);
+					1, 0, 0, info->ring_pages, NULL);
 	up_write(&ctx->mm->mmap_sem);
 
 	if (unlikely(info->nr_pages != nr_pages)) {
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/binfmt_elf.c /home/backup/linux-2.6.15-release/fs/binfmt_elf.c
--- linux-2.6.15/fs/binfmt_elf.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/binfmt_elf.c	2006-03-06 06:30:06.000000000 -0500
@@ -1600,7 +1600,7 @@
 			struct page* page;
 			struct vm_area_struct *vma;
 
-			if (get_user_pages(current, current->mm, addr, 1, 0, 1,
+			if (get_user_pages(current, current->mm, addr, 1, 0, 1, 0,
 						&page, &vma) <= 0) {
 				DUMP_SEEK (file->f_pos + PAGE_SIZE);
 			} else {
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/bio.c /home/backup/linux-2.6.15-release/fs/bio.c
--- linux-2.6.15/fs/bio.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/bio.c	2006-03-06 06:30:06.000000000 -0500
@@ -608,7 +608,7 @@
 		down_read(&current->mm->mmap_sem);
 		ret = get_user_pages(current, current->mm, uaddr,
 				     local_nr_pages,
-				     write_to_vm, 0, &pages[cur_page], NULL);
+				     write_to_vm, 0, 0, &pages[cur_page], NULL);
 		up_read(&current->mm->mmap_sem);
 
 		if (ret < local_nr_pages)
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/direct-io.c /home/backup/linux-2.6.15-release/fs/direct-io.c
--- linux-2.6.15/fs/direct-io.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/direct-io.c	2006-03-06 06:30:06.000000000 -0500
@@ -157,6 +157,7 @@
 		nr_pages,			/* How many pages? */
 		dio->rw == READ,		/* Write to memory? */
 		0,				/* force (?) */
+		0,
 		&dio->pages[0],
 		NULL);				/* vmas */
 	up_read(&current->mm->mmap_sem);
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/fuse/dev.c /home/backup/linux-2.6.15-release/fs/fuse/dev.c
--- linux-2.6.15/fs/fuse/dev.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/fuse/dev.c	2006-03-06 06:30:07.000000000 -0500
@@ -462,7 +462,7 @@
 		cs->nr_segs --;
 	}
 	down_read(&current->mm->mmap_sem);
-	err = get_user_pages(current, current->mm, cs->addr, 1, cs->write, 0,
+	err = get_user_pages(current, current->mm, cs->addr, 1, cs->write, 0, 0,
 			     &cs->pg, NULL);
 	up_read(&current->mm->mmap_sem);
 	if (err < 0)
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/fuse/file.c /home/backup/linux-2.6.15-release/fs/fuse/file.c
--- linux-2.6.15/fs/fuse/file.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/fuse/file.c	2006-03-06 06:30:07.000000000 -0500
@@ -457,7 +457,7 @@
 	npages = min(npages, FUSE_MAX_PAGES_PER_REQ);
 	down_read(&current->mm->mmap_sem);
 	npages = get_user_pages(current, current->mm, user_addr, npages, write,
-				0, req->pages, NULL);
+				0, 0, req->pages, NULL);
 	up_read(&current->mm->mmap_sem);
 	if (npages < 0)
 		return npages;
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/nfs/direct.c /home/backup/linux-2.6.15-release/fs/nfs/direct.c
--- linux-2.6.15/fs/nfs/direct.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/nfs/direct.c	2006-03-06 06:30:07.000000000 -0500
@@ -104,7 +104,7 @@
 	if (*pages) {
 		down_read(&current->mm->mmap_sem);
 		result = get_user_pages(current, current->mm, user_addr,
-					page_count, (rw == READ), 0,
+					page_count, (rw == READ), 0, 0,
 					*pages, NULL);
 		up_read(&current->mm->mmap_sem);
 	}
diff -urN --exclude-from=./exclude.files linux-2.6.15/fs/proc/proc_misc.c /home/backup/linux-2.6.15-release/fs/proc/proc_misc.c
--- linux-2.6.15/fs/proc/proc_misc.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/fs/proc/proc_misc.c	2006-03-06 06:44:50.000000000 -0500
@@ -123,6 +123,7 @@
 	struct page_state ps;
 	unsigned long inactive;
 	unsigned long active;
+	unsigned long wired;
 	unsigned long free;
 	unsigned long committed;
 	unsigned long allowed;
@@ -130,7 +131,7 @@
 	long cached;
 
 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &wired, &free);
 
 /*
  * display in kilobytes.
@@ -159,6 +160,7 @@
 		"SwapCached:   %8lu kB\n"
 		"Active:       %8lu kB\n"
 		"Inactive:     %8lu kB\n"
+		"Wired:        %8lu kB\n"
 		"HighTotal:    %8lu kB\n"
 		"HighFree:     %8lu kB\n"
 		"LowTotal:     %8lu kB\n"
@@ -182,6 +184,7 @@
 		K(total_swapcache_pages),
 		K(active),
 		K(inactive),
+		K(wired),
 		K(i.totalhigh),
 		K(i.freehigh),
 		K(i.totalram-i.totalhigh),
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/mm.h /home/backup/linux-2.6.15-release/include/linux/mm.h
--- linux-2.6.15/include/linux/mm.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/mm.h	2006-03-07 01:49:12.000000000 -0500
@@ -59,6 +59,9 @@
 	unsigned long vm_start;		/* Our start address within vm_mm. */
 	unsigned long vm_end;		/* The first byte after our end address
 					   within vm_mm. */
+	int vm_wire_change;			/* VM_LOCKED bit of vm_flags was just changed.
+								 * For rollback support of sys_mlock series system calls.
+								 */
 
 	/* linked list of VM areas per task, sorted by address */
 	struct vm_area_struct *vm_next;
@@ -218,6 +221,10 @@
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
 	atomic_t _count;		/* Usage count, see below. */
+	unsigned short wired_count; /* Count of wirings of the page.
+					 * If not zero,the page would be SetPageWired,
+					 * and put on Wired list of the zone.
+					 */
 	atomic_t _mapcount;		/* Count of ptes mapped in mms,
 					 * to show when page is mapped
 					 * & limit reverse map searches.
@@ -699,12 +706,13 @@
 	return __handle_mm_fault(mm, vma, address, write_access) & (~VM_FAULT_WRITE);
 }
 
-extern int make_pages_present(unsigned long addr, unsigned long end);
+extern int make_pages_wired(unsigned long addr, unsigned long end);
+void make_pages_unwired(struct mm_struct *mm, unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 void install_arg_page(struct vm_area_struct *, struct page *, unsigned long);
 
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
-		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
+		int len, int write, int force, int wire, struct page **pages, struct vm_area_struct **vmas);
 void print_bad_pte(struct vm_area_struct *, pte_t, unsigned long);
 
 int __set_page_dirty_buffers(struct page *page);
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/mm_inline.h /home/backup/linux-2.6.15-release/include/linux/mm_inline.h
--- linux-2.6.15/include/linux/mm_inline.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/mm_inline.h	2006-03-07 01:56:10.000000000 -0500
@@ -1,3 +1,9 @@
+/*
+ * There are 3 per-zone lists in LRU: 
+ * 			Active: pages which were accessed more frequently.
+ *			Inactive: pages accessed less frequently.
+ *			Wired: pages mlocked by some tasks.
+ */
 
 static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
@@ -14,6 +20,13 @@
 }
 
 static inline void
+add_page_to_wired_list(struct zone *zone, struct page *page)
+{
+	list_add(&page->lru, &zone->wired_list);
+	zone->nr_wired++;
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
@@ -28,10 +41,20 @@
 }
 
 static inline void
+del_page_from_wired_list(struct zone *zone, struct page *page)
+{
+	list_del(&page->lru);
+	zone->nr_wired--;
+}
+
+static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	if (PageActive(page)) {
+	if(PageWired(page)){
+		ClearPageWired(page);
+		zone->nr_wired--;
+	} else if (PageActive(page)) {
 		ClearPageActive(page);
 		zone->nr_active--;
 	} else {
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/mmzone.h /home/backup/linux-2.6.15-release/include/linux/mmzone.h
--- linux-2.6.15/include/linux/mmzone.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/mmzone.h	2006-03-07 01:58:26.000000000 -0500
@@ -143,10 +143,12 @@
 	spinlock_t		lru_lock;	
 	struct list_head	active_list;
 	struct list_head	inactive_list;
+	struct list_head	wired_list; /* Pages wired. */
 	unsigned long		nr_scan_active;
 	unsigned long		nr_scan_inactive;
 	unsigned long		nr_active;
 	unsigned long		nr_inactive;
+	unsigned long		nr_wired;
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
@@ -315,9 +317,9 @@
 extern struct pglist_data *pgdat_list;
 
 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+		unsigned long *wired, unsigned long *free, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+		unsigned long *wired, unsigned long *free);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/page-flags.h /home/backup/linux-2.6.15-release/include/linux/page-flags.h
--- linux-2.6.15/include/linux/page-flags.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/page-flags.h	2006-03-06 06:30:07.000000000 -0500
@@ -76,6 +76,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
+#define PG_wired		20  /* Page is on Wired list */
+
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
  * allowed.
@@ -198,7 +200,14 @@
 #define __ClearPageDirty(page)	__clear_bit(PG_dirty, &(page)->flags)
 #define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags)
 
+#define SetPageWired(page)	set_bit(PG_wired, &(page)->flags)
+#define ClearPageWired(page) clear_bit(PG_wired,&(page)->flags)
+#define PageWired(page)		test_bit(PG_wired, &(page)->flags)
+#define TestSetPageWired(page)	test_and_set_bit(PG_wired, &(page)->flags)
+#define TestClearPageWired(page)	test_and_clear_bit(PG_wired, &(page)->flags)
+
 #define SetPageLRU(page)	set_bit(PG_lru, &(page)->flags)
+#define ClearPageLRU(page)	clear_bit(PG_lru, &(page)->flags)
 #define PageLRU(page)		test_bit(PG_lru, &(page)->flags)
 #define TestSetPageLRU(page)	test_and_set_bit(PG_lru, &(page)->flags)
 #define TestClearPageLRU(page)	test_and_clear_bit(PG_lru, &(page)->flags)
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/rmap.h /home/backup/linux-2.6.15-release/include/linux/rmap.h
--- linux-2.6.15/include/linux/rmap.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/rmap.h	2006-03-06 06:30:07.000000000 -0500
@@ -71,8 +71,8 @@
  * rmap interfaces called when adding or removing pte of page
  */
 void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_add_file_rmap(struct page *, struct vm_area_struct *);
+void page_remove_rmap(struct page *, struct vm_area_struct *);
 
 /**
  * page_dup_rmap - duplicate pte mapping to a page
diff -urN --exclude-from=./exclude.files linux-2.6.15/include/linux/swap.h /home/backup/linux-2.6.15-release/include/linux/swap.h
--- linux-2.6.15/include/linux/swap.h	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/include/linux/swap.h	2006-03-06 06:30:07.000000000 -0500
@@ -165,6 +165,8 @@
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(wire_page(struct page *));
+extern void FASTCALL(unwire_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int rotate_reclaimable_page(struct page *page);
diff -urN --exclude-from=./exclude.files linux-2.6.15/kernel/futex.c /home/backup/linux-2.6.15-release/kernel/futex.c
--- linux-2.6.15/kernel/futex.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/kernel/futex.c	2006-03-06 06:30:07.000000000 -0500
@@ -201,7 +201,7 @@
 	 * from swap.  But that's a lot of code to duplicate here
 	 * for a rare case, so we simply fetch the page.
 	 */
-	err = get_user_pages(current, mm, uaddr, 1, 0, 0, &page, NULL);
+	err = get_user_pages(current, mm, uaddr, 1, 0, 0, 0, &page, NULL);
 	if (err >= 0) {
 		key->shared.pgoff =
 			page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
diff -urN --exclude-from=./exclude.files linux-2.6.15/kernel/ptrace.c /home/backup/linux-2.6.15-release/kernel/ptrace.c
--- linux-2.6.15/kernel/ptrace.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/kernel/ptrace.c	2006-03-06 06:30:07.000000000 -0500
@@ -228,7 +228,7 @@
 		void *maddr;
 
 		ret = get_user_pages(tsk, mm, addr, 1,
-				write, 1, &page, &vma);
+				write, 1, 0, &page, &vma);
 		if (ret <= 0)
 			break;
 
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/filemap_xip.c /home/backup/linux-2.6.15-release/mm/filemap_xip.c
--- linux-2.6.15/mm/filemap_xip.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/filemap_xip.c	2006-03-06 06:30:07.000000000 -0500
@@ -189,7 +189,7 @@
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
+			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/fremap.c /home/backup/linux-2.6.15-release/mm/fremap.c
--- linux-2.6.15/mm/fremap.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/fremap.c	2006-03-06 06:30:07.000000000 -0500
@@ -33,7 +33,7 @@
 		if (page) {
 			if (pte_dirty(pte))
 				set_page_dirty(page);
-			page_remove_rmap(page);
+			page_remove_rmap(page, vma);
 			page_cache_release(page);
 		}
 	} else {
@@ -80,7 +80,7 @@
 
 	flush_icache_page(vma, page);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
-	page_add_file_rmap(page);
+	page_add_file_rmap(page, vma);
 	pte_val = *pte;
 	update_mmu_cache(vma, addr, pte_val);
 	err = 0;
@@ -203,6 +203,8 @@
 			spin_unlock(&mapping->i_mmap_lock);
 		}
 
+		if(vma->vm_flags & VM_LOCKED)
+			flags &= ~MAP_NONBLOCK;
 		err = vma->vm_ops->populate(vma, start, size,
 					    vma->vm_page_prot,
 					    pgoff, flags & MAP_NONBLOCK);
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/memory.c /home/backup/linux-2.6.15-release/mm/memory.c
--- linux-2.6.15/mm/memory.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/memory.c	2006-03-07 11:14:59.000000000 -0500
@@ -656,7 +656,7 @@
 					mark_page_accessed(page);
 				file_rss--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap(page, vma);
 			tlb_remove_page(tlb, page);
 			continue;
 		}
@@ -950,8 +950,30 @@
 	return page;
 }
 
+void make_pages_unwired(struct mm_struct *mm,
+					unsigned long start,unsigned long end)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+	unsigned int foll_flags;
+
+	foll_flags =0;
+
+	vma=find_vma(mm,start);
+	if(!vma)
+		BUG();
+	if(is_vm_hugetlb_page(vma))
+		return;
+	
+	for(; start<end ; start+=PAGE_SIZE) {
+		page=follow_page(vma,start,foll_flags);
+		if(page)
+			unwire_page(page);
+	}
+}
+
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, int len, int write, int force,
+		unsigned long start, int len, int write,int force, int wire,
 		struct page **pages, struct vm_area_struct **vmas)
 {
 	int i;
@@ -973,6 +995,7 @@
 		if (!vma && in_gate_area(tsk, start)) {
 			unsigned long pg = start & PAGE_MASK;
 			struct vm_area_struct *gate_vma = get_gate_vma(tsk);
+			struct page *page;	
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
@@ -994,6 +1017,7 @@
 				pte_unmap(pte);
 				return i ? : -EFAULT;
 			}
+			page = vm_normal_page(gate_vma, start, *pte);
 			if (pages) {
 				struct page *page = vm_normal_page(gate_vma, start, *pte);
 				pages[i] = page;
@@ -1003,9 +1027,12 @@
 			pte_unmap(pte);
 			if (vmas)
 				vmas[i] = gate_vma;
+			if(wire)
+				wire_page(page);
 			i++;
 			start += PAGE_SIZE;
 			len--;
+
 			continue;
 		}
 
@@ -1013,6 +1040,7 @@
 				|| !(vm_flags & vma->vm_flags))
 			return i ? : -EFAULT;
 
+		/* We dont account wired HugeTLB pages */
 		if (is_vm_hugetlb_page(vma)) {
 			i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &len, i);
@@ -1067,17 +1095,20 @@
 			}
 			if (vmas)
 				vmas[i] = vma;
+			if(wire)
+				wire_page(page);
 			i++;
 			start += PAGE_SIZE;
 			len--;
 		} while (len && start < vma->vm_end);
 	} while (len);
+
 	return i;
 }
 EXPORT_SYMBOL(get_user_pages);
 
-static int zeromap_pte_range(struct mm_struct *mm, pmd_t *pmd,
-			unsigned long addr, unsigned long end, pgprot_t prot)
+static int zeromap_pte_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t prot)
 {
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -1089,7 +1120,7 @@
 		struct page *page = ZERO_PAGE(addr);
 		pte_t zero_pte = pte_wrprotect(mk_pte(page, prot));
 		page_cache_get(page);
-		page_add_file_rmap(page);
+		page_add_file_rmap(page,vma);
 		inc_mm_counter(mm, file_rss);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, zero_pte);
@@ -1098,8 +1129,8 @@
 	return 0;
 }
 
-static inline int zeromap_pmd_range(struct mm_struct *mm, pud_t *pud,
-			unsigned long addr, unsigned long end, pgprot_t prot)
+static inline int zeromap_pmd_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			 pud_t *pud, unsigned long addr, unsigned long end, pgprot_t prot)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -1109,14 +1140,14 @@
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (zeromap_pte_range(mm, pmd, addr, next, prot))
+		if (zeromap_pte_range(mm, vma, pmd, addr, next, prot))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
-static inline int zeromap_pud_range(struct mm_struct *mm, pgd_t *pgd,
-			unsigned long addr, unsigned long end, pgprot_t prot)
+static inline int zeromap_pud_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			pgd_t *pgd, unsigned long addr, unsigned long end, pgprot_t prot)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -1126,7 +1157,7 @@
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (zeromap_pmd_range(mm, pud, addr, next, prot))
+		if (zeromap_pmd_range(mm, vma, pud, addr, next, prot))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
@@ -1146,7 +1177,7 @@
 	flush_cache_range(vma, addr, end);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = zeromap_pud_range(mm, pgd, addr, next, prot);
+		err = zeromap_pud_range(mm, vma, pgd, addr, next, prot);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
@@ -1172,7 +1203,8 @@
  * old drivers should use this, and they needed to mark their
  * pages reserved for the old functions anyway.
  */
-static int insert_page(struct mm_struct *mm, unsigned long addr, struct page *page, pgprot_t prot)
+static int insert_page(struct mm_struct *mm, struct vm_area_struct *vma, 
+			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	int retval;
 	pte_t *pte;
@@ -1193,7 +1225,7 @@
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
 	inc_mm_counter(mm, file_rss);
-	page_add_file_rmap(page);
+	page_add_file_rmap(page,vma);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
 
 	retval = 0;
@@ -1229,7 +1261,7 @@
 	if (!page_count(page))
 		return -EINVAL;
 	vma->vm_flags |= VM_INSERTPAGE;
-	return insert_page(vma->vm_mm, addr, page, vma->vm_page_prot);
+	return insert_page(vma->vm_mm, vma, addr, page, vma->vm_page_prot);
 }
 EXPORT_SYMBOL(vm_insert_page);
 
@@ -1484,7 +1516,7 @@
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
-			page_remove_rmap(old_page);
+			page_remove_rmap(old_page, vma);
 			if (!PageAnon(old_page)) {
 				dec_mm_counter(mm, file_rss);
 				inc_mm_counter(mm, anon_rss);
@@ -1967,7 +1999,7 @@
 		if (!pte_none(*page_table))
 			goto release;
 		inc_mm_counter(mm, file_rss);
-		page_add_file_rmap(page);
+		page_add_file_rmap(page, vma);
 	}
 
 	set_pte_at(mm, address, page_table, entry);
@@ -2089,7 +2121,7 @@
 			page_add_anon_rmap(new_page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
-			page_add_file_rmap(new_page);
+			page_add_file_rmap(new_page, vma);
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2306,10 +2338,13 @@
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-int make_pages_present(unsigned long addr, unsigned long end)
+int make_pages_wired(unsigned long addr, unsigned long end)
 {
 	int ret, len, write;
+	struct page *page;
 	struct vm_area_struct * vma;
+	struct mm_struct *mm=current->mm;
+	int wire_change;
 
 	vma = find_vma(current->mm, addr);
 	if (!vma)
@@ -2320,13 +2355,26 @@
 	if (end > vma->vm_end)
 		BUG();
 	len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
-	ret = get_user_pages(current, current->mm, addr,
-			len, write, 0, NULL, NULL);
-	if (ret < 0)
-		return ret;
-	return ret == len ? 0 : -1;
+	wire_change = vma->vm_wire_change;
+	vma->vm_wire_change = 1;
+	ret = get_user_pages(current, mm, addr,
+			len, write, 1, 1, NULL, NULL); /* write,set_wire */
+	vma->vm_wire_change = wire_change;
+	if(ret < len) {
+	    for(; addr< end ; addr += PAGE_SIZE) {
+        	page=follow_page(vma,addr,0);
+            if(page)
+				unwire_page(page);
+			else
+				BUG();
+   		}
+		return -1;
+	} 
+	else
+		return 0;
 }
 
+
 /* 
  * Map a vmalloc()-space virtual address to the physical page.
  */
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/mempolicy.c /home/backup/linux-2.6.15-release/mm/mempolicy.c
--- linux-2.6.15/mm/mempolicy.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/mempolicy.c	2006-03-06 06:30:08.000000000 -0500
@@ -440,7 +440,7 @@
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, 0, &p, NULL);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/mlock.c /home/backup/linux-2.6.15-release/mm/mlock.c
--- linux-2.6.15/mm/mlock.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/mlock.c	2006-03-07 10:50:52.000000000 -0500
@@ -3,6 +3,7 @@
  *
  *  (C) Copyright 1995 Linus Torvalds
  *  (C) Copyright 2002 Christoph Hellwig
+ *  (C) Copyright 2006 Shaoping Wang
  */
 
 #include <linux/mman.h>
@@ -10,72 +11,119 @@
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 
-
-static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
-	unsigned long start, unsigned long end, unsigned int newflags)
+static int do_mlock(unsigned long start, size_t len,unsigned int jump_hole)
 {
-	struct mm_struct * mm = vma->vm_mm;
-	pgoff_t pgoff;
-	int pages;
-	int ret = 0;
+	unsigned long  end=0,vmoff=0;
+	unsigned long  pages=0;
+	struct mm_struct *mm=current->mm;
+	struct vm_area_struct * vma, *prev, **pprev,*next;
+	int ret=0;
 
-	if (newflags == vma->vm_flags) {
-		*prev = vma;
-		goto out;
-	}
+	len = PAGE_ALIGN(len);
+	end = start + len;
+	if (end < start)
+		return -EINVAL;
+	if (end == start)
+		return 0;
 
-	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
-	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
-			  vma->vm_file, pgoff, vma_policy(vma));
-	if (*prev) {
-		vma = *prev;
-		goto success;
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (!vma || vma->vm_start > start)
+		return -ENOMEM;
+    else if (vma->vm_start < start){
+			prev=vma;
+		    ret = split_vma(mm, prev, start, 0);
+			if(!ret)
+				vma=prev->vm_next;
+			else {
+				return ret;
+			}
 	}
 
-	*prev = vma;
-
-	if (start != vma->vm_start) {
-		ret = split_vma(mm, vma, start, 1);
-		if (ret)
+	while(vma->vm_start < end){
+		vmoff =vma->vm_end; /* Record where we have proceeded. */
+		if (vma->vm_end > end){
+    	   	ret = split_vma(mm, vma, end, 0);
+   			if (ret) 
+		   		goto out;
+		}
+		if(vma->vm_flags & VM_LOCKED)
+			goto next;
+		vma->vm_flags |= VM_LOCKED;
+		vma->vm_wire_change =1;
+		pages += (vma->vm_end-vma->vm_start) >> PAGE_SHIFT;
+
+    	if (!(vma->vm_flags & VM_IO)) {
+   			ret = make_pages_wired(vma->vm_start, vma->vm_end);
+			if(ret<0){ 
+				vma->vm_flags &= ~VM_LOCKED;
+				vma->vm_wire_change =0;
+				goto out;
+			}
+		}
+next:
+		if(vma->vm_end ==end)
+			break;
+		prev =vma;
+		vma =vma->vm_next;
+		
+		/* If called from do_mlockall, 
+		 * we may jump over holes. 
+         */
+		if(jump_hole){ 
+			if(vma)
+				continue;
+			else
+				goto out;
+		}
+		else if (!vma || vma->vm_start != prev->vm_end){
+			ret = -ENOMEM;
 			goto out;
+		}
 	}
 
-	if (end != vma->vm_end) {
-		ret = split_vma(mm, vma, end, 0);
-		if (ret)
-			goto out;
-	}
+out:
+	pprev =&prev;
+	vma = find_vma_prev(mm, start, pprev);
 
-success:
-	/*
-	 * vm_flags is protected by the mmap_sem held in write mode.
-	 * It's okay if try_to_unmap_one unmaps a page just after we
-	 * set VM_LOCKED, make_pages_present below will bring it back.
-	 */
-	vma->vm_flags = newflags;
-
-	/*
-	 * Keep track of amount of locked VM.
-	 */
-	pages = (end - start) >> PAGE_SHIFT;
-	if (newflags & VM_LOCKED) {
-		pages = -pages;
-		if (!(newflags & VM_IO))
-			ret = make_pages_present(start, end);
+	/* If error happened,do rollback.
+	 * Whether success or not,try to merge the vmas.
+     */
+	while( vma && vma->vm_end <= vmoff ){
+		if(vma->vm_wire_change) {
+			if(ret){
+				make_pages_unwired(mm, vma->vm_start, vma->vm_end);
+				vma->vm_flags &= ~VM_LOCKED;
+			}
+			vma->vm_wire_change =0;
+		}
+		next=vma->vm_next;
+		if(next && next->vm_wire_change) {
+			if(ret){
+				make_pages_unwired(mm, next->vm_start, next->vm_end);
+				next->vm_flags &= ~VM_LOCKED;
+			}
+			next->vm_wire_change=0;
+		}
+		*pprev=vma_merge(mm, *pprev, vma->vm_start, vma->vm_end, vma->vm_flags,
+					vma->anon_vma,vma->vm_file, vma->vm_pgoff, vma_policy(vma));
+		if(*pprev)
+			vma =*pprev;
+		vma =vma->vm_next;
 	}
 
-	vma->vm_mm->locked_vm -= pages;
-out:
-	if (ret == -ENOMEM)
-		ret = -EAGAIN;
+	if(!ret)
+		mm->locked_vm += pages;
 	return ret;
 }
 
-static int do_mlock(unsigned long start, size_t len, int on)
+
+static int do_munlock(unsigned long start, size_t len, unsigned int jump_hole)
 {
-	unsigned long nstart, end, tmp;
-	struct vm_area_struct * vma, * prev;
-	int error;
+	unsigned long  end=0,vmoff=0;
+	unsigned long  pages=0;
+	struct mm_struct *mm=current->mm;
+	struct vm_area_struct * vma, *prev, **pprev, *next;
+	int ret=0;
 
 	len = PAGE_ALIGN(len);
 	end = start + len;
@@ -86,38 +134,81 @@
 	vma = find_vma_prev(current->mm, start, &prev);
 	if (!vma || vma->vm_start > start)
 		return -ENOMEM;
+    else if (vma->vm_start < start){
+		prev=vma;
+	    ret = split_vma(mm, prev, start, 0);
+		if(!ret)
+			vma=prev->vm_next;
+		else 
+			return ret;
+	}
 
-	if (start > vma->vm_start)
-		prev = vma;
+	while(vma->vm_start < end){
+		vmoff =vma->vm_end;
+    	if (vma->vm_end > end){
+    	   	ret = split_vma(mm, vma, end, 0);
+   	 		if (ret) 
+	   			goto out;
+		}
 
-	for (nstart = start ; ; ) {
-		unsigned int newflags;
+		if(!(vma->vm_flags & VM_LOCKED))
+			goto next;
 
-		/* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
+		vma->vm_wire_change=1;
+		pages += (vma->vm_end -vma->vm_start) >>PAGE_SHIFT;
 
-		newflags = vma->vm_flags | VM_LOCKED;
-		if (!on)
-			newflags &= ~VM_LOCKED;
-
-		tmp = vma->vm_end;
-		if (tmp > end)
-			tmp = end;
-		error = mlock_fixup(vma, &prev, nstart, tmp, newflags);
-		if (error)
-			break;
-		nstart = tmp;
-		if (nstart < prev->vm_end)
-			nstart = prev->vm_end;
-		if (nstart >= end)
+next:
+		if(vma->vm_end ==end)
 			break;
+		prev =vma;
+		vma =vma->vm_next;
 
-		vma = prev->vm_next;
-		if (!vma || vma->vm_start != nstart) {
-			error = -ENOMEM;
-			break;
+		/* If called from munlockall,
+		 * we may jump over holes.
+		 */
+		if(jump_hole){
+			if(!vma)
+				goto out;
+			else
+				continue;
+		}
+		else if (!vma || (vma->vm_start != prev->vm_end) ){
+			ret= -ENOMEM;
+			goto out;
 		}
 	}
-	return error;
+
+out:
+	pprev =&prev;
+	vma = find_vma_prev(current->mm, start, pprev);
+
+	while( vma && vma->vm_end <= vmoff ){
+			if(!ret && vma->vm_wire_change){
+	    		if (!(vma->vm_flags & VM_IO))
+					make_pages_unwired(mm, vma->vm_start, vma->vm_end);
+				vma->vm_flags &=~VM_LOCKED;
+			}
+			vma->vm_wire_change =0;
+			next = vma->vm_next;
+			if(next){ 
+				if(!ret && next->vm_wire_change){
+		    		if (!(next->vm_flags & VM_IO))
+						make_pages_unwired(mm, next->vm_start,next->vm_end);
+					next->vm_flags &=~VM_LOCKED;
+				}
+				next->vm_wire_change =0;
+			}
+		*pprev =vma_merge(mm, *pprev, vma->vm_start, vma->vm_end, vma->vm_flags,
+		vma->anon_vma,vma->vm_file, vma->vm_pgoff, vma_policy(vma));
+		if(*pprev)
+			vma =*pprev;
+		vma =vma->vm_next;
+	}
+
+	if(!ret)
+		mm->locked_vm -= pages;
+	
+	return ret;
 }
 
 asmlinkage long sys_mlock(unsigned long start, size_t len)
@@ -141,7 +232,7 @@
 
 	/* check against resource limits */
 	if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
-		error = do_mlock(start, len, 1);
+		error = do_mlock(start, len, 0);
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
@@ -153,33 +244,41 @@
 	down_write(&current->mm->mmap_sem);
 	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
 	start &= PAGE_MASK;
-	ret = do_mlock(start, len, 0);
+	ret = do_munlock(start, len,0);
 	up_write(&current->mm->mmap_sem);
 	return ret;
 }
 
 static int do_mlockall(int flags)
 {
-	struct vm_area_struct * vma, * prev = NULL;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct * vma;
 	unsigned int def_flags = 0;
+	unsigned long start;
+	int ret = 0;
 
 	if (flags & MCL_FUTURE)
 		def_flags = VM_LOCKED;
-	current->mm->def_flags = def_flags;
+	mm->def_flags = def_flags;
 	if (flags == MCL_FUTURE)
 		goto out;
+	vma=mm->mmap;
+	start = vma->vm_start;
+	ret=do_mlock(start,TASK_SIZE,1);
+out:
+	return ret;
+}
 
-	for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
-		unsigned int newflags;
-
-		newflags = vma->vm_flags | VM_LOCKED;
-		if (!(flags & MCL_CURRENT))
-			newflags &= ~VM_LOCKED;
+static int do_munlockall(void)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct * vma;
+	unsigned long start;
+
+	vma=mm->mmap;
+	start = vma->vm_start;
+	do_munlock(start,TASK_SIZE,1);
 
-		/* Ignore errors */
-		mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
-	}
-out:
 	return 0;
 }
 
@@ -214,7 +313,7 @@
 	int ret;
 
 	down_write(&current->mm->mmap_sem);
-	ret = do_mlockall(0);
+	ret = do_munlockall();
 	up_write(&current->mm->mmap_sem);
 	return ret;
 }
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/mmap.c /home/backup/linux-2.6.15-release/mm/mmap.c
--- linux-2.6.15/mm/mmap.c	2006-02-17 05:24:09.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/mmap.c	2006-03-06 06:30:08.000000000 -0500
@@ -1119,7 +1119,7 @@
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		make_pages_wired(addr, addr + len);
 	}
 	if (flags & MAP_POPULATE) {
 		up_write(&mm->mmap_sem);
@@ -1551,7 +1551,7 @@
 	if (!prev || expand_stack(prev, addr))
 		return NULL;
 	if (prev->vm_flags & VM_LOCKED) {
-		make_pages_present(addr, prev->vm_end);
+		make_pages_wired(addr, prev->vm_end);
 	}
 	return prev;
 }
@@ -1614,7 +1614,7 @@
 	if (expand_stack(vma, addr))
 		return NULL;
 	if (vma->vm_flags & VM_LOCKED) {
-		make_pages_present(addr, start);
+		make_pages_wired(addr, start);
 	}
 	return vma;
 }
@@ -1921,7 +1921,7 @@
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		make_pages_wired(addr, addr + len);
 	}
 	return addr;
 }
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/mremap.c /home/backup/linux-2.6.15-release/mm/mremap.c
--- linux-2.6.15/mm/mremap.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/mremap.c	2006-03-06 06:30:08.000000000 -0500
@@ -230,7 +230,7 @@
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += new_len >> PAGE_SHIFT;
 		if (new_len > old_len)
-			make_pages_present(new_addr + old_len,
+			make_pages_wired(new_addr + old_len,
 					   new_addr + new_len);
 	}
 
@@ -367,7 +367,7 @@
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
-				make_pages_present(addr + old_len,
+				make_pages_wired(addr + old_len,
 						   addr + new_len);
 			}
 			ret = addr;
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/nommu.c /home/backup/linux-2.6.15-release/mm/nommu.c
--- linux-2.6.15/mm/nommu.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/nommu.c	2006-03-06 06:30:08.000000000 -0500
@@ -124,7 +124,7 @@
  * The nommu dodgy version :-)
  */
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-	unsigned long start, int len, int write, int force,
+	unsigned long start, int len, int force,
 	struct page **pages, struct vm_area_struct **vmas)
 {
 	int i;
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/page_alloc.c /home/backup/linux-2.6.15-release/mm/page_alloc.c
--- linux-2.6.15/mm/page_alloc.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/page_alloc.c	2006-03-06 06:30:08.000000000 -0500
@@ -348,7 +348,8 @@
 			1 << PG_slab	|
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_reserved )))
+			1 << PG_reserved |
+			1 << PG_wired )))
 		bad_page(function, page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
@@ -481,6 +482,7 @@
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_wired   |
 			1 << PG_dirty	|
 			1 << PG_reclaim	|
 			1 << PG_slab    |
@@ -1252,35 +1254,39 @@
 EXPORT_SYMBOL(__mod_page_state);
 
 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *wired,unsigned long *free, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
 
 	*active = 0;
 	*inactive = 0;
+	*wired = 0;
 	*free = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
+		*wired += zones[i].nr_wired;
 		*free += zones[i].free_pages;
 	}
 }
 
 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *wired, unsigned long *free)
 {
 	struct pglist_data *pgdat;
 
 	*active = 0;
 	*inactive = 0;
+	*wired = 0;
 	*free = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n, o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
-		*free += n;
+		*wired += n;
+		*free += o;
 	}
 }
 
@@ -1328,6 +1334,7 @@
 	int cpu, temperature;
 	unsigned long active;
 	unsigned long inactive;
+	unsigned long wired;
 	unsigned long free;
 	struct zone *zone;
 
@@ -1358,16 +1365,17 @@
 	}
 
 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &wired, &free);
 
 	printk("Free pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));
 
-	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
+	printk("Active:%lu inactive:%lu wired:%lu dirty:%lu writeback:%lu "
 		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
+		wired,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
@@ -1387,6 +1395,7 @@
 			" high:%lukB"
 			" active:%lukB"
 			" inactive:%lukB"
+			" wired:%lukB"
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1398,6 +1407,7 @@
 			K(zone->pages_high),
 			K(zone->nr_active),
 			K(zone->nr_inactive),
+			K(zone->nr_wired),
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone->all_unreclaimable ? "yes" : "no")
@@ -2009,10 +2019,12 @@
 		zone_pcp_init(zone);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
+		INIT_LIST_HEAD(&zone->wired_list);
 		zone->nr_scan_active = 0;
 		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
 		zone->nr_inactive = 0;
+		zone->nr_wired = 0;
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
 			continue;
@@ -2161,6 +2173,7 @@
 			   "\n        high     %lu"
 			   "\n        active   %lu"
 			   "\n        inactive %lu"
+			   "\n        wired    %lu"
 			   "\n        scanned  %lu (a: %lu i: %lu)"
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
@@ -2170,6 +2183,7 @@
 			   zone->pages_high,
 			   zone->nr_active,
 			   zone->nr_inactive,
+			   zone->nr_wired,
 			   zone->pages_scanned,
 			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/readahead.c /home/backup/linux-2.6.15-release/mm/readahead.c
--- linux-2.6.15/mm/readahead.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/readahead.c	2006-03-06 06:30:08.000000000 -0500
@@ -564,8 +564,9 @@
 {
 	unsigned long active;
 	unsigned long inactive;
+	unsigned long wired;
 	unsigned long free;
 
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &wired, &free, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/rmap.c /home/backup/linux-2.6.15-release/mm/rmap.c
--- linux-2.6.15/mm/rmap.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/rmap.c	2006-03-07 06:17:57.000000000 -0500
@@ -449,6 +449,8 @@
 		struct anon_vma *anon_vma = vma->anon_vma;
 
 		BUG_ON(!anon_vma);
+		if((vma->vm_flags & VM_LOCKED) && !vma->vm_wire_change)
+	        wire_page(page);
 		anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 		page->mapping = (struct address_space *) anon_vma;
 
@@ -465,11 +467,13 @@
  *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma)
 {
 	BUG_ON(PageAnon(page));
 	BUG_ON(!pfn_valid(page_to_pfn(page)));
 
+	if((vma->vm_flags & VM_LOCKED) && !vma->vm_wire_change)
+		wire_page(page);
 	if (atomic_inc_and_test(&page->_mapcount))
 		inc_page_state(nr_mapped);
 }
@@ -480,8 +484,11 @@
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
 {
+	if(PageWired(page) && (vma->vm_flags&VM_LOCKED))
+		unwire_page(page);
+
 	if (atomic_add_negative(-1, &page->_mapcount)) {
 		BUG_ON(page_mapcount(page) < 0);
 		/*
@@ -562,7 +569,7 @@
 	} else
 		dec_mm_counter(mm, file_rss);
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, vma);
 	page_cache_release(page);
 
 out_unmap:
@@ -652,7 +659,7 @@
 		if (pte_dirty(pteval))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, vma);
 		page_cache_release(page);
 		dec_mm_counter(mm, file_rss);
 		(*mapcount)--;
@@ -712,8 +719,10 @@
 
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
+
+		/* If VM_LOCKED set, the page will be moved to Wired list.*/
+		if (vma->vm_flags & VM_LOCKED) 
+			continue;                  
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
 			max_nl_cursor = cursor;
diff -urN --exclude-from=./exclude.files linux-2.6.15/mm/swap.c /home/backup/linux-2.6.15-release/mm/swap.c
--- linux-2.6.15/mm/swap.c	2006-01-02 22:21:10.000000000 -0500
+++ /home/backup/linux-2.6.15-release/mm/swap.c	2006-03-07 11:45:37.000000000 -0500
@@ -110,6 +110,44 @@
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+/* Wire the page; if the page is in LRU,
+ * try move it to Wired list. 
+ */
+void fastcall wire_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	page->wired_count ++;
+	if(!PageWired(page)){
+		if(PageLRU(page)){
+			del_page_from_lru(zone, page);
+			add_page_to_wired_list(zone,page);
+			SetPageWired(page);
+		}
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
+/* Unwire the page.
+ * If it isnt wired by any process, try move it to active list.
+ */
+void fastcall unwire_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	page->wired_count --;
+	if(!page->wired_count){
+		if(PageLRU(page) && TestClearPageWired(page)){
+			del_page_from_wired_list(zone,page);
+			add_page_to_active_list(zone,page);
+			SetPageActive(page);
+		}
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 /*
  * Mark a page as having seen activity.
  *
@@ -119,11 +157,13 @@
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-	} else if (!PageReferenced(page)) {
-		SetPageReferenced(page);
+	if(!PageWired(page)) {
+		if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+			activate_page(page);
+			ClearPageReferenced(page);
+		} else if (!PageReferenced(page)) {
+			SetPageReferenced(page);
+		}
 	}
 }
 
@@ -178,13 +218,15 @@
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	if (TestClearPageLRU(page))
-		del_page_from_lru(zone, page);
-	if (page_count(page) != 0)
-		page = NULL;
+	if(!PageWired(page)) {
+		if (TestClearPageLRU(page))
+			del_page_from_lru(zone, page);
+		if (page_count(page) != 0)
+			page = NULL;
+		if (page)
+			free_hot_page(page);
+	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
-	if (page)
-		free_hot_page(page);
 }
 
 EXPORT_SYMBOL(__page_cache_release);
@@ -214,7 +256,8 @@
 
 		if (!put_page_testzero(page))
 			continue;
-
+		if(PageWired(page))
+			continue;
 		pagezone = page_zone(page);
 		if (pagezone != zone) {
 			if (zone)
@@ -301,7 +344,12 @@
 		}
 		if (TestSetPageLRU(page))
 			BUG();
-		add_page_to_inactive_list(zone, page);
+		if(!page->wired_count)
+			add_page_to_inactive_list(zone, page);
+		else {
+			SetPageWired(page);
+			add_page_to_wired_list(zone,page);
+		}
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
@@ -330,7 +378,12 @@
 			BUG();
 		if (TestSetPageActive(page))
 			BUG();
-		add_page_to_active_list(zone, page);
+		if(!page->wired_count)
+			add_page_to_active_list(zone, page);
+		else{
+			SetPageWired(page);
+			add_page_to_wired_list(zone,page);
+		}
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 13:35 [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic Stone Wang
@ 2006-03-20 13:41 ` Arjan van de Ven
  2006-03-20 23:52   ` Nate Diller
  2006-03-20 17:27 ` Christoph Lameter
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Arjan van de Ven @ 2006-03-20 13:41 UTC (permalink / raw)
  To: Stone Wang; +Cc: akpm, linux-kernel, linux-mm

> 1. Posix mlock/munlock/mlockall/munlockall.
>    Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
>    just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
>    Thus users of mlock system call series will always have an clear map of
>    mlocked areas.
> 2. More consistent LRU semantics in Memory Management.
>    Mlocked pages is placed on a separate LRU list: Wired List.

please give this a more logical name, such as mlocked list or pinned
list


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 13:41 ` Arjan van de Ven
@ 2006-03-20 23:52   ` Nate Diller
  2006-03-21  7:10     ` Arjan van de Ven
  2006-03-21 12:24     ` Nick Piggin
  0 siblings, 2 replies; 14+ messages in thread
From: Nate Diller @ 2006-03-20 23:52 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Stone Wang, akpm, linux-kernel, linux-mm

On 3/20/06, Arjan van de Ven <arjan@infradead.org> wrote:
> > 1. Posix mlock/munlock/mlockall/munlockall.
> >    Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
> >    just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
> >    Thus users of mlock system call series will always have an clear map of
> >    mlocked areas.
> > 2. More consistent LRU semantics in Memory Management.
> >    Mlocked pages is placed on a separate LRU list: Wired List.
>
> please give this a more logical name, such as mlocked list or pinned
> list

Shaoping, thanks for doing this work, it is something I have been
thinking about for the past few weeks.  It's especially nice to be
able to see how many pages are pinned in this manner.

Might I suggest calling it the long_term_pinned list?  It also might
be worth putting ramdisk pages on this list, since they cannot be
written out in response to memory pressure.  This would eliminate the
need for AOP_WRITEPAGE_ACTIVATE.

NATE

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 23:52   ` Nate Diller
@ 2006-03-21  7:10     ` Arjan van de Ven
  2006-03-21 12:24     ` Nick Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Arjan van de Ven @ 2006-03-21  7:10 UTC (permalink / raw)
  To: Nate Diller; +Cc: Stone Wang, akpm, linux-kernel, linux-mm

On Mon, 2006-03-20 at 15:52 -0800, Nate Diller wrote:
> On 3/20/06, Arjan van de Ven <arjan@infradead.org> wrote:
> > > 1. Posix mlock/munlock/mlockall/munlockall.
> > >    Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
> > >    just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
> > >    Thus users of mlock system call series will always have an clear map of
> > >    mlocked areas.
> > > 2. More consistent LRU semantics in Memory Management.
> > >    Mlocked pages is placed on a separate LRU list: Wired List.
> >
> > please give this a more logical name, such as mlocked list or pinned
> > list
> 
> Shaoping, thanks for doing this work, it is something I have been
> thinking about for the past few weeks.  It's especially nice to be
> able to see how many pages are pinned in this manner.
> 
> Might I suggest calling it the long_term_pinned list?  It also might
> be worth putting ramdisk pages on this list, since they cannot be
> written out in response to memory pressure.  This would eliminate the
> need for AOP_WRITEPAGE_ACTIVATE.

I like that idea



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 23:52   ` Nate Diller
  2006-03-21  7:10     ` Arjan van de Ven
@ 2006-03-21 12:24     ` Nick Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-03-21 12:24 UTC (permalink / raw)
  To: Nate Diller; +Cc: Arjan van de Ven, Stone Wang, akpm, linux-kernel, linux-mm

Nate Diller wrote:

> Might I suggest calling it the long_term_pinned list?  It also might
> be worth putting ramdisk pages on this list, since they cannot be
> written out in response to memory pressure.  This would eliminate the
> need for AOP_WRITEPAGE_ACTIVATE.
> 

They are for the ram filesystem, btw. and I don't think you can eliminate
AOP_WRITEPAGE_ACTIVATE, because it is needed for a number of reasons (out
of swap space being one).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 13:35 [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic Stone Wang
  2006-03-20 13:41 ` Arjan van de Ven
@ 2006-03-20 17:27 ` Christoph Lameter
  2006-03-21  5:23   ` Stone Wang
                     ` (2 more replies)
  2006-03-21 12:20 ` Nick Piggin
  2006-03-24 14:36 ` Andi Kleen
  3 siblings, 3 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-03-20 17:27 UTC (permalink / raw)
  To: Stone Wang; +Cc: akpm, linux-kernel, linux-mm

On Mon, 20 Mar 2006, Stone Wang wrote:

> 2. More consistent LRU semantics in Memory Management.
>    Mlocked pages is placed on a separate LRU list: Wired List.
>    The pages dont take part in LRU algorithms,for they could never be swapped,
>    until munlocked.

This also implies that dirty bits of the pte for mlocked pages are never 
checked. 

Currently light swapping (which is very common) will scan over all pages 
and move the dirty bits from the pte into struct page. This may take 
awhile but at least at some point we will write out dirtied pages.

The result of not scanning mlocked pages will be that mmapped files will 
not be updated unless either the process terminates or msync() is called.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 17:27 ` Christoph Lameter
@ 2006-03-21  5:23   ` Stone Wang
  2006-03-21 15:20   ` Stone Wang
  2006-03-24  4:45   ` Rik van Riel
  2 siblings, 0 replies; 14+ messages in thread
From: Stone Wang @ 2006-03-21  5:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

I will check and fix it.

2006/3/20, Christoph Lameter <clameter@sgi.com>:
> On Mon, 20 Mar 2006, Stone Wang wrote:
>
> > 2. More consistent LRU semantics in Memory Management.
> >    Mlocked pages is placed on a separate LRU list: Wired List.
> >    The pages dont take part in LRU algorithms,for they could never be swapped,
> >    until munlocked.
>
> This also implies that dirty bits of the pte for mlocked pages are never
> checked.
>
> Currently light swapping (which is very common) will scan over all pages
> and move the dirty bits from the pte into struct page. This may take
> awhile but at least at some point we will write out dirtied pages.
>
> The result of not scanning mlocked pages will be that mmapped files will
> not be updated unless either the process terminates or msync() is called.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 17:27 ` Christoph Lameter
  2006-03-21  5:23   ` Stone Wang
@ 2006-03-21 15:20   ` Stone Wang
  2006-03-24  4:45   ` Rik van Riel
  2 siblings, 0 replies; 14+ messages in thread
From: Stone Wang @ 2006-03-21 15:20 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

Checked, mlocked pages dont take part in swapping-writeback,
unlike normal mmaped pages :

linux-2.6.16/mm/rmap.c

try_to_unmap_one()

    603     if ((vma->vm_flags & VM_LOCKED) ||
    604             (ptep_clear_flush_young(vma, address, pte)
    605                 && !ignore_refs)) {
    606         ret = SWAP_FAIL;
    607         goto out_unmap;
    608     }
    609
    610     /* Nuke the page table entry. */
    611     flush_cache_page(vma, address, page_to_pfn(page));
    612     pteval = ptep_clear_flush(vma, address, pte);
    613
    614     /* Move the dirty bit to the physical page now the pte is gone. */
    615     if (pte_dirty(pteval))
    616         set_page_dirty(page);

For VM_LOCKED page, it goes back(line 607) without set_page_dirty(line 616).



2006/3/20, Christoph Lameter <clameter@sgi.com>:
> On Mon, 20 Mar 2006, Stone Wang wrote:
>
> > 2. More consistent LRU semantics in Memory Management.
> >    Mlocked pages is placed on a separate LRU list: Wired List.
> >    The pages dont take part in LRU algorithms,for they could never be swapped,
> >    until munlocked.
>
> This also implies that dirty bits of the pte for mlocked pages are never
> checked.
>
> Currently light swapping (which is very common) will scan over all pages
> and move the dirty bits from the pte into struct page. This may take
> awhile but at least at some point we will write out dirtied pages.
>
> The result of not scanning mlocked pages will be that mmapped files will
> not be updated unless either the process terminates or msync() is called.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 17:27 ` Christoph Lameter
  2006-03-21  5:23   ` Stone Wang
  2006-03-21 15:20   ` Stone Wang
@ 2006-03-24  4:45   ` Rik van Riel
  2 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2006-03-24  4:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Stone Wang, akpm, linux-kernel, linux-mm

On Mon, 20 Mar 2006, Christoph Lameter wrote:

> The result of not scanning mlocked pages will be that mmapped files will 
> not be updated unless either the process terminates or msync() is called.

That's ok.  Light swapping on a system with non-mlocked
mmapped pages has the same result, since we won't scan
mapped pages most of the time...

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 13:35 [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic Stone Wang
  2006-03-20 13:41 ` Arjan van de Ven
  2006-03-20 17:27 ` Christoph Lameter
@ 2006-03-21 12:20 ` Nick Piggin
  2006-03-24 15:05   ` Stone Wang
  2006-03-24 14:36 ` Andi Kleen
  3 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2006-03-21 12:20 UTC (permalink / raw)
  To: Stone Wang; +Cc: akpm, linux-kernel, linux-mm

Stone Wang wrote:
> Both one of my friends(who is working on a DBMS oriented from
> PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.
> 

I'm not sure this is a great idea. There are more conditions than just
mlock that prevent pages being reclaimed. Running out of swap, for
example, no swap, page temporarily pinned (in other words -- any duration
from fleeting to permanent). I think something _much_ simpler could be
done for a more general approach just to teach the VM to tolerate these
pages a bit better.

Also, supposing we do want this, I think there is a fairly significant
queue of mm stuff you need to line up behind... it is probably asking
too much to target 2.6.17 for such a significant change in any case.

But despite all that I looked though and have a few comments ;)
Kudos for jumping in and getting your hands dirty! It can be tricky code.

> The patch brings Linux with:
> 1. Posix mlock/munlock/mlockall/munlockall.
>    Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
>    just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
>    Thus users of mlock system call series will always have an clear map of
>    mlocked areas.

In what way are we not now posix compliant now?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-21 12:20 ` Nick Piggin
@ 2006-03-24 15:05   ` Stone Wang
  2006-03-24 16:57     ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Stone Wang @ 2006-03-24 15:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-kernel, linux-mm

2006/3/21, Nick Piggin <nickpiggin@yahoo.com.au>:
> Stone Wang wrote:
> > Both one of my friends(who is working on a DBMS oriented from
> > PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.
> >
>
> I'm not sure this is a great idea. There are more conditions than just
> mlock that prevent pages being reclaimed. Running out of swap, for
> example, no swap, page temporarily pinned (in other words -- any duration
> from fleeting to permanent). I think something _much_ simpler could be
> done for a more general approach just to teach the VM to tolerate these
> pages a bit better.
>
> Also, supposing we do want this, I think there is a fairly significant
> queue of mm stuff you need to line up behind... it is probably asking
> too much to target 2.6.17 for such a significant change in any case.
>
> But despite all that I looked though and have a few comments ;)
> Kudos for jumping in and getting your hands dirty! It can be tricky code.
>
> > The patch brings Linux with:
> > 1. Posix mlock/munlock/mlockall/munlockall.
> >    Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
> >    just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
> >    Thus users of mlock system call series will always have an clear map of
> >    mlocked areas.
>
> In what way are we not now posix compliant now?

Currently, Linux's mlock for example, may fail with  only part of its
task finished.

While accroding to POSIX definition:

man mlock(2)

"
RETURN VALUE
       On success, mlock returns zero.  On error, -1 is returned, errno is set
       appropriately, and no changes are made to  any  locks  in  the  address
       space of the process.
"

Shaoping Wang

>
> --
> SUSE Labs, Novell Inc.
>
>
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-24 15:05   ` Stone Wang
@ 2006-03-24 16:57     ` Nick Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-03-24 16:57 UTC (permalink / raw)
  To: Stone Wang; +Cc: akpm, linux-kernel, linux-mm

Stone Wang wrote:
> 2006/3/21, Nick Piggin <nickpiggin@yahoo.com.au>:

>>In what way are we not now posix compliant now?
> 
> 
> Currently, Linux's mlock for example, may fail with  only part of its
> task finished.
> 
> While accroding to POSIX definition:
> 
> man mlock(2)
> 
> "
> RETURN VALUE
>        On success, mlock returns zero.  On error, -1 is returned, errno is set
>        appropriately, and no changes are made to  any  locks  in  the  address
>        space of the process.
> "
> 

Looks like you're right, so good catch. You should probably try to submit your
posix mlock patch by itself then. Make sure you look at the coding standards
though, and try to _really_ follow coding conventions of the file you're
modifying.

You also should make sure the patch works standalone (ie. not just as part of
a set). Oh, and introducing a new field in vma for a flag is probably not the
best option if you still have room in the vm_flags field.

And the patch changelog should contain the actual problem, and quote the
relevant part of the POSIX definition, if applicable.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-20 13:35 [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic Stone Wang
                   ` (2 preceding siblings ...)
  2006-03-21 12:20 ` Nick Piggin
@ 2006-03-24 14:36 ` Andi Kleen
  2006-03-24 14:54   ` Stone Wang
  3 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2006-03-24 14:36 UTC (permalink / raw)
  To: Stone Wang; +Cc: linux-kernel, linux-mm

"Stone Wang" <pwstone@gmail.com> writes:
>    mlocked areas.
> 2. More consistent LRU semantics in Memory Management.
>    Mlocked pages is placed on a separate LRU list: Wired List.

If it's mlocked why don't you just called it Mlocked list? 
Strange jargon makes the patch cooler? Also in meminfo

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
  2006-03-24 14:36 ` Andi Kleen
@ 2006-03-24 14:54   ` Stone Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Stone Wang @ 2006-03-24 14:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

I am preparing patch for 2.6.16, replace the name "wired" with "pinned".

Potentially, the list could be used for more purposes, than just mlocked pages.

Shaoping Wang

24 Mar 2006 15:36:46 +0100, Andi Kleen <ak@suse.de>:
> "Stone Wang" <pwstone@gmail.com> writes:
> >    mlocked areas.
> > 2. More consistent LRU semantics in Memory Management.
> >    Mlocked pages is placed on a separate LRU list: Wired List.
>
> If it's mlocked why don't you just called it Mlocked list?
> Strange jargon makes the patch cooler? Also in meminfo
>
> -Andi
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-03-24 16:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-20 13:35 [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic Stone Wang
2006-03-20 13:41 ` Arjan van de Ven
2006-03-20 23:52   ` Nate Diller
2006-03-21  7:10     ` Arjan van de Ven
2006-03-21 12:24     ` Nick Piggin
2006-03-20 17:27 ` Christoph Lameter
2006-03-21  5:23   ` Stone Wang
2006-03-21 15:20   ` Stone Wang
2006-03-24  4:45   ` Rik van Riel
2006-03-21 12:20 ` Nick Piggin
2006-03-24 15:05   ` Stone Wang
2006-03-24 16:57     ` Nick Piggin
2006-03-24 14:36 ` Andi Kleen
2006-03-24 14:54   ` Stone Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox