Re: Large memory system

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Large memory system
@ 1999-02-08 20:33 Manfred Spraul
  1999-02-10 14:25 ` Stephen C. Tweedie
  0 siblings, 1 reply; 11+ messages in thread
From: Manfred Spraul @ 1999-02-08 20:33 UTC (permalink / raw)
  To: Stephen C. Tweedie, Benjamin C.R. LaHaise; +Cc: Daniel Blakeley, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1819 bytes --]

Stephen C. Tweedie wrote:
>
> Hi,
>
> On Sat, 30 Jan 1999 12:00:53 -0500 (EST), "Benjamin C.R. LaHaise"
> <blah@kvack.org> said:
>
> > Easily isn't a good way of putting it, unless you're talking about doing
> > something like mmap on /dev/mem, in which case you could make the
> > user/kernel virtual spilt weigh heavy on the user side and do memory
> > allocation yourself.  If you're talking about doing it transparently,
> > you're best bet is to do something like davem's suggested high mem
> > approach, and only use non-kernel mapped memory for user pages... if you
> > want to be able to support the page cache in high memory, things get
> > messy.
>
> No it doesn't!  The only tricky thing is IO, but we need to have bounce
> buffers to high memory anyway for swapping.  The page cache uses "struct
> page" addresses in preference to actual page data pointers almost
> everywhere anyway, and whenever we are doing something like read(2) or
> write(2) functions, we just need a single per-CPU virtual pte in the
> vmalloc region to temporarily map the page into memory while we copy to
> user space (and remember that we do this from the context of the user
> process anyway, so we don't have to remap the user page even if it is in
> high memory).
>
There is another possibility if you want to extend the page cache:
Add a 'second level cache':
if shrink_mmap wants to discard a page from the page cache, it is saved
in the physical memory cache.
if __find_page() can't find a page in the normal cache, it checks if it
is in the physical memory cache. If so, the entry is copied into the
normal cache. You only have to modify three or four lines in filemap.c &
pagemap.h.

I've attached a patch that extends the page cache, but it's incomplete:
there is no way to configure the cache, and it's ugly.

Manfred


[-- Attachment #2: patch2 --]
[-- Type: application/octet-stream, Size: 32193 bytes --]

diff -u -r -P 2.2.1/Documentation/Configure.help current/Documentation/Configure.help
--- 2.2.1/Documentation/Configure.help	Wed Jan 20 20:05:32 1999
+++ current/Documentation/Configure.help	Sun Feb  7 20:37:05 1999
@@ -229,6 +229,22 @@
 
   Most users will answer N here.
 
+Hugeram Ramdisk
+CONFIG_BLK_DEV_HUGERAMD
+  Saying Y here makes it possible to use the memory above 1 gigabyte
+  as a ramdisk. This only works if you have enabled CONFIG_HUGEMEM.
+  
+  The ramdisk can be accessed through block special files
+  /dev/hugeram0 ... /dev/hugeram7, with major number 126 and
+  minor numbers 0..7 (do "man mknod" for help how to create them)
+  
+  If you want to compile this driver as a module ( = code which can be
+  inserted in and removed from the running kernel whenever you want),
+  say M here and read Documentation/modules.txt. The module will be
+  called hugeramd.o.
+  
+  Most users will answer N here.
+
 Network Block Device support
 CONFIG_BLK_DEV_NBD
   Saying Y here will allow your computer to be a client for network
@@ -8366,6 +8382,25 @@
   just add about 3k to your kernel.
 
   See Documentation/mtrr.txt for more information.
+
+Huge memory support
+CONFIG_HUGEMEM
+  Enables kernel support for really huge amounts of RAM (beyond 1 Gigabyte)
+  
+  Linux currently supports up to 960 MB of RAM on Intel and compatible
+  computers.
+  If you have more memory, there are two things you can do:
+  
+  1: Increase manually the limit to 1984MB.
+     Read the note in include/linux/page.h to do this.
+  2: Use the RAM above the 960MB limit (or above 1984MB if you applied 1:)
+     for special devices such as a HugeRamD ramdisk.
+  
+  Saying Y here makes it possible to use the solution 2:
+  use the RAM above the limit as a ramdisk.
+  
+  Note that to actually enable this ramdisk, you must also say Y to the
+  "Huge Ramdisk support" below.
 
 Main CPU frequency, only for DEC alpha machine
 CONFIG_FT_ALPHA_CLOCK
diff -u -r -P 2.2.1/arch/i386/config.in current/arch/i386/config.in
--- 2.2.1/arch/i386/config.in	Wed Jan 20 19:18:53 1999
+++ current/arch/i386/config.in	Fri Feb  5 18:38:29 1999
@@ -36,6 +36,7 @@
 bool 'Math emulation' CONFIG_MATH_EMULATION
 bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
 bool 'Symmetric multi-processing support' CONFIG_SMP
+bool 'Huge memory support' CONFIG_HUGEMEM
 endmenu
 
 mainmenu_option next_comment
diff -u -r -P 2.2.1/arch/i386/kernel/Makefile current/arch/i386/kernel/Makefile
--- 2.2.1/arch/i386/kernel/Makefile	Wed Jan 20 19:18:53 1999
+++ current/arch/i386/kernel/Makefile	Sun Feb  7 00:27:30 1999
@@ -26,6 +26,10 @@
 O_OBJS += mca.o
 endif
 
+ifdef CONFIG_HUGEMEM
+O_OBJS += hugemem.o
+endif
+
 ifeq ($(CONFIG_MTRR),y)
 OX_OBJS += mtrr.o
 else
diff -u -r -P 2.2.1/arch/i386/kernel/hugemem.c current/arch/i386/kernel/hugemem.c
--- 2.2.1/arch/i386/kernel/hugemem.c	Thu Jan  1 01:00:00 1970
+++ current/arch/i386/kernel/hugemem.c	Sun Feb  7 22:05:06 1999
@@ -0,0 +1,158 @@
+/*
+ * linux/arch/i386/kernel/hugemem.c
+ *
+ *   Written 1999 by Manfred Spraul <masp0008@stud.uni-sb.de>
+ */
+
+#include <linux/kernel.h>
+#include <linux/malloc.h>
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <asm/hugemem.h>
+#include <asm/spinlock.h>
+
+/*
+	global variables:
+ */
+
+int hugemem_startpg;    /* number of the first page managed by hugemem */
+int hugemem_len = 0;
+unsigned char* hugemem_bitmap = NULL;
+
+spinlock_t hugemem_lock = SPIN_LOCK_UNLOCKED;
+
+/*
+	internal functions:
+ */
+
+int hm_init(void);
+
+/*
+  This function is called very early, it cannot call
+  any other kernel function.
+  Defer the actual initialization.
+ */
+ 
+void init_hugemem(int memstart, int memend)
+{
+	memstart = (memstart + PAGE_SIZE-1) & PAGE_MASK;
+	memend = memend & PAGE_MASK;
+
+	hugemem_startpg = memstart/PAGE_SIZE;
+	hugemem_len = (memend-memstart)/PAGE_SIZE/HM_PAGES_PER_BIT;
+}
+
+int alloc_hugemem(int size)
+{
+	int result;
+	int i, missing;
+	
+	if(hugemem_len == 0)
+		return -1;
+
+	if(size == 0)
+		return -1;
+	
+	spin_lock(&hugemem_lock);
+
+	if(hugemem_bitmap == NULL) {
+		if(!hm_init()) {
+			spin_unlock(&hugemem_lock);
+			return -1;
+		}
+	}
+	
+	size = (size+HM_PAGES_PER_BIT-1)/HM_PAGES_PER_BIT;
+	
+	result = -1;
+	missing = size;
+	for(i=0;i<hugemem_len;i++) {
+		if( (hugemem_bitmap[i/8] & (0x80>>(i&0x07))) == 0) {
+			if(result==-1) {
+				missing = size;
+				result = i;
+			}
+			missing--;
+			if(missing == 0)
+				break;
+		} else {
+			result = -1;
+		}
+	}
+	if(missing != 0)
+	{
+		spin_unlock(&hugemem_lock);
+		return -1;
+	}
+	for(i=result;i<result+size;i++)
+		hugemem_bitmap[i/8] |= 0x80>>(i&0x07);
+	
+	spin_unlock(&hugemem_lock);
+	return result*HM_PAGES_PER_BIT+hugemem_startpg;
+}
+
+void free_hugemem(int startpg, int size)
+{
+	startpg -= hugemem_startpg;
+	size = (size+HM_PAGES_PER_BIT-1)/HM_PAGES_PER_BIT;
+
+	if((size > hugemem_len) || (size==0) || (startpg&(HM_PAGES_PER_BIT-1))) {
+		printk(KERN_DEBUG "free_hugemem: invalid parameter.");
+		return;
+	}
+	if(hugemem_bitmap == NULL)
+		return;
+
+	spin_lock(&hugemem_lock);
+	
+	startpg /= HM_PAGES_PER_BIT;
+	while(size != 0) {
+		hugemem_bitmap[startpg/8] &= ~(0x80>>(startpg & 0x07));
+		startpg++;
+		size--;
+	}
+	spin_unlock(&hugemem_lock);
+	return;
+}
+
+/* pmax can be NULL, returns the size of the largest free area. */
+int getfree_hugemem(int* pmax)
+{
+	int i, found, max;
+
+	spin_lock(&hugemem_lock);
+
+	if(hugemem_bitmap == NULL) {
+		if(!hm_init()) {
+			spin_unlock(&hugemem_lock);
+			return -1;
+		}
+	}
+		
+	if(pmax != NULL)
+		*pmax = hugemem_len*HM_PAGES_PER_BIT;
+	found = 0;
+	max = 0;
+	
+	for(i=0;i<hugemem_len;i++) {
+		if( (hugemem_bitmap[i/8] & (0x80>>(i&0x07))) == 0) {
+			found++;
+			if(found > max)
+				max = found;
+		} else {
+			found = 0;;
+		}
+	}
+	spin_unlock(&hugemem_lock);
+	return max*HM_PAGES_PER_BIT;
+}
+
+int hm_init(void)
+{
+	hugemem_bitmap = kmalloc((hugemem_len+7)/8,GFP_ATOMIC);
+	if(hugemem_bitmap == NULL)
+		return 0;
+	memset(hugemem_bitmap,0,(hugemem_len+7)/8);
+	return 1;
+}
+
diff -u -r -P 2.2.1/arch/i386/kernel/i386_ksyms.c current/arch/i386/kernel/i386_ksyms.c
--- 2.2.1/arch/i386/kernel/i386_ksyms.c	Tue Jan 19 20:02:59 1999
+++ current/arch/i386/kernel/i386_ksyms.c	Fri Feb  5 18:38:29 1999
@@ -17,6 +17,7 @@
 #include <asm/hardirq.h>
 #include <asm/delay.h>
 #include <asm/irq.h>
+#include <asm/hugemem.h>
 
 extern void dump_thread(struct pt_regs *, struct user *);
 extern int dump_fpu(elf_fpregset_t *);
@@ -40,6 +41,7 @@
 EXPORT_SYMBOL(enable_irq);
 EXPORT_SYMBOL(disable_irq);
 EXPORT_SYMBOL(kernel_thread);
+EXPORT_SYMBOL(init_mm);
 
 EXPORT_SYMBOL_NOVERS(__down_failed);
 EXPORT_SYMBOL_NOVERS(__down_failed_interruptible);
@@ -109,4 +111,9 @@
 
 #ifdef CONFIG_VT
 EXPORT_SYMBOL(screen_info);
+#endif
+
+#ifdef CONFIG_HUGEMEM
+EXPORT_SYMBOL(alloc_hugemem);
+EXPORT_SYMBOL(free_hugemem);
 #endif
diff -u -r -P 2.2.1/arch/i386/kernel/setup.c current/arch/i386/kernel/setup.c
--- 2.2.1/arch/i386/kernel/setup.c	Thu Jan 21 20:28:40 1999
+++ current/arch/i386/kernel/setup.c	Mon Feb  8 20:17:01 1999
@@ -32,6 +32,9 @@
 #ifdef CONFIG_BLK_DEV_RAM
 #include <linux/blk.h>
 #endif
+#ifdef CONFIG_HUGEMEM
+#include <asm/hugemem.h>
+#endif
 #include <asm/processor.h>
 #include <linux/console.h>
 #include <asm/uaccess.h>
@@ -327,11 +330,20 @@
 	*to = '\0';
 	*cmdline_p = command_line;
 
+#ifdef CONFIG_HUGEMEM
+#define VMALLOC_RESERVE	(96<<20)	/* more memory for vmalloc */
+#else
 #define VMALLOC_RESERVE	(64 << 20)	/* 64MB for vmalloc */
+#endif
 #define MAXMEM	((unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE))
-
+/* FIXME: debug code. */
+#undef MAXMEM
+#define MAXMEM (32 << 20)
 	if (memory_end > MAXMEM)
 	{
+#ifdef CONFIG_HUGEMEM
+		init_hugemem(MAXMEM,memory_end);
+#endif
 		memory_end = MAXMEM;
 		printk(KERN_WARNING "Warning only %ldMB will be used.\n",
 			MAXMEM>>20);
diff -u -r -P 2.2.1/arch/i386/mm/Makefile current/arch/i386/mm/Makefile
--- 2.2.1/arch/i386/mm/Makefile	Fri Nov  1 10:56:43 1996
+++ current/arch/i386/mm/Makefile	Fri Feb  5 16:06:55 1999
@@ -10,4 +10,8 @@
 O_TARGET := mm.o
 O_OBJS	 := init.o fault.o ioremap.o extable.o
 
+ifdef CONFIG_HUGEMEM
+O_OBJS += hmcache.o
+endif
+
 include $(TOPDIR)/Rules.make
diff -u -r -P 2.2.1/arch/i386/mm/hmcache.c current/arch/i386/mm/hmcache.c
--- 2.2.1/arch/i386/mm/hmcache.c	Thu Jan  1 01:00:00 1970
+++ current/arch/i386/mm/hmcache.c	Mon Feb  8 20:32:25 1999
@@ -0,0 +1,378 @@
+/*
+ * arch/i386/mm/hmcache.o
+ *
+ * physical memory cache:
+ * 
+ * (C) Copyright 1999 Manfred Spraul <masp0008@stud.uni-sb.de>
+ *
+ * assumptions:
+ *	- the offset field to __find_page is a byte offset.
+ */
+
+#include <linux/vmalloc.h>
+#include <linux/pagemap.h>
+#include <linux/fs.h>
+#include <asm/pgtable.h>
+#include <asm/spinlock.h>
+#include <asm/io.h>
+#include <asm/hugemem.h>
+#include <asm/hmcache.h>
+
+/* the size of this structure must be 32 bytes. */
+typedef struct hmc_page hmc_page_t;
+struct hmc_page {
+	struct inode* inode;
+	unsigned long offset;
+	hmc_page_t *hash_next;
+	hmc_page_t *lru_newer;
+	/* 0x10: */
+	hmc_page_t *hash_prev;
+	hmc_page_t *lru_older;
+	hmc_page_t *ihash_next;
+	hmc_page_t *ihash_prev;
+};
+
+#define HMC_DESC_PER_PAGE	128
+
+unsigned long hmc_allocstart = 0;
+unsigned long hmc_alloclen = 0;
+unsigned long hmc_start;
+unsigned long hmc_size;
+hmc_page_t *hmc_desc = NULL;
+
+hmc_page_t *hmc_freelist = NULL;	/* linked with the lru-field  */
+hmc_page_t *hmc_lrunew = NULL;
+hmc_page_t *hmc_lruold = NULL;
+
+spinlock_t hmc_lock = SPIN_LOCK_UNLOCKED;
+
+#define HM_PHASH_BITS		14
+#define HM_PHASH_SIZE		(1 << HM_PHASH_BITS)
+#define HM_PHASH_ALLOCORDER	4
+
+#define HM_IHASH_BITS		10
+#define HM_IHASH_SIZE		(1 << HM_IHASH_BITS)
+#define HM_IHASH_ALLOCORDER	0
+
+hmc_page_t **hmc_hash_table = NULL;
+hmc_page_t ** hmc_ihash_table = NULL;
+void* hmc_mapwnd = NULL;
+
+
+/* #define HMC_DEBUG */
+#ifdef HMC_DEBUG
+
+static int ENABLE_HMC = 0;
+
+#define assert(x)	if( !(x) ) __asm__ __volatile__ ("int3");
+
+#else
+#define ENABLE_HMC 1
+#define assert(x)	do {  } while (0)
+#endif
+
+/* I never free the hash tables, they remain allocated. */
+static int _hmc_init(void)
+{
+	int i;
+	if(!ENABLE_HMC)
+		return -ENOMEM;
+
+	assert(sizeof(hmc_page_t) == 32);
+	
+	i = getfree_hugemem(NULL);
+	hmc_allocstart = alloc_hugemem(i);
+	if(hmc_allocstart == 0)
+		return 0;
+	hmc_alloclen = i;
+	
+	if(hmc_hash_table == NULL) {
+		hmc_hash_table = (void*)__get_free_pages(__GFP_MED,
+						HM_PHASH_ALLOCORDER);
+		if(hmc_hash_table == NULL)
+			goto out_mem;
+		memset(hmc_hash_table,0,HM_PHASH_SIZE*sizeof(unsigned long));
+	}
+	if(hmc_ihash_table == NULL) {
+		hmc_ihash_table = (void*)__get_free_pages(__GFP_MED,
+						HM_IHASH_ALLOCORDER);
+		if(hmc_ihash_table == NULL)
+			goto out_mem;
+		memset(hmc_ihash_table,0,HM_IHASH_SIZE*sizeof(unsigned long));
+	}
+	hmc_size = (i+HMC_DESC_PER_PAGE)/(HMC_DESC_PER_PAGE+1);
+	hmc_start = hmc_allocstart + hmc_size;
+	assert( (hmc_alloclen - hmc_size) <= hmc_size*HMC_DESC_PER_PAGE);
+	hmc_size = hmc_alloclen - hmc_size;
+	
+	/* add a window for the cache descriptors: up to 24 MB */
+	/* FIXME: ioremap should support 4 MB PTE's */
+	hmc_desc = ioremap(hmc_allocstart*PAGE_SIZE,hmc_size*sizeof(hmc_page_t));
+	if(hmc_desc == NULL)
+		goto out_mem;
+	if(hmc_mapwnd == NULL)
+		hmc_mapwnd = ioremap(hmc_allocstart*PAGE_SIZE,PAGE_SIZE);
+		
+	memset(hmc_desc,0,hmc_size*sizeof(hmc_page_t));
+	hmc_freelist = &hmc_desc[0];
+	hmc_desc[0].lru_older = NULL;
+	hmc_desc[0].lru_newer = &hmc_desc[1];
+	
+	for(i=1;i<hmc_size;i++)
+	{
+		hmc_desc[i].lru_older = &hmc_desc[i-1];
+		hmc_desc[i].lru_newer = &hmc_desc[i+1];
+	}
+	hmc_desc[hmc_size-1].lru_newer = NULL;
+	return 0;
+out_mem:
+	free_hugemem(hmc_allocstart, hmc_alloclen);
+	hmc_allocstart = 0;
+	return -ENOMEM;
+}
+
+static inline int hmc_init(void)
+{
+	if(hmc_allocstart == 0)
+		return _hmc_init();
+	return 0;
+}
+/*
+ * race protection:
+ * - functions that own hmc_lock() never schedule.
+ * - a page is either present in the normal mmap cache, or in the physical cache, but never in both.
+ */
+static inline unsigned long hmc_gethash(struct inode* inode, unsigned long offset, unsigned long hashbits)
+{
+#define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
+#define o (offset >> PAGE_SHIFT)
+#define s(x) ((x)+((x)>>hashbits))
+	return s(i+o) & ((1<<hashbits)-1);
+#undef i
+#undef o
+#undef s
+}
+
+void hmc_copypg(hmc_page_t* phys, struct page* pg, int tophys)
+{
+	unsigned long linear = (unsigned long)hmc_mapwnd;
+	pte_t* pte = pte_offset(pmd_offset(pgd_offset_k(linear),linear), linear);
+	unsigned long physpg;
+	
+	physpg = phys-&hmc_desc[0];
+	physpg *= PAGE_SIZE;
+	physpg += PAGE_SIZE*hmc_start;
+	
+	set_pte(pte, mk_pte_phys(physpg, __pgprot(_PAGE_PRESENT| _PAGE_RW |
+				_PAGE_DIRTY | _PAGE_ACCESSED )));
+		
+	/* this call only affects the current cpu.
+	   This is not a problem, because only one cpu is allowed to execute
+	   these lines.
+	 */
+	__flush_tlb_one(linear);
+
+	if(tophys)
+		copy_page(hmc_mapwnd, page_address(pg));
+	 else
+	 	copy_page(page_address(pg), hmc_mapwnd);
+}
+
+static inline void hmc_checkpg(hmc_page_t* p)
+{
+	assert( (p->hash_prev == NULL) || (p->hash_prev->hash_next == p) );
+	assert( (p->hash_next == NULL) || (p->hash_next->hash_prev == p) );
+	assert( (p->ihash_next == NULL) || (p->ihash_next->ihash_prev == p) );
+	assert( (p->ihash_prev == NULL) || (p->ihash_prev->ihash_next == p) );
+	assert( (p->lru_newer == NULL) || (p->lru_newer->lru_older == p) );
+	assert( (p->lru_older == NULL) || (p->lru_older->lru_newer == p) );
+}
+
+
+void hmc_unlinkpg(hmc_page_t* p)
+{
+	hmc_checkpg(p);
+	
+	if(p->hash_next != NULL)
+		p->hash_next->hash_prev = p->hash_prev;
+	if(p->hash_prev == NULL)
+		hmc_hash_table[hmc_gethash(p->inode, p->offset, HM_PHASH_BITS)] = p->hash_next;
+	 else
+	 	p->hash_prev->hash_next = p->hash_next;
+	
+	if(p->ihash_next != NULL)
+		p->ihash_next->ihash_prev = p->ihash_prev;
+	if(p->ihash_prev == NULL)
+		hmc_ihash_table[hmc_gethash(p->inode, 0, HM_IHASH_BITS)] = p->ihash_next;
+	 else
+	 	p->ihash_prev->ihash_next = p->ihash_next;
+	
+	if(p->lru_newer == NULL)
+		hmc_lrunew = p->lru_older;
+	 else
+		p->lru_newer->lru_older = p->lru_older;
+	if(p->lru_older == NULL)
+		hmc_lruold = p->lru_newer;
+	 else
+	 	p->lru_older->lru_newer = p->lru_newer;
+}
+
+/* from mm/filemap.c */
+/* FIXME: race: which locks are required for changing these lists? */
+static inline void add_to_page_cache(struct page * page,
+	struct inode * inode, unsigned long offset,
+	struct page **hash)
+{
+	atomic_inc(&page->count);
+	page->flags = (page->flags & ~((1 << PG_uptodate) | (1 << PG_error))) | (1 << PG_referenced);
+	page->offset = offset;
+	add_page_to_inode_queue(inode, page);
+	__add_page_to_hash_queue(page, hash);
+}
+
+struct page* hmc_getpage(hmc_page_t* phys)
+{
+	struct page* pg = NULL;
+	unsigned long pgaddr;
+	
+	hmc_checkpg(phys);
+	if( (phys->offset >= phys->inode->i_size) &&
+	  ( (pgaddr = __get_free_page(__GFP_MED)) != 0) )
+	{
+		struct page** hash;
+	
+		pg = mem_map+ MAP_NR(pgaddr);
+		
+		hmc_copypg(phys, pg, 0);
+		hash = page_hash(phys->inode, phys->offset);
+		pg->inode = phys->inode;
+		pg->offset = phys->offset;
+		add_to_page_cache(pg, phys->inode, phys->offset, hash);	
+		__free_page(pg);
+	}
+	hmc_unlinkpg(phys);
+	phys->lru_older = NULL;
+	phys->lru_newer = hmc_freelist;
+	if(hmc_freelist!= NULL)
+		hmc_freelist->lru_older = phys;
+	hmc_freelist = phys;
+
+	return pg;
+}
+
+struct page* hmc_findpage(struct inode* inode, unsigned long offset)
+{
+	struct page* out = NULL;
+	hmc_page_t* p;
+
+	spin_lock(&hmc_lock);
+	if(!hmc_init())
+	{
+		/* scan through the cache */
+		p = hmc_hash_table[hmc_gethash(inode, offset, HM_PHASH_BITS)];	
+		while(p!=NULL)
+		{
+			hmc_checkpg(p);
+			if( (p->inode == inode) &&
+				(p->offset == offset))
+			{
+				out = hmc_getpage(p);
+				break;
+			}
+			p = p->hash_next;
+		}
+	}
+	spin_unlock(&hmc_lock);
+	return out;
+}
+
+void hmc_invalidate_inode_pages(struct inode* inode)
+{
+	hmc_truncate_inode_pages(inode, 0);
+}
+
+/* clearing partial pages is complicated, always discard the complete page. */
+void hmc_truncate_inode_pages(struct inode* inode, unsigned long start)
+{
+	spin_lock(&hmc_lock);
+	if(!hmc_init())
+	{
+		hmc_page_t *p, *next;
+
+assert(0);
+		start = start & PAGE_MASK;
+		p = hmc_ihash_table[hmc_gethash(inode, 0, HM_IHASH_BITS)];	
+		while(p!=NULL)
+		{
+			next = p->ihash_next;
+			if((p->inode == inode) &&
+				(p->offset >= start))
+			{
+				hmc_unlinkpg(p);
+				p->lru_newer = hmc_lruold;
+				p->lru_older = NULL;
+				hmc_lruold = p;
+			}
+			p = next;
+		}
+	}
+	spin_unlock(&hmc_lock);
+}
+
+/* the functions is called after the page was removed from the main cache.
+FIXME: race: a new page is created before spin_lock() returns. */
+
+extern struct inode swapper_inode;
+
+void hmc_add_page(struct page* page)
+{
+	if(page->flags & ((1 << PG_locked)|(1<<PG_error)|(1<<PG_dirty)|(1<<PG_skip)|(1<<PG_swap_cache)) )
+		return;
+	if(page->inode == &swapper_inode)
+		return;	// do not cache swapper entries.
+	spin_lock(&hmc_lock);
+	if(!hmc_init())
+	{
+		hmc_page_t* p;
+		int hash;
+		if(hmc_freelist == NULL) {
+			p = hmc_lruold;
+			hmc_unlinkpg(hmc_lruold);
+		} else {
+		 	p = hmc_freelist;
+			if(p->lru_newer != NULL)
+				p->lru_newer->lru_older = NULL;
+			hmc_freelist = p->lru_newer;
+		}
+
+		p->inode = page->inode;
+		p->offset = page->offset;
+		
+		hash = hmc_gethash(p->inode, p->offset, HM_PHASH_BITS);
+		if(hmc_hash_table[hash] != NULL)
+			hmc_hash_table[hash]->hash_prev = p;
+		p->hash_next = hmc_hash_table[hash];
+		p->hash_prev = NULL;
+		hmc_hash_table[hash] = p;
+	
+		hash = hmc_gethash(p->inode, 0, HM_IHASH_BITS);
+		if(hmc_ihash_table[hash] != NULL)
+			hmc_ihash_table[hash]->ihash_prev = p;
+		p->ihash_next = hmc_ihash_table[hash];
+		p->ihash_prev = NULL;
+		hmc_ihash_table[hash] = p;
+
+		if(hmc_lrunew != NULL)
+			hmc_lrunew->lru_newer = p;
+		p->lru_newer = NULL;
+		p->lru_older = hmc_lrunew;
+		hmc_lrunew = p;
+		if(hmc_lruold == NULL)
+			hmc_lruold = p;
+
+		hmc_copypg(p, page, 1);
+		hmc_checkpg(p);
+	}
+	spin_unlock(&hmc_lock);
+}
+
diff -u -r -P 2.2.1/drivers/block/Config.in current/drivers/block/Config.in
--- 2.2.1/drivers/block/Config.in	Tue Dec 29 20:21:49 1998
+++ current/drivers/block/Config.in	Sun Feb  7 20:37:05 1999
@@ -94,6 +94,9 @@
 comment 'Additional Block Devices'
 
 tristate 'Loopback device support' CONFIG_BLK_DEV_LOOP
+if [ "$CONFIG_EXPERIMENTAL" = "y" ];then
+  tristate 'Hugeram Ramdisk' CONFIG_BLK_DEV_HUGERAMD
+fi
 if [ "$CONFIG_NET" = "y" ]; then
   tristate 'Network block device support' CONFIG_BLK_DEV_NBD
 fi
diff -u -r -P 2.2.1/drivers/block/Makefile current/drivers/block/Makefile
--- 2.2.1/drivers/block/Makefile	Wed Sep 16 22:25:56 1998
+++ current/drivers/block/Makefile	Sun Feb  7 20:37:05 1999
@@ -94,6 +94,14 @@
   endif
 endif
 
+ifeq ($(CONFIG_BLK_DEV_HUGERAMD),y)
+L_OBJS += hugeramd.o
+else
+        ifeq ($(CONFIG_BLK_DEV_HUGERAMD),m)
+        M_OBJS += hugeramd.o
+        endif
+endif
+
 ifeq ($(CONFIG_BLK_DEV_HD),y)
 L_OBJS += hd.o
 endif
diff -u -r -P 2.2.1/drivers/block/hugeramd.c current/drivers/block/hugeramd.c
--- 2.2.1/drivers/block/hugeramd.c	Thu Jan  1 01:00:00 1970
+++ current/drivers/block/hugeramd.c	Sun Feb  7 20:37:05 1999
@@ -0,0 +1,331 @@
+/*
+ *  linux/drivers/block/hugeramd.c
+ *
+ *  Written by Manfred Spraul
+ * 
+ *  based on the loop driver written by Theodore Ts'o
+ */
+
+#include <linux/module.h>
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/ioctl.h>
+#include <asm/io.h>
+#include <asm/uaccess.h>
+#include <linux/errno.h>
+#include <linux/major.h>
+#include <linux/hugeramd.h>
+#include <asm/hugemem.h>
+
+#define MAJOR_NR        HUGERAMD_MAJOR
+#define DEVICE_NAME     "HugeRamD"
+#define DEVICE_NR(device)       MINOR(device)
+#define DEVICE_NO_RANDOM
+#define DEVICE_OFF(x)           do { } while (0)
+#define DEVICE_REQUEST  hugeramd_request
+#include <linux/blk.h>
+
+struct HRD_DEVICE {
+	spinlock_t lock;
+	void* remap_addr;
+	int start; /* in pages */
+	int size;
+	int refcount;
+} ;
+
+
+struct HRD_DEVICE hrd_devices[MAX_HUGERAMD];
+
+int hugeramd_bps[MAX_HUGERAMD] = { 0};
+int hugeramd_blockcount[MAX_HUGERAMD] = {0};
+#define FALSE 0
+#define TRUE (!FALSE)
+
+/*
+	internal prototypes
+*/
+
+int hrd_rw(int dev, unsigned long offset, char* buf, int len, int read);
+
+static int hrd_ioctl(struct inode * inode, struct file * file,
+	unsigned int cmd, unsigned long arg)
+{
+	struct HRD_DEVICE* hrd;
+	int dev;
+	int res;
+
+	if (!inode)
+		return -EINVAL;
+	if (MAJOR(inode->i_rdev) != HUGERAMD_MAJOR) {
+		printk(KERN_WARNING "hrd_ioctl: pseudo-major != %d\n", HUGERAMD_MAJOR);
+		return -ENODEV;
+	}
+	dev = MINOR(inode->i_rdev);
+	if (dev >= MAX_HUGERAMD)
+		return -ENODEV;
+       hrd = &hrd_devices[dev];
+	spin_lock(&hrd->lock);
+	res = 0;
+	switch (cmd) {
+	case BLKGETSIZE:   /* Return device size */
+		res =  put_user(hrd->size<<3, (long *) arg);
+		break;
+	case HRDSETBPS:
+		if(!capable(CAP_SYS_ADMIN)) {
+			res = -EACCES;
+			break;
+		}
+		if( (arg != 512) &&
+			(arg != 1024) &&
+			(arg != 2048) &&
+			(arg != 4096) ) {
+			res = -EINVAL;
+			break;
+		}
+		hugeramd_bps[dev] = arg;
+		hugeramd_blockcount[dev] = hrd->size*(PAGE_SIZE/hugeramd_bps[dev]);
+		break;
+	case HRDSETSIZE:
+		if(!capable(CAP_SYS_ADMIN)) {
+			res = -EACCES;
+			break;
+		}
+		if(hrd->size != 0)
+		{
+			free_hugemem(hrd->start,
+						hrd->size);
+			hrd->size = 0;
+		}
+		if(arg == 0)
+			break;
+		hrd->start = alloc_hugemem(arg);
+		if(hrd->start == -1) {
+			res = -ENOMEM;
+			break;
+		}
+		hrd->size = arg;
+		
+		if(hrd->remap_addr==NULL)
+			hrd->remap_addr = ioremap(hrd->start*PAGE_SIZE, PAGE_SIZE);
+		if(hrd->remap_addr == NULL) {
+			free_hugemem(hrd->start,
+							hrd->size);
+			hrd->size = 0;
+			res = -ENOMEM;
+			break;
+		}
+		hugeramd_blockcount[dev] = hrd->size*(PAGE_SIZE/hugeramd_bps[dev]);
+		break;
+		
+		/* FIXME: additional ioctl's required? */
+	default:
+		res =  -EINVAL;
+	}
+	spin_unlock(&hrd->lock);
+	return res;
+}
+
+static int hrd_open(struct inode *inode, struct file *file)
+{
+	struct HRD_DEVICE *hrd;
+	int     dev;
+
+	if (!inode)
+		return -EINVAL;
+	if (MAJOR(inode->i_rdev) != HUGERAMD_MAJOR) {
+		printk(KERN_WARNING "hrd_open: pseudo-major != %d\n", HUGERAMD_MAJOR);
+		return -ENODEV;
+	}
+	dev = MINOR(inode->i_rdev);
+	if (dev >= MAX_HUGERAMD) {
+		return -ENODEV;
+	}
+	hrd = &hrd_devices[dev];
+
+	spin_lock(&hrd->lock);
+	hrd->refcount++;
+	spin_unlock(&hrd->lock);
+	MOD_INC_USE_COUNT;
+	return 0;
+}
+
+static int hrd_release(struct inode *inode, struct file *file)
+{
+	struct HRD_DEVICE* hrd;
+	int     dev;
+
+	if (!inode)
+		return 0;
+	if (MAJOR(inode->i_rdev) != HUGERAMD_MAJOR) {
+		printk(KERN_WARNING "hrd_release: pseudo-major != %d\n", HUGERAMD_MAJOR);
+		return 0;
+	}
+	dev = MINOR(inode->i_rdev);
+	if (dev >= MAX_HUGERAMD)
+		return 0;
+	hrd = &hrd_devices[dev];
+	spin_lock(&hrd->lock);
+	
+	if (hrd->refcount <= 0)
+		printk(KERN_ERR "hrd_release: refcount(MINOR=%d) <= 0\n", dev);
+	else  {
+		hrd->refcount--;
+		if(hrd->refcount == 0)
+		MOD_DEC_USE_COUNT;
+	}
+	spin_unlock(&hrd->lock);
+	return 0;
+}
+
+int hrd_fsync(struct file* unused, struct dentry* unused2)
+{
+	/* syncing a ramdisk??? */
+	return 0;
+}
+
+void hugeramd_request(void)
+{
+	int dev;
+	int res;
+	
+	while(1) {
+		INIT_REQUEST
+	
+		/* FIXME: release the io lock? */
+		
+		dev = DEVICE_NR(CURRENT->rq_dev);
+		if(dev > MAX_HUGERAMD) {
+			res = 0;
+		} else
+		{
+			spin_lock(&hrd_devices[dev].lock);
+			res = hrd_rw(dev, CURRENT->sector, CURRENT->buffer, CURRENT->current_nr_sectors, (CURRENT->cmd != WRITE) );
+			spin_unlock(&hrd_devices[dev].lock);
+		}
+		/* FIXME: reacquire the io lock ? */
+		end_request(res);
+	}
+}
+
+static struct file_operations hrd_fops = {
+	NULL,                   /* lseek - default */
+	block_read,             /* read */ 
+	block_write,    /* write  */
+	NULL,                   /* readdir - bad */
+	NULL,                   /* poll */
+	hrd_ioctl,              /* ioctl */
+	NULL,                   /* mmap */
+	hrd_open,               /* open */
+	NULL,                   /* flush */
+	hrd_release,    /* release */
+	block_fsync             /* fsync */
+};
+
+/*
+ * And now the modules code and kernel interface.
+ */
+#ifdef MODULE
+#define hugeramd_init init_module
+#endif
+
+int hugeramd_init(void) 
+{
+	int     i;
+
+	if (register_blkdev(HUGERAMD_MAJOR, "hugeramd", &hrd_fops)) {
+		printk(KERN_WARNING "Unable to get major number %d for hugeramd device\n",
+		       HUGERAMD_MAJOR);
+		return -EIO;
+	}
+#ifndef MODULE
+	printk(KERN_INFO "hugeramd: registered device at major %d\n", HUGERAMD_MAJOR);
+#endif
+    /* FIXME: which global variables must be initialized? */
+
+	blksize_size[HUGERAMD_MAJOR] = hugeramd_bps;
+	blk_size[HUGERAMD_MAJOR] = hugeramd_blockcount;
+	read_ahead[HUGERAMD_MAJOR] = 0; /* no read ahead, since the seek time is 0 */
+	hardsect_size[HUGERAMD_MAJOR] = NULL;
+	blk_dev[HUGERAMD_MAJOR].request_fn = DEVICE_REQUEST;
+
+	for (i=0; i < MAX_HUGERAMD; i++) {
+		memset(&hrd_devices[i],0,sizeof(hrd_devices[i]));
+		hrd_devices[i].lock = SPIN_LOCK_UNLOCKED;
+		hugeramd_bps[i] = 1024;
+		hugeramd_blockcount[i] = 0;
+	}
+	return 0;
+}
+
+#ifdef MODULE
+void cleanup_module(void) 
+{
+	int i;
+	for(i=0;i<MAX_HUGERAMD;i++) {
+		if(hrd_devices[i].size != 0) {
+			free_hugemem(hrd_devices[i].start,
+					hrd_devices[i].size);
+			hrd_devices[i].size = 0;
+		}
+		if(hrd_devices[i].remap_addr != NULL) {
+			/* FIXME: test for memory leak */
+			iounmap(hrd_devices[i].remap_addr);
+			hrd_devices[i].remap_addr = NULL;
+		}
+		
+	}
+	if (unregister_blkdev(HUGERAMD_MAJOR, "hugeramd") != 0)
+		printk(KERN_WARNING "hugeramd: cannot unregister blkdev\n");
+}
+#endif
+
+int hrd_rw(int dev, unsigned long offset, char* buf, int len, int read)
+{
+	struct HRD_DEVICE* hrd = &hrd_devices[dev];
+	loff_t byteoff = offset*512;
+
+	
+	len *= 512;
+	if(byteoff > ((loff_t)hrd->size)*PAGE_SIZE)
+		return 0;
+	if(byteoff+len > ((loff_t)hrd->size)*PAGE_SIZE)
+		len = ((loff_t)hrd->size)*PAGE_SIZE-byteoff;
+	
+	while(len != 0) {
+		unsigned long linear = (unsigned long)hrd->remap_addr;
+		pte_t* pte = pte_offset(pmd_offset(pgd_offset_k(linear),linear), linear);
+		unsigned long physoffset;
+		int datalen;
+		int offset;
+	
+		physoffset = hrd->start + (byteoff>>PAGE_SHIFT);
+		physoffset *= PAGE_SIZE;
+		offset = byteoff & (PAGE_SIZE-1);
+		
+		set_pte(pte, mk_pte_phys(physoffset, __pgprot(_PAGE_PRESENT| _PAGE_RW |
+				_PAGE_DIRTY | _PAGE_ACCESSED )));
+		
+		/* this call only affects the current cpu.
+		   This is not a problem, because only one cpu is allowed to execute
+		   these lines.
+		 */
+		__flush_tlb_one(linear);
+
+		datalen = PAGE_SIZE-offset;
+		
+		if(datalen > len)
+			datalen = len;
+
+		if(read)
+			memcpy(buf,(char*)linear+offset,datalen);
+		 else
+			memcpy((char*)linear+offset, buf, datalen);
+
+		buf += datalen;
+		len -= datalen;
+		byteoff += datalen;
+	}
+	
+	return 1;
+}
diff -u -r -P 2.2.1/drivers/block/ll_rw_blk.c current/drivers/block/ll_rw_blk.c
--- 2.2.1/drivers/block/ll_rw_blk.c	Mon Dec 28 20:19:19 1998
+++ current/drivers/block/ll_rw_blk.c	Sun Feb  7 20:37:05 1999
@@ -740,6 +740,9 @@
 #ifdef CONFIG_BLK_DEV_RAM
 	rd_init();
 #endif
+#ifdef CONFIG_BLK_DEV_HUGERAMD
+	hugeramd_init();
+#endif
 #ifdef CONFIG_BLK_DEV_LOOP
 	loop_init();
 #endif
diff -u -r -P 2.2.1/include/asm-i386/hmcache.h current/include/asm-i386/hmcache.h
--- 2.2.1/include/asm-i386/hmcache.h	Thu Jan  1 01:00:00 1970
+++ current/include/asm-i386/hmcache.h	Mon Feb  8 19:45:41 1999
@@ -0,0 +1,16 @@
+/*
+ *
+ * (C) Manfred Spraul <masp0008@stud.uni-sb.de>
+ *
+ */
+
+#ifndef _HMCACHE_H
+#define _HMCACHE_H
+
+struct page* hmc_findpage(struct inode* inode, unsigned long offset);
+
+void hmc_invalidate_inode_pages(struct inode* inode);
+void hmc_truncate_inode_pages(struct inode* inode, unsigned long start);
+void hmc_add_page(struct page* page);
+
+#endif /* _HMCACHE_H */
diff -u -r -P 2.2.1/include/asm-i386/hugemem.h current/include/asm-i386/hugemem.h
--- 2.2.1/include/asm-i386/hugemem.h	Thu Jan  1 01:00:00 1970
+++ current/include/asm-i386/hugemem.h	Sun Feb  7 21:21:52 1999
@@ -0,0 +1,45 @@
+/*
+ * Huge memory support
+ * 
+ *              Copyright 1999 by Manfred Spraul <masp0008@stud.uni-sb.de>
+ *
+ *      This program is free software; you can redistribute it and/or modify
+ *      it under the terms of the GNU General Public License as published by
+ *      the Free Software Foundation; either version 2 of the License, or
+ *      (at your option) any later version.
+ *
+ *      This program is distributed in the hope that it will be useful,
+ *      but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *      GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License
+ *      along with this program; if not, write to the Free Software
+ *      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * 
+ */
+#ifndef _HUGEMEM_H
+#define _HUGEMEM_H
+
+void init_hugemem(int memstart, int memend);
+
+/*
+   All size parameters are page counts.
+   
+   hugemem internally rounds all allocation
+   to this size. Use kmalloc if you need less
+   memory
+ */
+
+#define HM_PAGES_PER_BIT        64
+
+/* return value: start of the physical area, or 0 on error. */
+int alloc_hugemem(int size);
+
+void free_hugemem(int startpg, int size);
+
+/* pmax can be NULL, returns the size of the largest free area. */
+int getfree_hugemem(int* pmax);
+
+#endif /* _HUGEMEM_H */
diff -u -r -P 2.2.1/include/linux/hugeramd.h current/include/linux/hugeramd.h
--- 2.2.1/include/linux/hugeramd.h	Thu Jan  1 01:00:00 1970
+++ current/include/linux/hugeramd.h	Sun Feb  7 20:37:05 1999
@@ -0,0 +1,26 @@
+#ifndef _HUGERAMD_H
+#define _HUGERAMD_H
+
+/*
+ * Hugeramd: special ramdisk that uses memory from the hugeram memory pool.
+ *
+ */
+
+#include <linux/ioctl.h>
+
+#ifndef __i386__
+#error Hugeramd is only available on the i386 platform.
+#endif
+
+#define HUGERAMD_MAJOR  126             /* FIXME: normal MAJOR required. */
+
+#define MAX_HUGERAMD            8   /* maximum number of ramdisks. */
+
+/* parameter: new page count*/
+#define HRDSETSIZE  _IO(HUGERAMD_MAJOR,101)
+
+/* parameter: new block size */
+#define HRDSETBPS       _IO(HUGERAMD_MAJOR,102)
+
+
+#endif /* _HUGERAMD_H */
diff -u -r -P 2.2.1/include/linux/pagemap.h current/include/linux/pagemap.h
--- 2.2.1/include/linux/pagemap.h	Tue Jan 26 01:06:23 1999
+++ current/include/linux/pagemap.h	Mon Feb  8 20:10:15 1999
@@ -44,6 +44,10 @@
 
 #define page_hash(inode,offset) (page_hash_table+_page_hashfn(inode,offset))
 
+#ifdef CONFIG_HUGEMEM
+extern struct page* hmc_findpage(struct inode* inode, unsigned long offset);
+#endif
+
 static inline struct page * __find_page(struct inode * inode, unsigned long offset, struct page *page)
 {
 	goto inside;
@@ -61,6 +65,10 @@
 	atomic_inc(&page->count);
 	set_bit(PG_referenced, &page->flags);
 not_found:
+#ifdef CONFIG_HUGEMEM
+	if (page == NULL)
+		page = hmc_findpage(inode, offset);
+#endif
 	return page;
 }
 
diff -u -r -P 2.2.1/mm/filemap.c current/mm/filemap.c
--- 2.2.1/mm/filemap.c	Mon Jan 25 19:47:11 1999
+++ current/mm/filemap.c	Mon Feb  8 20:13:21 1999
@@ -22,7 +22,9 @@
 
 #include <asm/pgtable.h>
 #include <asm/uaccess.h>
-
+#ifdef __i386__
+#include <asm/hmcache.h>
+#endif
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -65,6 +67,9 @@
 		__free_page(page);
 		continue;
 	}
+#ifdef BUILD_HUGEMEM
+	hmc_invalidate_inode_pages(inode);
+#endif
 }
 
 /*
@@ -106,6 +111,9 @@
 			flush_page_to_ram(address);
 		}
 	}
+#ifdef CONFIG_HUGEMEM
+	hmc_truncate_inode_pages(inode, start);
+#endif
 }
 
 /*
@@ -114,6 +122,9 @@
 void remove_inode_page(struct page *page)
 {
 	remove_page_from_hash_queue(page);
+#ifdef CONFIG_HUGEMEM 
+	hmc_add_page(page);
+#endif
 	remove_page_from_inode_queue(page);
 	__free_page(page);
 }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
  1999-02-08 20:33 Large memory system Manfred Spraul
@ 1999-02-10 14:25 ` Stephen C. Tweedie
  0 siblings, 0 replies; 11+ messages in thread
From: Stephen C. Tweedie @ 1999-02-10 14:25 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, Daniel Blakeley, linux-mm

Hi,

On Mon, 8 Feb 1999 21:33:09 +0100, "Manfred Spraul"
<manfreds@colorfullife.com> said:

> There is another possibility if you want to extend the page cache:
> Add a 'second level cache':

The primary reason for adding more memory is for process anonymous
pages, not for cache, so this is really of limited value on its own.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
@ 1999-02-10 17:02 Manfred Spraul
  1999-02-11 11:12 ` Stephen C. Tweedie
  0 siblings, 1 reply; 11+ messages in thread
From: Manfred Spraul @ 1999-02-10 17:02 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

>The primary reason for adding more memory is for process anonymous
>pages, not for cache, so this is really of limited value on its own.

This was not intended as a solution, but as a new idea:
- the memory > 1 GB is allocated one page at a time.
- some 'struct page' fields are useless for high memory.
- if someone who is not prepared to handle high memory finds such a page,
the computer will crash anyway.
- high memory needs bounce buffers, so a special if(highmem()) is required.
---> no need to use mem_map, add an independant array for high_mem.

The advantage is that you can add new fields to such an array (e.g. true
LRU for a cache), without causing problems in the remaining kernel.

If you restrict the remaining memory to unshared pages (i.e. no COW), then
the implementation should be really simple:

* all page-in's go to normal memory (i.e. < 1 GB) (swap cache compatible)
* if try_to_swap_out() want's to discard a page, it is first moved to high
memory.
(this break's any COW links.)
* if <shrink_highmem> decides that a page should be discarded, then the page
is removed from the vma, a bounce buffer is created, written out & added to
the swap cache.

I'm sure that this could be extended to COW pages, but I haven't yet
understood the COW implementation :=)

Regards,
    Manfred

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
  1999-02-10 17:02 Manfred Spraul
@ 1999-02-11 11:12 ` Stephen C. Tweedie
  0 siblings, 0 replies; 11+ messages in thread
From: Stephen C. Tweedie @ 1999-02-11 11:12 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Stephen C. Tweedie, linux-mm

Hi,

On Wed, 10 Feb 1999 18:02:32 +0100, "Manfred Spraul"
<masp0008@stud.uni-sb.de> said:

> This was not intended as a solution, but as a new idea:
> - the memory > 1 GB is allocated one page at a time.
> - some 'struct page' fields are useless for high memory.
> - if someone who is not prepared to handle high memory finds such a page,
> the computer will crash anyway.
> - high memory needs bounce buffers, so a special if(highmem()) is
> required.

All of this is already in the design.

> ---> no need to use mem_map, add an independant array for high_mem.

No, it makes no sense at all to do this, because you'd have to
implement two separate page caches if you wanted both low-mem and
high-mem cached pages.  It makes far, far more sense to simply expand
mem_map. 

> The advantage is that you can add new fields to such an array (e.g. true
> LRU for a cache), without causing problems in the remaining kernel.

That's really not a problem.  As long as we never hand out a high-mem
page to the kernel unless the kernel explicitly asks for one (for
anonymous pages or page cache), the kernel can never get so confused
anyway. 

> If you restrict the remaining memory to unshared pages (i.e. no COW), then
> the implementation should be really simple:

There is no reason to make this restriction, COW is dead easy.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Large memory system
@ 1999-01-30 13:36 Daniel Blakeley
  1999-01-30 17:00 ` Benjamin C.R. LaHaise
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Daniel Blakeley @ 1999-01-30 13:36 UTC (permalink / raw)
  To: linux-mm

Hi,

I've jumped the gun a little bit and recommended a Professor buy 4GB
of RAM on a Xeon machine to run Linux on and he did.  After he got it
I read the large memory howto which states that the max memory size
for Linux 2.2.x is 2GB physical/2GB virtual.  The memory size seems to
limited by the 32bit nature of the x86 architecture.  The Xeon seems
to have a 36bit memory addressing mode.  Can Linux be easily expanded
to use the 36bit addressing?

Thanks for any info on the subject.

- Daniel (Who needs to read more before recommending computers.)

--
Daniel Blakeley (N2YEN)     Cornell Center for Materials Research
daniel@msc.cornell.edu      E20 Clark Hall
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
  1999-01-30 13:36 Daniel Blakeley
@ 1999-01-30 17:00 ` Benjamin C.R. LaHaise
  1999-02-08 11:24   ` Stephen C. Tweedie
  1999-02-01 15:59 ` Rik van Riel
  1999-02-08 11:22 ` Stephen C. Tweedie
  2 siblings, 1 reply; 11+ messages in thread
From: Benjamin C.R. LaHaise @ 1999-01-30 17:00 UTC (permalink / raw)
  To: Daniel Blakeley; +Cc: linux-mm

On Sat, 30 Jan 1999, Daniel Blakeley wrote:

> Hi,
> 
> I've jumped the gun a little bit and recommended a Professor buy 4GB
> of RAM on a Xeon machine to run Linux on and he did.  After he got it
> I read the large memory howto which states that the max memory size
> for Linux 2.2.x is 2GB physical/2GB virtual.  The memory size seems to
> limited by the 32bit nature of the x86 architecture.  The Xeon seems
> to have a 36bit memory addressing mode.  Can Linux be easily expanded
> to use the 36bit addressing?

Easily isn't a good way of putting it, unless you're talking about doing
something like mmap on /dev/mem, in which case you could make the
user/kernel virtual spilt weigh heavy on the user side and do memory
allocation yourself.  If you're talking about doing it transparently,
you're best bet is to do something like davem's suggested high mem
approach, and only use non-kernel mapped memory for user pages... if you
want to be able to support the page cache in high memory, things get
messy.

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
  1999-01-30 17:00 ` Benjamin C.R. LaHaise
@ 1999-02-08 11:24   ` Stephen C. Tweedie
  1999-02-08 15:31     ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen C. Tweedie @ 1999-02-08 11:24 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: Daniel Blakeley, linux-mm

Hi,

On Sat, 30 Jan 1999 12:00:53 -0500 (EST), "Benjamin C.R. LaHaise"
<blah@kvack.org> said:

> Easily isn't a good way of putting it, unless you're talking about doing
> something like mmap on /dev/mem, in which case you could make the
> user/kernel virtual spilt weigh heavy on the user side and do memory
> allocation yourself.  If you're talking about doing it transparently,
> you're best bet is to do something like davem's suggested high mem
> approach, and only use non-kernel mapped memory for user pages... if you
> want to be able to support the page cache in high memory, things get
> messy.

No it doesn't!  The only tricky thing is IO, but we need to have bounce
buffers to high memory anyway for swapping.  The page cache uses "struct
page" addresses in preference to actual page data pointers almost
everywhere anyway, and whenever we are doing something like read(2) or
write(2) functions, we just need a single per-CPU virtual pte in the
vmalloc region to temporarily map the page into memory while we copy to
user space (and remember that we do this from the context of the user
process anyway, so we don't have to remap the user page even if it is in
high memory).

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
  1999-02-08 11:24   ` Stephen C. Tweedie
@ 1999-02-08 15:31     ` Eric W. Biederman
  1999-02-09 22:57       ` Stephen C. Tweedie
  0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 1999-02-08 15:31 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Benjamin C.R. LaHaise, Daniel Blakeley, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> Hi,
ST> On Sat, 30 Jan 1999 12:00:53 -0500 (EST), "Benjamin C.R. LaHaise"
ST> <blah@kvack.org> said:

>> Easily isn't a good way of putting it, unless you're talking about doing
>> something like mmap on /dev/mem, in which case you could make the
>> user/kernel virtual spilt weigh heavy on the user side and do memory
>> allocation yourself.  If you're talking about doing it transparently,
>> you're best bet is to do something like davem's suggested high mem
>> approach, and only use non-kernel mapped memory for user pages... if you
>> want to be able to support the page cache in high memory, things get
>> messy.

ST> No it doesn't!  The only tricky thing is IO, but we need to have bounce
ST> buffers to high memory anyway for swapping.  The page cache uses "struct
ST> page" addresses in preference to actual page data pointers almost
ST> everywhere anyway, and whenever we are doing something like read(2) or
ST> write(2) functions, we just need a single per-CPU virtual pte in the
ST> vmalloc region to temporarily map the page into memory while we copy to
ST> user space (and remember that we do this from the context of the user
ST> process anyway, so we don't have to remap the user page even if it is in
ST> high memory).

Cool.  We now have an idea that sounds possible.

The only remaining question is how much of a performance hit would changing 
the contents of a pte around all of the time be?

Every single page read/write syscall, as well as copying down to I/O bounce buffers
sounds common enough that we probably would see a performance hit.

The other thing that happens is we start breaking assumptions about fixed limits
based on architecture size.  Things like the swap entry may need to be expanded.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
  1999-02-08 15:31     ` Eric W. Biederman
@ 1999-02-09 22:57       ` Stephen C. Tweedie
  0 siblings, 0 replies; 11+ messages in thread
From: Stephen C. Tweedie @ 1999-02-09 22:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, Daniel Blakeley, linux-mm

Hi,

On 08 Feb 1999 09:31:11 -0600, ebiederm+eric@ccr.net (Eric W. Biederman)
said:

> Cool.  We now have an idea that sounds possible.

> The only remaining question is how much of a performance hit would
> changing the contents of a pte around all of the time be?

Very little: there's the cost of the page invalidate (a couple of
cycles), plus the cost of the CPU refilling that tlb from the page
tables.  It's completely lost in the noise compared to the cost of
transferring an entire page of data to/from user space.

> Every single page read/write syscall, as well as copying down to I/O
> bounce buffers sounds common enough that we probably would see a
> performance hit.

I doubt that it would be measurable.

> The other thing that happens is we start breaking assumptions about
> fixed limits based on architecture size.  Things like the swap entry
> may need to be expanded.

The swap entry can probably stay completely independent; most people
with 8G of ram are going to be trying hard never to hit swap anyway. :)
Besides, we already have support for 16G of swap as things stand.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
  1999-01-30 13:36 Daniel Blakeley
  1999-01-30 17:00 ` Benjamin C.R. LaHaise
@ 1999-02-01 15:59 ` Rik van Riel
  1999-02-08 11:22 ` Stephen C. Tweedie
  2 siblings, 0 replies; 11+ messages in thread
From: Rik van Riel @ 1999-02-01 15:59 UTC (permalink / raw)
  To: Daniel Blakeley; +Cc: linux-mm

On Sat, 30 Jan 1999, Daniel Blakeley wrote:

> I've jumped the gun a little bit and recommended a Professor buy
> 4GB of RAM on a Xeon machine to run Linux on and he did.  After he
> got it I read the large memory howto which states that the max
> memory size for Linux 2.2.x is 2GB physical/2GB virtual.  The
> memory size seems to limited by the 32bit nature of the x86
> architecture.  The Xeon seems to have a 36bit memory addressing
> mode.  Can Linux be easily expanded to use the 36bit addressing?

Just today there was a patch on linux-kernel with a
patch that allows you to use the top 2 GB as a RAM
disk or something like that.

You can use that for swap and to mmap() stuff on.
I think this could be quite useful for large simulations
and stuff like that.

36-bit addressing is a bit difficult at the moment, but
undoubtedly someone will code up something like that for
Linux 2.3 (maybe the prof could let some (under)graduate
student do this as a major project?).

succes,

Rik -- If a Microsoft product fails, who do you sue?
+-------------------------------------------------------------------+
| Linux Memory Management site:  http://humbolt.geo.uu.nl/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Large memory system
  1999-01-30 13:36 Daniel Blakeley
  1999-01-30 17:00 ` Benjamin C.R. LaHaise
  1999-02-01 15:59 ` Rik van Riel
@ 1999-02-08 11:22 ` Stephen C. Tweedie
  2 siblings, 0 replies; 11+ messages in thread
From: Stephen C. Tweedie @ 1999-02-08 11:22 UTC (permalink / raw)
  To: Daniel Blakeley; +Cc: linux-mm, Stephen Tweedie

Hi,

On Sat, 30 Jan 1999 08:36:31 -0500, Daniel Blakeley
<daniel@msc.cornell.edu> said:

> I've jumped the gun a little bit and recommended a Professor buy 4GB
> of RAM on a Xeon machine to run Linux on and he did.  After he got it
> I read the large memory howto which states that the max memory size
> for Linux 2.2.x is 2GB physical/2GB virtual.  The memory size seems to
> limited by the 32bit nature of the x86 architecture.  The Xeon seems
> to have a 36bit memory addressing mode.  Can Linux be easily expanded
> to use the 36bit addressing?

It's not exactly trivial, but it can (and will) be done.  For now, you
can only use 4G on a 64-bit architecture (Alpha or Sparc64), but
basically we know how to address it on Intel too, transparently to the
user.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~1999-02-11 11:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-02-08 20:33 Large memory system Manfred Spraul
1999-02-10 14:25 ` Stephen C. Tweedie
  -- strict thread matches above, loose matches on Subject: below --
1999-02-10 17:02 Manfred Spraul
1999-02-11 11:12 ` Stephen C. Tweedie
1999-01-30 13:36 Daniel Blakeley
1999-01-30 17:00 ` Benjamin C.R. LaHaise
1999-02-08 11:24   ` Stephen C. Tweedie
1999-02-08 15:31     ` Eric W. Biederman
1999-02-09 22:57       ` Stephen C. Tweedie
1999-02-01 15:59 ` Rik van Riel
1999-02-08 11:22 ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox