linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* 2.5.53-mm2
@ 2002-12-29  0:52 Andrew Morton
  2003-01-02  4:53 ` 2.5.53-mm2 William Lee Irwin III
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2002-12-29  0:52 UTC (permalink / raw)
  To: lkml, linux-mm

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.53/2.5.53-mm2/

Mainly stability work:

. If pte_chain_alloc() fails to allocate GFP_ATOMIC memory, the kernel
  oopses.  This is a long-standing rmap problem.  Present also in the 2.4
  rmap patches and, as far as I know, production Red Hat kernels.

  So it is clearly a very rare problem but it is not acceptable to have
  an unchecked kmalloc in the core of the 2.5 VM.

  The approach which I took was to change the page_add_rmap() API to
  require the caller to pass in a preallocated pte_chain.  And change
  all callers to allocate their pte_chains with GFP_KERNEL.

  This change is fairly ugly, but every other hare-brained scheme I
  could come up with had holes.  This one adds maybe 20 instructions
  to pagefaults and works...

  The swapoff path has not yet been converted - this can still oops.

  The locking isn't quite right yet, if shared pagetables are enabled.

. If radix_tree_insert() fails to allocate GFP_ATOMIC memory, a system call
  will return -ENOMEM, resulting in application failure.

  This was fixed by implementing a reservation API within the slab allocator.
  Before taking locks the caller of radix_tree_insert will ask slab to preallocate
  sufficient objects in this CPU's slab head array to guarantee that the
  allocation of up to seven (on ia32) radix_tree_nodes cannot fail.

  This permitted the removal of the radix tree mempool.  That's a 130 kbyte
  saving.  (260k on 64-bit).

. Some aggressive pruning of various system-wide memory reserve settings:

  - The page reservation limits in the page allocator have been reduced
    from ~256 pages per zone to ~4 pages per zone.

  - The preallocation levels in the slab head arrays (which were ridiculously
    large) have been reduced from 32k-128k to, typically, a single page.

  - The per-cpu-pages head arrays in the page allocator have been reduced
    from ~64 pages to 2 pages.

  The net effect of these changes is to remove almost all of the kernel's
  reserved memory buffers.  Instead of maintaining several megabytes of
  free memory the kernel will only maintain some tens of kilobytes.

  And guess what?   Everything still works.

  I won't be submitting these changes - they are here for robustness testing.
  But it certainly does indicate that the settings of these thresholds need
  to be reviewed.  And that there don't appear to be any low-on-memory deadlocks
  in the VM (with ext2, at least..)

. An updated dcache_rcu patch which should fix a rename-related race which
  Al Viro noted.




Changes since 2.5.53-mm1:

+linus.patch

 Latest -BK

+aic-bounce.patch

 aic7xxx highmem IO fix

+misc.patch

 triviata

+devfs-fix.patch

 A partial fix for a CONFIG_DEVFS=y boot problem

+copy_page_range-cleanup.patch

 Small cleanups, partly to ease the maintenance of the shared pagetable
 diff.

+pte_chain_alloc-fix.patch

 Infrastructure for handling pte_chain_alloc() failures.

+page_add_rmap-rework.patch

 Handle pte_chain_alloc() failures.

 shpte-ng.patch

 Lots of changes to handle pte_chain_alloc() failures.

+slab-preallocation.patch

 Add an API to slab to reserve objects in the per-CPU head arrays.

+slab-export-tuning.patch

 Export the slab head-array tuning functions.

+rat-preallocation.patch

 Add a reservation API to the radix_tree code.

+use-rat-preallocation.patch

 Use the reservation API to avoid radix_tree allocation failures.

+teeny-mem-limits.patch

 Remove most of the page allocator page reserves.

+smaller-head-arrays.patch

 Remove most of the slab memory reserves.

+remove-hugetlb-syscalls.patch

 Remove the hugetlb system calls.  hugetlbfs is suitable.





All 72 patches:

linus.patch
  cset-1.951-to-1.1030.txt.gz

kgdb.patch

aic-bounce.patch

rcf.patch
  run-child-first after fork

ga2.patch
  don't call console drivers on non-online CPUs

misc.patch
  misc fixes

devfs-fix.patch

dio-return-partial-result.patch

aio-direct-io-infrastructure.patch
  AIO support for raw/O_DIRECT

deferred-bio-dirtying.patch
  bio dirtying infrastructure

aio-direct-io.patch
  AIO support for raw/O_DIRECT

aio-dio-debug.patch

dio-reduce-context-switch-rate.patch
  Reduced wakeup rate in direct-io code

cputimes_stat.patch
  Retore per-cpu time accounting, with a config option

reduce-random-context-switch-rate.patch
  Reduce context switch rate due to the random driver

inlines-net.patch

rbtree-iosched.patch
  rbtree-based IO scheduler

deadsched-fix.patch
  deadline scheduler fix

quota-smp-locks.patch
  Subject: [PATCH] Quota SMP locks

copy_page_range-cleanup.patch
  copy_page_range: minor cleanup

pte_chain_alloc-fix.patch

page_add_rmap-rework.patch

shpte-ng.patch
  pagetable sharing for ia32

slab-preallocation.patch

slab-export-tuning.patch

rat-preallocation.patch

use-rat-preallocation.patch

teeny-mem-limits.patch

smaller-head-arrays.patch

ptrace-flush.patch
  Subject: [PATCH] ptrace on 2.5.44

buffer-debug.patch
  buffer.c debugging

warn-null-wakeup.patch

pentium-II.patch
  Pentium-II support bits

rcu-stats.patch
  RCU statistics reporting

auto-unplug.patch
  self-unplugging request queues

less-unplugging.patch
  Remove most of the blk_run_queues() calls

ext3-fsync-speedup.patch
  Clean up ext3_sync_file()

lockless-current_kernel_time.patch
  Lockless current_kernel_timer()

scheduler-tunables.patch
  scheduler tunables

dio-always-kmalloc.patch
  direct-io: dynamically allocate struct dio

file-nr-doc-fix.patch
  Docs: fix explanation of file-nr

set_page_dirty_lock.patch
  fix set_page_dirty vs truncate&free races

remove-memshared.patch
  Remove /proc/meminfo:MemShared

bin2bcd.patch
  BIN_TO_BCD consolidation

log_buf_size.patch
  move LOG_BUF_SIZE to header/config

semtimedop-update.patch
  Enable semtimedop for ia64 32-bit emulation.

drain_local_pages.patch
  add drain_local_pages() for CONFIG_SOFTWARE_SUSPEND

htlb-2.patch
  hugetlb: fix MAP_FIXED handling

kmalloc_percpu.patch
  kmalloc_percpu -- stripped down version

config_page_offset.patch
  Configurable kenrel/user memory split

config_hz.patch
  CONFIGurable HZ

dont-aligns-vmas.patch
  Don't cacheline-align vm_area_struct

remove-swappable.patch
  remove task_struct.swappable

remove-hugetlb-syscalls.patch
  Subject: [hugetlb] remove hugetlb syscalls

wli-01_numaq_io.patch
  (undescribed patch)

wli-02_do_sak.patch
  (undescribed patch)

wli-03_proc_super.patch
  (undescribed patch)

wli-06_uml_get_task.patch
  (undescribed patch)

wli-07_numaq_mem_map.patch
  (undescribed patch)

wli-08_numaq_pgdat.patch
  (undescribed patch)

wli-09_has_stopped_jobs.patch
  (undescribed patch)

wli-10_inode_wait.patch
  (undescribed patch)

wli-11_pgd_ctor.patch
  (undescribed patch)

wli-12_pidhash_size.patch
  (undescribed patch)

wli-13_rmap_nrpte.patch
  (undescribed patch)

dcache_rcu-2.patch
  dcache_rcu-2-2.5.51.patch

dcache_rcu-3.patch
  dcache_rcu-3-2.5.51.patch

page-walk-api.patch

page-walk-scsi.patch

page-walk-api-update.patch
  pagewalk API update

gup-check-valid.patch
  valid page test in get_user_pages()
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 2.5.53-mm2
  2002-12-29  0:52 2.5.53-mm2 Andrew Morton
@ 2003-01-02  4:53 ` William Lee Irwin III
  2003-01-02  5:25   ` 2.5.53-mm2 William Lee Irwin III
  0 siblings, 1 reply; 3+ messages in thread
From: William Lee Irwin III @ 2003-01-02  4:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Sat, Dec 28, 2002 at 04:52:20PM -0800, Andrew Morton wrote:
> wli-11_pgd_ctor.patch
>   (undescribed patch)

A moment's reflection on the subject suggests to me it's worthwhile to
generalize pgd_ctor support so it works (without #ifdefs!) on both PAE
and non-PAE. This tiny tweak is actually more noticeably beneficial
on non-PAE systems but only really because pgd_alloc() is more visible;
the most likely reason it's less visible on PAE is "other overhead".
It looks particularly nice since it removes more code than it adds.

Touch tested on NUMA-Q (PAE). OFTC #kn testers testing the non-PAE case.

 arch/i386/mm/init.c               |   36 +++++++++++++----------
 arch/i386/mm/pgtable.c            |   58 ++++++++++++--------------------------
 include/asm-i386/pgtable-3level.h |    2 -
 include/asm-i386/pgtable.h        |   13 +-------
 4 files changed, 41 insertions(+), 68 deletions(-)


diff -urpN mm3-2.5.53-1/arch/i386/mm/init.c mm3-2.5.53-2/arch/i386/mm/init.c
--- mm3-2.5.53-1/arch/i386/mm/init.c	2003-01-01 18:49:19.000000000 -0800
+++ mm3-2.5.53-2/arch/i386/mm/init.c	2003-01-01 18:51:17.000000000 -0800
@@ -504,32 +504,36 @@ void __init mem_init(void)
 #endif
 }
 
-#if CONFIG_X86_PAE
 #include <linux/slab.h>
 
-kmem_cache_t *pae_pmd_cachep;
-kmem_cache_t *pae_pgd_cachep;
+kmem_cache_t *pmd_cache;
+kmem_cache_t *pgd_cache;
 
-void pae_pmd_ctor(void *, kmem_cache_t *, unsigned long);
-void pae_pgd_ctor(void *, kmem_cache_t *, unsigned long);
+void pmd_ctor(void *, kmem_cache_t *, unsigned long);
+void pgd_ctor(void *, kmem_cache_t *, unsigned long);
 
 void __init pgtable_cache_init(void)
 {
+	if (PTRS_PER_PMD > 1) {
+		pmd_cache = kmem_cache_create("pae_pmd",
+						PTRS_PER_PMD*sizeof(pmd_t),
+						0,
+						SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
+						pmd_ctor,
+						NULL);
+
+		if (!pmd_cache)
+			panic("pgtable_cache_init(): cannot create pmd cache");
+	}
+
         /*
          * PAE pgds must be 16-byte aligned:
          */
-	pae_pmd_cachep = kmem_cache_create("pae_pmd", 4096, 0,
-		SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, pae_pmd_ctor, NULL);
-
-	if (!pae_pmd_cachep)
-		panic("init_pae(): cannot allocate pae_pmd SLAB cache");
-
-        pae_pgd_cachep = kmem_cache_create("pae_pgd", 32, 0,
-                SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, pae_pgd_ctor, NULL);
-        if (!pae_pgd_cachep)
-                panic("init_pae(): Cannot alloc pae_pgd SLAB cache");
+        pgd_cache = kmem_cache_create("pgd", PTRS_PER_PGD*sizeof(pgd_t), 0,
+                SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, pgd_ctor, NULL);
+        if (!pgd_cache)
+                panic("pgtable_cache_init(): Cannot create pgd cache");
 }
-#endif
 
 /* Put this after the callers, so that it cannot be inlined */
 static int do_test_wp_bit(void)
diff -urpN mm3-2.5.53-1/arch/i386/mm/pgtable.c mm3-2.5.53-2/arch/i386/mm/pgtable.c
--- mm3-2.5.53-1/arch/i386/mm/pgtable.c	2003-01-01 18:49:19.000000000 -0800
+++ mm3-2.5.53-2/arch/i386/mm/pgtable.c	2003-01-01 18:51:17.000000000 -0800
@@ -166,19 +166,20 @@ struct page *pte_alloc_one(struct mm_str
 	return pte;
 }
 
-#if CONFIG_X86_PAE
+extern kmem_cache_t *pmd_cache;
+extern kmem_cache_t *pgd_cache;
 
-extern kmem_cache_t *pae_pmd_cachep;
-
-void pae_pmd_ctor(void *__pmd, kmem_cache_t *pmd_cache, unsigned long flags)
+void pmd_ctor(void *__pmd, kmem_cache_t *pmd_cache, unsigned long flags)
 {
 	clear_page(__pmd);
 }
 
-void pae_pgd_ctor(void *__pgd, kmem_cache_t *pgd_cache, unsigned long flags)
+void pgd_ctor(void *__pgd, kmem_cache_t *pgd_cache, unsigned long flags)
 {
 	pgd_t *pgd = __pgd;
 
+	if (PTRS_PER_PMD == 1)
+		memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
 	memcpy(pgd + USER_PTRS_PER_PGD,
 		swapper_pg_dir + USER_PTRS_PER_PGD,
 		(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
@@ -187,59 +188,38 @@ void pae_pgd_ctor(void *__pgd, kmem_cach
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	int i;
-	pgd_t *pgd = kmem_cache_alloc(pae_pgd_cachep, SLAB_KERNEL);
+	pgd_t *pgd = kmem_cache_alloc(pgd_cache, SLAB_KERNEL);
 
-	if (!pgd)
+	if (PTRS_PER_PMD == 1)
+		return pgd;
+	else if (!pgd)
 		return NULL;
 
 	for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
-		pmd_t *pmd = kmem_cache_alloc(pae_pmd_cachep, SLAB_KERNEL);
+		pmd_t *pmd = kmem_cache_alloc(pmd_cache, SLAB_KERNEL);
 		if (!pmd)
 			goto out_oom;
-		else if ((unsigned long)pmd & ~PAGE_MASK) {
-			printk("kmem_cache_alloc did wrong! death ensues!\n");
-			goto out_oom;
-		}
 		set_pgd(pgd + i, __pgd(1 + __pa((unsigned long long)((unsigned long)pmd))));
 	}
 	return pgd;
 
 out_oom:
 	for (i--; i >= 0; --i)
-		kmem_cache_free(pae_pmd_cachep, (void *)__va(pgd_val(pgd[i])-1));
-	kmem_cache_free(pae_pgd_cachep, (void *)pgd);
+		kmem_cache_free(pmd_cache, (void *)__va(pgd_val(pgd[i])-1));
+	kmem_cache_free(pgd_cache, (void *)pgd);
 	return NULL;
 }
 
 void pgd_free(pgd_t *pgd)
 {
 	int i;
-	for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
-		kmem_cache_free(pae_pmd_cachep, (void *)__va(pgd_val(pgd[i])-1));
-		set_pgd(pgd + i, __pgd(0));
-	}
-	kmem_cache_free(pae_pgd_cachep, (void *)pgd);
-}
-
-#else
 
-pgd_t *pgd_alloc(struct mm_struct *mm)
-{
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
-	if (pgd) {
-		memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
-		memcpy(pgd + USER_PTRS_PER_PGD,
-			swapper_pg_dir + USER_PTRS_PER_PGD,
-			(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
+	if (PTRS_PER_PMD > 1) {
+		for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
+			kmem_cache_free(pmd_cache, (void *)__va(pgd_val(pgd[i])-1));
+			set_pgd(pgd + i, __pgd(0));
+		}
 	}
-	return pgd;
-}
 
-void pgd_free(pgd_t *pgd)
-{
-	free_page((unsigned long)pgd);
+	kmem_cache_free(pgd_cache, (void *)pgd);
 }
-
-#endif /* CONFIG_X86_PAE */
-
diff -urpN mm3-2.5.53-1/include/asm-i386/pgtable-3level.h mm3-2.5.53-2/include/asm-i386/pgtable-3level.h
--- mm3-2.5.53-1/include/asm-i386/pgtable-3level.h	2002-12-23 21:21:07.000000000 -0800
+++ mm3-2.5.53-2/include/asm-i386/pgtable-3level.h	2003-01-01 18:51:17.000000000 -0800
@@ -106,6 +106,4 @@ static inline pmd_t pfn_pmd(unsigned lon
 	return __pmd(((unsigned long long)page_nr << PAGE_SHIFT) | pgprot_val(pgprot));
 }
 
-extern struct kmem_cache_s *pae_pgd_cachep;
-
 #endif /* _I386_PGTABLE_3LEVEL_H */
diff -urpN mm3-2.5.53-1/include/asm-i386/pgtable.h mm3-2.5.53-2/include/asm-i386/pgtable.h
--- mm3-2.5.53-1/include/asm-i386/pgtable.h	2003-01-01 18:49:21.000000000 -0800
+++ mm3-2.5.53-2/include/asm-i386/pgtable.h	2003-01-01 18:51:17.000000000 -0800
@@ -41,22 +41,13 @@ extern unsigned long empty_zero_page[102
 #ifndef __ASSEMBLY__
 #if CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>
-
-/*
- * Need to initialise the X86 PAE caches
- */
-extern void pgtable_cache_init(void);
-
 #else
 # include <asm/pgtable-2level.h>
+#endif
 
-/*
- * No page table caches to initialise
- */
-#define pgtable_cache_init()	do { } while (0)
+void pgtable_cache_init(void);
 
 #endif
-#endif
 
 #define __beep() asm("movb $0x3,%al; outb %al,$0x61")
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 2.5.53-mm2
  2003-01-02  4:53 ` 2.5.53-mm2 William Lee Irwin III
@ 2003-01-02  5:25   ` William Lee Irwin III
  0 siblings, 0 replies; 3+ messages in thread
From: William Lee Irwin III @ 2003-01-02  5:25 UTC (permalink / raw)
  To: Andrew Morton, lkml, linux-mm

On Sat, Dec 28, 2002 at 04:52:20PM -0800, Andrew Morton wrote:
>> wli-11_pgd_ctor.patch

On Wed, Jan 01, 2003 at 08:53:27PM -0800, William Lee Irwin III wrote:
> A moment's reflection on the subject suggests to me it's worthwhile to
> generalize pgd_ctor support so it works (without #ifdefs!) on both PAE
> and non-PAE. This tiny tweak is actually more noticeably beneficial
> on non-PAE systems but only really because pgd_alloc() is more visible;
> the most likely reason it's less visible on PAE is "other overhead".
> It looks particularly nice since it removes more code than it adds.
> Touch tested on NUMA-Q (PAE). OFTC #kn testers testing the non-PAE case.

For those needing more interpretation, this is essentially a reinstatement
of the 2.4.x-style pgd/pmd cache optimization in a leak-free and accounted
(in /proc/slabinfo) manner.

The point of the optimizations is that these initializations are large
cache hits to take in a single shot, and in the PAE case, amount to a
full L1 cache flush as they traverse almost an entire 16K.

No rigorous benchmarking has been done yet.

Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-01-02  5:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-29  0:52 2.5.53-mm2 Andrew Morton
2003-01-02  4:53 ` 2.5.53-mm2 William Lee Irwin III
2003-01-02  5:25   ` 2.5.53-mm2 William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox