* [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
2005-10-30 18:34 ` [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap Mel Gorman
` (6 subsequent siblings)
7 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman
This patch adds two flags __GFP_EASYRCLM and __GFP_KERNRCLM that are used
to trap the type of allocation the caller is made. Allocations using
the __GFP_EASYRCLM flag are expected to be easily reclaimed by syncing
with backing storage (be it a file or swap) or cleaning the buffers and
discarding. Allocations using the __GFP_KERNRCLM flag belong to slab caches
that can be shrunk by the kernel.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/buffer.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/buffer.c
--- linux-2.6.14-rc5-mm1-clean/fs/buffer.c 2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/buffer.c 2005-10-30 13:34:50.000000000 +0000
@@ -1119,7 +1119,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;
- page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, index,
+ GFP_NOFS|__GFP_EASYRCLM);
if (!page)
return NULL;
@@ -3058,7 +3059,8 @@ static void recalc_bh_state(void)
struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
{
- struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+ struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+ gfp_flags|__GFP_KERNRCLM);
if (ret) {
get_cpu_var(bh_accounting).nr++;
recalc_bh_state();
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/compat.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/compat.c
--- linux-2.6.14-rc5-mm1-clean/fs/compat.c 2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/compat.c 2005-10-30 13:34:50.000000000 +0000
@@ -1363,7 +1363,7 @@ static int compat_copy_strings(int argc,
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/dcache.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/dcache.c
--- linux-2.6.14-rc5-mm1-clean/fs/dcache.c 2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/dcache.c 2005-10-30 13:34:50.000000000 +0000
@@ -878,7 +878,7 @@ struct dentry *d_alloc(struct dentry * p
struct dentry *dentry;
char *dname;
- dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+ dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL|__GFP_KERNRCLM);
if (!dentry)
return NULL;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/exec.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/exec.c
--- linux-2.6.14-rc5-mm1-clean/fs/exec.c 2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/exec.c 2005-10-30 13:34:50.000000000 +0000
@@ -237,7 +237,7 @@ static int copy_strings(int argc, char _
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ext2/super.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext2/super.c
--- linux-2.6.14-rc5-mm1-clean/fs/ext2/super.c 2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext2/super.c 2005-10-30 13:34:50.000000000 +0000
@@ -141,7 +141,8 @@ static kmem_cache_t * ext2_inode_cachep;
static struct inode *ext2_alloc_inode(struct super_block *sb)
{
struct ext2_inode_info *ei;
- ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+ ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+ SLAB_KERNEL|__GFP_KERNRCLM);
if (!ei)
return NULL;
#ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ext3/super.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext3/super.c
--- linux-2.6.14-rc5-mm1-clean/fs/ext3/super.c 2005-10-30 13:20:00.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext3/super.c 2005-10-30 13:34:50.000000000 +0000
@@ -444,7 +444,7 @@ static struct inode *ext3_alloc_inode(st
{
struct ext3_inode_info *ei;
- ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+ ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
if (!ei)
return NULL;
#ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/inode.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/inode.c
--- linux-2.6.14-rc5-mm1-clean/fs/inode.c 2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/inode.c 2005-10-30 13:34:50.000000000 +0000
@@ -146,7 +146,7 @@ static struct inode *alloc_inode(struct
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+ mapping_set_gfp_mask(mapping, GFP_HIGHUSER|__GFP_EASYRCLM);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ntfs/inode.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ntfs/inode.c
--- linux-2.6.14-rc5-mm1-clean/fs/ntfs/inode.c 2005-10-30 13:20:01.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ntfs/inode.c 2005-10-30 13:34:50.000000000 +0000
@@ -318,7 +318,7 @@ struct inode *ntfs_alloc_big_inode(struc
ntfs_inode *ni;
ntfs_debug("Entering.");
- ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+ ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
if (likely(ni != NULL)) {
ni->state = 0;
return VFS_I(ni);
@@ -343,7 +343,7 @@ static inline ntfs_inode *ntfs_alloc_ext
ntfs_inode *ni;
ntfs_debug("Entering.");
- ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+ ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
if (likely(ni != NULL)) {
ni->state = 0;
return ni;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/asm-i386/page.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/asm-i386/page.h
--- linux-2.6.14-rc5-mm1-clean/include/asm-i386/page.h 2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/asm-i386/page.h 2005-10-30 13:34:50.000000000 +0000
@@ -36,7 +36,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | __GFP_EASYRCLM, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/linux/gfp.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/gfp.h
--- linux-2.6.14-rc5-mm1-clean/include/linux/gfp.h 2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/gfp.h 2005-10-30 13:34:50.000000000 +0000
@@ -50,14 +50,27 @@ struct vm_area_struct;
#define __GFP_HARDWALL 0x40000u /* Enforce hardwall cpuset memory allocs */
#define __GFP_VALID 0x80000000u /* valid GFP flags */
-#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
+/*
+ * Allocation type modifiers, these are required to be adjacent
+ * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
+ * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
+ * Both bits off: Kernel non-reclaimable or very hard to reclaim
+ * __GFP_EASYRCLM and __GFP_KERNRCLM should not be specified at the same time
+ * RCLM_SHIFT (defined elsewhere) depends on the location of these bits
+ */
+#define __GFP_EASYRCLM 0x80000u /* User and other easily reclaimed pages */
+#define __GFP_KERNRCLM 0x100000u /* Kernel page that is reclaimable */
+#define __GFP_RCLM_BITS (__GFP_EASYRCLM|__GFP_KERNRCLM)
+
+#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL)
+ __GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL| \
+ __GFP_EASYRCLM|__GFP_KERNRCLM)
#define GFP_ATOMIC (__GFP_VALID | __GFP_HIGH)
#define GFP_NOIO (__GFP_VALID | __GFP_WAIT)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/linux/highmem.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/highmem.h
--- linux-2.6.14-rc5-mm1-clean/include/linux/highmem.h 2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/highmem.h 2005-10-30 13:34:50.000000000 +0000
@@ -47,7 +47,8 @@ static inline void clear_user_highpage(s
static inline struct page *
alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
{
- struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+ struct page *page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+ vma, vaddr);
if (page)
clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/memory.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/memory.c
--- linux-2.6.14-rc5-mm1-clean/mm/memory.c 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/memory.c 2005-10-30 13:34:50.000000000 +0000
@@ -1295,7 +1295,8 @@ static int do_wp_page(struct mm_struct *
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+ vma, address);
if (!new_page)
goto oom;
copy_user_highpage(new_page, old_page, address);
@@ -1858,7 +1859,8 @@ retry:
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+ vma, address);
if (!page)
goto oom;
copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/shmem.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/shmem.c
--- linux-2.6.14-rc5-mm1-clean/mm/shmem.c 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/shmem.c 2005-10-30 13:34:50.000000000 +0000
@@ -906,7 +906,7 @@ shmem_alloc_page(unsigned long gfp, stru
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
- page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
+ page = alloc_page_vma(gfp | __GFP_ZERO | __GFP_EASYRCLM, &pvma, 0);
mpol_free(pvma.vm_policy);
return page;
}
@@ -921,7 +921,7 @@ shmem_swapin(struct shmem_inode_info *in
static inline struct page *
shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
{
- return alloc_page(gfp | __GFP_ZERO);
+ return alloc_page(gfp | __GFP_ZERO | __GFP_EASYRCLM);
}
#endif
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/swap_state.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/swap_state.c
--- linux-2.6.14-rc5-mm1-clean/mm/swap_state.c 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/swap_state.c 2005-10-30 13:34:50.000000000 +0000
@@ -341,7 +341,8 @@ struct page *read_swap_cache_async(swp_e
* Get a new page to read into from swap.
*/
if (!new_page) {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+ vma, addr);
if (!new_page)
break; /* Out of memory */
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
2005-10-30 18:34 ` [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore Mel Gorman
` (5 subsequent siblings)
7 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel
This patch adds a "usemap" to the allocator. When a PAGE_PER_MAXORDER block
of pages (i.e. 2^(MAX_ORDER-1)) is split, the usemap is updated with the
type of allocation when splitting. This information is used in an
anti-fragmentation patch to group related allocation types together.
The __GFP_EASYRCLM and __GFP_KERNRCLM bits are used to enumerate three allocation
types;
RCLM_NORLM: These are kernel allocations that cannot be reclaimed
on demand.
RCLM_EASY: These are pages allocated with __GFP_EASYRCLM flag set. They are
considered to be user and other easily reclaimed pages such
as buffers
RCLM_KERN: Allocated for the kernel but for caches that can be reclaimed
on demand.
gfpflags_to_rclmtype() converts gfp_flags to their corresponding RCLM_TYPE
by masking out irrelevant bits and shifting the result right by RCLM_SHIFT.
Compile-time checks are made on RCLM_SHIFT to ensure gfpflags_to_rclmtype()
keeps working. ffz() could be used to avoid static checks, but it would be
runtime overhead for a compile-time constant.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mm.h linux-2.6.14-rc5-mm1-002_usemap/include/linux/mm.h
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mm.h 2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/include/linux/mm.h 2005-10-30 13:35:31.000000000 +0000
@@ -529,6 +529,12 @@ static inline void set_page_links(struct
extern struct page *mem_map;
#endif
+/*
+ * Return what type of page this 2^(MAX_ORDER-1) block of pages is being
+ * used for. Return value is one of the RCLM_X types
+ */
+extern int get_pageblock_type(struct zone *zone, struct page *page);
+
static inline void *lowmem_page_address(struct page *page)
{
return __va(page_to_pfn(page) << PAGE_SHIFT);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mmzone.h linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mmzone.h 2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h 2005-10-30 13:35:31.000000000 +0000
@@ -21,6 +21,17 @@
#else
#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
#endif
+#define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1))
+
+/*
+ * The two bit field __GFP_RECLAIMBITS enumerates the following types of
+ * page reclaimability.
+ */
+#define RCLM_NORCLM 0
+#define RCLM_EASY 1
+#define RCLM_KERN 2
+#define RCLM_TYPES 3
+#define BITS_PER_RCLM_TYPE 2
struct free_area {
struct list_head free_list;
@@ -146,6 +157,13 @@ struct zone {
#endif
struct free_area free_area[MAX_ORDER];
+#ifndef CONFIG_SPARSEMEM
+ /*
+ * The map tracks what each 2^MAX_ORDER-1 sized block is being used for.
+ * Each PAGES_PER_MAXORDER block of pages use BITS_PER_RCLM_TYPE bits
+ */
+ unsigned long *free_area_usemap;
+#endif
ZONE_PADDING(_pad1_)
@@ -501,9 +519,14 @@ extern struct pglist_data contig_page_da
#define PAGES_PER_SECTION (1UL << PFN_SECTION_SHIFT)
#define PAGE_SECTION_MASK (~(PAGES_PER_SECTION-1))
+#define FREE_AREA_BITS 64
+
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
+#if ((SECTION_SIZE_BITS - MAX_ORDER) * BITS_PER_RCLM_TYPE) > FREE_AREA_BITS
+#error free_area_usemap is not big enough
+#endif
struct page;
struct mem_section {
@@ -516,6 +539,7 @@ struct mem_section {
* before using it wrong.
*/
unsigned long section_mem_map;
+ DECLARE_BITMAP(free_area_usemap, FREE_AREA_BITS);
};
#ifdef CONFIG_SPARSEMEM_EXTREME
@@ -584,6 +608,18 @@ static inline struct mem_section *__pfn_
return __nr_to_section(pfn_to_section_nr(pfn));
}
+static inline unsigned long *pfn_to_usemap(struct zone *zone,
+ unsigned long pfn)
+{
+ return &__pfn_to_section(pfn)->free_area_usemap[0];
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+ pfn &= (PAGES_PER_SECTION-1);
+ return (pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE;
+}
+
#define pfn_to_page(pfn) \
({ \
unsigned long __pfn = (pfn); \
@@ -621,6 +657,17 @@ void sparse_init(void);
#else
#define sparse_init() do {} while (0)
#define sparse_index_init(_sec, _nid) do {} while (0)
+static inline unsigned long *pfn_to_usemap(struct zone *zone,
+ unsigned long pfn)
+{
+ return zone->free_area_usemap;
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+ pfn = pfn - zone->zone_start_pfn;
+ return (pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE;
+}
#endif /* CONFIG_SPARSEMEM */
#ifdef CONFIG_NODES_SPAN_OTHER_NODES
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/page_alloc.c linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/page_alloc.c 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c 2005-10-30 13:35:31.000000000 +0000
@@ -69,6 +69,99 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
EXPORT_SYMBOL(totalram_pages);
/*
+ * RCLM_SHIFT is the number of bits that a gfp_mask has to be shifted right
+ * to have just the __GFP_EASYRCLM and __GFP_KERNRCLM bits. The static check
+ * is made afterwards in case the GFP flags are not updated without updating
+ * this number
+ */
+#define RCLM_SHIFT 19
+#if (__GFP_EASYRCLM >> RCLM_SHIFT) != RCLM_EASY
+#error __GFP_EASYRCLM not mapping to RCLM_EASY
+#endif
+#if (__GFP_KERNRCLM >> RCLM_SHIFT) != RCLM_KERN
+#error __GFP_KERNRCLM not mapping to RCLM_KERN
+#endif
+
+/*
+ * This function maps gfpflags to their RCLM_TYPE. It makes assumptions
+ * on the location of the GFP flags.
+ */
+static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
+{
+ unsigned long rclmbits = gfp_flags & __GFP_RCLM_BITS;
+
+ /* Specifying both RCLM flags makes no sense */
+ if (unlikely(rclmbits == __GFP_RCLM_BITS)) {
+ printk(KERN_WARNING "Multiple RCLM GFP flags specified\n");
+ dump_stack();
+ return RCLM_TYPES;
+ }
+
+ return rclmbits >> RCLM_SHIFT;
+}
+
+/*
+ * copy_bits - Copy bits between bitmaps
+ * @dstaddr: The destination bitmap to copy to
+ * @srcaddr: The source bitmap to copy from
+ * @sindex_dst: The start bit index within the destination map to copy to
+ * @sindex_src: The start bit index within the source map to copy from
+ * @nr: The number of bits to copy
+ *
+ * Note that this method is slow and makes no guarantees for atomicity.
+ * It depends on being called with the zone spinlock held to ensure data
+ * safety
+ */
+static inline void copy_bits(unsigned long *dstaddr,
+ unsigned long *srcaddr,
+ int sindex_dst,
+ int sindex_src,
+ int nr)
+{
+ /*
+ * Written like this to take advantage of arch-specific
+ * set_bit() and clear_bit() functions
+ */
+ for (nr = nr - 1; nr >= 0; nr--) {
+ int bit = test_bit(sindex_src + nr, srcaddr);
+ if (bit)
+ set_bit(sindex_dst + nr, dstaddr);
+ else
+ clear_bit(sindex_dst + nr, dstaddr);
+ }
+}
+
+int get_pageblock_type(struct zone *zone, struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long type = 0;
+ unsigned long *usemap;
+ int bitidx;
+
+ bitidx = pfn_to_bitidx(zone, pfn);
+ usemap = pfn_to_usemap(zone, pfn);
+
+ copy_bits(&type, usemap, 0, bitidx, BITS_PER_RCLM_TYPE);
+
+ return type;
+}
+
+/* Reserve a block of pages for an allocation type */
+static inline void set_pageblock_type(struct zone *zone, struct page *page,
+ int type)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long *usemap;
+ unsigned long ltype = type;
+ int bitidx;
+
+ bitidx = pfn_to_bitidx(zone, pfn);
+ usemap = pfn_to_usemap(zone, pfn);
+
+ copy_bits(usemap, <ype, bitidx, 0, BITS_PER_RCLM_TYPE);
+}
+
+/*
* Used by page_zone() to look up the address of the struct zone whose
* id is encoded in the upper bits of page->flags
*/
@@ -498,7 +591,8 @@ static void prep_new_page(struct page *p
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+ int alloctype)
{
struct free_area * area;
unsigned int current_order;
@@ -514,6 +608,14 @@ static struct page *__rmqueue(struct zon
rmv_page_order(page);
area->nr_free--;
zone->free_pages -= 1UL << order;
+
+ /*
+ * If splitting a large block, record what the block is being
+ * used for in the usemap
+ */
+ if (current_order == MAX_ORDER-1)
+ set_pageblock_type(zone, page, alloctype);
+
return expand(zone, page, order, current_order, area);
}
@@ -526,7 +628,8 @@ static struct page *__rmqueue(struct zon
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list,
+ int alloctype)
{
unsigned long flags;
int i;
@@ -535,7 +638,7 @@ static int rmqueue_bulk(struct zone *zon
spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, alloctype);
if (page == NULL)
break;
allocated++;
@@ -719,6 +822,11 @@ buffered_rmqueue(struct zone *zone, int
unsigned long flags;
struct page *page = NULL;
int cold = !!(gfp_flags & __GFP_COLD);
+ int alloctype = gfpflags_to_rclmtype(gfp_flags);
+
+ /* If the alloctype is RCLM_TYPES, the gfp_flags make no sense */
+ if (alloctype == RCLM_TYPES)
+ return NULL;
if (order == 0) {
struct per_cpu_pages *pcp;
@@ -727,7 +835,8 @@ buffered_rmqueue(struct zone *zone, int
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list,
+ alloctype);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -739,7 +848,7 @@ buffered_rmqueue(struct zone *zone, int
if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, alloctype);
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -1866,6 +1975,38 @@ inline void setup_pageset(struct per_cpu
INIT_LIST_HEAD(&pcp->list);
}
+#ifndef CONFIG_SPARSEMEM
+#define roundup(x, y) ((((x)+((y)-1))/(y))*(y))
+/*
+ * Calculate the size of the zone->usemap in bytes rounded to an unsigned long
+ * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
+ * Then figure 1 RCLM_TYPE worth of bits per MAX_ORDER-1, finally round up
+ * what is now in bits to nearest long in bits, then return it in bytes.
+ */
+static unsigned long __init usemap_size(unsigned long zonesize)
+{
+ unsigned long usemapsize;
+
+ usemapsize = roundup(zonesize, PAGES_PER_MAXORDER);
+ usemapsize = usemapsize >> (MAX_ORDER-1);
+ usemapsize *= BITS_PER_RCLM_TYPE;
+ usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+
+ return usemapsize / 8;
+}
+
+static void __init setup_usemap(struct pglist_data *pgdat,
+ struct zone *zone, unsigned long zonesize)
+{
+ unsigned long usemapsize = usemap_size(zonesize);
+ zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+ memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+}
+#else
+static void inline setup_usemap(struct pglist_data *pgdat,
+ struct zone *zone, unsigned long zonesize) {}
+#endif /* CONFIG_SPARSEMEM */
+
#ifdef CONFIG_NUMA
/*
* Boot pageset table. One per cpu which is going to be used for all
@@ -2079,6 +2220,7 @@ static void __init free_area_init_core(s
zonetable_add(zone, nid, j, zone_start_pfn, size);
init_currently_empty_zone(zone, zone_start_pfn, size);
zone_start_pfn += size;
+ setup_usemap(pgdat, zone, size);
}
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman
2005-10-30 18:34 ` [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
2005-10-30 18:34 ` [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback Mel Gorman
` (4 subsequent siblings)
7 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman
This patch adds the core of the anti-fragmentation strategy. It works by
grouping related allocation types together. The idea is that large groups of
pages that may be reclaimed are placed near each other. The zone->free_area
list is broken into three free lists for each RCLM_TYPE.
This section of the patch looks superflous but it is to surpress a compiler
warning. Suggestions to make this better looking are welcome.
- struct free_area * area;
+ struct free_area * area = NULL;
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h 2005-10-30 13:35:31.000000000 +0000
+++ linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h 2005-10-30 13:36:16.000000000 +0000
@@ -33,6 +33,10 @@
#define RCLM_TYPES 3
#define BITS_PER_RCLM_TYPE 2
+#define for_each_rclmtype_order(type, order) \
+ for (order = 0; order < MAX_ORDER; order++) \
+ for (type = 0; type < RCLM_TYPES; type++)
+
struct free_area {
struct list_head free_list;
unsigned long nr_free;
@@ -155,7 +159,6 @@ struct zone {
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
#endif
- struct free_area free_area[MAX_ORDER];
#ifndef CONFIG_SPARSEMEM
/*
@@ -165,6 +168,8 @@ struct zone {
unsigned long *free_area_usemap;
#endif
+ struct free_area free_area_lists[RCLM_TYPES][MAX_ORDER];
+
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c 2005-10-30 13:35:31.000000000 +0000
+++ linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c 2005-10-30 13:36:16.000000000 +0000
@@ -352,6 +352,15 @@ __find_combined_index(unsigned long page
}
/*
+ * Return the free list for a given page within a zone
+ */
+static inline struct free_area *__page_find_freelist(struct zone *zone,
+ struct page *page)
+{
+ return zone->free_area_lists[get_pageblock_type(zone, page)];
+}
+
+/*
* This function checks whether a page is free && is the buddy
* we can do coalesce a page and its buddy if
* (a) the buddy is free &&
@@ -398,6 +407,8 @@ static inline void __free_pages_bulk (st
{
unsigned long page_idx;
int order_size = 1 << order;
+ struct free_area *area;
+ struct free_area *freelist;
if (unlikely(order))
destroy_compound_page(page, order);
@@ -407,10 +418,11 @@ static inline void __free_pages_bulk (st
BUG_ON(page_idx & (order_size - 1));
BUG_ON(bad_range(zone, page));
+ freelist = __page_find_freelist(zone, page);
+
zone->free_pages += order_size;
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
- struct free_area *area;
struct page *buddy;
combined_idx = __find_combined_index(page_idx, order);
@@ -421,7 +433,7 @@ static inline void __free_pages_bulk (st
if (!page_is_buddy(buddy, order))
break; /* Move the buddy up one level. */
list_del(&buddy->lru);
- area = zone->free_area + order;
+ area = &freelist[order];
area->nr_free--;
rmv_page_order(buddy);
page = page + (combined_idx - page_idx);
@@ -429,8 +441,8 @@ static inline void __free_pages_bulk (st
order++;
}
set_page_order(page, order);
- list_add(&page->lru, &zone->free_area[order].free_list);
- zone->free_area[order].nr_free++;
+ list_add_tail(&page->lru, &freelist[order].free_list);
+ freelist[order].nr_free++;
}
static inline void free_pages_check(const char *function, struct page *page)
@@ -587,6 +599,45 @@ static void prep_new_page(struct page *p
kernel_map_pages(page, 1 << order, 1);
}
+/*
+ * Find a list that has a 2^MAX_ORDER-1 block of pages available and
+ * return it
+ */
+struct page *steal_maxorder_block(struct zone *zone, int alloctype)
+{
+ struct page *page;
+ struct free_area *area = NULL;
+ int i;
+
+ for(i = 0; i < RCLM_TYPES; i++) {
+ if (i == alloctype)
+ continue;
+
+ area = &zone->free_area_lists[i][MAX_ORDER-1];
+ if (!list_empty(&area->free_list))
+ break;
+ }
+ if (i == RCLM_TYPES)
+ return NULL;
+
+ page = list_entry(area->free_list.next, struct page, lru);
+ area->nr_free--;
+
+ set_pageblock_type(zone, page, alloctype);
+
+ return page;
+}
+
+static inline struct page *
+remove_page(struct zone *zone, struct page *page, unsigned int order,
+ unsigned int current_order, struct free_area *area)
+{
+ list_del(&page->lru);
+ rmv_page_order(page);
+ zone->free_pages -= 1UL << order;
+ return expand(zone, page, order, current_order, area);
+}
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
@@ -594,31 +645,25 @@ static void prep_new_page(struct page *p
static struct page *__rmqueue(struct zone *zone, unsigned int order,
int alloctype)
{
- struct free_area * area;
+ struct free_area * area = NULL;
unsigned int current_order;
struct page *page;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
+ area = &zone->free_area_lists[alloctype][current_order];
if (list_empty(&area->free_list))
continue;
page = list_entry(area->free_list.next, struct page, lru);
- list_del(&page->lru);
- rmv_page_order(page);
area->nr_free--;
- zone->free_pages -= 1UL << order;
-
- /*
- * If splitting a large block, record what the block is being
- * used for in the usemap
- */
- if (current_order == MAX_ORDER-1)
- set_pageblock_type(zone, page, alloctype);
-
- return expand(zone, page, order, current_order, area);
+ return remove_page(zone, page, order, current_order, area);
}
+ /* Allocate a MAX_ORDER block */
+ page = steal_maxorder_block(zone, alloctype);
+ if (page != NULL)
+ return remove_page(zone, page, order, MAX_ORDER-1, area);
+
return NULL;
}
@@ -704,9 +749,9 @@ static void __drain_pages(unsigned int c
void mark_free_pages(struct zone *zone)
{
unsigned long zone_pfn, flags;
- int order;
+ int order, t;
+ unsigned long start_pfn, i;
struct list_head *curr;
-
if (!zone->spanned_pages)
return;
@@ -714,14 +759,12 @@ void mark_free_pages(struct zone *zone)
for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
- for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list) {
- unsigned long start_pfn, i;
-
+ for_each_rclmtype_order(t, order) {
+ list_for_each(curr,&zone->free_area_lists[t][order].free_list) {
start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
-
for (i=0; i < (1<<order); i++)
SetPageNosaveFree(pfn_to_page(start_pfn+i));
+ }
}
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -876,6 +919,7 @@ int zone_watermark_ok(struct zone *z, in
/* free_pages my go negative - that's OK */
long min = mark, free_pages = z->free_pages - (1 << order) + 1;
int o;
+ struct free_area *kernnorclm, *kernrclm, *easyrclm;
if (gfp_high)
min -= min / 2;
@@ -884,15 +928,22 @@ int zone_watermark_ok(struct zone *z, in
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
goto out_failed;
+ kernnorclm = z->free_area_lists[RCLM_NORCLM];
+ easyrclm = z->free_area_lists[RCLM_EASY];
+ kernrclm = z->free_area_lists[RCLM_KERN];
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[o].nr_free << o;
+ free_pages -= (kernnorclm->nr_free + kernrclm->nr_free +
+ easyrclm->nr_free) << o;
/* Require fewer higher order pages to be free */
min >>= 1;
if (free_pages <= min)
goto out_failed;
+ kernnorclm++;
+ easyrclm++;
+ kernrclm++;
}
return 1;
@@ -1496,6 +1547,7 @@ void show_free_areas(void)
unsigned long inactive;
unsigned long free;
struct zone *zone;
+ int type;
for_each_zone(zone) {
show_node(zone);
@@ -1575,7 +1627,9 @@ void show_free_areas(void)
}
for_each_zone(zone) {
- unsigned long nr, flags, order, total = 0;
+ unsigned long nr = 0;
+ unsigned long total = 0;
+ unsigned long flags,order;
show_node(zone);
printk("%s: ", zone->name);
@@ -1585,10 +1639,18 @@ void show_free_areas(void)
}
spin_lock_irqsave(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
- nr = zone->free_area[order].nr_free;
+ for_each_rclmtype_order(type, order) {
+ nr += zone->free_area_lists[type][order].nr_free;
total += nr << order;
- printk("%lu*%lukB ", nr, K(1UL) << order);
+
+ /*
+ * If type had reached RCLM_TYPE, the free pages
+ * for this order have been summed up
+ */
+ if (type == RCLM_TYPES-1) {
+ printk("%lu*%lukB ", nr, K(1UL) << order);
+ nr = 0;
+ }
}
spin_unlock_irqrestore(&zone->lock, flags);
printk("= %lukB\n", K(total));
@@ -1899,9 +1961,14 @@ void zone_init_free_lists(struct pglist_
unsigned long size)
{
int order;
- for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
- zone->free_area[order].nr_free = 0;
+ int type;
+ struct free_area *area;
+
+ /* Initialse the three size ordered lists of free_areas */
+ for_each_rclmtype_order(type, order) {
+ area = &(zone->free_area_lists[type][order]);
+ INIT_LIST_HEAD(&area->free_list);
+ area->nr_free = 0;
}
}
@@ -2314,16 +2381,26 @@ static int frag_show(struct seq_file *m,
struct zone *zone;
struct zone *node_zones = pgdat->node_zones;
unsigned long flags;
- int order;
+ int order, t;
+ struct free_area *area;
+ unsigned long nr_bufs = 0;
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!zone->present_pages)
continue;
spin_lock_irqsave(&zone->lock, flags);
- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+ for_each_rclmtype_order(t, order) {
+ area = &(zone->free_area_lists[t][order]);
+ nr_bufs += area->nr_free;
+
+ if (t == RCLM_TYPES-1) {
+ seq_printf(m, "%6lu ", nr_bufs);
+ nr_bufs = 0;
+ }
+ }
+
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
` (2 preceding siblings ...)
2005-10-30 18:34 ` [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
2005-10-30 18:34 ` [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder Mel Gorman
` (3 subsequent siblings)
7 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel
This patch implements fallback logic. In the event there is no 2^(MAX_ORDER-1)
blocks of pages left, this will help the system decide what list to use. The
highlights of the patch are;
o Define a RCLM_FALLBACK type for fallbacks
o Use a percentage of each zone for fallbacks. When a reserved pool of pages
is depleted, it will try and use RCLM_FALLBACK before using anything else.
This greatly reduces the amount of fallbacks causing fragmentation without
needing complex balancing algorithms
o Add a fallback_reserve that records how much of the zone is currently used
for allocations falling back to RCLM_FALLBACK
o Adds a fallback_allocs[] array that determines the order of freelists are
used for each allocation type
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h linux-2.6.14-rc5-mm1-004_fallback/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h 2005-10-30 13:36:16.000000000 +0000
+++ linux-2.6.14-rc5-mm1-004_fallback/include/linux/mmzone.h 2005-10-30 13:36:56.000000000 +0000
@@ -30,7 +30,8 @@
#define RCLM_NORCLM 0
#define RCLM_EASY 1
#define RCLM_KERN 2
-#define RCLM_TYPES 3
+#define RCLM_FALLBACK 3
+#define RCLM_TYPES 4
#define BITS_PER_RCLM_TYPE 2
#define for_each_rclmtype_order(type, order) \
@@ -168,8 +169,17 @@ struct zone {
unsigned long *free_area_usemap;
#endif
+ /*
+ * With allocation fallbacks, the nr_free count for each RCLM_TYPE must
+ * be added together to get the correct count of free pages for a given
+ * order. Individually, the nr_free count in a free_area may not match
+ * the number of pages in the free_list.
+ */
struct free_area free_area_lists[RCLM_TYPES][MAX_ORDER];
+ /* Number of pages currently used for RCLM_FALLBACK */
+ unsigned long fallback_reserve;
+
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
@@ -292,6 +302,17 @@ struct zonelist {
struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
};
+static inline void inc_reserve_count(struct zone *zone, int type)
+{
+ if (type == RCLM_FALLBACK)
+ zone->fallback_reserve += PAGES_PER_MAXORDER;
+}
+
+static inline void dec_reserve_count(struct zone *zone, int type)
+{
+ if (type == RCLM_FALLBACK && zone->fallback_reserve)
+ zone->fallback_reserve -= PAGES_PER_MAXORDER;
+}
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c 2005-10-30 13:36:16.000000000 +0000
+++ linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c 2005-10-30 13:36:56.000000000 +0000
@@ -54,6 +54,22 @@ unsigned long totalhigh_pages __read_mos
long nr_swap_pages;
/*
+ * fallback_allocs contains the fallback types for low memory conditions
+ * where the preferred alloction type if not available.
+ */
+int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
+ {RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES},
+ {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
+ {RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
+};
+
+/* Returns 1 if the needed percentage of the zone is reserved for fallbacks */
+static inline int min_fallback_reserved(struct zone *zone)
+{
+ return zone->fallback_reserve >= zone->present_pages >> 3;
+}
+
+/*
* results with 256, 32 in the lowmem_reserve sysctl:
* 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
* 1G machine -> (16M dma, 784M normal, 224M high)
@@ -623,7 +639,12 @@ struct page *steal_maxorder_block(struct
page = list_entry(area->free_list.next, struct page, lru);
area->nr_free--;
+ if (!min_fallback_reserved(zone))
+ alloctype = RCLM_FALLBACK;
+
set_pageblock_type(zone, page, alloctype);
+ dec_reserve_count(zone, i);
+ inc_reserve_count(zone, alloctype);
return page;
}
@@ -638,6 +659,78 @@ remove_page(struct zone *zone, struct pa
return expand(zone, page, order, current_order, area);
}
+/*
+ * If we are falling back, and the allocation is KERNNORCLM,
+ * then reserve any buddies for the KERNNORCLM pool. These
+ * allocations fragment the worst so this helps keep them
+ * in the one place
+ */
+static inline struct free_area *
+fallback_buddy_reserve(int start_alloctype, struct zone *zone,
+ unsigned int current_order, struct page *page,
+ struct free_area *area)
+{
+ if (start_alloctype != RCLM_NORCLM)
+ return area;
+
+ area = &zone->free_area_lists[RCLM_NORCLM][current_order];
+
+ /* Reserve the whole block if this is a large split */
+ if (current_order >= MAX_ORDER / 2) {
+ int reserve_type = RCLM_NORCLM;
+ if (!min_fallback_reserved(zone))
+ reserve_type = RCLM_FALLBACK;
+
+ dec_reserve_count(zone, get_pageblock_type(zone,page));
+ set_pageblock_type(zone, page, reserve_type);
+ inc_reserve_count(zone, reserve_type);
+ }
+ return area;
+}
+
+static struct page *
+fallback_alloc(int alloctype, struct zone *zone, unsigned int order)
+{
+ int *fallback_list;
+ int start_alloctype = alloctype;
+ struct free_area *area;
+ unsigned int current_order;
+ struct page *page;
+ int i;
+
+ /* Ok, pick the fallback order based on the type */
+ BUG_ON(alloctype >= RCLM_TYPES);
+ fallback_list = fallback_allocs[alloctype];
+
+ /*
+ * Here, the alloc type lists has been depleted as well as the global
+ * pool, so fallback. When falling back, the largest possible block
+ * will be taken to keep the fallbacks clustered if possible
+ */
+ for (i = 0; fallback_list[i] != RCLM_TYPES; i++) {
+ alloctype = fallback_list[i];
+
+ /* Find a block to allocate */
+ area = &zone->free_area_lists[alloctype][MAX_ORDER-1];
+ for (current_order = MAX_ORDER - 1; current_order > order;
+ current_order--, area--) {
+ if (list_empty(&area->free_list))
+ continue;
+
+ page = list_entry(area->free_list.next,
+ struct page, lru);
+ area->nr_free--;
+ area = fallback_buddy_reserve(start_alloctype, zone,
+ current_order, page, area);
+ return remove_page(zone, page, order,
+ current_order, area);
+
+ }
+ }
+
+ return NULL;
+}
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
@@ -664,7 +757,8 @@ static struct page *__rmqueue(struct zon
if (page != NULL)
return remove_page(zone, page, order, MAX_ORDER-1, area);
- return NULL;
+ /* Try falling back */
+ return fallback_alloc(alloctype, zone, order);
}
/*
@@ -2270,6 +2364,7 @@ static void __init free_area_init_core(s
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
+ zone->fallback_reserve = 0;
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
` (3 preceding siblings ...)
2005-10-30 18:34 ` [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
2005-10-30 18:34 ` [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu Mel Gorman
` (2 subsequent siblings)
7 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman
Fragmentation avoidance patches increase our chances of satisfying high
order allocations. So this patch takes more than one iteration at trying
to fulfill those allocations because, unlike before, the extra iterations
are often useful.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c 2005-10-30 13:36:56.000000000 +0000
+++ linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c 2005-10-30 13:37:34.000000000 +0000
@@ -1127,6 +1127,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
int do_retry;
int can_try_harder;
int did_some_progress;
+ int highorder_retry = 3;
might_sleep_if(wait);
@@ -1275,7 +1276,17 @@ rebalance:
goto got_pg;
}
- out_of_memory(gfp_mask, order);
+ if (order < MAX_ORDER / 2)
+ out_of_memory(gfp_mask, order);
+
+ /*
+ * Due to low fragmentation efforts, we try a little
+ * harder to satisfy high order allocations and only
+ * go OOM for low-order allocations
+ */
+ if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+ goto rebalance;
+
goto restart;
}
@@ -1292,6 +1303,8 @@ rebalance:
do_retry = 1;
if (gfp_mask & __GFP_NOFAIL)
do_retry = 1;
+ if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+ do_retry = 1;
}
if (do_retry) {
blk_congestion_wait(WRITE, HZ/50);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
` (4 preceding siblings ...)
2005-10-30 18:34 ` [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
2005-10-30 18:34 ` [PATCH 7/7] Fragmentation Avoidance V19: 007_stats Mel Gorman
2005-10-31 5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz
7 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel
The freelists for each allocation type can slowly become corrupted due to
the per-cpu list. Consider what happens when the following happens
1. A 2^(MAX_ORDER-1) list is reserved for __GFP_EASYRCLM pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation
Now, a kernel page is in the middle of a __GFP_EASYRCLM page. This means
that over long periods of the time, the anti-fragmentation scheme slowly
degrades to the standard allocator.
This patch divides the per-cpu lists into Kernel and User lists. RCLM_NORCLM
and RCLM_KERN use the Kernel list and RCLM_EASY uses the user list. Strictly
speaking, there should be three lists but as little effort is made to reclaim
RCLM_KERN pages, it is not worth the overhead *yet*.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-005_largealloc_tryharder/include/linux/mmzone.h linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-005_largealloc_tryharder/include/linux/mmzone.h 2005-10-30 13:36:56.000000000 +0000
+++ linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h 2005-10-30 13:38:14.000000000 +0000
@@ -60,12 +60,21 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif
+/*
+ * Indices into pcpu_list
+ * PCPU_KERNEL: For RCLM_NORCLM and RCLM_KERN allocations
+ * PCPU_EASY: For RCLM_EASY allocations
+ */
+#define PCPU_KERNEL 0
+#define PCPU_EASY 1
+#define PCPU_TYPES 2
+
struct per_cpu_pages {
- int count; /* number of pages in the list */
+ int count[PCPU_TYPES]; /* Number of pages on each list */
int low; /* low watermark, refill needed */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
- struct list_head list; /* the list of pages */
+ struct list_head list[PCPU_TYPES]; /* the lists of pages */
};
struct per_cpu_pageset {
@@ -80,6 +89,10 @@ struct per_cpu_pageset {
#endif
} ____cacheline_aligned_in_smp;
+/* Helpers for per_cpu_pages */
+#define pset_count(pset) (pset.count[PCPU_KERNEL] + pset.count[PCPU_EASY])
+#define for_each_pcputype(pindex) \
+ for (pindex = 0; pindex < PCPU_TYPES; pindex++)
#ifdef CONFIG_NUMA
#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
#else
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c 2005-10-30 13:37:34.000000000 +0000
+++ linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c 2005-10-30 13:38:14.000000000 +0000
@@ -792,7 +792,7 @@ static int rmqueue_bulk(struct zone *zon
void drain_remote_pages(void)
{
struct zone *zone;
- int i;
+ int i, pindex;
unsigned long flags;
local_irq_save(flags);
@@ -808,9 +808,16 @@ void drain_remote_pages(void)
struct per_cpu_pages *pcp;
pcp = &pset->pcp[i];
- if (pcp->count)
- pcp->count -= free_pages_bulk(zone, pcp->count,
- &pcp->list, 0);
+ for_each_pcputype(pindex) {
+ if (!pcp->count[pindex])
+ continue;
+
+ /* Try remove all pages from the pcpu list */
+ pcp->count[pindex] -=
+ free_pages_bulk(zone,
+ pcp->count[pindex],
+ &pcp->list[pindex], 0);
+ }
}
}
local_irq_restore(flags);
@@ -821,7 +828,7 @@ void drain_remote_pages(void)
static void __drain_pages(unsigned int cpu)
{
struct zone *zone;
- int i;
+ int i, pindex;
for_each_zone(zone) {
struct per_cpu_pageset *pset;
@@ -831,8 +838,16 @@ static void __drain_pages(unsigned int c
struct per_cpu_pages *pcp;
pcp = &pset->pcp[i];
- pcp->count -= free_pages_bulk(zone, pcp->count,
- &pcp->list, 0);
+ for_each_pcputype(pindex) {
+ if (!pcp->count[pindex])
+ continue;
+
+ /* Try remove all pages from the pcpu list */
+ pcp->count[pindex] -=
+ free_pages_bulk(zone,
+ pcp->count[pindex],
+ &pcp->list[pindex], 0);
+ }
}
}
}
@@ -911,6 +926,7 @@ static void fastcall free_hot_cold_page(
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
+ int pindex;
arch_free_page(page, 0);
@@ -920,11 +936,21 @@ static void fastcall free_hot_cold_page(
page->mapping = NULL;
free_pages_check(__FUNCTION__, page);
pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
+
+ /*
+ * Strictly speaking, we should not be accessing the zone information
+ * here. In this case, it does not matter if the read is incorrect
+ */
+ if (get_pageblock_type(zone, page) == RCLM_EASY)
+ pindex = PCPU_EASY;
+ else
+ pindex = PCPU_KERNEL;
local_irq_save(flags);
- list_add(&page->lru, &pcp->list);
- pcp->count++;
- if (pcp->count >= pcp->high)
- pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+ list_add(&page->lru, &pcp->list[pindex]);
+ pcp->count[pindex]++;
+ if (pcp->count[pindex] >= pcp->high)
+ pcp->count[pindex] -= free_pages_bulk(zone, pcp->batch,
+ &pcp->list[pindex], 0);
local_irq_restore(flags);
put_cpu();
}
@@ -967,17 +993,23 @@ buffered_rmqueue(struct zone *zone, int
if (order == 0) {
struct per_cpu_pages *pcp;
+ int pindex = PCPU_KERNEL;
+ if (alloctype == RCLM_EASY)
+ pindex = PCPU_EASY;
pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
local_irq_save(flags);
- if (pcp->count <= pcp->low)
- pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list,
- alloctype);
- if (pcp->count) {
- page = list_entry(pcp->list.next, struct page, lru);
+ if (pcp->count[pindex] <= pcp->low)
+ pcp->count[pindex] += rmqueue_bulk(zone,
+ 0, pcp->batch,
+ &(pcp->list[pindex]),
+ alloctype);
+
+ if (pcp->count[pindex]) {
+ page = list_entry(pcp->list[pindex].next,
+ struct page, lru);
list_del(&page->lru);
- pcp->count--;
+ pcp->count[pindex]--;
}
local_irq_restore(flags);
put_cpu();
@@ -1678,7 +1710,7 @@ void show_free_areas(void)
pageset->pcp[temperature].low,
pageset->pcp[temperature].high,
pageset->pcp[temperature].batch,
- pageset->pcp[temperature].count);
+ pset_count(pageset->pcp[temperature]));
}
}
@@ -2135,18 +2167,22 @@ inline void setup_pageset(struct per_cpu
struct per_cpu_pages *pcp;
pcp = &p->pcp[0]; /* hot */
- pcp->count = 0;
+ pcp->count[PCPU_KERNEL] = 0;
+ pcp->count[PCPU_EASY] = 0;
pcp->low = 0;
- pcp->high = 6 * batch;
+ pcp->high = 3 * batch;
pcp->batch = max(1UL, 1 * batch);
- INIT_LIST_HEAD(&pcp->list);
+ INIT_LIST_HEAD(&pcp->list[PCPU_KERNEL]);
+ INIT_LIST_HEAD(&pcp->list[PCPU_EASY]);
pcp = &p->pcp[1]; /* cold*/
- pcp->count = 0;
+ pcp->count[PCPU_KERNEL] = 0;
+ pcp->count[PCPU_EASY] = 0;
pcp->low = 0;
- pcp->high = 2 * batch;
+ pcp->high = batch;
pcp->batch = max(1UL, batch/2);
- INIT_LIST_HEAD(&pcp->list);
+ INIT_LIST_HEAD(&pcp->list[PCPU_KERNEL]);
+ INIT_LIST_HEAD(&pcp->list[PCPU_EASY]);
}
#ifndef CONFIG_SPARSEMEM
@@ -2574,7 +2610,7 @@ static int zoneinfo_show(struct seq_file
pageset = zone_pcp(zone, i);
for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
- if (pageset->pcp[j].count)
+ if (pset_count(pageset->pcp[j]))
break;
}
if (j == ARRAY_SIZE(pageset->pcp))
@@ -2587,7 +2623,7 @@ static int zoneinfo_show(struct seq_file
"\n high: %i"
"\n batch: %i",
i, j,
- pageset->pcp[j].count,
+ pset_count(pageset->pcp[j]),
pageset->pcp[j].low,
pageset->pcp[j].high,
pageset->pcp[j].batch);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* [PATCH 7/7] Fragmentation Avoidance V19: 007_stats
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
` (5 preceding siblings ...)
2005-10-30 18:34 ` [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
2005-10-31 5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz
7 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman
It is not necessary to apply this patch to get all the anti-fragmentation
code. This patch adds a new config option called CONFIG_ALLOCSTATS. If
set, a number of new bean counters are added that are related to the
anti-fragmentation code. The information is exported via /proc/buddyinfo. This
is very useful when debugging why high-order pages are not available for
allocation.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h linux-2.6.14-rc5-mm1-007_stats/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h 2005-10-30 13:38:14.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/include/linux/mmzone.h 2005-10-30 13:38:56.000000000 +0000
@@ -193,6 +193,17 @@ struct zone {
/* Number of pages currently used for RCLM_FALLBACK */
unsigned long fallback_reserve;
+#ifdef CONFIG_ALLOCSTATS
+ /*
+ * These are beancounters that track how the placement policy
+ * of the buddy allocator is performing
+ */
+ unsigned long fallback_count[RCLM_TYPES];
+ unsigned long alloc_count[RCLM_TYPES];
+ unsigned long reserve_count[RCLM_TYPES];
+ unsigned long kernnorclm_full_steal;
+ unsigned long kernnorclm_partial_steal;
+#endif
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
@@ -292,6 +303,17 @@ struct zone {
char *name;
} ____cacheline_maxaligned_in_smp;
+#ifdef CONFIG_ALLOCSTATS
+#define inc_fallback_count(zone, type) zone->fallback_count[type]++
+#define inc_alloc_count(zone, type) zone->alloc_count[type]++
+#define inc_kernnorclm_partial_steal(zone) zone->kernnorclm_partial_steal++
+#define inc_kernnorclm_full_steal(zone) zone->kernnorclm_full_steal++
+#else
+#define inc_fallback_count(zone, type) do {} while (0)
+#define inc_alloc_count(zone, type) do {} while (0)
+#define inc_kernnorclm_partial_steal(zone) do {} while (0)
+#define inc_kernnorclm_full_steal(zone) do {} while (0)
+#endif
/*
* The "priority" of VM scanning is how much of the queues we will scan in one
@@ -319,12 +341,19 @@ static inline void inc_reserve_count(str
{
if (type == RCLM_FALLBACK)
zone->fallback_reserve += PAGES_PER_MAXORDER;
+#ifdef CONFIG_ALLOCSTATS
+ zone->reserve_count[type]++;
+#endif
}
static inline void dec_reserve_count(struct zone *zone, int type)
{
if (type == RCLM_FALLBACK && zone->fallback_reserve)
zone->fallback_reserve -= PAGES_PER_MAXORDER;
+#ifdef CONFIG_ALLOCSTATS
+ if (zone->reserve_count[type] > 0)
+ zone->reserve_count[type]--;
+#endif
}
/*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/lib/Kconfig.debug linux-2.6.14-rc5-mm1-007_stats/lib/Kconfig.debug
--- linux-2.6.14-rc5-mm1-006_percpu/lib/Kconfig.debug 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/lib/Kconfig.debug 2005-10-30 13:38:56.000000000 +0000
@@ -77,6 +77,17 @@ config SCHEDSTATS
application, you can say N to avoid the very slight overhead
this adds.
+config ALLOCSTATS
+ bool "Collection buddy allocator statistics"
+ depends on DEBUG_KERNEL && PROC_FS
+ help
+ If you say Y here, additional code will be inserted into the
+ page allocator routines to collect statistics on the allocator
+ behavior and provide them in /proc/buddyinfo. These stats are
+ useful for measuring fragmentation in the buddy allocator. If
+ you are not debugging or measuring the allocator, you can say N
+ to avoid the slight overhead this adds.
+
config DEBUG_SLAB
bool "Debug memory allocations"
depends on DEBUG_KERNEL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c linux-2.6.14-rc5-mm1-007_stats/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c 2005-10-30 13:38:14.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/mm/page_alloc.c 2005-10-30 13:38:56.000000000 +0000
@@ -187,6 +187,11 @@ EXPORT_SYMBOL(zone_table);
static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
int min_free_kbytes = 1024;
+#ifdef CONFIG_ALLOCSTATS
+static char *type_names[RCLM_TYPES] = { "KernNoRclm", "EasyRclm",
+ "KernRclm", "Fallback"};
+#endif /* CONFIG_ALLOCSTATS */
+
unsigned long __initdata nr_kernel_pages;
unsigned long __initdata nr_all_pages;
@@ -684,6 +689,9 @@ fallback_buddy_reserve(int start_allocty
dec_reserve_count(zone, get_pageblock_type(zone,page));
set_pageblock_type(zone, page, reserve_type);
inc_reserve_count(zone, reserve_type);
+ inc_kernnorclm_full_steal(zone);
+ } else {
+ inc_kernnorclm_partial_steal(zone);
}
return area;
}
@@ -726,6 +734,15 @@ fallback_alloc(int alloctype, struct zon
current_order, area);
}
+
+ /*
+ * If the current alloctype is RCLM_FALLBACK, it means
+ * that the requested pool and fallback pool are both
+ * depleted and we are falling back to other pools.
+ * At this point, pools are starting to get fragmented
+ */
+ if (alloctype == RCLM_FALLBACK)
+ inc_fallback_count(zone, start_alloctype);
}
return NULL;
@@ -742,6 +759,8 @@ static struct page *__rmqueue(struct zon
unsigned int current_order;
struct page *page;
+ inc_alloc_count(zone, alloctype);
+
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = &zone->free_area_lists[alloctype][current_order];
if (list_empty(&area->free_list))
@@ -2373,6 +2392,9 @@ static __devinit void init_currently_emp
memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+#ifdef CONFIG_ALLOCSTATS
+ zone->reserve_count[RCLM_NORCLM] = zone->present_pages >> (MAX_ORDER-1);
+#endif /* CONFIG_ALLOCSTATS */
}
/*
@@ -2528,6 +2550,18 @@ static int frag_show(struct seq_file *m,
int order, t;
struct free_area *area;
unsigned long nr_bufs = 0;
+#ifdef CONFIG_ALLOCSTATS
+ int i;
+ unsigned long kernnorclm_full_steal = 0;
+ unsigned long kernnorclm_partial_steal = 0;
+ unsigned long reserve_count[RCLM_TYPES];
+ unsigned long fallback_count[RCLM_TYPES];
+ unsigned long alloc_count[RCLM_TYPES];
+
+ memset(reserve_count, 0, sizeof(reserve_count));
+ memset(fallback_count, 0, sizeof(fallback_count));
+ memset(alloc_count, 0, sizeof(alloc_count));
+#endif
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!zone->present_pages)
@@ -2548,6 +2582,86 @@ static int frag_show(struct seq_file *m,
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
+
+#ifdef CONFIG_ALLOCSTATS
+ /* Show statistics for each allocation type */
+ seq_printf(m, "\nPer-allocation-type statistics");
+ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ if (!zone->present_pages)
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (t = 0; t < RCLM_TYPES; t++) {
+ struct list_head *elem;
+ seq_printf(m, "\nNode %d, zone %8s, type %10s ",
+ pgdat->node_id, zone->name,
+ type_names[t]);
+ for (order = 0; order < MAX_ORDER; ++order) {
+ nr_bufs = 0;
+
+ list_for_each(elem, &zone->free_area_lists[t][order].free_list)
+ ++nr_bufs;
+ seq_printf(m, "%6lu ", nr_bufs);
+ }
+ }
+
+ /* Scan global list */
+ seq_printf(m, "\n");
+ seq_printf(m, "Node %d, zone %8s, type %10s",
+ pgdat->node_id, zone->name,
+ "MAX_ORDER");
+ nr_bufs = 0;
+ for (t = 0; t < RCLM_TYPES; t++) {
+ nr_bufs +=
+ zone->free_area_lists[t][MAX_ORDER-1].nr_free;
+ }
+ seq_printf(m, "%6lu ", nr_bufs);
+ seq_printf(m, "\n");
+
+ seq_printf(m, "%s Zone beancounters\n", zone->name);
+ seq_printf(m, "Fallback reserve: %lu (%lu blocks)\n",
+ zone->fallback_reserve,
+ zone->fallback_reserve >> (MAX_ORDER-1));
+ seq_printf(m, "Fallback needed: %lu (%lu blocks)\n",
+ zone->present_pages >> 3,
+ (zone->present_pages >> 3) >> (MAX_ORDER-1));
+ seq_printf(m, "Partial steal: %lu\n",
+ zone->kernnorclm_partial_steal);
+ seq_printf(m, "Full steal: %lu\n",
+ zone->kernnorclm_full_steal);
+
+ kernnorclm_partial_steal += zone->kernnorclm_partial_steal;
+ kernnorclm_full_steal += zone->kernnorclm_full_steal;
+ seq_putc(m, '\n');
+
+ for (i = 0; i< RCLM_TYPES; i++) {
+ seq_printf(m, "%-10s Allocs: %-10lu Reserve: %-10lu Fallbacks: %-10lu\n",
+ type_names[i],
+ zone->alloc_count[i],
+ zone->reserve_count[i],
+ zone->fallback_count[i]);
+ alloc_count[i] += zone->alloc_count[i];
+ reserve_count[i] += zone->reserve_count[i];
+ fallback_count[i] += zone->fallback_count[i];
+ }
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+
+
+ /* Show bean counters */
+ seq_printf(m, "\nGlobal beancounters\n");
+ seq_printf(m, "Partial steal: %lu\n", kernnorclm_partial_steal);
+ seq_printf(m, "Full steal: %lu\n", kernnorclm_full_steal);
+
+ for (i = 0; i< RCLM_TYPES; i++) {
+ seq_printf(m, "%-10s Allocs: %-10lu Reserve: %-10lu Fallbacks: %-10lu\n",
+ type_names[i],
+ alloc_count[i],
+ reserve_count[i],
+ fallback_count[i]);
+ }
+#endif /* CONFIG_ALLOCSTATS */
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
` (6 preceding siblings ...)
2005-10-30 18:34 ` [PATCH 7/7] Fragmentation Avoidance V19: 007_stats Mel Gorman
@ 2005-10-31 5:57 ` Mike Kravetz
2005-10-31 6:37 ` Nick Piggin
7 siblings, 1 reply; 253+ messages in thread
From: Mike Kravetz @ 2005-10-31 5:57 UTC (permalink / raw)
To: Mel Gorman; +Cc: akpm, linux-mm, linux-kernel, lhms-devel
On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
> Here are a few brief reasons why this set of patches is useful;
>
> o Reduced fragmentation improves the chance a large order allocation succeeds
> o General-purpose memory hotplug needs the page/memory groupings provided
> o Reduces the number of badly-placed pages that page migration mechanism must
> deal with. This also applies to any active page defragmentation mechanism.
I can say that this patch set makes hotplug memory remove be of
value on ppc64. My system has 6GB of memory and I would 'load
it up' to the point where it would just start to swap and let it
run for an hour. Without these patches, it was almost impossible
to find a section that could be offlined. With the patches, I
can consistently reduce memory to somewhere between 512MB and 1GB.
Of course, results will vary based on workload. Also, this is
most advantageous for memory hotlug on ppc64 due to relatively
small section size (16MB) as compared to the page grouping size
(8MB). A more general purpose solution is needed for memory hotplug
support on architectures with larger section sizes.
Just another data point,
--
Mike
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz
@ 2005-10-31 6:37 ` Nick Piggin
2005-10-31 7:54 ` Andrew Morton
0 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-10-31 6:37 UTC (permalink / raw)
To: Mike Kravetz; +Cc: Mel Gorman, akpm, linux-mm, linux-kernel, lhms-devel
Mike Kravetz wrote:
> On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
>
>>Here are a few brief reasons why this set of patches is useful;
>>
>>o Reduced fragmentation improves the chance a large order allocation succeeds
>>o General-purpose memory hotplug needs the page/memory groupings provided
>>o Reduces the number of badly-placed pages that page migration mechanism must
>> deal with. This also applies to any active page defragmentation mechanism.
>
>
> I can say that this patch set makes hotplug memory remove be of
> value on ppc64. My system has 6GB of memory and I would 'load
> it up' to the point where it would just start to swap and let it
> run for an hour. Without these patches, it was almost impossible
> to find a section that could be offlined. With the patches, I
> can consistently reduce memory to somewhere between 512MB and 1GB.
> Of course, results will vary based on workload. Also, this is
> most advantageous for memory hotlug on ppc64 due to relatively
> small section size (16MB) as compared to the page grouping size
> (8MB). A more general purpose solution is needed for memory hotplug
> support on architectures with larger section sizes.
>
> Just another data point,
Despite what people were trying to tell me at Ottawa, this patch
set really does add quite a lot of complexity to the page
allocator, and it seems to be increasingly only of benefit to
dynamically allocating hugepages and memory hot unplug.
If that is the case, do we really want to make such sacrifices
for the huge machines that want these things? What about just
making an extra zone for easy-to-reclaim things to live in?
This could possibly even be resized at runtime according to
demand with the memory hotplug stuff (though I haven't been
following that).
Don't take this as criticism of the actual implementation or its
effectiveness.
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 6:37 ` Nick Piggin
@ 2005-10-31 7:54 ` Andrew Morton
2005-10-31 7:11 ` Nick Piggin
2005-10-31 14:34 ` Martin J. Bligh
0 siblings, 2 replies; 253+ messages in thread
From: Andrew Morton @ 2005-10-31 7:54 UTC (permalink / raw)
To: Nick Piggin; +Cc: kravetz, mel, linux-mm, linux-kernel, lhms-devel
Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Mike Kravetz wrote:
> > On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
> >
> >>Here are a few brief reasons why this set of patches is useful;
> >>
> >>o Reduced fragmentation improves the chance a large order allocation succeeds
> >>o General-purpose memory hotplug needs the page/memory groupings provided
> >>o Reduces the number of badly-placed pages that page migration mechanism must
> >> deal with. This also applies to any active page defragmentation mechanism.
> >
> >
> > I can say that this patch set makes hotplug memory remove be of
> > value on ppc64. My system has 6GB of memory and I would 'load
> > it up' to the point where it would just start to swap and let it
> > run for an hour. Without these patches, it was almost impossible
> > to find a section that could be offlined. With the patches, I
> > can consistently reduce memory to somewhere between 512MB and 1GB.
> > Of course, results will vary based on workload. Also, this is
> > most advantageous for memory hotlug on ppc64 due to relatively
> > small section size (16MB) as compared to the page grouping size
> > (8MB). A more general purpose solution is needed for memory hotplug
> > support on architectures with larger section sizes.
> >
> > Just another data point,
>
> Despite what people were trying to tell me at Ottawa, this patch
> set really does add quite a lot of complexity to the page
> allocator, and it seems to be increasingly only of benefit to
> dynamically allocating hugepages and memory hot unplug.
Remember that Rohit is seeing ~10% variation between runs of scientific
software, and that his patch to use higher-order pages to preload the
percpu-pages magazines fixed that up. I assume this means that it provided
up to 10% speedup, which is a lot.
But the patch caused page allocator fragmentation and several reports of
gigE Tx buffer allocation failures, so I dropped it.
We think that Mel's patches will allow us to reintroduce Rohit's
optimisation.
> If that is the case, do we really want to make such sacrifices
> for the huge machines that want these things? What about just
> making an extra zone for easy-to-reclaim things to live in?
>
> This could possibly even be resized at runtime according to
> demand with the memory hotplug stuff (though I haven't been
> following that).
>
> Don't take this as criticism of the actual implementation or its
> effectiveness.
>
But yes, adding additional complexity is a black mark, and these patches
add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 7:54 ` Andrew Morton
@ 2005-10-31 7:11 ` Nick Piggin
2005-10-31 16:19 ` Mel Gorman
2005-10-31 14:34 ` Martin J. Bligh
1 sibling, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-10-31 7:11 UTC (permalink / raw)
To: Andrew Morton; +Cc: kravetz, mel, linux-mm, linux-kernel, lhms-devel
Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>Despite what people were trying to tell me at Ottawa, this patch
>>set really does add quite a lot of complexity to the page
>>allocator, and it seems to be increasingly only of benefit to
>>dynamically allocating hugepages and memory hot unplug.
>
>
> Remember that Rohit is seeing ~10% variation between runs of scientific
> software, and that his patch to use higher-order pages to preload the
> percpu-pages magazines fixed that up. I assume this means that it provided
> up to 10% speedup, which is a lot.
>
OK, I wasn't aware of this. I wonder what other approaches we could
try to add a bit of colour to our pages? I bet something simple like
trying to hand out alternate odd/even pages per task might help.
> But the patch caused page allocator fragmentation and several reports of
> gigE Tx buffer allocation failures, so I dropped it.
>
> We think that Mel's patches will allow us to reintroduce Rohit's
> optimisation.
>
>
>>If that is the case, do we really want to make such sacrifices
>>for the huge machines that want these things? What about just
>>making an extra zone for easy-to-reclaim things to live in?
>>
>>This could possibly even be resized at runtime according to
>>demand with the memory hotplug stuff (though I haven't been
>>following that).
>>
>>Don't take this as criticism of the actual implementation or its
>>effectiveness.
>>
>
>
> But yes, adding additional complexity is a black mark, and these patches
> add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw).
>
They do look quite fine. They seem to get their claws pretty deep
into page reclaim, but I guess that is to be expected if we want
to increase readahead smarts much more.
However, I'm hoping bits of that can be merged at a time, and
interfaces and page reclaim stuff can be discussed and the best
option taken. No such luck with these patches AFAIKS - simply
adding another level of page groups, and another level of
heuristics to the page allocator is going to hurt. By definition.
I do wonder why zones can't be used... though I'm sure there are
good reasons.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 7:11 ` Nick Piggin
@ 2005-10-31 16:19 ` Mel Gorman
2005-10-31 23:54 ` Nick Piggin
0 siblings, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-10-31 16:19 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel
On Mon, 31 Oct 2005, Nick Piggin wrote:
> Andrew Morton wrote:
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> > > Despite what people were trying to tell me at Ottawa, this patch
> > > set really does add quite a lot of complexity to the page
> > > allocator, and it seems to be increasingly only of benefit to
> > > dynamically allocating hugepages and memory hot unplug.
> >
> >
> > Remember that Rohit is seeing ~10% variation between runs of scientific
> > software, and that his patch to use higher-order pages to preload the
> > percpu-pages magazines fixed that up. I assume this means that it provided
> > up to 10% speedup, which is a lot.
> >
>
> OK, I wasn't aware of this. I wonder what other approaches we could
> try to add a bit of colour to our pages? I bet something simple like
> trying to hand out alternate odd/even pages per task might help.
>
Reading through the kernel archives, it appears that any page colouring
scheme was getting rejected because it slowed up workloads like kernel
compilers that were not very cache sensitive. Where an approach didn't
suffer from that problem, there was disagreement over whether there was a
general performance improvement or not.
I recall Rohit's patch from an earlier -mm. Without knowing anything about
his test, I am guessing he is getting cheap page colouring by preloading
the per-cpu cache with contiguous pages and his workload is faulting in
the batch of pages immediately by doing something like linearly reading a
large array. Hence, the mappings of his workload are getting the right
colour pages. This makes his workload a "lucky" workload. The general
benefit of preloading the percpu magazines is that there is a chance the
allocator only has to be called once, not pcp->batch times.
An odd/even allocation scheme could be provided by having two free_lists
in a free_area. One list for the "left buddy" and the other list for the
"right buddy". However, at best, that would provide two colours. I'm not
sure how much benefit it would give for the cost of more linked lists.
> > gigE Tx buffer allocation failures, so I dropped it.
> >
> > We think that Mel's patches will allow us to reintroduce Rohit's
> > optimisation.
> >
> >
> > > If that is the case, do we really want to make such sacrifices
> > > for the huge machines that want these things? What about just
> > > making an extra zone for easy-to-reclaim things to live in?
> > >
> > > This could possibly even be resized at runtime according to
> > > demand with the memory hotplug stuff (though I haven't been
> > > following that).
> > >
> > > Don't take this as criticism of the actual implementation or its
> > > effectiveness.
> > >
> >
> >
> > But yes, adding additional complexity is a black mark, and these patches
> > add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw).
> >
>
> They do look quite fine. They seem to get their claws pretty deep
> into page reclaim, but I guess that is to be expected if we want
> to increase readahead smarts much more.
>
> However, I'm hoping bits of that can be merged at a time, and
> interfaces and page reclaim stuff can be discussed and the best
> option taken. No such luck with these patches AFAIKS - simply
> adding another level of page groups, and another level of
> heuristics to the page allocator is going to hurt. By definition.
> I do wonder why zones can't be used... though I'm sure there are
> good reasons.
>
Granted, the patch set does add complexity even though I tried to keep it
as simple as possible. Benchmarks were posted with each patchset to show
that it was not suffering in real performance even if the code is a bit
less approachable.
Doing something similar with zones is an old idea and brought up
specifically for memory hotplug. In implementations, the zone was called
ZONE_HOTREMOVABLE or something similar. In my opinion, replicating the
effect of this set of patches with zones introduces it's own set of
headaches and ends up being far more complicated. Hopefully, someone will
point out if I am missing historical context here, am rehashing old
arguments or am just plain wrong :)
To replicate the functionality of these patches with zones would require
two additional zones for NormalEasy and HighmemEasy (I suck at naming
things). The plus side is that once the zone fallback lists are updated,
the page allocator remains more or less the same as it is today. Then the
headaches start.
Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
we are allocating PTEs from high memory, we could fallback to the Normal
zone even if highmem pages are available because the HighMem zone was out
of pages. It will require very different fallback logic to say that
HighMem allocations can also use HighMemEasy rather than falling back to
Normal.
Problem 2: Setting the zone size will be a very difficult tunable to get
right. Right off, we are are introducing a tunable which will make
foreheads furrow. If the tunable is set wrong, system performance will
suffer and we could see situations where kernel allocations fail because
it's zone got depleted.
Problem 3: To get rid of the tunable, we could try resizing the zones
dynamically but that will be hard. Obviously, the zones are going to be
physically adjacent to each other. To resize the zone, the pages at one
end of the zone will need to be free. Shrinking the NormalEasy zone would
be easy enough, but shrinking the Normal zone with kernel pages in it
would be considerably harder, if not outright impossible. One page in the
wrong place will mean the zone cannot be resized
Problem 4: Page reclaim would have two new zones to deal with bringing
with it a new set of zone balancing problems. That brings it's own special
brand of fun.
There may be more problems but these 4 are fairly important. This patchset
does not suffer from the same problems.
Problem 1: This patchset has a fallback list for each allocation type. So
EasyRclm allocations can just as easily use an area reserved for kernel
allocations and vice versa. Obviously we don't like when this happens, but
when it does, things start fragmenting rather than breaking.
Problem 2: The number of pages that get reserved for each type grows and
shrinks on demand. There is no tunable and no need for one.
Problem 3: Problem doesn't exist for this patchset
Problem 4: Problem doesn't exist for this patchset.
Bottom line, using zones will be more complex than this set of patches and
bring a lot of tricky issues with it.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 16:19 ` Mel Gorman
@ 2005-10-31 23:54 ` Nick Piggin
2005-11-01 1:28 ` Mel Gorman
0 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-10-31 23:54 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel
Mel Gorman wrote:
> I recall Rohit's patch from an earlier -mm. Without knowing anything about
> his test, I am guessing he is getting cheap page colouring by preloading
> the per-cpu cache with contiguous pages and his workload is faulting in
> the batch of pages immediately by doing something like linearly reading a
> large array. Hence, the mappings of his workload are getting the right
> colour pages. This makes his workload a "lucky" workload. The general
> benefit of preloading the percpu magazines is that there is a chance the
> allocator only has to be called once, not pcp->batch times.
>
Or we could introduce a new allocation mechanism for anon pages that
passes the vaddr to the allocator, and tries to get an odd/even page
according to the vaddr.
> An odd/even allocation scheme could be provided by having two free_lists
> in a free_area. One list for the "left buddy" and the other list for the
> "right buddy". However, at best, that would provide two colours. I'm not
> sure how much benefit it would give for the cost of more linked lists.
>
2 colours should be a good first order improvement because you will
no longer have adjacent pages of the same colour.
It would definitely be cheaper than fragmentation avoidance + higher
order batch loading.
> To replicate the functionality of these patches with zones would require
> two additional zones for NormalEasy and HighmemEasy (I suck at naming
> things). The plus side is that once the zone fallback lists are updated,
> the page allocator remains more or less the same as it is today. Then the
> headaches start.
>
> Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
> fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
> we are allocating PTEs from high memory, we could fallback to the Normal
> zone even if highmem pages are available because the HighMem zone was out
> of pages. It will require very different fallback logic to say that
> HighMem allocations can also use HighMemEasy rather than falling back to
> Normal.
>
Just be a different set of GFP flags. Your patches obviously also have
some ordering imposed.... pagecache would want HighMemEasy, HighMem,
NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.
Note that if you do need to make some changes to the zone allocator, then
IMO that is far preferable to add a new layer of things-that-are-blocks-of-
-memory-but-not-zones, complete with their own balancing and other heuristics.
> Problem 2: Setting the zone size will be a very difficult tunable to get
> right. Right off, we are are introducing a tunable which will make
> foreheads furrow. If the tunable is set wrong, system performance will
> suffer and we could see situations where kernel allocations fail because
> it's zone got depleted.
>
But even so, when you do automatic resizing, you seem to be adding a
fundamental weak point in fragmentation avoidance.
> Problem 3: To get rid of the tunable, we could try resizing the zones
> dynamically but that will be hard. Obviously, the zones are going to be
> physically adjacent to each other. To resize the zone, the pages at one
> end of the zone will need to be free. Shrinking the NormalEasy zone would
> be easy enough, but shrinking the Normal zone with kernel pages in it
> would be considerably harder, if not outright impossible. One page in the
> wrong place will mean the zone cannot be resized
>
OK, maybe it is hard ;) Do they really need to be resized, then?
Isn't the big memory hotunplug push aimed at virtual machines and
hypervisors anyway? In which case one would presumably have some
memory that "must" be reclaimable, in which case we can't expand
non-Easy zones into that memory anyway.
> Problem 4: Page reclaim would have two new zones to deal with bringing
> with it a new set of zone balancing problems. That brings it's own special
> brand of fun.
>
> There may be more problems but these 4 are fairly important. This patchset
> does not suffer from the same problems.
>
If page reclaim can't deal with 5 zones then it is going to have problems
somewhere at 3 and needs to be fixed. I don't see how your patches get
around this fun by simply introducing their own balancing and fallback
heuristics.
> Problem 1: This patchset has a fallback list for each allocation type. So
> EasyRclm allocations can just as easily use an area reserved for kernel
> allocations and vice versa. Obviously we don't like when this happens, but
> when it does, things start fragmenting rather than breaking.
>
> Problem 2: The number of pages that get reserved for each type grows and
> shrinks on demand. There is no tunable and no need for one.
>
> Problem 3: Problem doesn't exist for this patchset
>
> Problem 4: Problem doesn't exist for this patchset.
>
> Bottom line, using zones will be more complex than this set of patches and
> bring a lot of tricky issues with it.
>
Maybe zones don't do exactly what you need, but I think they're better
than you think ;)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 23:54 ` Nick Piggin
@ 2005-11-01 1:28 ` Mel Gorman
2005-11-01 1:42 ` Nick Piggin
0 siblings, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 1:28 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel
On Tue, 1 Nov 2005, Nick Piggin wrote:
> Mel Gorman wrote:
>
> > I recall Rohit's patch from an earlier -mm. Without knowing anything about
> > his test, I am guessing he is getting cheap page colouring by preloading
> > the per-cpu cache with contiguous pages and his workload is faulting in
> > the batch of pages immediately by doing something like linearly reading a
> > large array. Hence, the mappings of his workload are getting the right
> > colour pages. This makes his workload a "lucky" workload. The general
> > benefit of preloading the percpu magazines is that there is a chance the
> > allocator only has to be called once, not pcp->batch times.
> >
>
> Or we could introduce a new allocation mechanism for anon pages that
> passes the vaddr to the allocator, and tries to get an odd/even page
> according to the vaddr.
>
We could, but it is a different problem than what this set of patches are
trying to address. I'll add page colouring to the end of the todo list in
case I get stuck for something to do.
> > An odd/even allocation scheme could be provided by having two free_lists
> > in a free_area. One list for the "left buddy" and the other list for the
> > "right buddy". However, at best, that would provide two colours. I'm not
> > sure how much benefit it would give for the cost of more linked lists.
> >
>
> 2 colours should be a good first order improvement because you will
> no longer have adjacent pages of the same colour.
>
> It would definitely be cheaper than fragmentation avoidance + higher
> order batch loading.
>
Ok, but the page colours would also need to be in the per-cpu lists this
new api that supplies vaddrs always takes the spinlock for the free lists.
I don't believe it would be cheaper and any benefit would only show up on
benchmarks that are cache sensitive. Judging by previous discussions on
page colouring in the mail archives, Linus will happily kick the approach
full of holes.
As for current performance, the Aim9 benchmarks show that the
fragmentation avoidance does not have a major performance penalty. A run
of the patches in the -mm tree should find out if there are performance
regressions on other machine types.
>
> > To replicate the functionality of these patches with zones would require
> > two additional zones for NormalEasy and HighmemEasy (I suck at naming
> > things). The plus side is that once the zone fallback lists are updated,
> > the page allocator remains more or less the same as it is today. Then the
> > headaches start.
> >
> > Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
> > fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
> > we are allocating PTEs from high memory, we could fallback to the Normal
> > zone even if highmem pages are available because the HighMem zone was out
> > of pages. It will require very different fallback logic to say that
> > HighMem allocations can also use HighMemEasy rather than falling back to
> > Normal.
> >
>
> Just be a different set of GFP flags. Your patches obviously also have
> some ordering imposed.... pagecache would want HighMemEasy, HighMem,
> NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.
>
As well as a different set of GFP flags, we would also need new zone
fallback logic which will hit the __alloc_pages() path. It will be adding
more complexity to the allocator and we're replacing one type of
complexity with another.
> Note that if you do need to make some changes to the zone allocator, then
> IMO that is far preferable to add a new layer of things-that-are-blocks-of-
> -memory-but-not-zones, complete with their own balancing and other heuristics.
>
Thing is, with my approach, the very worst that happens is that it
fragments just as bad as the normal allocator. With a zone-based approach,
the worst that happens is that the kernel zone is too small, kernel caches
do not grow to a suitable size and overall system performance degrades.
> > Problem 2: Setting the zone size will be a very difficult tunable to get
> > right. Right off, we are are introducing a tunable which will make
> > foreheads furrow. If the tunable is set wrong, system performance will
> > suffer and we could see situations where kernel allocations fail because
> > it's zone got depleted.
> >
>
> But even so, when you do automatic resizing, you seem to be adding a
> fundamental weak point in fragmentation avoidance.
>
The sizing I do is when a large block is split. Then the region is just
marked for a particular allocation type. This is very simple. The second
resizing that occurs is when a kernel allocation "steal" easyrclm pages. I
do not like the fact that we steal in this fashion but the alternative is
to teach kswapd how to reclaim easyrclm pages from other areas. I view
this as "future work" but if it was done, the "steal" mechanism would go
away.
> > Problem 3: To get rid of the tunable, we could try resizing the zones
> > dynamically but that will be hard. Obviously, the zones are going to be
> > physically adjacent to each other. To resize the zone, the pages at one
> > end of the zone will need to be free. Shrinking the NormalEasy zone would
> > be easy enough, but shrinking the Normal zone with kernel pages in it
> > would be considerably harder, if not outright impossible. One page in the
> > wrong place will mean the zone cannot be resized
> >
>
> OK, maybe it is hard ;) Do they really need to be resized, then?
>
I think we would need to, yes. If the size of the region is wrong, bad
things are likely to happen. If the kernel page zone is too small, it'll
be under pressure even though there is memory available elsewhere. If it's
too large, then it will get fragmented and high order allocations will
fail.
> Isn't the big memory hotunplug push aimed at virtual machines and
> hypervisors anyway? In which case one would presumably have some
> memory that "must" be reclaimable, in which case we can't expand
> non-Easy zones into that memory anyway.
>
I believe that is the case for hotplug all right, but not the case where
we just want to satisfy high order allocations in a reasonably reliable
fashion. In that case, it would be nice to reclaim an easyrclm region.
It has already been reported by Mike Kravetz that memory remove works a
whole lot better on PPC64 with this patch than without it. Memory hotplug
remove was not the problem I was trying to solve, but I consider the fact
that it is helped to be a big plus. So, even though it is possible that
this approach still gets fragmented under some workloads, we know that, in
general, it does a pretty good job.
> > Problem 4: Page reclaim would have two new zones to deal with bringing
> > with it a new set of zone balancing problems. That brings it's own special
> > brand of fun.
> >
> > There may be more problems but these 4 are fairly important. This patchset
> > does not suffer from the same problems.
> >
>
> If page reclaim can't deal with 5 zones then it is going to have problems
> somewhere at 3 and needs to be fixed. I don't see how your patches get
> around this fun by simply introducing their own balancing and fallback
> heuristics.
>
If my approach gets the sizes of areas all wrong, it will fragment. If the
zone-based approach gets the sizes of areas wrong, system performance
degrades. I prefer the failure scenario of my approach :).
> > Problem 1: This patchset has a fallback list for each allocation type. So
> > EasyRclm allocations can just as easily use an area reserved for kernel
> > allocations and vice versa. Obviously we don't like when this happens, but
> > when it does, things start fragmenting rather than breaking.
> >
> > Problem 2: The number of pages that get reserved for each type grows and
> > shrinks on demand. There is no tunable and no need for one.
> >
> > Problem 3: Problem doesn't exist for this patchset
> >
> > Problem 4: Problem doesn't exist for this patchset.
> >
> > Bottom line, using zones will be more complex than this set of patches and
> > bring a lot of tricky issues with it.
> >
>
> Maybe zones don't do exactly what you need, but I think they're better
> than you think ;)
>
You may be right, but I still think that my approach is simpler and less
likely to introduce horrible balancing problems.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 1:28 ` Mel Gorman
@ 2005-11-01 1:42 ` Nick Piggin
0 siblings, 0 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-01 1:42 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel
Mel Gorman wrote:
> On Tue, 1 Nov 2005, Nick Piggin wrote:
> Ok, but the page colours would also need to be in the per-cpu lists this
> new api that supplies vaddrs always takes the spinlock for the free lists.
> I don't believe it would be cheaper and any benefit would only show up on
> benchmarks that are cache sensitive. Judging by previous discussions on
> page colouring in the mail archives, Linus will happily kick the approach
> full of holes.
>
OK, but I'm just pointing out that improving page colouring doesn't
require contiguous pages.
> As for current performance, the Aim9 benchmarks show that the
> fragmentation avoidance does not have a major performance penalty. A run
> of the patches in the -mm tree should find out if there are performance
> regressions on other machine types.
>
But I can see that there will be penalties. Cache misses, branches,
etc. Obviously any new feature or more sophisticated behaviour is
going to require that but they obviously need good justification.
>>Just be a different set of GFP flags. Your patches obviously also have
>>some ordering imposed.... pagecache would want HighMemEasy, HighMem,
>>NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.
>>
>
>
> As well as a different set of GFP flags, we would also need new zone
> fallback logic which will hit the __alloc_pages() path. It will be adding
> more complexity to the allocator and we're replacing one type of
> complexity with another.
>
It is complexity that is mostly already handled for us with the zones
logic. Picking out a couple of small points that zones don't get exactly
right isn't a good basis to come up with a completely new zoneing layer.
>
>>Note that if you do need to make some changes to the zone allocator, then
>>IMO that is far preferable to add a new layer of things-that-are-blocks-of-
>>-memory-but-not-zones, complete with their own balancing and other heuristics.
>>
>
>
> Thing is, with my approach, the very worst that happens is that it
> fragments just as bad as the normal allocator. With a zone-based approach,
> the worst that happens is that the kernel zone is too small, kernel caches
> do not grow to a suitable size and overall system performance degrades.
>
If you don't need to guarantee higher order allocations, then there is
no problem with our current approach. If you do then you simply need to
make a sacrifice.
>
>>>Problem 2: Setting the zone size will be a very difficult tunable to get
>>>right. Right off, we are are introducing a tunable which will make
>>>foreheads furrow. If the tunable is set wrong, system performance will
>>>suffer and we could see situations where kernel allocations fail because
>>>it's zone got depleted.
>>>
>>
>>But even so, when you do automatic resizing, you seem to be adding a
>>fundamental weak point in fragmentation avoidance.
>>
>
>
> The sizing I do is when a large block is split. Then the region is just
> marked for a particular allocation type. This is very simple. The second
> resizing that occurs is when a kernel allocation "steal" easyrclm pages. I
> do not like the fact that we steal in this fashion but the alternative is
> to teach kswapd how to reclaim easyrclm pages from other areas. I view
> this as "future work" but if it was done, the "steal" mechanism would go
> away.
>
Weak point, as in: gets fragmented.
>
>>>Problem 3: To get rid of the tunable, we could try resizing the zones
>>>dynamically but that will be hard. Obviously, the zones are going to be
>>>physically adjacent to each other. To resize the zone, the pages at one
>>>end of the zone will need to be free. Shrinking the NormalEasy zone would
>>>be easy enough, but shrinking the Normal zone with kernel pages in it
>>>would be considerably harder, if not outright impossible. One page in the
>>>wrong place will mean the zone cannot be resized
>>>
>>
>>OK, maybe it is hard ;) Do they really need to be resized, then?
>>
>
>
> I think we would need to, yes. If the size of the region is wrong, bad
> things are likely to happen. If the kernel page zone is too small, it'll
> be under pressure even though there is memory available elsewhere. If it's
> too large, then it will get fragmented and high order allocations will
> fail.
>
But people will just have to get it right then. If they want to be able
to hot unplug 10G of memory, or allocate 4G of hugepages on demand, then
they simply need to specify their requirements. Not too difficult? It is
really nice to be able to place some burden on huge servers and mainframes,
because they have people administering and tuning them full-time. It
allows us to not penalise small servers and desktops.
>
>>Isn't the big memory hotunplug push aimed at virtual machines and
>>hypervisors anyway? In which case one would presumably have some
>>memory that "must" be reclaimable, in which case we can't expand
>>non-Easy zones into that memory anyway.
>>
>
>
> I believe that is the case for hotplug all right, but not the case where
> we just want to satisfy high order allocations in a reasonably reliable
> fashion. In that case, it would be nice to reclaim an easyrclm region.
>
As I've said before, I think this is a false hope and we need to
move away from higher order allocations.
> It has already been reported by Mike Kravetz that memory remove works a
> whole lot better on PPC64 with this patch than without it. Memory hotplug
> remove was not the problem I was trying to solve, but I consider the fact
> that it is helped to be a big plus. So, even though it is possible that
> this approach still gets fragmented under some workloads, we know that, in
> general, it does a pretty good job.
>
Sure, but using zones would work too, and on the plus side you would
be able to specify exactly how much removable memory to be.
>>
>>Maybe zones don't do exactly what you need, but I think they're better
>>than you think ;)
>>
>
>
> You may be right, but I still think that my approach is simpler and less
> likely to introduce horrible balancing problems.
>
Simpler? We already have zones though. They are a complexity we need to
deal with already. I really can't see how you can use the simpler argument
in favour of your patches ;)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 7:54 ` Andrew Morton
2005-10-31 7:11 ` Nick Piggin
@ 2005-10-31 14:34 ` Martin J. Bligh
2005-10-31 19:24 ` Andrew Morton
` (2 more replies)
1 sibling, 3 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-10-31 14:34 UTC (permalink / raw)
To: Andrew Morton, Nick Piggin
Cc: kravetz, mel, linux-mm, linux-kernel, lhms-devel
>> Despite what people were trying to tell me at Ottawa, this patch
>> set really does add quite a lot of complexity to the page
>> allocator, and it seems to be increasingly only of benefit to
>> dynamically allocating hugepages and memory hot unplug.
>
> Remember that Rohit is seeing ~10% variation between runs of scientific
> software, and that his patch to use higher-order pages to preload the
> percpu-pages magazines fixed that up. I assume this means that it provided
> up to 10% speedup, which is a lot.
I, for one, would like to see some harder numbers on that, together with
which architectures they actually affect.
> But the patch caused page allocator fragmentation and several reports of
> gigE Tx buffer allocation failures, so I dropped it.
Yes, it made that condition worse but ...
> We think that Mel's patches will allow us to reintroduce Rohit's
> optimisation.
... frankly, it happens without Rohit's patch as well (under more stress).
If we want a OS that is robust, and supports higher order allocations,
we need to start caring about fragmentations. Not just for large pages,
and hotplug, but also for more common things like jumbo GigE frames,
CIFS, various device drivers, kernel stacks > 4K etc.
To me, the question is "do we support higher order allocations, or not?".
Pretending we do, making a half-assed job of it, and then it not working
well under pressure is not helping anyone. I'm told, for instance, that
AMD64 requires > 4K stacks - that's pretty fundamental, as just one
instance. I'd rather make Linux pretty bulletproof - the added feature
stuff is just a bonus that comes for free with that.
We don't make a good job of doing long-term stress testing, which is
where fragmentation occurs. Unfortunately, customers do ;-(. I've become
fairly convinced we need something like this.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 14:34 ` Martin J. Bligh
@ 2005-10-31 19:24 ` Andrew Morton
2005-10-31 19:40 ` Martin J. Bligh
2005-10-31 23:29 ` Nick Piggin
2005-11-01 14:40 ` Avi Kivity
2 siblings, 1 reply; 253+ messages in thread
From: Andrew Morton @ 2005-10-31 19:24 UTC (permalink / raw)
To: Martin J. Bligh
Cc: nickpiggin, kravetz, mel, linux-mm, linux-kernel, lhms-devel
"Martin J. Bligh" <mbligh@mbligh.org> wrote:
>
> To me, the question is "do we support higher order allocations, or not?".
> Pretending we do, making a half-assed job of it, and then it not working
> well under pressure is not helping anyone. I'm told, for instance, that
> AMD64 requires > 4K stacks - that's pretty fundamental, as just one
> instance. I'd rather make Linux pretty bulletproof - the added feature
> stuff is just a bonus that comes for free with that.
Well... stacks are allocated with GFP_KERNEL, so we're reliable there.
It's the GFP_ATOMIC higher-order allocations which fail, and networking
copes with that.
I suspect this would all be a non-issue if the net drivers were using
__GFP_NOWARN ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 19:24 ` Andrew Morton
@ 2005-10-31 19:40 ` Martin J. Bligh
2005-10-31 23:59 ` Nick Piggin
0 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-10-31 19:40 UTC (permalink / raw)
To: Andrew Morton
Cc: nickpiggin, kravetz, mel, linux-mm, linux-kernel, lhms-devel
--On Monday, October 31, 2005 11:24:09 -0800 Andrew Morton <akpm@osdl.org> wrote:
> "Martin J. Bligh" <mbligh@mbligh.org> wrote:
>>
>> To me, the question is "do we support higher order allocations, or not?".
>> Pretending we do, making a half-assed job of it, and then it not working
>> well under pressure is not helping anyone. I'm told, for instance, that
>> AMD64 requires > 4K stacks - that's pretty fundamental, as just one
>> instance. I'd rather make Linux pretty bulletproof - the added feature
>> stuff is just a bonus that comes for free with that.
>
> Well... stacks are allocated with GFP_KERNEL, so we're reliable there.
>
> It's the GFP_ATOMIC higher-order allocations which fail, and networking
> copes with that.
>
> I suspect this would all be a non-issue if the net drivers were using
> __GFP_NOWARN ;)
We still need to allocate them, even if it's GFP_KERNEL. As memory gets
larger and larger, and we have no targetted reclaim, we'll have to blow
away more and more stuff at random before we happen to get contiguous
free areas. Just statistics aren't in your favour ... Getting 4 contig
pages on a 1GB desktop is much harder than on a 128MB machine.
Is not going to get better as time goes on ;-) Yeah, yeah, I know, you
want recreates, numbers, etc. Not the easiest thing to reproduce in a
short-term consistent manner though.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 19:40 ` Martin J. Bligh
@ 2005-10-31 23:59 ` Nick Piggin
2005-11-01 1:36 ` Mel Gorman
0 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-10-31 23:59 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Andrew Morton, kravetz, mel, linux-mm, linux-kernel, lhms-devel
Martin J. Bligh wrote:
> --On Monday, October 31, 2005 11:24:09 -0800 Andrew Morton <akpm@osdl.org> wrote:
>>I suspect this would all be a non-issue if the net drivers were using
>>__GFP_NOWARN ;)
>
>
> We still need to allocate them, even if it's GFP_KERNEL. As memory gets
> larger and larger, and we have no targetted reclaim, we'll have to blow
> away more and more stuff at random before we happen to get contiguous
> free areas. Just statistics aren't in your favour ... Getting 4 contig
> pages on a 1GB desktop is much harder than on a 128MB machine.
>
However, these allocations are not of the "easy to reclaim" type, in
which case they just use the regular fragmented-to-shit areas. If no
contiguous pages are available from there, then an easy-reclaim area
needs to be stolen, right?
In which case I don't see why these patches don't have similar long
term failure cases if there is strong demand for higher order
allocations. Prolong things a bit, perhaps, but...
> Is not going to get better as time goes on ;-) Yeah, yeah, I know, you
> want recreates, numbers, etc. Not the easiest thing to reproduce in a
> short-term consistent manner though.
>
Regardless, I think we need to continue our steady move away from
higher order allocation requirements.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 23:59 ` Nick Piggin
@ 2005-11-01 1:36 ` Mel Gorman
0 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 1:36 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel,
lhms-devel
On Tue, 1 Nov 2005, Nick Piggin wrote:
> Martin J. Bligh wrote:
> > --On Monday, October 31, 2005 11:24:09 -0800 Andrew Morton <akpm@osdl.org>
> > wrote:
>
> > > I suspect this would all be a non-issue if the net drivers were using
> > > __GFP_NOWARN ;)
> >
> >
> > We still need to allocate them, even if it's GFP_KERNEL. As memory gets
> > larger and larger, and we have no targetted reclaim, we'll have to blow
> > away more and more stuff at random before we happen to get contiguous
> > free areas. Just statistics aren't in your favour ... Getting 4 contig
> > pages on a 1GB desktop is much harder than on a 128MB machine.
>
> However, these allocations are not of the "easy to reclaim" type, in
> which case they just use the regular fragmented-to-shit areas. If no
> contiguous pages are available from there, then an easy-reclaim area
> needs to be stolen, right?
>
Right.
> In which case I don't see why these patches don't have similar long
> term failure cases if there is strong demand for higher order
> allocations. Prolong things a bit, perhaps, but...
>
It hinges all on how long the high order kernel allocation is. If it's
short-lived, it will get freed back to the easyrclm free lists and we
don't fragment. If it turns out to be long lived, then we are in trouble.
If this turns out to be the case, a possibility would be to use the
__GFP_KERNRCLM flag for high order, short lived allocations. This would
tend to group large free areas in the same place. It would only be worth
investigating if we found that memory still got fragmented over very long
periods of time.
> > Is not going to get better as time goes on ;-) Yeah, yeah, I know, you
> > want recreates, numbers, etc. Not the easiest thing to reproduce in a
> > short-term consistent manner though.
> >
>
> Regardless, I think we need to continue our steady move away from
> higher order allocation requirements.
>
No arguement with you there. My actual aim is to guarantee HugeTLB
allocations for userspace which we currently have to reserve at boot time.
Stuff like memory hotplug remove and high order kernel allocations are
benefits that would be nice to pick up on the way.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 14:34 ` Martin J. Bligh
2005-10-31 19:24 ` Andrew Morton
@ 2005-10-31 23:29 ` Nick Piggin
2005-11-01 0:59 ` Mel Gorman
2005-11-01 14:40 ` Avi Kivity
2 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-10-31 23:29 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Andrew Morton, kravetz, mel, linux-mm, linux-kernel, lhms-devel
Martin J. Bligh wrote:
>>We think that Mel's patches will allow us to reintroduce Rohit's
>>optimisation.
>
>
> ... frankly, it happens without Rohit's patch as well (under more stress).
> If we want a OS that is robust, and supports higher order allocations,
> we need to start caring about fragmentations. Not just for large pages,
> and hotplug, but also for more common things like jumbo GigE frames,
> CIFS, various device drivers, kernel stacks > 4K etc.
>
But it doesn't seem to be a great problem right now, apart from hotplug
and hugepages. Some jumbo GigE drivers use higher order allocations, but
I think there are moves to get away from that (e1000, for example).
> To me, the question is "do we support higher order allocations, or not?".
> Pretending we do, making a half-assed job of it, and then it not working
> well under pressure is not helping anyone. I'm told, for instance, that
> AMD64 requires > 4K stacks - that's pretty fundamental, as just one
And i386 had required 8K stacks for a long long time too.
> instance. I'd rather make Linux pretty bulletproof - the added feature
> stuff is just a bonus that comes for free with that.
>
But this doesn't exactly make Linux bulletproof, AFAIKS it doesn't work
well on small memory systems, and it can still get fragmented and not work.
IMO in order to make Linux bulletproof, just have fallbacks for anything
greater than about order 2 allocations.
From what I have seen, by far our biggest problems in the mm are due to
page reclaim, and these patches will make our reclaim behaviour more
complex I think.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 23:29 ` Nick Piggin
@ 2005-11-01 0:59 ` Mel Gorman
2005-11-01 1:31 ` Nick Piggin
0 siblings, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 0:59 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel,
lhms-devel
On Tue, 1 Nov 2005, Nick Piggin wrote:
> Martin J. Bligh wrote:
>
> > > We think that Mel's patches will allow us to reintroduce Rohit's
> > > optimisation.
> >
> >
> > ... frankly, it happens without Rohit's patch as well (under more stress).
> > If we want a OS that is robust, and supports higher order allocations,
> > we need to start caring about fragmentations. Not just for large pages,
> > and hotplug, but also for more common things like jumbo GigE frames,
> > CIFS, various device drivers, kernel stacks > 4K etc.
>
> But it doesn't seem to be a great problem right now, apart from hotplug
> and hugepages. Some jumbo GigE drivers use higher order allocations, but
> I think there are moves to get away from that (e1000, for example).
>
GigE drivers and any other subsystem will not use higher order allocations
if they know the underlying allocator is not going to satisfy the
request. These patches are the starting point for properly supporting
large allocations. I will admit that this set of patches is not going to
solve the whole problem, but it is a start that can be built upon.
> > To me, the question is "do we support higher order allocations, or not?".
> > Pretending we do, making a half-assed job of it, and then it not working
> > well under pressure is not helping anyone. I'm told, for instance, that
> > AMD64 requires > 4K stacks - that's pretty fundamental, as just one
>
> And i386 had required 8K stacks for a long long time too.
>
> > instance. I'd rather make Linux pretty bulletproof - the added feature
> > stuff is just a bonus that comes for free with that.
> >
>
> But this doesn't exactly make Linux bulletproof, AFAIKS it doesn't work
> well on small memory systems, and it can still get fragmented and not work.
Small memory systems are unlikely to care about satisfying large
allocations. These patches should not be adversely affecting small memory
systems but it is likely that a smaller value of MAX_ORDER would have to
be used to help with fragmentation.
You are right that we can still get fragmented. To prevent all
fragmentation would require more work but these patches would still be the
starting point. It makes sense to start with this patchset now and move on
the the more complex stuff later. If these patches are in, we could later
do stuff like;
o Configurable option that controls how strict fallback is. In a situation
where we absolutely do not want to fragment, do not allow kernel
allocations to fallback to EasyRclm zones. Instead, teach kswapd to
reclaim pages from the Fallback and KernNoRclm areas.
o Configurable option that gets kswapd to keep the KernNoRclm, KernRclm
and Fallback areas free of EasyRclm pages. This would prevent awkward
kernel pages ending up in the wrong areas at the cost of more work for
kswapd
o Linear scan memory to remove contiguous groups of large pages to satisfy
larger allocations. The usemap gives hints to what regions are worth
trying to reclaim. I have a set of patches that do something like this
and it was able to satisfy large allocations reliably, but they are slow
right now and need a lot of work.
All these ideas need a mechanism like this set of patches to group related
pages together. This set of patches still help fragmentation now, although
not in a 100% reliable fashion. My desktop which is running a kernel
patched with these patches has been running for 33 hours and managed to
allocate 80 order-10 blocks from ZONE_NORMAL which is about 42% of the
zone while xmms, X, konqueror and a pile of terminals were running. That
is pretty decent, even if it's not perfect.
With this approach, an easyrclm region can be reclaimed to help satisfy a
large allocation. This would be harder to do with a zone-based approach.
Obviously, stealing easyrclm pages to satisfy a high order allocation
could end up fragmenting the system given enough time. This worst-case
scenario would occur if high order allocations were in heavy demand, they
were kernel allocations *and* they were long lived.
If we find in the future that this worst-case scenario occurs frequently,
the easiest solution would be to use __GFP_KERNRCLM for high order short
lived allocations rather than it's current usage for caches like icaches.
This would set up areas that tend to have high order free blocks in them.
> IMO in order to make Linux bulletproof, just have fallbacks for anything
> greater than about order 2 allocations.
>
What sort of fallbacks? Private pools of pages of the larger order for
subsystems that need large pages is hardly desirable.
> From what I have seen, by far our biggest problems in the mm are due to
> page reclaim, and these patches will make our reclaim behaviour more
> complex I think.
>
This patchset does not touch reclaim at all. The lists that this patch
really affects is the zone freelists, not the LRU lists that page reclaim
are dealing with. It is only later when we want to try and guarantee
large-order allocations that we will have to change page reclaim.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 0:59 ` Mel Gorman
@ 2005-11-01 1:31 ` Nick Piggin
2005-11-01 2:07 ` Mel Gorman
2005-11-01 15:25 ` Martin J. Bligh
0 siblings, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-01 1:31 UTC (permalink / raw)
To: Mel Gorman
Cc: Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel,
lhms-devel, Ingo Molnar
Mel Gorman wrote:
> On Tue, 1 Nov 2005, Nick Piggin wrote:
>>But it doesn't seem to be a great problem right now, apart from hotplug
>>and hugepages. Some jumbo GigE drivers use higher order allocations, but
>>I think there are moves to get away from that (e1000, for example).
>>
>
>
> GigE drivers and any other subsystem will not use higher order allocations
> if they know the underlying allocator is not going to satisfy the
> request. These patches are the starting point for properly supporting
> large allocations. I will admit that this set of patches is not going to
> solve the whole problem, but it is a start that can be built upon.
>
I really don't think we *want* to say we support higher order allocations
absolutely robustly, nor do we want people using them if possible. Because
we don't. Even with your patches.
Ingo also brought up this point at Ottawa.
>>But this doesn't exactly make Linux bulletproof, AFAIKS it doesn't work
>>well on small memory systems, and it can still get fragmented and not work.
>
>
> Small memory systems are unlikely to care about satisfying large
> allocations. These patches should not be adversely affecting small memory
> systems but it is likely that a smaller value of MAX_ORDER would have to
> be used to help with fragmentation.
>
But complexity. More bugs, code harder to understand and maintain, more
cache and memory footprint, more branches and instructions.
> You are right that we can still get fragmented. To prevent all
> fragmentation would require more work but these patches would still be the
> starting point. It makes sense to start with this patchset now and move on
> the the more complex stuff later. If these patches are in, we could later
> do stuff like;
>
> o Configurable option that controls how strict fallback is. In a situation
> where we absolutely do not want to fragment, do not allow kernel
> allocations to fallback to EasyRclm zones. Instead, teach kswapd to
> reclaim pages from the Fallback and KernNoRclm areas.
>
In which case someone like GigE is not going to be able to access unfragmented
memory anyway. This is my point. The patch still has the same long term failure
cases that we appear to only be able to sanely solve by avoiding higher order
allocations.
The easy-to-reclaim stuff doesn't need higher order allocations anyway, so
there is no point in being happy about large contiguous regions for these
guys.
The only thing that seems to need it is memory hot unplug, which should rather
use another zone.
> All these ideas need a mechanism like this set of patches to group related
> pages together. This set of patches still help fragmentation now, although
> not in a 100% reliable fashion. My desktop which is running a kernel
> patched with these patches has been running for 33 hours and managed to
> allocate 80 order-10 blocks from ZONE_NORMAL which is about 42% of the
> zone while xmms, X, konqueror and a pile of terminals were running. That
> is pretty decent, even if it's not perfect.
>
But nobody does that. Why should we care? And in the case you *really* need
to do that, your system likely to fail at some point anyway.
OK, for hot unplug you may want that, or for hugepages. However, in those
cases it should be done with zones AFAIKS.
>>IMO in order to make Linux bulletproof, just have fallbacks for anything
>>greater than about order 2 allocations.
>>
>
>
> What sort of fallbacks? Private pools of pages of the larger order for
> subsystems that need large pages is hardly desirable.
>
Mechanisms to continue to run without contiguous memory would be best.
Small private pools aren't particularly undesirable - we do that everywhere
anyway. Your fragmentation patches essentially do that.
>
>>From what I have seen, by far our biggest problems in the mm are due to
>>page reclaim, and these patches will make our reclaim behaviour more
>>complex I think.
>>
>
>
> This patchset does not touch reclaim at all. The lists that this patch
> really affects is the zone freelists, not the LRU lists that page reclaim
> are dealing with. It is only later when we want to try and guarantee
> large-order allocations that we will have to change page reclaim.
>
But it affects things in the allocation path which in turn affects the
reclaim path. You're doing various balancing and fallbacks and it is
simply complicated behaviour in terms of trying to analyse a working
system.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 1:31 ` Nick Piggin
@ 2005-11-01 2:07 ` Mel Gorman
2005-11-01 2:35 ` Nick Piggin
2005-11-01 15:25 ` Martin J. Bligh
1 sibling, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 2:07 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel,
lhms-devel, Ingo Molnar
On Tue, 1 Nov 2005, Nick Piggin wrote:
> Mel Gorman wrote:
> > On Tue, 1 Nov 2005, Nick Piggin wrote:
>
> > > But it doesn't seem to be a great problem right now, apart from hotplug
> > > and hugepages. Some jumbo GigE drivers use higher order allocations, but
> > > I think there are moves to get away from that (e1000, for example).
> > >
> >
> >
> > GigE drivers and any other subsystem will not use higher order allocations
> > if they know the underlying allocator is not going to satisfy the
> > request. These patches are the starting point for properly supporting
> > large allocations. I will admit that this set of patches is not going to
> > solve the whole problem, but it is a start that can be built upon.
> >
>
> I really don't think we *want* to say we support higher order allocations
> absolutely robustly, nor do we want people using them if possible. Because
> we don't. Even with your patches.
>
I accept that. We should not be encouraging subsystems to use high order
allocations but keeping the system in a fragmented state to force the
issue is hardly the correct thing to do either.
> Ingo also brought up this point at Ottawa.
>
> > > But this doesn't exactly make Linux bulletproof, AFAIKS it doesn't work
> > > well on small memory systems, and it can still get fragmented and not
> > > work.
> >
> >
> > Small memory systems are unlikely to care about satisfying large
> > allocations. These patches should not be adversely affecting small memory
> > systems but it is likely that a smaller value of MAX_ORDER would have to
> > be used to help with fragmentation.
> >
>
> But complexity. More bugs, code harder to understand and maintain, more
> cache and memory footprint, more branches and instructions.
>
The patches have gone through a large number of revisions, have been
heavily tested and reviewed by a few people. The memory footprint of this
approach is smaller than introducing new zones. If the cache footprint,
increased branches and instructions were a problem, I would expect them to
show up in the aim9 benchmark or the benchmark that ran ghostscript
multiple times on a large file.
> > You are right that we can still get fragmented. To prevent all
> > fragmentation would require more work but these patches would still be the
> > starting point. It makes sense to start with this patchset now and move on
> > the the more complex stuff later. If these patches are in, we could later
> > do stuff like;
> >
> > o Configurable option that controls how strict fallback is. In a situation
> > where we absolutely do not want to fragment, do not allow kernel
> > allocations to fallback to EasyRclm zones. Instead, teach kswapd to
> > reclaim pages from the Fallback and KernNoRclm areas.
> >
>
> In which case someone like GigE is not going to be able to access unfragmented
> memory anyway. This is my point. The patch still has the same long term
> failure
> cases that we appear to only be able to sanely solve by avoiding higher order
> allocations.
>
> The easy-to-reclaim stuff doesn't need higher order allocations anyway, so
> there is no point in being happy about large contiguous regions for these
> guys.
>
The will need high order allocations if we want to provide HugeTLB pages
to userspace on-demand rather than reserving at boot-time. This is a
future problem, but it's one that is not worth tackling until the
fragmentation problem is fixed first.
> The only thing that seems to need it is memory hot unplug, which should rather
> use another zone.
>
Work from 2004 in memory hotplug was trying to use additional zones. I am
hoping that someone more involved with memory hotplug will tell us what
problems they ran into. If they ran into no problems, they might explain
why it was never included in the mainline.
>
> > All these ideas need a mechanism like this set of patches to group related
> > pages together. This set of patches still help fragmentation now, although
> > not in a 100% reliable fashion. My desktop which is running a kernel
> > patched with these patches has been running for 33 hours and managed to
> > allocate 80 order-10 blocks from ZONE_NORMAL which is about 42% of the
> > zone while xmms, X, konqueror and a pile of terminals were running. That
> > is pretty decent, even if it's not perfect.
> >
>
> But nobody does that. Why should we care? And in the case you *really* need
> to do that, your system likely to fail at some point anyway.
>
> OK, for hot unplug you may want that, or for hugepages. However, in those
> cases it should be done with zones AFAIKS.
>
And then we are back to what size to make the zones. This set of patches
will largely manage themselves without requiring a sysadmin to intervene.
> > > IMO in order to make Linux bulletproof, just have fallbacks for anything
> > > greater than about order 2 allocations.
> > >
> >
> >
> > What sort of fallbacks? Private pools of pages of the larger order for
> > subsystems that need large pages is hardly desirable.
> >
>
> Mechanisms to continue to run without contiguous memory would be best.
> Small private pools aren't particularly undesirable - we do that everywhere
> anyway. Your fragmentation patches essentially do that.
>
The main difference been that when a subsystem has small private pools, it
is possible for anyone else to use them and shrinking mechanisms are
required. My fragmentation patches has subpools, but they are always
available.
> >
> > > From what I have seen, by far our biggest problems in the mm are due to
> > > page reclaim, and these patches will make our reclaim behaviour more
> > > complex I think.
> > >
> >
> >
> > This patchset does not touch reclaim at all. The lists that this patch
> > really affects is the zone freelists, not the LRU lists that page reclaim
> > are dealing with. It is only later when we want to try and guarantee
> > large-order allocations that we will have to change page reclaim.
> >
>
> But it affects things in the allocation path which in turn affects the
> reclaim path.
Maybe it's because it's late, but I don't see how these patches currently
hit the reclaim path. The reclaim path deals with LRU lists, this set of
patches deals with the freelists.
> You're doing various balancing and fallbacks and it is
> simply complicated behaviour in terms of trying to analyse a working
> system.
>
Someone performing such an analysis of the system will only hit problems
with these patches if they are performing a deep analysis of the page
allocator. Other analysis such as the page reclaim should not even notice
that the page allocator has changed.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 2:07 ` Mel Gorman
@ 2005-11-01 2:35 ` Nick Piggin
2005-11-01 11:57 ` Mel Gorman
2005-11-01 20:59 ` Joel Schopp
0 siblings, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-01 2:35 UTC (permalink / raw)
To: Mel Gorman
Cc: Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel,
lhms-devel, Ingo Molnar
OK I'm starting to repeat myself a bit so after this I'll be
quiet for a bit and let others speak :)
Mel Gorman wrote:
> On Tue, 1 Nov 2005, Nick Piggin wrote:
> I accept that. We should not be encouraging subsystems to use high order
> allocations but keeping the system in a fragmented state to force the
> issue is hardly the correct thing to do either.
>
But you don't seem to actually "fix" anything. It is slightly improved,
but for cases where higher order GFP_ATOMIC and GFP_KERNEL allocations
fail (ie. anything other than memory hotplug or hugepages) you still
seem to have all the same failure cases.
Transient higher order allocations mean we don't fragment much, you say?
Well that is true, but it is true for how the system currently works.
My desktop has been up for a day or two, and it has 4444K free, and it
has 295 order-3 pages available - it can run a GigE and all its trasient
allocations no problem.
In the cases were we *do* actually get those failures from eg. networking,
I'd say your patch probably will end up having problems too. The way to
fix it is to not use higher order allocations.
>>But complexity. More bugs, code harder to understand and maintain, more
>>cache and memory footprint, more branches and instructions.
>>
>
>
> The patches have gone through a large number of revisions, have been
> heavily tested and reviewed by a few people. The memory footprint of this
> approach is smaller than introducing new zones. If the cache footprint,
> increased branches and instructions were a problem, I would expect them to
> show up in the aim9 benchmark or the benchmark that ran ghostscript
> multiple times on a large file.
>
I appreciate that a lot of work has gone into them. You must appreciate
that they add a reasonable amount of complexity and a non-zero perormance
cost to the page allocator.
However I think something must be broken if the footprint of adding a new
zone is higher?
>>The easy-to-reclaim stuff doesn't need higher order allocations anyway, so
>>there is no point in being happy about large contiguous regions for these
>>guys.
>>
>
>
> The will need high order allocations if we want to provide HugeTLB pages
> to userspace on-demand rather than reserving at boot-time. This is a
> future problem, but it's one that is not worth tackling until the
> fragmentation problem is fixed first.
>
Sure. In what form, we haven't agreed. I vote zones! :)
>
>>The only thing that seems to need it is memory hot unplug, which should rather
>>use another zone.
>>
>
>
> Work from 2004 in memory hotplug was trying to use additional zones. I am
> hoping that someone more involved with memory hotplug will tell us what
> problems they ran into. If they ran into no problems, they might explain
> why it was never included in the mainline.
>
That would be good.
>>OK, for hot unplug you may want that, or for hugepages. However, in those
>>cases it should be done with zones AFAIKS.
>>
>
>
> And then we are back to what size to make the zones. This set of patches
> will largely manage themselves without requiring a sysadmin to intervene.
>
Either you need to guarantee some hugepage allocation / hot unplug
capability or you don't. Placing a bit of burden on admins of these
huge servers or mainframes sounds like a fine idea to me.
Seriously nobody else will want this, no embedded, no desktops, no
small servers.
>
>>>>IMO in order to make Linux bulletproof, just have fallbacks for anything
>>>>greater than about order 2 allocations.
>>>>
>>>
>>>
>>>What sort of fallbacks? Private pools of pages of the larger order for
>>>subsystems that need large pages is hardly desirable.
>>>
>>
>>Mechanisms to continue to run without contiguous memory would be best.
>>Small private pools aren't particularly undesirable - we do that everywhere
>>anyway. Your fragmentation patches essentially do that.
>>
>
>
> The main difference been that when a subsystem has small private pools, it
> is possible for anyone else to use them and shrinking mechanisms are
> required. My fragmentation patches has subpools, but they are always
> available.
>
True, but we're talking about the need to guarantee an allocation. In
that case, mempools are required anyway and neither the current nor your
modified page allocator will help.
In the case were there is no need for a guarantee, there is presumably
some other fallback.
>
>>>>From what I have seen, by far our biggest problems in the mm are due to
>>>>page reclaim, and these patches will make our reclaim behaviour more
>>>>complex I think.
>>>>
>>>
>>>
>>>This patchset does not touch reclaim at all. The lists that this patch
>>>really affects is the zone freelists, not the LRU lists that page reclaim
>>>are dealing with. It is only later when we want to try and guarantee
>>>large-order allocations that we will have to change page reclaim.
>>>
>>
>>But it affects things in the allocation path which in turn affects the
>>reclaim path.
>
>
> Maybe it's because it's late, but I don't see how these patches currently
> hit the reclaim path. The reclaim path deals with LRU lists, this set of
> patches deals with the freelists.
>
You don't "hit" the reclaim path, but by making the allocation path
more complex makes reclaim behaviour harder to analyse.
>
>>You're doing various balancing and fallbacks and it is
>>simply complicated behaviour in terms of trying to analyse a working
>>system.
>>
>
>
> Someone performing such an analysis of the system will only hit problems
> with these patches if they are performing a deep analysis of the page
> allocator. Other analysis such as the page reclaim should not even notice
> that the page allocator has changed.
>
Let me think what a nasty one we had was? Oh yeah, the reclaim
priority would "wind up" because concurrent allocations were keeping
free pages below watermarks.
I don't know, that's just an example but there are others. The two
are fundamentally tied together.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 2:35 ` Nick Piggin
@ 2005-11-01 11:57 ` Mel Gorman
2005-11-01 13:56 ` Ingo Molnar
2005-11-01 20:59 ` Joel Schopp
1 sibling, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 11:57 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel,
lhms-devel, Ingo Molnar
On Tue, 1 Nov 2005, Nick Piggin wrote:
> OK I'm starting to repeat myself a bit so after this I'll be
> quiet for a bit and let others speak :)
>
> Mel Gorman wrote:
> > On Tue, 1 Nov 2005, Nick Piggin wrote:
>
> > I accept that. We should not be encouraging subsystems to use high order
> > allocations but keeping the system in a fragmented state to force the
> > issue is hardly the correct thing to do either.
> >
>
> But you don't seem to actually "fix" anything. It is slightly improved,
> but for cases where higher order GFP_ATOMIC and GFP_KERNEL allocations
> fail (ie. anything other than memory hotplug or hugepages) you still
> seem to have all the same failure cases.
>
The set of patches do fix a lot and make a strong start at addressing the
fragmentation problem, just not 100% of the way. The stress tests I've
been running with kernel compiles show that relatively few kernel pages
fallback to undesirable areas. The failure cases I hit are dependant on
workload rather than almost guaranteed to happen as we have with the
current approach.
For example, this is the fallback statistics for the Normal zone after a
kernel compile stress-test (5 simultaneous -j2 kernel compiles);
KernNoRclm Allocs: 35461 Reserve: 24 Fallbacks: 2
EasyRclm Allocs: 971798 Reserve: 152 Fallbacks: 1223
KernRclm Allocs: 34816 Reserve: 16 Fallbacks: 0
Fallback Allocs: 0 Reserve: 28 Fallbacks: 0
2 really awkward kernel pages out of 35,461 allocations ended up in the
wrong place. Reducing the fallbacks to 0 for all workloads would require
page reclaim that knew what the areas in the usemap meant. If a mechanism
like linear page reclaiming was built upon these patches, we would find
that at least 152 2^MAX_ORDER-1 pages could be allocated on demand if we
wanted to. We don't do anything like this today because it simply isn't
possible.
GFP_ATOMIC allocations still suffer. If the order size they need is not
available, they can't fix up the situation. Fixing that up would require
something like active defragmentation, kswapd to work on keeping high
order free pages or the reliable and reasonable "don't do high order
GFP_ATOMIC allocations".
So, with this set of patches, how fragmented you get is dependant on the
workload and it may still break down and high order allocations will fail.
But the current situation is that it will defiantly break down. The fact
is that it has been reported that memory hotplug remove works with these
patches and doesn't without them. Granted, this is just one feature on a
high-end machine, but it is one solid operation we can perform with the
patches and cannot without them. The second possibility is that this patch
may allow the preloading of per-cpu magazines which will improve some
workloads and make no different to others. Preloading in one allocation is
less work than loading with pcp->batch allocations.
> Transient higher order allocations mean we don't fragment much, you say?
> Well that is true, but it is true for how the system currently works.
> My desktop has been up for a day or two, and it has 4444K free, and it
> has 295 order-3 pages available - it can run a GigE and all its trasient
> allocations no problem.
>
> In the cases were we *do* actually get those failures from eg. networking,
> I'd say your patch probably will end up having problems too. The way to
> fix it is to not use higher order allocations.
>
> > > But complexity. More bugs, code harder to understand and maintain, more
> > > cache and memory footprint, more branches and instructions.
> > >
> >
> >
> > The patches have gone through a large number of revisions, have been
> > heavily tested and reviewed by a few people. The memory footprint of this
> > approach is smaller than introducing new zones. If the cache footprint,
> > increased branches and instructions were a problem, I would expect them to
> > show up in the aim9 benchmark or the benchmark that ran ghostscript
> > multiple times on a large file.
> >
>
> I appreciate that a lot of work has gone into them. You must appreciate
> that they add a reasonable amount of complexity and a non-zero perormance
> cost to the page allocator.
>
I do appreciate that there is a reasonable amount of complexity. Hence the
patches are fairly well commented and the introduction mail and changelog
is detailed to help explain the mechanism. If that is not good enough,
people googling may find the lwn.net article
(http://lwn.net/Articles/120960/) that explains the guts of the mechanism
complete with comment at the end on how hard it is to get high-order
allocation patches merged :). The complexity of this scheme was the main
reason why an early version was released long before it was ready so
people would have a chance to look through it.
The performance cost is something that has to be determined by benchmarks.
With each version of this patch, I released aim9 benchmarks of the clean
kernel and with my benchmarks. If there were performance regressions on my
test machine, it didn't get released until I had figured out what I was
doing wrong. Here is the difference between 2.6.14-rc5-mm1 and
2.6.14-rc5-mm1-mbuddy-v19
1 creat-clo 16006.00 15889.41 -116.59 -0.73% File Creations and Closes/second
2 page_test 117515.83 117082.15 -433.68 -0.37% System Allocations & Pages/second
3 brk_test 440289.81 437887.37 -2402.44 -0.55% System Memory Allocations/second
4 jmp_test 4179466.67 4179950.00 483.33 0.01% Non-local gotos/second
5 signal_test 80803.20 85335.78 4532.58 5.61% Signal Traps/second
6 exec_test 61.75 61.92 0.17 0.28% Program Loads/second
7 fork_test 1327.01 1342.21 15.20 1.15% Task Creations/second
8 link_test 5531.53 5555.55 24.02 0.43% Link/Unlink Pairs/second
I'll admit right now there is a 0.37% drop in raw page allocation
performance on this test run but these figures always vary by a few
percent. I could run this aim9 test a few more times until I got a figure
that showed the set of patches giving a performance gain. fork_test and
signal_test show a nice performance improvement.
A trip through the -mm tree would discover if the performance figures are
real, or are they just on my test machine.
> However I think something must be broken if the footprint of adding a new
> zone is higher?
>
Here are the sizeof() of struct zone in three kernels
2.6.14-rc5-mm1: 768
2.6.14-rc5-mm1-mbuddy-nostats: 1408
2.6.14-rc5-mm1-mbuddy-withallocstats: 1536
The main increases in the size is one additional list for per-cpu for
every CPU in the system and the addition of the new free lists. The usemap
is 2 bits per 2^(MAX_ORDER-1) pages in the system. On my system with
1.5GiB of RAM, that's 94 bytes. So the memory overhead is about 734 bytes
in all.
If a very small memory system was worried about this, they could get rid
of this whole scheme by defining __GFP_KERNRCLM and __GFP_EASYRCLM to 0
and change RCLM_TYPES from 4 to 1. Actually... Now that I think it, this
whole anti fragmentation scheme could be made configurable by doing
something like;
#ifdef CONFIG_ANTIDEFRAG
#define RCLM_NORCLM 0
#define RCLM_EASY 1
#define RCLM_KERN 2
#define RCLM_FALLBACK 3
#define RCLM_TYPES 4
#define __GFP_EASYRCLM 0x80000u /* User and other easily reclaimed pages */
#define __GFP_KERNRCLM 0x100000u /* Kernel page that is reclaimable */
#else
#define RCLM_NORCLM 0
#define RCLM_EASY 0
#define RCLM_KERN 0
#define RCLM_FALLBACK 0
#define RCLM_TYPES 1
#define __GFP_EASYRCLM 0u
#define __GFP_KERNRCLM 0u
#endif
This would need more work obviously, but essentially, the above would make
anti-defragmentation a configurable option for small memory systems.
However, I would be wary of changing the behavior of the allocator as a
configurable option for anything other than debugging. Such an option
should only be provided if we really want those 734 bytes back.
We can't measure the difference in code complexity as we don't have a
zone-based approach to compare against. Ideally if it did exist, any loop
that depends on MAX_NR_ZONES would be increased by 1, maybe 2 depending on
how many of these easyrclm zones that would be created. This would impact
both the allocator and kswapd. It would at least add 768 or 1536 for two
zones in comparison to the 734 bytes my approach adds.
> > > The easy-to-reclaim stuff doesn't need higher order allocations anyway, so
> > > there is no point in being happy about large contiguous regions for these
> > > guys.
> > >
> >
> >
> > The will need high order allocations if we want to provide HugeTLB pages
> > to userspace on-demand rather than reserving at boot-time. This is a
> > future problem, but it's one that is not worth tackling until the
> > fragmentation problem is fixed first.
> >
>
> Sure. In what form, we haven't agreed. I vote zones! :)
>
We'll agree to disagree for the moment ;) . I would expect others to choke
on the idea of more zones been introduced. Some architectures like power
and sparc64 (I think) only use one zone ZONE_DMA. Hopefully some of the
architecture maintainers will express some opinion on the addition of new
zones.
> >
> > > The only thing that seems to need it is memory hot unplug, which should
> > > rather
> > > use another zone.
> > >
> >
> >
> > Work from 2004 in memory hotplug was trying to use additional zones. I am
> > hoping that someone more involved with memory hotplug will tell us what
> > problems they ran into. If they ran into no problems, they might explain
> > why it was never included in the mainline.
> >
>
> That would be good.
>
> > > OK, for hot unplug you may want that, or for hugepages. However, in those
> > > cases it should be done with zones AFAIKS.
> > >
> >
> >
> > And then we are back to what size to make the zones. This set of patches
> > will largely manage themselves without requiring a sysadmin to intervene.
> >
>
> Either you need to guarantee some hugepage allocation / hot unplug
> capability or you don't. Placing a bit of burden on admins of these
> huge servers or mainframes sounds like a fine idea to me.
>
I'd rather avoid hitting people with tunables if at all possible. I'd
rather my face didn't end up on the dart board of some NOC because their
high-end server failed at 4 in the morning because they misconfigured the
size of the kernel zone.
> Seriously nobody else will want this, no embedded, no desktops, no
> small servers.
>
Not yet anyway. My long-term plan is to have HugeTLB pages supplied on
demand for applications that wanted them. This would require fragmentation
to be addressed first. Desktop applications like openoffice or anything
using a sparse address space like Java applications should benefit if they
could use HugeTLB pages. Small database servers should see a benefit as
well. Of course, this would not happen today, because right now, we cannot
give HugeTLB pages on demand to anyone, userspace or kernel space and
there is no point even trying.
Solaris is able to supply large pages for applications on demand but it
eventually falls back to using small pages because they get fragmented. I
think Windows has an API for large pages as well, but it also hits
fragmentation problems.
> >
> > > > > IMO in order to make Linux bulletproof, just have fallbacks for
> > > > > anything
> > > > > greater than about order 2 allocations.
> > > > >
> > > >
> > > >
> > > > What sort of fallbacks? Private pools of pages of the larger order for
> > > > subsystems that need large pages is hardly desirable.
> > > >
> > >
> > > Mechanisms to continue to run without contiguous memory would be best.
> > > Small private pools aren't particularly undesirable - we do that
> > > everywhere
> > > anyway. Your fragmentation patches essentially do that.
> > >
> >
> >
> > The main difference been that when a subsystem has small private pools, it
> > is possible for anyone else to use them and shrinking mechanisms are
> > required. My fragmentation patches has subpools, but they are always
> > available.
> >
>
> True, but we're talking about the need to guarantee an allocation. In
> that case, mempools are required anyway and neither the current nor your
> modified page allocator will help.
>
The modified allocator will help when refilling the mempools as long as
the caller is not GFP_ATOMIC. If using GFP_KERNEL, kswapd will page out
enough pages to get the contiguous blocks. Teaching kswapd to be smarter
about freeing contiguous pages is future work.
> In the case were there is no need for a guarantee, there is presumably
> some other fallback.
>
> >
> > > > > From what I have seen, by far our biggest problems in the mm are due
> > > > > to
> > > > > page reclaim, and these patches will make our reclaim behaviour more
> > > > > complex I think.
> > > > >
> > > >
> > > >
> > > > This patchset does not touch reclaim at all. The lists that this patch
> > > > really affects is the zone freelists, not the LRU lists that page
> > > > reclaim
> > > > are dealing with. It is only later when we want to try and guarantee
> > > > large-order allocations that we will have to change page reclaim.
> > > >
> > >
> > > But it affects things in the allocation path which in turn affects the
> > > reclaim path.
> >
> >
> > Maybe it's because it's late, but I don't see how these patches currently
> > hit the reclaim path. The reclaim path deals with LRU lists, this set of
> > patches deals with the freelists.
> >
>
> You don't "hit" the reclaim path, but by making the allocation path
> more complex makes reclaim behaviour harder to analyse.
>
I still don't see why. With the normal allocator, free pages are on a list
that get allocated. With the modified allocator, free pages are on lists
that get allocated.
> >
> > > You're doing various balancing and fallbacks and it is
> > > simply complicated behaviour in terms of trying to analyse a working
> > > system.
> > >
> >
> >
> > Someone performing such an analysis of the system will only hit problems
> > with these patches if they are performing a deep analysis of the page
> > allocator. Other analysis such as the page reclaim should not even notice
> > that the page allocator has changed.
> >
>
> Let me think what a nasty one we had was? Oh yeah, the reclaim
> priority would "wind up" because concurrent allocations were keeping
> free pages below watermarks.
>
That sounds like kswapd could not free pages fast enough for the storm of
allocators coming in. In that case, they would all enter direct reclaim
leading to a storm of kswapd-like processes.
I still cannot see any impact these patches would have on reclaim but
maybe that is a lack of imagination. I'll accept that, potentially, these
patches affect reclaim in some-currently-undefined-fashion. If that is the
case, I would assert that any zone-based approach would also have an
impact because kswapd has a new zone to manage the watermarks for.
> I don't know, that's just an example but there are others. The two
> are fundamentally tied together.
>
>
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 11:57 ` Mel Gorman
@ 2005-11-01 13:56 ` Ingo Molnar
2005-11-01 14:10 ` Dave Hansen
` (2 more replies)
0 siblings, 3 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-01 13:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel
* Mel Gorman <mel@csn.ul.ie> wrote:
> The set of patches do fix a lot and make a strong start at addressing
> the fragmentation problem, just not 100% of the way. [...]
do you have an expectation to be able to solve the 'fragmentation
problem', all the time, in a 100% way, now or in the future?
> So, with this set of patches, how fragmented you get is dependant on
> the workload and it may still break down and high order allocations
> will fail. But the current situation is that it will defiantly break
> down. The fact is that it has been reported that memory hotplug remove
> works with these patches and doesn't without them. Granted, this is
> just one feature on a high-end machine, but it is one solid operation
> we can perform with the patches and cannot without them. [...]
can you always, under any circumstance hot unplug RAM with these patches
applied? If not, do you have any expectation to reach 100%?
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 13:56 ` Ingo Molnar
@ 2005-11-01 14:10 ` Dave Hansen
2005-11-01 14:29 ` Ingo Molnar
2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-11-01 18:23 ` Rob Landley
2 siblings, 1 reply; 253+ messages in thread
From: Dave Hansen @ 2005-11-01 14:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Tue, 2005-11-01 at 14:56 +0100, Ingo Molnar wrote:
> * Mel Gorman <mel@csn.ul.ie> wrote:
>
> > The set of patches do fix a lot and make a strong start at addressing
> > the fragmentation problem, just not 100% of the way. [...]
>
> do you have an expectation to be able to solve the 'fragmentation
> problem', all the time, in a 100% way, now or in the future?
In a word, yes.
The current allocator has no design for measuring or reducing
fragmentation. These patches provide the framework for at least
measuring fragmentation.
The patches can not do anything magical and there will be a point where
the system has to make a choice: fragment, or fail an allocation when
there _is_ free memory.
These patches take us in a direction where we are capable of making such
a decision.
> > So, with this set of patches, how fragmented you get is dependant on
> > the workload and it may still break down and high order allocations
> > will fail. But the current situation is that it will defiantly break
> > down. The fact is that it has been reported that memory hotplug remove
> > works with these patches and doesn't without them. Granted, this is
> > just one feature on a high-end machine, but it is one solid operation
> > we can perform with the patches and cannot without them. [...]
>
> can you always, under any circumstance hot unplug RAM with these patches
> applied? If not, do you have any expectation to reach 100%?
With these patches, no. There are currently some very nice,
pathological workloads which will still cause fragmentation. But, in
the interest of incremental feature introduction, I think they're a fine
first step. We can effectively reach toward a more comprehensive
solution on top of these patches.
Reaching truly 100% will require some other changes such as being able
to virtually remap things like kernel text.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:10 ` Dave Hansen
@ 2005-11-01 14:29 ` Ingo Molnar
2005-11-01 14:49 ` Dave Hansen
0 siblings, 1 reply; 253+ messages in thread
From: Ingo Molnar @ 2005-11-01 14:29 UTC (permalink / raw)
To: Dave Hansen
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
* Dave Hansen <haveblue@us.ibm.com> wrote:
> > can you always, under any circumstance hot unplug RAM with these patches
> > applied? If not, do you have any expectation to reach 100%?
>
> With these patches, no. There are currently some very nice,
> pathological workloads which will still cause fragmentation. But, in
> the interest of incremental feature introduction, I think they're a
> fine first step. We can effectively reach toward a more comprehensive
> solution on top of these patches.
>
> Reaching truly 100% will require some other changes such as being able
> to virtually remap things like kernel text.
then we need to see that 100% solution first - at least in terms of
conceptual steps. Not being able to hot-unplug RAM in a 100% way wont
satisfy customers. Whatever solution we choose, it must work 100%. Just
to give a comparison: would you be content with your computer failing to
start up apps 1 time out of 100, saying that 99% is good enough? Or
would you call it what it is: buggy and unreliable?
to stress it: hot unplug is a _feature_ that must work 100%, _not_ some
optimization where 99% is good enough. This is a feature that people
will be depending on if we promise it, and 1% failure rate is not
acceptable. Your 'pathological workload' might be customer X's daily
workload. Unless there is a clear definition of what is possible and
what is not (which definition can be relied upon by users), having a 99%
solution is much worse than the current 0% solution!
worse than that, this is a known _hard_ problem to solve in a 100% way,
and saying 'this patch is a good first step' just lures us (and
customers) into believing that we are only 1% away from the desired 100%
solution, while nothing could be further from the truth. They will
demand the remaining 1%, but can we offer it? Unless you can provide a
clear, accepted-upon path towards the 100% solution, we have nothing
right now.
I have no problems with using higher-order pages for performance
purposes [*], as long as 'failed' allocation (and freeing) actions are
user-invisible. But the moment you make it user-visible, it _must_ work
in a deterministic way!
Ingo
[*] in which case any slowdown in the page allocator must be offset by
the gains.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:29 ` Ingo Molnar
@ 2005-11-01 14:49 ` Dave Hansen
2005-11-01 15:01 ` Ingo Molnar
2005-11-02 0:51 ` Nick Piggin
0 siblings, 2 replies; 253+ messages in thread
From: Dave Hansen @ 2005-11-01 14:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Tue, 2005-11-01 at 15:29 +0100, Ingo Molnar wrote:
> * Dave Hansen <haveblue@us.ibm.com> wrote:
> > > can you always, under any circumstance hot unplug RAM with these patches
> > > applied? If not, do you have any expectation to reach 100%?
> >
> > With these patches, no. There are currently some very nice,
> > pathological workloads which will still cause fragmentation. But, in
> > the interest of incremental feature introduction, I think they're a
> > fine first step. We can effectively reach toward a more comprehensive
> > solution on top of these patches.
> >
> > Reaching truly 100% will require some other changes such as being able
> > to virtually remap things like kernel text.
>
> then we need to see that 100% solution first - at least in terms of
> conceptual steps.
I don't think saying "truly 100%" really even makes sense. There will
always be restrictions of some kind. For instance, with a 10MB kernel
image, should you be able to shrink the memory in the system below
10MB? ;)
There is also no precedent in existing UNIXes for a 100% solution. From
http://publib.boulder.ibm.com/infocenter/pseries/index.jsp?topic=/com.ibm.aix.doc/aixbman/prftungd/dlpar.htm , a seemingly arbitrary restriction:
A memory region that contains a large page cannot be removed.
What the fragmentation patches _can_ give us is the ability to have 100%
success in removing certain areas: the "user-reclaimable" areas
referenced in the patch. This gives a customer at least the ability to
plan for how dynamically reconfigurable a system should be.
After these patches, the next logical steps are to increase the
knowledge that the slabs have about fragmentation, and to teach some of
the shrinkers about fragmentation.
After that, we'll need some kind of virtual remapping, breaking the 1:1
kernel virtual mapping, so that the most problematic pages can be
remapped. These pages would retain their virtual address, but getting a
new physical. However, this is quite far down the road and will require
some serious evaluation because it impacts how normal devices are able
to to DMA. The ppc64 proprietary hypervisor has features to work around
these issues, and any new hypervisors wishing to support partition
memory hotplug would likely have to follow suit.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:49 ` Dave Hansen
@ 2005-11-01 15:01 ` Ingo Molnar
2005-11-01 15:22 ` Dave Hansen
2005-11-01 16:48 ` Kamezawa Hiroyuki
2005-11-02 0:51 ` Nick Piggin
1 sibling, 2 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-01 15:01 UTC (permalink / raw)
To: Dave Hansen
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
* Dave Hansen <haveblue@us.ibm.com> wrote:
> > then we need to see that 100% solution first - at least in terms of
> > conceptual steps.
>
> I don't think saying "truly 100%" really even makes sense. There will
> always be restrictions of some kind. For instance, with a 10MB kernel
> image, should you be able to shrink the memory in the system below
> 10MB? ;)
think of it in terms of filesystem shrinking: yes, obviously you cannot
shrink to below the allocated size, but no user expects to be able to do
it. But users would not accept filesystem shrinking failing for certain
file layouts. In that case we are better off with no ability to shrink:
it makes it clear that we have not solved the problem, yet.
so it's all about expectations: _could_ you reasonably remove a piece of
RAM? Customer will say: "I have stopped all nonessential services, and
free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
the kernel!". No reasonable customer will say: "True, I have all RAM
used up in mlock()ed sections, but i want to remove some RAM
nevertheless".
> There is also no precedent in existing UNIXes for a 100% solution.
does this have any relevance to the point, other than to prove that it's
a hard problem that we should not pretend to be able to solve, without
seeing a clear path towards a solution?
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 15:01 ` Ingo Molnar
@ 2005-11-01 15:22 ` Dave Hansen
2005-11-02 8:49 ` Ingo Molnar
2005-11-01 16:48 ` Kamezawa Hiroyuki
1 sibling, 1 reply; 253+ messages in thread
From: Dave Hansen @ 2005-11-01 15:22 UTC (permalink / raw)
To: Ingo Molnar
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Tue, 2005-11-01 at 16:01 +0100, Ingo Molnar wrote:
> so it's all about expectations: _could_ you reasonably remove a piece of
> RAM? Customer will say: "I have stopped all nonessential services, and
> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
> the kernel!".
That's an excellent example. Until we have some kind of kernel
remapping, breaking the 1:1 kernel virtual mapping, these pages will
always exist. The easiest example of this kind of memory is kernel
text.
Another example might be a somewhat errant device driver which has
allocates some large buffers and is doing DMA to or from them. In this
case, we need to have APIs to require devices to give up and reacquire
any dynamically allocated structures. If the device driver does not
implement these APIs it is not compatible with memory hotplug.
> > There is also no precedent in existing UNIXes for a 100% solution.
>
> does this have any relevance to the point, other than to prove that it's
> a hard problem that we should not pretend to be able to solve, without
> seeing a clear path towards a solution?
Agreed. It is a hard problem. One that some other UNIXes have not
fully solved.
Here are the steps that I think we need to take. Do you see any holes
in their coverage? Anything that seems infeasible?
1. Fragmentation avoidance
* by itself, increases likelyhood of having an area of memory
which might be easily removed
* very small (if any) performance overhead
* other potential in-kernel users
* creates infrastructure to enforce the "hotplugablity" of any
particular are of memory.
2. Driver APIs
* Require that drivers specifically request for areas which must
retain constant physical addresses
* Driver must relinquish control of such areas upon request
* Can be worked around by hypervisors
3. Break 1:1 Kernel Virtual/Physial Mapping
* In any large area of physical memory we wish to remove, there will
likely be very, very few straggler pages, which can not easily be
freed.
* Kernel will transparently move the contents of these physical pages
to new pages, keeping constant virtual addresses.
* Negative TLB overhead, as in-kernel large page mappings are broken
down into smaller pages.
* __{p,v}a() become more expensive, likely a table lookup
I've already done (3) on a limited basis, in the early days of memory
hotplug. Not the remapping, just breaking the 1:1 assumptions. It
wasn't too horribly painful.
We'll also need to make some decisions along the way about what to do
about thinks like large pages. Is it better to just punt like AIX and
refuse to remove their areas? Break them down into small pages and
degrade performance?
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 15:22 ` Dave Hansen
@ 2005-11-02 8:49 ` Ingo Molnar
2005-11-02 9:02 ` Nick Piggin
2005-11-02 14:57 ` Martin J. Bligh
0 siblings, 2 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-02 8:49 UTC (permalink / raw)
To: Dave Hansen
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
* Dave Hansen <haveblue@us.ibm.com> wrote:
> On Tue, 2005-11-01 at 16:01 +0100, Ingo Molnar wrote:
> > so it's all about expectations: _could_ you reasonably remove a piece of
> > RAM? Customer will say: "I have stopped all nonessential services, and
> > free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
> > the kernel!".
>
> That's an excellent example. Until we have some kind of kernel
> remapping, breaking the 1:1 kernel virtual mapping, these pages will
> always exist. The easiest example of this kind of memory is kernel
> text.
another example is open files, dentries, inodes, kernel stacks and
various other kernel objects, which can become embedded in a generic
kernel memory zone anywhere, and can become referenced to from other
objects.
The C language we use for the kernel has no notion to automatically
track these links between objects, which makes general purpose memory
unmapping very hard: each and every pointer would have to be tracked
explicitly.
Such an 'explicit pointer tracking' approach is not only error-prone
[the C language offers us no way to _avoid_ direct dereferencing], it's
also clearly a maintainance nightmare. Code like:
obj->ptr = obj2;
would have to become something like:
obj_set(obj_deref(obj, ptr), obj2);
this is only a theoretical thing, it is very clear that such an approach
is unreadable, unmaintainable and unworkable.
fixing 1:1 mapping assumptions is a cakewalk in comparison ...
the only sane solution to make generic kernel RAM hot-removable, from a
conceptual angle, is to use a language for the kernel that supports
pointer-rewriting, garbage-collection and hence VM-shrinking. I.e. to
rewrite the kernel in Java, C# or whatever other type-safe language that
can track pointers. [But possibly not even current Java implementations
can do this right now, because they currently use faulting methods for
GC and do not track every pointer, which method is not suitable for
hot-remove.]
[ C++ might work too, but that needs extensive other changes and a
kernel-pointer type that all other pointer types have to inherited
from. No quick & easy void * pointers allowed. Such a restriction is
possibly unenforcable and thus the solution is unmaintainable. ]
just to state the obvious: while using another programming language for
the Linux kernel might make sense in the future, the likelhood for that
to happen anytime soon seems quite low =B-)
so i strongly believe that it's plain impossible to do memory hot-unplug
of generic kernel RAM in a reliable and guaranteed way.
there are other 'hot-' features though that might be doable though:
memory hot-add and memory hot-replace:
- hot-add is relatively easy (still nontrivial) and with discontigmem we
have it supported in essence.
- hot-replace becomes possible with the breaking of 1:1 kernel mapping,
because the totality of kernel RAM does not shrink, so it has no
impact on the virtual side of kernel memory, it's "just" a replacement
act on the physical side. It's still not trivial though: if the new
memory area has a different physical offset (which is likely under
most hw designs), all physical pointers needs tracking and fixups.
I.e. DMA has to be tracked (iommu-alike approach) or silenced, and
pagetables may need fixups. Also, if the swapped module involves the
kernel image itself then "interesting" per-arch things have to be
done. But in any case, this is a much more limited change than what
hot-remove of generic kernel RAM necessiates. Hot-replace is what
fault tolerant systems would need.
reliable hot-remove of generic kernel RAM is plain impossible even in a
fully virtualized solution. It's impossible even with maximum hardware
help. We simply dont have the means to fix up live kernel pointers still
linked into the removed region, under the C programming model.
the hurdles towards a reliable solution are so incredibly high, that
other solutions _have_ to be considered: restrict the type of RAM that
can be removed, and put it into a separate zone. That solves things
easily: no kernel pointers will be allowed in those zones. It becomes
similar to highmem: various kernel caches can opt-in to be included in
that type of RAM, and the complexity (and maintainance impact) of the
approach can thus be nicely scaled.
> > > There is also no precedent in existing UNIXes for a 100% solution.
> >
> > does this have any relevance to the point, other than to prove that it's
> > a hard problem that we should not pretend to be able to solve, without
> > seeing a clear path towards a solution?
>
> Agreed. It is a hard problem. One that some other UNIXes have not
> fully solved.
>
> Here are the steps that I think we need to take. Do you see any holes
> in their coverage? Anything that seems infeasible?
>
> 1. Fragmentation avoidance
> * by itself, increases likelyhood of having an area of memory
> which might be easily removed
> * very small (if any) performance overhead
> * other potential in-kernel users
> * creates infrastructure to enforce the "hotplugablity" of any
> particular are of memory.
> 2. Driver APIs
> * Require that drivers specifically request for areas which must
> retain constant physical addresses
> * Driver must relinquish control of such areas upon request
> * Can be worked around by hypervisors
> 3. Break 1:1 Kernel Virtual/Physial Mapping
> * In any large area of physical memory we wish to remove, there will
> likely be very, very few straggler pages, which can not easily be
> freed.
> * Kernel will transparently move the contents of these physical pages
> to new pages, keeping constant virtual addresses.
> * Negative TLB overhead, as in-kernel large page mappings are broken
> down into smaller pages.
> * __{p,v}a() become more expensive, likely a table lookup
>
> I've already done (3) on a limited basis, in the early days of memory
> hotplug. Not the remapping, just breaking the 1:1 assumptions. It
> wasn't too horribly painful.
i dont see the most fundamental problem listed: live kernel pointers
pointing into a generic kernel RAM zone. Removing the 1:1 mapping and
making the kernel VM space fully virtual will not solve that problem!
lets face it: removal of generic kernel RAM is a hard, and essentially
unsolvable problem under the current Linux kernel model. It's not just
the VM itself and 1:1 mappings (which is a nontrivial problem but which
we can and probably should solve), it boils down to the fundamental
choice of using C as the language of the kernel!
really, once you accept that, the path out of this mess becomes 'easy':
we _have to_ compromise on the feature side! And the moment we give up
the notion of 'generic kernel RAM' and focus on the hot-removability of
a limited-functionality zone, the complexity of the solution becomes
three orders of magnitude smaller. No fragmentation avoidance necessary.
No 'have to handle dozens of very hard problems to become 99%
functional' issues. Once you make that zone an opt-in thing, it becomes
much better from a development dynamics point of view as well.
i believe that it's also easier from an emotional point of view: our
choice to use the C language forces us to abandon the idea of
hot-removable generic kernel RAM. This is not some borderline decision
where different people have different judgement - this is a hard,
almost-mathematical fact that is forced upon us by the laws of physics
(and/or whatever deity you might believe in). The same laws that make
faster than O(N)*O(log(N)) sorting impossible. No amount of hacking will
get us past that wall.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 8:49 ` Ingo Molnar
@ 2005-11-02 9:02 ` Nick Piggin
2005-11-02 9:17 ` Ingo Molnar
2005-11-02 9:32 ` Dave Hansen
2005-11-02 14:57 ` Martin J. Bligh
1 sibling, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 9:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
Ingo Molnar wrote:
> really, once you accept that, the path out of this mess becomes 'easy':
> we _have to_ compromise on the feature side! And the moment we give up
> the notion of 'generic kernel RAM' and focus on the hot-removability of
> a limited-functionality zone, the complexity of the solution becomes
> three orders of magnitude smaller. No fragmentation avoidance necessary.
> No 'have to handle dozens of very hard problems to become 99%
> functional' issues. Once you make that zone an opt-in thing, it becomes
> much better from a development dynamics point of view as well.
>
I agree. Especially considering that all this memory hotplug usage for
hypervisors etc. is a relatively new thing with few of our userbase
actually using it. I think a simple zones solution is the right way to
go for now.
In future, if we have a large proportion of users who want it, and
their requirements are better understood, and there is still no
hardware / hypervisor support for handling this for us, *then* it is
time to re-evaluate our compromise.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 9:02 ` Nick Piggin
@ 2005-11-02 9:17 ` Ingo Molnar
2005-11-02 9:32 ` Dave Hansen
1 sibling, 0 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-02 9:17 UTC (permalink / raw)
To: Nick Piggin
Cc: Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
* Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Ingo Molnar wrote:
>
> >really, once you accept that, the path out of this mess becomes 'easy':
> >we _have to_ compromise on the feature side! And the moment we give up
> >the notion of 'generic kernel RAM' and focus on the hot-removability of
> >a limited-functionality zone, the complexity of the solution becomes
> >three orders of magnitude smaller. No fragmentation avoidance necessary.
> >No 'have to handle dozens of very hard problems to become 99%
> >functional' issues. Once you make that zone an opt-in thing, it becomes
> >much better from a development dynamics point of view as well.
> >
>
> I agree. Especially considering that all this memory hotplug usage for
> hypervisors etc. is a relatively new thing with few of our userbase
> actually using it. I think a simple zones solution is the right way to
> go for now.
btw., virtualization is pretty much a red herring here. Xen already has
a 'balooning driver', where a guest OS can give back unused RAM on a
page-granular basis. This is an advantage of having a fully virtualized
guest OS. That covers 99% of the 'remove RAM' needs. So i believe the
real target audience of hot-unplug is mostly limited to hardware-level
RAM unplug.
[ Xen also offers other features like migration of live images to
another piece of hardware, which further dampen the cost of
virtualization (mapping overhead, etc.). With hot-remove you dont get
such compound benefits of a conceptually more robust and thus more
pervasive approach. ]
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 9:02 ` Nick Piggin
2005-11-02 9:17 ` Ingo Molnar
@ 2005-11-02 9:32 ` Dave Hansen
2005-11-02 9:48 ` Nick Piggin
1 sibling, 1 reply; 253+ messages in thread
From: Dave Hansen @ 2005-11-02 9:32 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
On Wed, 2005-11-02 at 20:02 +1100, Nick Piggin wrote:
> I agree. Especially considering that all this memory hotplug usage for
> hypervisors etc. is a relatively new thing with few of our userbase
> actually using it. I think a simple zones solution is the right way to
> go for now.
I agree enough on concept that I think we can go implement at least a
demonstration of how easy it is to perform.
There are a couple of implementation details that will require some
changes to the current zone model, however. Perhaps you have some
suggestions on those.
In which zone do we place hot-added RAM? I don't think answer can
simply be the HOTPLUGGABLE zone. If you start with sufficiently small
of a machine, you'll degrade into the same horrible HIGHMEM behavior
that a 64GB ia32 machine has today, despite your architecture. Think of
a machine that starts out with a size of 256MB and grows to 1TB.
So, if you have to add to NORMAL/DMA on the fly, how do you handle a
case where the new NORMAL/DMA ram is physically above
HIGHMEM/HOTPLUGGABLE? Is there any other course than to make a zone
required to be able to span other zones, and be noncontiguous? Would
that represent too much of a change to the current model?
>From where do we perform reclaim when we run out of a particular zone?
Getting reclaim rates of the HIGHMEM and NORMAL zones balanced has been
hard, and I worry that we never got it quite. Introducing yet another
zone makes this harder.
Should we allow allocations for NORMAL to fall back into HOTPLUGGABLE in
any case?
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 9:32 ` Dave Hansen
@ 2005-11-02 9:48 ` Nick Piggin
2005-11-02 10:54 ` Dave Hansen
2005-11-02 15:02 ` Martin J. Bligh
0 siblings, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 9:48 UTC (permalink / raw)
To: Dave Hansen
Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
Dave Hansen wrote:
> On Wed, 2005-11-02 at 20:02 +1100, Nick Piggin wrote:
>
>>I agree. Especially considering that all this memory hotplug usage for
>>hypervisors etc. is a relatively new thing with few of our userbase
>>actually using it. I think a simple zones solution is the right way to
>>go for now.
>
>
> I agree enough on concept that I think we can go implement at least a
> demonstration of how easy it is to perform.
>
> There are a couple of implementation details that will require some
> changes to the current zone model, however. Perhaps you have some
> suggestions on those.
>
> In which zone do we place hot-added RAM? I don't think answer can
> simply be the HOTPLUGGABLE zone. If you start with sufficiently small
> of a machine, you'll degrade into the same horrible HIGHMEM behavior
> that a 64GB ia32 machine has today, despite your architecture. Think of
> a machine that starts out with a size of 256MB and grows to 1TB.
>
What can we do reasonably sanely? I think we can drive about 16GB of
highmem per 1GB of normal fairly well. So on your 1TB system, you
should be able to unplug 960GB RAM.
Lower the ratio to taste if you happen to be doing something
particularly zone normal intensive - remember in that case the frag
patches won't buy you anything more because a zone normal intensive
workload is going to cause unreclaimable regions by definition.
> So, if you have to add to NORMAL/DMA on the fly, how do you handle a
> case where the new NORMAL/DMA ram is physically above
> HIGHMEM/HOTPLUGGABLE? Is there any other course than to make a zone
> required to be able to span other zones, and be noncontiguous? Would
> that represent too much of a change to the current model?
>
Perhaps. Perhaps it wouldn't be required to get a solution that is
"good enough" though.
But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim
it all and expand your normal zones into it, bottom up.
>>From where do we perform reclaim when we run out of a particular zone?
> Getting reclaim rates of the HIGHMEM and NORMAL zones balanced has been
> hard, and I worry that we never got it quite. Introducing yet another
> zone makes this harder.
>
We didn't get it right, but there are fairly simple things we can do
(http://marc.theaimsgroup.com/?l=linux-kernel&m=113082256231168&w=2)
to improve things remarkably, and having yet more users should result
in even more improvements.
We still have ZONE_DMA and ZONE_DMA32, so we can't just afford to
abandon zones because they're crap ;)
> Should we allow allocations for NORMAL to fall back into HOTPLUGGABLE in
> any case?
>
I think this would defeat the purpose if we really want to set limits,
but we could have a sysctl perhaps to turn it on or off, or say, only
allow it if the alternative is going OOM.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 9:48 ` Nick Piggin
@ 2005-11-02 10:54 ` Dave Hansen
2005-11-02 15:02 ` Martin J. Bligh
1 sibling, 0 replies; 253+ messages in thread
From: Dave Hansen @ 2005-11-02 10:54 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
On Wed, 2005-11-02 at 20:48 +1100, Nick Piggin wrote:
> > So, if you have to add to NORMAL/DMA on the fly, how do you handle a
> > case where the new NORMAL/DMA ram is physically above
> > HIGHMEM/HOTPLUGGABLE? Is there any other course than to make a zone
> > required to be able to span other zones, and be noncontiguous? Would
> > that represent too much of a change to the current model?
> >
>
> Perhaps. Perhaps it wouldn't be required to get a solution that is
> "good enough" though.
>
> But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim
> it all and expand your normal zones into it, bottom up.
That's a good point. It would be slow, because you have to wait on page
reclaim, but it would work. I do worry a bit that this might make
adding memory to slow of an operation to be useful for short periods,
but we'll see how it actually behaves.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 9:48 ` Nick Piggin
2005-11-02 10:54 ` Dave Hansen
@ 2005-11-02 15:02 ` Martin J. Bligh
2005-11-03 3:21 ` Nick Piggin
1 sibling, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-02 15:02 UTC (permalink / raw)
To: Nick Piggin, Dave Hansen
Cc: Ingo Molnar, Mel Gorman, Andrew Morton, Linus Torvalds, kravetz,
linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven
>> I agree enough on concept that I think we can go implement at least a
>> demonstration of how easy it is to perform.
>>
>> There are a couple of implementation details that will require some
>> changes to the current zone model, however. Perhaps you have some
>> suggestions on those.
>>
>> In which zone do we place hot-added RAM? I don't think answer can
>> simply be the HOTPLUGGABLE zone. If you start with sufficiently small
>> of a machine, you'll degrade into the same horrible HIGHMEM behavior
>> that a 64GB ia32 machine has today, despite your architecture. Think of
>> a machine that starts out with a size of 256MB and grows to 1TB.
>>
>
> What can we do reasonably sanely? I think we can drive about 16GB of
> highmem per 1GB of normal fairly well. So on your 1TB system, you
> should be able to unplug 960GB RAM.
I think you need to talk to some more users trying to run 16GB ia32
systems. Feel the pain.
> Lower the ratio to taste if you happen to be doing something
> particularly zone normal intensive - remember in that case the frag
> patches won't buy you anything more because a zone normal intensive
> workload is going to cause unreclaimable regions by definition.
>
>> So, if you have to add to NORMAL/DMA on the fly, how do you handle a
>> case where the new NORMAL/DMA ram is physically above
>> HIGHMEM/HOTPLUGGABLE? Is there any other course than to make a zone
>> required to be able to span other zones, and be noncontiguous? Would
>> that represent too much of a change to the current model?
>>
>
> Perhaps. Perhaps it wouldn't be required to get a solution that is
> "good enough" though.
>
> But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim
> it all and expand your normal zones into it, bottom up.
Can we quit coming up with specialist hacks for hotplug, and try to solve
the generic problem please? hotplug is NOT the only issue here. Fragmentation
in general is.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 15:02 ` Martin J. Bligh
@ 2005-11-03 3:21 ` Nick Piggin
2005-11-03 15:36 ` Martin J. Bligh
0 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-03 3:21 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
Martin J. Bligh wrote:
>>What can we do reasonably sanely? I think we can drive about 16GB of
>>highmem per 1GB of normal fairly well. So on your 1TB system, you
>>should be able to unplug 960GB RAM.
>
>
> I think you need to talk to some more users trying to run 16GB ia32
> systems. Feel the pain.
>
OK, make it 8GB then.
And as a bonus we get all you IBM guys back on the case again
to finish the job that was started on highmem :)
And as another bonus, you actually *have* the capability to unplug
memory or use hugepages exactly the size you require, which is not the
case with the frag patches.
>>But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim
>>it all and expand your normal zones into it, bottom up.
>
>
> Can we quit coming up with specialist hacks for hotplug, and try to solve
> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> in general is.
>
Not really it isn't. There have been a few cases (e1000 being the main
one, and is fixed upstream) where fragmentation in general is a problem.
But mostly it is not.
Anyone who thinks they can start using higher order allocations willy
nilly after Mel's patch, I'm fairly sure they're wrong because they are
just going to be using up the contiguous regions.
Trust me, if the frag patches were a general solution that solved the
generic fragmentation problem I would be a lot less concerned about the
complexity they introduce. But even then it only seems to be a problem
that a very small number of users care about.
Anyway I keep saying the same things (sorry) so I'll stop now.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 3:21 ` Nick Piggin
@ 2005-11-03 15:36 ` Martin J. Bligh
2005-11-03 15:40 ` Arjan van de Ven
0 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 15:36 UTC (permalink / raw)
To: Nick Piggin
Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
>> Can we quit coming up with specialist hacks for hotplug, and try to solve
>> the generic problem please? hotplug is NOT the only issue here. Fragmentation
>> in general is.
>>
>
> Not really it isn't. There have been a few cases (e1000 being the main
> one, and is fixed upstream) where fragmentation in general is a problem.
> But mostly it is not.
Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. And
devices that don't support scatter-gather DMA.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 15:36 ` Martin J. Bligh
@ 2005-11-03 15:40 ` Arjan van de Ven
2005-11-03 15:51 ` Linus Torvalds
2005-11-03 15:53 ` Martin J. Bligh
0 siblings, 2 replies; 253+ messages in thread
From: Arjan van de Ven @ 2005-11-03 15:40 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> >> in general is.
> >>
> >
> > Not really it isn't. There have been a few cases (e1000 being the main
> > one, and is fixed upstream) where fragmentation in general is a problem.
> > But mostly it is not.
>
> Sigh. OK, tell me how you're going to fix kernel stacks > 4K please.
with CONFIG_4KSTACKS :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 15:40 ` Arjan van de Ven
@ 2005-11-03 15:51 ` Linus Torvalds
2005-11-03 15:57 ` Martin J. Bligh
` (2 more replies)
2005-11-03 15:53 ` Martin J. Bligh
1 sibling, 3 replies; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 15:51 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar,
Mel Gorman, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Arjan van de Ven wrote:
> On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> > >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> > >> in general is.
> > >>
> > >
> > > Not really it isn't. There have been a few cases (e1000 being the main
> > > one, and is fixed upstream) where fragmentation in general is a problem.
> > > But mostly it is not.
> >
> > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please.
>
> with CONFIG_4KSTACKS :)
2-page allocations are _not_ a problem.
Especially not for fork()/clone(). If you don't even have 2-page
contiguous areas, you are doing something _wrong_, or you're so low on
memory that there's no point in forking any more.
Don't confuse "fragmentation" with "perfectly spread out page
allocations".
Fragmentation means that it gets _exponentially_ more unlikely that you
can allocate big contiguous areas. But contiguous areas of order 1 are
very very likely indeed. It's only the _big_ areas that aren't going to
happen.
This is why fragmentation avoidance has always been totally useless. It is
- only useful for big areas
- very hard for big areas
(Corollary: when it's easy and possible, it's not useful).
Don't do it. We've never done it, and we've been fine. Claiming that
fork() is a reason to do fragmentation avoidance is invalid.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 15:51 ` Linus Torvalds
@ 2005-11-03 15:57 ` Martin J. Bligh
2005-11-03 16:20 ` Arjan van de Ven
2005-11-03 16:27 ` Mel Gorman
2 siblings, 0 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 15:57 UTC (permalink / raw)
To: Linus Torvalds, Arjan van de Ven
Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
kravetz, linux-mm, Linux Kernel Mailing List, lhms,
Arjan van de Ven
>> with CONFIG_4KSTACKS :)
>
> 2-page allocations are _not_ a problem.
>
> Especially not for fork()/clone(). If you don't even have 2-page
> contiguous areas, you are doing something _wrong_, or you're so low on
> memory that there's no point in forking any more.
64 bit platforms need kernel stacks > 8K, it seems.
> Don't confuse "fragmentation" with "perfectly spread out page
> allocations".
>
> Fragmentation means that it gets _exponentially_ more unlikely that you
> can allocate big contiguous areas. But contiguous areas of order 1 are
> very very likely indeed. It's only the _big_ areas that aren't going to
> happen.
>
> This is why fragmentation avoidance has always been totally useless. It is
> - only useful for big areas
> - very hard for big areas
>
> (Corollary: when it's easy and possible, it's not useful).
>
> Don't do it. We've never done it, and we've been fine. Claiming that
> fork() is a reason to do fragmentation avoidance is invalid.
With respect, we have not been fine. We see problems fairly regularly
with no large page/hotplug issues with higher order allocations.
Drivers, CIFS, kernel stacks, etc, etc etc.
The larger memory gets, the worse the problem is, just because the
statistics make it less likely to free up multiple contiguous pages.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 15:51 ` Linus Torvalds
2005-11-03 15:57 ` Martin J. Bligh
@ 2005-11-03 16:20 ` Arjan van de Ven
2005-11-03 16:27 ` Mel Gorman
2 siblings, 0 replies; 253+ messages in thread
From: Arjan van de Ven @ 2005-11-03 16:20 UTC (permalink / raw)
To: Linus Torvalds
Cc: Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar,
Mel Gorman, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 2005-11-03 at 07:51 -0800, Linus Torvalds wrote:
>
> On Thu, 3 Nov 2005, Arjan van de Ven wrote:
>
> > On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> > > >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> > > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> > > >> in general is.
> > > >>
> > > >
> > > > Not really it isn't. There have been a few cases (e1000 being the main
> > > > one, and is fixed upstream) where fragmentation in general is a problem.
> > > > But mostly it is not.
> > >
> > > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please.
> >
> > with CONFIG_4KSTACKS :)
>
> 2-page allocations are _not_ a problem.
agreed for the general case. There are some corner cases that you can
trigger deliberate in an artifical setting with lots of java threads
(esp on x86 on a 32Gb box; the lowmem zone works as a lever here leading
to "hyperfragmentation"; otoh on x86 you can do 4k stacks and it's gone
mostly)
> Fragmentation means that it gets _exponentially_ more unlikely that you
> can allocate big contiguous areas. But contiguous areas of order 1 are
> very very likely indeed. It's only the _big_ areas that aren't going to
> happen.
yup. only possible exception is the leveraged scenario .. thank god for
64 bit x86-64.
(and in the leveraged scenario I don't think active defragmentation will
buy you much over the long term at all)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 15:51 ` Linus Torvalds
2005-11-03 15:57 ` Martin J. Bligh
2005-11-03 16:20 ` Arjan van de Ven
@ 2005-11-03 16:27 ` Mel Gorman
2005-11-03 16:46 ` Linus Torvalds
2 siblings, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-03 16:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Linus Torvalds wrote:
>
>
> On Thu, 3 Nov 2005, Arjan van de Ven wrote:
>
> > On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> > > >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> > > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> > > >> in general is.
> > > >>
> > > >
> > > > Not really it isn't. There have been a few cases (e1000 being the main
> > > > one, and is fixed upstream) where fragmentation in general is a problem.
> > > > But mostly it is not.
> > >
> > > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please.
> >
> > with CONFIG_4KSTACKS :)
>
> 2-page allocations are _not_ a problem.
>
> Especially not for fork()/clone(). If you don't even have 2-page
> contiguous areas, you are doing something _wrong_, or you're so low on
> memory that there's no point in forking any more.
>
> Don't confuse "fragmentation" with "perfectly spread out page
> allocations".
>
> Fragmentation means that it gets _exponentially_ more unlikely that you
> can allocate big contiguous areas. But contiguous areas of order 1 are
> very very likely indeed. It's only the _big_ areas that aren't going to
> happen.
>
For me, it's the big areas that I am interested in, especially if we want
to give HugeTLB pages to a user when they are asking for them. The obvious
one here is database and HPC loads, particularly the HPC loads which may
not have had a chance to reserve what they needed at boot time. These
loads need 1024 contiguous pages on the x86 at least, not 2. We can free
all we want on todays kernels and you're not going to get more than 1 or
two blocks this large unless you are very lucky.
Hotplug is, for me, an additional benefit. For others, it is the main
benefit. For others of course, they don't care, but others don't are about
scalability to 64 processors either but we provide it anyway at a low cost
to smaller machines.
> This is why fragmentation avoidance has always been totally useless. It is
> - only useful for big areas
> - very hard for big areas
>
> (Corollary: when it's easy and possible, it's not useful).
>
Unless you are a user that wants a large area when it suddenly is useful.
> Don't do it. We've never done it, and we've been fine. Claiming that
> fork() is a reason to do fragmentation avoidance is invalid.
>
We've never done it but, but we've only supported HugeTLB pages being
reserved at boot time and nothing else as well.
I'm going to setup a kbuild environment, hopefully this evening, and see
are these patches adversely impacting a load that kernel developers care
about. If I am impacting it, oops I'm in some trouble. If I'm not, then
why not try and help out the people who care about the big areas.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 16:27 ` Mel Gorman
@ 2005-11-03 16:46 ` Linus Torvalds
2005-11-03 16:52 ` Martin J. Bligh
0 siblings, 1 reply; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 16:46 UTC (permalink / raw)
To: Mel Gorman
Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Mel Gorman wrote:
> On Thu, 3 Nov 2005, Linus Torvalds wrote:
>
> > This is why fragmentation avoidance has always been totally useless. It is
> > - only useful for big areas
> > - very hard for big areas
> >
> > (Corollary: when it's easy and possible, it's not useful).
> >
>
> Unless you are a user that wants a large area when it suddenly is useful.
No. It's _not_ suddenly useful.
It might be something you _want_, but that's a totally different issue.
My point is that regardless of what you _want_, defragmentation is
_useless_. It's useless simply because for big areas it is so expensive as
to be impractical.
Put another way: you may _want_ the moon to be made of cheese, but a moon
made out of cheese is _useless_ because it is impractical.
The only way to support big areas is to have special zones for them.
(Then, we may be able to use the special zones for small things too, but
under special rules, like "only used for anonymous mappings" where we
can just always remove them by paging them out. But it would still be a
special area meant for big pages, just temporarily "on loan").
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 16:46 ` Linus Torvalds
@ 2005-11-03 16:52 ` Martin J. Bligh
2005-11-03 17:19 ` Linus Torvalds
0 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 16:52 UTC (permalink / raw)
To: Linus Torvalds, Mel Gorman
Cc: Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
> The only way to support big areas is to have special zones for them.
>
> (Then, we may be able to use the special zones for small things too, but
> under special rules, like "only used for anonymous mappings" where we
> can just always remove them by paging them out. But it would still be a
> special area meant for big pages, just temporarily "on loan").
The problem is how these zones get resized. Can we hotplug memory between
them, with some sparsemem like indirection layer?
Real customers have shown us that their workloads shift, and they have
different needs at different parts of the day. We can't just pick one
size and call it good. It's the same argument as the traditional VM
balancing act between pagecache, user pages, and kernel pages (which
incidentally, we don't use zones for). We want the system to be able
to use memory wherever it's most needed.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 16:52 ` Martin J. Bligh
@ 2005-11-03 17:19 ` Linus Torvalds
2005-11-03 17:48 ` Dave Hansen
2005-11-03 17:51 ` Martin J. Bligh
0 siblings, 2 replies; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 17:19 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> The problem is how these zones get resized. Can we hotplug memory between
> them, with some sparsemem like indirection layer?
I think you should be able to add them. You can remove them. But you can't
resize them.
And I suspect that by default, there should be zero of them. Ie you'd have
to set them up the same way you now set up a hugetlb area.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 17:19 ` Linus Torvalds
@ 2005-11-03 17:48 ` Dave Hansen
2005-11-03 17:51 ` Martin J. Bligh
1 sibling, 0 replies; 253+ messages in thread
From: Dave Hansen @ 2005-11-03 17:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Martin J. Bligh, Mel Gorman, Arjan van de Ven, Nick Piggin,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 2005-11-03 at 09:19 -0800, Linus Torvalds wrote:
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> >
> > The problem is how these zones get resized. Can we hotplug memory between
> > them, with some sparsemem like indirection layer?
>
> I think you should be able to add them. You can remove them. But you can't
> resize them.
Any particular reasons you think we can't resize them? I know shrinking
the non-reclaim (DMA,NORMAL) zones will be practically impossible, but
it should be quite possible to shrink the reclaim zone, and grow DMA or
NORMAL into it.
This will likely be necessary as memory is added to a system, and the
ratio of reclaim to non-reclaim zones gets out of whack and away from
the magic 16:1 or 8:1 highmem:normal ratio that seems popular.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 17:19 ` Linus Torvalds
2005-11-03 17:48 ` Dave Hansen
@ 2005-11-03 17:51 ` Martin J. Bligh
2005-11-03 17:59 ` Arjan van de Ven
` (2 more replies)
1 sibling, 3 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 17:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
--Linus Torvalds <torvalds@osdl.org> wrote (on Thursday, November 03, 2005 09:19:35 -0800):
>
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>>
>> The problem is how these zones get resized. Can we hotplug memory between
>> them, with some sparsemem like indirection layer?
>
> I think you should be able to add them. You can remove them. But you can't
> resize them.
>
> And I suspect that by default, there should be zero of them. Ie you'd have
> to set them up the same way you now set up a hugetlb area.
So ... if there are 0 by default, and I run for a while and dirty up
memory, how do I free any pages up to put into them? Not sure how that
works.
Going back to finding contig pages for a sec ... I don't disagree with
your assertion that order 1 is doable (however, we do need to make one
fix ...see below). It's > 1 that's a problem.
For amusement, let me put in some tritely oversimplified math. For the
sake of arguement, assume the free watermarks are 8MB or so. Let's assume
a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
I'm going to assume random distribution of free pages, which is
oversimplified, but I'm trying to demonstrate a general premise, not get
accurate numbers.
8MB = 2048 pages.
On a 64MB system, we have 16384 pages, 2048 free. Very rougly speaking, for
each free page, chance of it's buddy being free is 2048/16384. So in
grossly-oversimplified stats-land, if I can remember anything at all,
chance of finding one page with a free buddy is 1-(1-2048/16384)^2048,
which is, for all intents and purposes ... 1.
1 GB. system, 262144 pages 1-(1-2048/16384)^2048 = 0.9999989
128GB system. 33554432 pages. 0.1175 probability
yes, yes, my math sucks and I'm a simpleton. The point is that as memory
gets bigger, the odds suck for getting contiguous pages. And would also
explain why you think there's no problem, and I do ;-) And bear in mind
that's just for order 1 allocs. For bigger stuff, it REALLY sucks - I'll
spare you more wild attempts at foully-approximated math.
Hmmm. If we keep 128MB free, that totally kills off the above calculation
I think I'll just tweak it so the limit is not so hard on really big
systems. Will send you a patch. However ... larger allocs will still
suck ... I guess I'd better gross you out with more incorrect math after
all ...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 17:51 ` Martin J. Bligh
@ 2005-11-03 17:59 ` Arjan van de Ven
2005-11-03 18:08 ` Linus Torvalds
2005-11-03 18:03 ` Linus Torvalds
2005-11-03 18:48 ` Martin J. Bligh
2 siblings, 1 reply; 253+ messages in thread
From: Arjan van de Ven @ 2005-11-03 17:59 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Linus Torvalds, Mel Gorman, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote:
> For amusement, let me put in some tritely oversimplified math. For the
> sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> I'm going to assume random distribution of free pages, which is
> oversimplified, but I'm trying to demonstrate a general premise, not get
> accurate numbers.
that is VERY over simplified though, given the anti-fragmentation
property of buddy algorithm
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 17:59 ` Arjan van de Ven
@ 2005-11-03 18:08 ` Linus Torvalds
2005-11-03 18:17 ` Martin J. Bligh
2005-11-03 21:11 ` Mel Gorman
0 siblings, 2 replies; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 18:08 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Martin J. Bligh, Mel Gorman, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Arjan van de Ven wrote:
> On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote:
>
> > For amusement, let me put in some tritely oversimplified math. For the
> > sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> > I'm going to assume random distribution of free pages, which is
> > oversimplified, but I'm trying to demonstrate a general premise, not get
> > accurate numbers.
>
> that is VERY over simplified though, given the anti-fragmentation
> property of buddy algorithm
Indeed. I write a program at one time doing random allocation and
de-allocation and looking at what the output was, and buddy is very good
at avoiding fragmentation.
These days we have things like per-cpu lists in front of the buddy
allocator that will make fragmentation somewhat higher, but it's still
absolutely true that the page allocation layout is _not_ random.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:08 ` Linus Torvalds
@ 2005-11-03 18:17 ` Martin J. Bligh
2005-11-03 18:44 ` Linus Torvalds
2005-11-04 0:58 ` Nick Piggin
2005-11-03 21:11 ` Mel Gorman
1 sibling, 2 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 18:17 UTC (permalink / raw)
To: Linus Torvalds, Arjan van de Ven
Cc: Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton,
kravetz, linux-mm, Linux Kernel Mailing List, lhms,
Arjan van de Ven
>> > For amusement, let me put in some tritely oversimplified math. For the
>> > sake of arguement, assume the free watermarks are 8MB or so. Let's assume
>> > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
>> > I'm going to assume random distribution of free pages, which is
>> > oversimplified, but I'm trying to demonstrate a general premise, not get
>> > accurate numbers.
>>
>> that is VERY over simplified though, given the anti-fragmentation
>> property of buddy algorithm
>
> Indeed. I write a program at one time doing random allocation and
> de-allocation and looking at what the output was, and buddy is very good
> at avoiding fragmentation.
>
> These days we have things like per-cpu lists in front of the buddy
> allocator that will make fragmentation somewhat higher, but it's still
> absolutely true that the page allocation layout is _not_ random.
OK, well I'll quit torturing you with incorrect math if you'll concede
that the situation gets much much worse as memory sizes get larger ;-)
For order > 1 allocs, I think it's fixable. For order > 1, I think we
basically don't have a prayer on a largish system under pressure.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:17 ` Martin J. Bligh
@ 2005-11-03 18:44 ` Linus Torvalds
2005-11-03 18:51 ` Martin J. Bligh
2005-11-04 0:58 ` Nick Piggin
1 sibling, 1 reply; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 18:44 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> >
> > These days we have things like per-cpu lists in front of the buddy
> > allocator that will make fragmentation somewhat higher, but it's still
> > absolutely true that the page allocation layout is _not_ random.
>
> OK, well I'll quit torturing you with incorrect math if you'll concede
> that the situation gets much much worse as memory sizes get larger ;-)
I don't remember the specifics (I did the stats several years ago), but if
I recall correctly, the low-order allocations actually got _better_ with
more memory, assuming you kept a fixed percentage of memory free. So you
actually needed _less_ memory free (in percentages) to get low-order
allocations reliably.
But the higher orders didn't much matter. Basically, it gets exponentially
more difficult to keep higher-order allocations, and it doesn't help one
whit if there's a linear improvement from having more memory available or
something like that.
So it doesn't get _harder_ with lots of memory, but
- you need to keep the "minimum free" watermarks growing at the same rate
the memory sizes grow (and on x86, I don't think we do: at least at
some point, the HIGHMEM zone had a much lower low-water-mark because it
made the balancing behaviour much nicer. But I didn't check that).
- with lots of memory, you tend to want to get higher-order pages, and
that gets harder much much faster than your memory size grows. So
_effectively_, the kinds of allocations you care about are much harder
to get.
If you look at get_free_pages(), you will note that we actyally
_guarantee_ memory allocations up to order-3:
...
if (!(gfp_mask & __GFP_NORETRY)) {
if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
do_retry = 1;
...
and nobody has ever even noticed. In other words, low-order allocations
really _are_ dependable. It's just that the kinds of orders you want for
memory hotplug or hugetlb (ie not orders <=3, but >=10) are not, and never
will be.
(Btw, my statistics did depend on that fact that the _usage_ was an even
higher exponential, ie you had many many more order-0 allocations than you
had order-1). You can always run out of order-n (n != 0) pages if you just
allocate enough of them. The buddy thing works well statistically, but it
obviously can't do wonders).
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:44 ` Linus Torvalds
@ 2005-11-03 18:51 ` Martin J. Bligh
2005-11-03 19:35 ` Linus Torvalds
0 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 18:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
--Linus Torvalds <torvalds@osdl.org> wrote (on Thursday, November 03, 2005 10:44:14 -0800):
>
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>> >
>> > These days we have things like per-cpu lists in front of the buddy
>> > allocator that will make fragmentation somewhat higher, but it's still
>> > absolutely true that the page allocation layout is _not_ random.
>>
>> OK, well I'll quit torturing you with incorrect math if you'll concede
>> that the situation gets much much worse as memory sizes get larger ;-)
>
> I don't remember the specifics (I did the stats several years ago), but if
> I recall correctly, the low-order allocations actually got _better_ with
> more memory, assuming you kept a fixed percentage of memory free. So you
> actually needed _less_ memory free (in percentages) to get low-order
> allocations reliably.
Possibly, I can redo the calculations easily enough (have to go for now,
but I just sent the other ones). But we don't keep a fixed percentage of
memory free - we cap it ... perhaps we should though?
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:51 ` Martin J. Bligh
@ 2005-11-03 19:35 ` Linus Torvalds
2005-11-03 22:40 ` Martin J. Bligh
0 siblings, 1 reply; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 19:35 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> Possibly, I can redo the calculations easily enough (have to go for now,
> but I just sent the other ones). But we don't keep a fixed percentage of
> memory free - we cap it ... perhaps we should though?
I suspect the capping may well be from some old HIGHMEM interaction on x86
(ie "don't keep half a gig free in the normal zone just because we have
16GB in the high-zone". We used to have serious balancing issues, and I
wouldn't be surprised at all if there are remnants from that. Stuff that
simply hasn't been visible, because not a lot of people had many many GB
of memory even on machines that didn't need HIGHMEM.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 19:35 ` Linus Torvalds
@ 2005-11-03 22:40 ` Martin J. Bligh
2005-11-03 22:56 ` Linus Torvalds
0 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 22:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
--On Thursday, November 03, 2005 11:35:28 -0800 Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>>
>> Possibly, I can redo the calculations easily enough (have to go for now,
>> but I just sent the other ones). But we don't keep a fixed percentage of
>> memory free - we cap it ... perhaps we should though?
>
> I suspect the capping may well be from some old HIGHMEM interaction on x86
> (ie "don't keep half a gig free in the normal zone just because we have
> 16GB in the high-zone". We used to have serious balancing issues, and I
> wouldn't be surprised at all if there are remnants from that. Stuff that
> simply hasn't been visible, because not a lot of people had many many GB
> of memory even on machines that didn't need HIGHMEM.
But pages_min is based on the zone size, not the system size. And we
still cap it. Maybe that's just a mistake?
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 22:40 ` Martin J. Bligh
@ 2005-11-03 22:56 ` Linus Torvalds
2005-11-03 23:01 ` Martin J. Bligh
0 siblings, 1 reply; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 22:56 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> But pages_min is based on the zone size, not the system size. And we
> still cap it. Maybe that's just a mistake?
The per-zone watermarking is actually the "modern" and "working" approach.
We didn't always do it that way. I would not be at all surprised if the
capping was from the global watermarking days.
Of course, I would _also_ not be at all surprised if it wasn't just out of
habit. Most of the things where we try to scale things up by memory size,
we cap for various reasons. Ie we tend to try to scale things like hash
sizes for core data structures by memory size, but then we tend to cap
them to "sane" versions.
So quite frankly, it's entirely possible that the capping is there not
because it _ever_ was a good idea, but just because it's what we almost
always do ;)
Mental inertia is definitely alive and well.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 22:56 ` Linus Torvalds
@ 2005-11-03 23:01 ` Martin J. Bligh
0 siblings, 0 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 23:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
>> But pages_min is based on the zone size, not the system size. And we
>> still cap it. Maybe that's just a mistake?
>
> The per-zone watermarking is actually the "modern" and "working" approach.
>
> We didn't always do it that way. I would not be at all surprised if the
> capping was from the global watermarking days.
>
> Of course, I would _also_ not be at all surprised if it wasn't just out of
> habit. Most of the things where we try to scale things up by memory size,
> we cap for various reasons. Ie we tend to try to scale things like hash
> sizes for core data structures by memory size, but then we tend to cap
> them to "sane" versions.
>
> So quite frankly, it's entirely possible that the capping is there not
> because it _ever_ was a good idea, but just because it's what we almost
> always do ;)
>
> Mental inertia is definitely alive and well.
Ha ;-) Well thanks for the explanation. I would suggest the patch I sent
you makes some semblence of sense then ...
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:17 ` Martin J. Bligh
2005-11-03 18:44 ` Linus Torvalds
@ 2005-11-04 0:58 ` Nick Piggin
2005-11-04 1:06 ` Linus Torvalds
1 sibling, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-04 0:58 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Linus Torvalds, Arjan van de Ven, Mel Gorman, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
Martin J. Bligh wrote:
>>These days we have things like per-cpu lists in front of the buddy
>>allocator that will make fragmentation somewhat higher, but it's still
>>absolutely true that the page allocation layout is _not_ random.
>
>
> OK, well I'll quit torturing you with incorrect math if you'll concede
> that the situation gets much much worse as memory sizes get larger ;-)
>
Let me add that as memory sized get larger, people are also looking
for more tlb coverage and less per page overhead.
Looks like ppc64 is getting 64K page support, at which point higher
order allocations (eg. for stacks) basically disappear don't they?
x86-64 I thought were also getting 64K page support but I can't
find a reference to it right now - at the very least I know Andi
wants to support larger soft pages for it.
ia64 is obviously already well covered.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 0:58 ` Nick Piggin
@ 2005-11-04 1:06 ` Linus Torvalds
2005-11-04 1:20 ` Paul Mackerras
` (2 more replies)
0 siblings, 3 replies; 253+ messages in thread
From: Linus Torvalds @ 2005-11-04 1:06 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin J. Bligh, Arjan van de Ven, Mel Gorman, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Fri, 4 Nov 2005, Nick Piggin wrote:
>
> Looks like ppc64 is getting 64K page support, at which point higher
> order allocations (eg. for stacks) basically disappear don't they?
Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a
general-purpose machine.
64kB pages are _only_ usable for databases, nothing else.
Why? Do the math. Try to cache the whole kernel source tree in 4kB pages
vs 64kB pages. See how the memory usage goes up by a factor of _four_.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 1:06 ` Linus Torvalds
@ 2005-11-04 1:20 ` Paul Mackerras
2005-11-04 1:22 ` Nick Piggin
2005-11-04 1:26 ` Mel Gorman
2 siblings, 0 replies; 253+ messages in thread
From: Paul Mackerras @ 2005-11-04 1:20 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Martin J. Bligh, Arjan van de Ven, Mel Gorman,
Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
Linus Torvalds writes:
> 64kB pages are _only_ usable for databases, nothing else.
Actually people running HPC apps also like 64kB pages since their TLB
misses go down significantly, and their data files tend to be large.
Fileserving for windows boxes should also benefit, since both the
executables and the data files that typical office applications on
windows use are largish. I got a distribution of file sizes for a
government department office and concluded that 64k pages would only
bloat the page cache by a few percent for that case.
Paul.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 1:06 ` Linus Torvalds
2005-11-04 1:20 ` Paul Mackerras
@ 2005-11-04 1:22 ` Nick Piggin
2005-11-04 1:48 ` Mel Gorman
2005-11-04 1:26 ` Mel Gorman
2 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-04 1:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Martin J. Bligh, Arjan van de Ven, Mel Gorman, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
Linus Torvalds wrote:
>
> On Fri, 4 Nov 2005, Nick Piggin wrote:
>
>>Looks like ppc64 is getting 64K page support, at which point higher
>>order allocations (eg. for stacks) basically disappear don't they?
>
>
> Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a
> general-purpose machine.
>
> 64kB pages are _only_ usable for databases, nothing else.
>
> Why? Do the math. Try to cache the whole kernel source tree in 4kB pages
> vs 64kB pages. See how the memory usage goes up by a factor of _four_.
>
Yeah that's true. But Martin's worried about future machines
with massive memories - so maybe it is safe to assume those will
be using big pages, I don't know.
Maybe the solution is to bloat the kernel sources enough to make
64KB pages worthwhile?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 1:22 ` Nick Piggin
@ 2005-11-04 1:48 ` Mel Gorman
2005-11-04 1:59 ` Nick Piggin
0 siblings, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-04 1:48 UTC (permalink / raw)
To: Nick Piggin
Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Fri, 4 Nov 2005, Nick Piggin wrote:
> Linus Torvalds wrote:
> >
> > On Fri, 4 Nov 2005, Nick Piggin wrote:
> >
> > > Looks like ppc64 is getting 64K page support, at which point higher
> > > order allocations (eg. for stacks) basically disappear don't they?
> >
> >
> > Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a
> > general-purpose machine.
> >
> > 64kB pages are _only_ usable for databases, nothing else.
> >
> > Why? Do the math. Try to cache the whole kernel source tree in 4kB pages vs
> > 64kB pages. See how the memory usage goes up by a factor of _four_.
> >
>
> Yeah that's true. But Martin's worried about future machines
> with massive memories - so maybe it is safe to assume those will
> be using big pages, I don't know.
>
Todays massive machines are tomorrows desktop. Weak comment, I know, but
it's happened before.
> Maybe the solution is to bloat the kernel sources enough to make
> 64KB pages worthwhile?
>
root@monocle:/boot# ls -l vmlinuz-2.6.14-rc5-mm1-clean
-rw-r--r-- 1 root root 1718063 2005-11-01 16:17
vmlinuz-2.6.14-rc5-mm1-clean
root@monocle:/boot# ls -l vmlinuz-2.6.14-rc5-mm1-mbuddy-v19
-rw-r--r-- 1 root root 1722102 2005-11-02 14:56
vmlinuz-2.6.14-rc5-mm1-mbuddy-v19
root@monocle:/boot# dc
1722102
1718063
- p
4039
root@monocle:/boot# ls -l vmlinux-2.6.14-rc5-mm1-clean
-rwxr-xr-x 1 root root 31518866 2005-11-01 16:17
vmlinux-2.6.14-rc5-mm1-clean
root@monocle:/boot# ls -l vmlinux-2.6.14-rc5-mm1-mbuddy-v19
-rwxr-xr-x 1 root root 31585714 2005-11-02 14:56
vmlinux-2.6.14-rc5-mm1-mbuddy-v19
mel@joshua:/usr/src/patchset-0.5/kernels/linux-2.6.14-rc5-mm1-nooom$ wc -l mm/page_alloc.c
2689 mm/page_alloc.c
mel@joshua:/usr/src/patchset-0.5/kernels/linux-2.6.14-rc5-mm1-mbuddy-v19-withdefrag$ wc -l mm/page_alloc.c
3188 mm/page_alloc.c
0.23% increase in size of bzImage, 0.21% increase in the size of vmlinux
and the major increase in code size is in one file, *one* file, all of
which does it's best to impact the flow of the well-understood code. We're
seeing bigger differences in performance than we are in the size of the
kernel. I'd understand if I was the first person to ever introduce
complexity to the VM.
If the size of the image for really small systems is the issue, what if I
say I'll add in another patch that optionally compiles away as much of
anti-defrag as possible without making the code a mess of #defines . Are
we still going to hear "no, I don't like looking at this". The current
patch to compile it away deliberately choose the smallest part to take
away to restore the allocator to todays behavior.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 1:48 ` Mel Gorman
@ 2005-11-04 1:59 ` Nick Piggin
2005-11-04 2:35 ` Mel Gorman
0 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-04 1:59 UTC (permalink / raw)
To: Mel Gorman
Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
Mel Gorman wrote:
> On Fri, 4 Nov 2005, Nick Piggin wrote:
>
> Todays massive machines are tomorrows desktop. Weak comment, I know, but
> it's happened before.
>
Oh I wouldn't bet against it. And if desktops of the future are using
100s of GB then they probably would be happy to use 64K pages as well.
>
>>Maybe the solution is to bloat the kernel sources enough to make
>>64KB pages worthwhile?
>>
>
Sorry this wasn't meant to be a dig at your patches - I guess it turned
out that way though :\
But yes, if anybody is adding complexity or size to core code it
obviously does need to be justified -- and by no means does this only
apply to you.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 1:59 ` Nick Piggin
@ 2005-11-04 2:35 ` Mel Gorman
0 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-04 2:35 UTC (permalink / raw)
To: Nick Piggin
Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Fri, 4 Nov 2005, Nick Piggin wrote:
> Mel Gorman wrote:
> > On Fri, 4 Nov 2005, Nick Piggin wrote:
> >
> > Todays massive machines are tomorrows desktop. Weak comment, I know, but
> > it's happened before.
> >
>
> Oh I wouldn't bet against it. And if desktops of the future are using
> 100s of GB then they probably would be happy to use 64K pages as well.
>
And would it not be nice to be ready when it happens, before it happens
even?
> >
> > > Maybe the solution is to bloat the kernel sources enough to make
> > > 64KB pages worthwhile?
> > >
> >
>
> Sorry this wasn't meant to be a dig at your patches - I guess it turned
> out that way though :\
>
Oh, I'll live. If I was going to take it personally and go into a big
sulk, I wouldn't be here. This is linux-kernel, not the super-friends
club.
> But yes, if anybody is adding complexity or size to core code it
> obviously does need to be justified -- and by no means does this only
> apply to you.
>
I've tried to justify it with benchmarks that came with each release and
code reviews, particularly by Dave Hansen, showed that earlier versions
had significant problems that needed to be ironed out. I don't want to
hurt the normal case, because the fact of the matter is, my desktop
machine (which runs with these patches to see if there are any bugs)
runs the normal case and it will until we get much further because I'm not
configuring my machine for HugeTLB when it boots. If I'm hurting the
normal case, that's more time switching windows to see if the next test
kernel has built yet.
If we can do this and not regress in the standard case, then what is
wrong? I'm still waiting for figures that say this approach is slow and I
can only assume someone is trying considering the length of this thread.
If and when those figures show up, I'll put on the thinking hat and see
where I went wrong because regression performance is wrong. There is a
win-win solution somewhere, how hard could it possibly be :) ?
I'm looking at the zone approach. I want to see if it can work in a nice
fashion, not in a "if the sysadm can see the future and configure
correctly, it'll work just fine" fashion. I'm not confident, but it might
be bias.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 1:06 ` Linus Torvalds
2005-11-04 1:20 ` Paul Mackerras
2005-11-04 1:22 ` Nick Piggin
@ 2005-11-04 1:26 ` Mel Gorman
2 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-04 1:26 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Martin J. Bligh, Arjan van de Ven, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Linus Torvalds wrote:
>
>
> On Fri, 4 Nov 2005, Nick Piggin wrote:
> >
> > Looks like ppc64 is getting 64K page support, at which point higher
> > order allocations (eg. for stacks) basically disappear don't they?
>
> Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a
> general-purpose machine.
>
> 64kB pages are _only_ usable for databases, nothing else.
>
Very well, but if the infrastructure required to help get 64kB pages still
performs the same, or better, than the current infrastructure that gives
4kB pages, then why not? I am biased obviously and probably optimistic but
I am hoping we have a case here where we get our cake and eat it twice.
> Why? Do the math. Try to cache the whole kernel source tree in 4kB pages
> vs 64kB pages. See how the memory usage goes up by a factor of _four_.
>
I don't know, but I doubt they would use 64kB pages as the default size
unless it is a specialised machine. I could be wrong, I don't have a ppc64
machine, I don't work on a ppc64 machine, I haven't read the architectures
documentation and I didn't write this code for a ppc64 machine. If the
machine here in question it's a specialised machine, they go into the
0.01% category of people, but it's a group that we can still help without
introducing static zones they have to configure.
I'm still waiting on figures that say the approach proposed here is
actually really slow, rather than makes people unhappy slow. If this is
proved to be slow, then I'll admit there is a problem and put more effort
into the plans to use zones instead. I just haven't found a problem on the
machines I have available to me, be it aim9, bench-stresshighalloc or
building kernels (which I think is important considering how often I build
test kernels). If it's a documentation problem with these patches, I'll
write up VM docs on the allocator and submit it as a patch, complete with
downsides and caveats to be fair.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:08 ` Linus Torvalds
2005-11-03 18:17 ` Martin J. Bligh
@ 2005-11-03 21:11 ` Mel Gorman
1 sibling, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-03 21:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Linus Torvalds wrote:
> On Thu, 3 Nov 2005, Arjan van de Ven wrote:
>
> > On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote:
> >
> > > For amusement, let me put in some tritely oversimplified math. For the
> > > sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> > > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> > > I'm going to assume random distribution of free pages, which is
> > > oversimplified, but I'm trying to demonstrate a general premise, not get
> > > accurate numbers.
> >
> > that is VERY over simplified though, given the anti-fragmentation
> > property of buddy algorithm
>
The statistical properties of the buddy system are a nightmare. There is a
paper called "Statistical Properties of the Buddy System" which is a whole
pile of no fun to read. It's because of the difficulty to analyse
fragmentation offline that bench-stresshighalloc was written to see how
well anti-defrag would do.
> Indeed. I write a program at one time doing random allocation and
> de-allocation and looking at what the output was, and buddy is very good
> at avoiding fragmentation.
>
The worse cause of fragmentation I found were kernel caches that were long
lived. How fragmenting the workload is depended heavily on whether things
like updatedb happened which is why bench-stresshighalloc deliberately ran
it. It's also why anti-defrag tries to group inodes and buffer_heads into
the same areas in memory separate from other
persumed-to-be-even-longer-lived kernel allocations. The assumption is if
the buffer, inode and dcaches are all shrunk, contiguous blocks will
appear.
You're also right on the size of the watermarks for zones and how it
affects fragmentation. A serious problem I had with anti-defrag was when
87.5% of memory is in use. At this point, a "fallback" area is used by any
allocation type that has no pages of it's own. When it is depleted, real
fragmentation starts happening and it's also about here that the high
watermark for reclaiming starts. I wanted to increase the watermarks up to
start reclaiming pages when the "fallback" area started getting used but
didn't think I would get away with adjusting those figures. I could have
cheated and set it via /proc before benchmarks but didn't to avoid "magic
test system" syndrome.
> These days we have things like per-cpu lists in front of the buddy
> allocator that will make fragmentation somewhat higher, but it's still
> absolutely true that the page allocation layout is _not_ random.
>
It's worse than somewhat higher for the per-cpu pages. Using another set
of patches on top of an earlier version of anti-defrag, I was about to
allocate about 75% of physical memory in pinned 4MiB chunks of memory
under loads of 15-20 (kernel builds). To get there, per-cpu pages had to
be drained using an IPI call because for some perverse reason, there were
always 2 or 3 free per-cpu pages in the middle of a 1024 block of pages.
Basically, I don't we have to live with fragmentation in the page
allocator. I think it can be pushed down a whole lot without taking a
performance hit for the 99.99% of users that don't care about this sort of
thing.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 17:51 ` Martin J. Bligh
2005-11-03 17:59 ` Arjan van de Ven
@ 2005-11-03 18:03 ` Linus Torvalds
2005-11-03 20:00 ` Paul Jackson
2005-11-03 20:46 ` Mel Gorman
2005-11-03 18:48 ` Martin J. Bligh
2 siblings, 2 replies; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 18:03 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> > And I suspect that by default, there should be zero of them. Ie you'd have
> > to set them up the same way you now set up a hugetlb area.
>
> So ... if there are 0 by default, and I run for a while and dirty up
> memory, how do I free any pages up to put into them? Not sure how that
> works.
You don't.
Just face it - people who want memory hotplug had better know that
beforehand (and let's be honest - in practice it's only going to work in
virtualized environments or in environments where you can insert the new
bank of memory and copy it over and remove the old one with hw support).
Same as hugetlb.
Nobody sane _cares_. Nobody sane is asking for these things. Only people
with special needs are asking for it, and they know their needs.
You have to realize that the first rule of engineering is to work out the
balances. The undeniable fact is, that 99.99% of all users will never care
one whit, and memory management is complex and fragile. End result: the
0.01% of users will have to do some manual configuration to keep things
simpler for the cases that really matter.
Because the case that really matters is the sane case. The one where we
- don't change memory (normal)
- only add memory (easy)
- only switch out memory with hardware support (ie the _hardware_
supports parallel memory, and you can switch out a DIMM without
software ever really even noticing)
- have system maintainers that do strange things, but _know_ that.
We simply DO NOT CARE about some theoretical "general case", because the
general case is (a) insane and (b) impossible to cater to without
excessive complexity.
Guys, a kernel developer needs to know when to say NO.
And we say NO, HELL NO!! to generic software-only memory hotplug.
If you are running a DB that needs to benchmark well, you damn well KNOW
IT IN ADVANCE, AND YOU TUNE FOR IT.
Nobody takes a random machine and says "ok, we'll now put our most
performance-critical database on this machine, and oh, btw, you can't
reboot it and tune for it beforehand". And if you have such a person, you
need to learn to IGNORE THE CRAZY PEOPLE.
When you hear voices in your head that tell you to shoot the pope, do you
do what they say? Same thing goes for customers and managers. They are the
crazy voices in your head, and you need to set them right, not just
blindly do what they ask for.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:03 ` Linus Torvalds
@ 2005-11-03 20:00 ` Paul Jackson
2005-11-03 20:46 ` Mel Gorman
1 sibling, 0 replies; 253+ messages in thread
From: Paul Jackson @ 2005-11-03 20:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: mbligh, mel, arjan, nickpiggin, haveblue, mingo, akpm, kravetz,
linux-mm, linux-kernel, lhms-devel, arjanv
> We simply DO NOT CARE about some theoretical "general case", because the
> general case is (a) insane and (b) impossible to cater to without
> excessive complexity.
The lawyers have a phrase for this:
Hard cases make bad law.
For us, that's bad code.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:03 ` Linus Torvalds
2005-11-03 20:00 ` Paul Jackson
@ 2005-11-03 20:46 ` Mel Gorman
1 sibling, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-03 20:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: Martin J. Bligh, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Linus Torvalds wrote:
>
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> > > And I suspect that by default, there should be zero of them. Ie you'd have
> > > to set them up the same way you now set up a hugetlb area.
> >
> > So ... if there are 0 by default, and I run for a while and dirty up
> > memory, how do I free any pages up to put into them? Not sure how that
> > works.
>
> You don't.
>
> Just face it - people who want memory hotplug had better know that
> beforehand (and let's be honest - in practice it's only going to work in
> virtualized environments or in environments where you can insert the new
> bank of memory and copy it over and remove the old one with hw support).
>
> Same as hugetlb.
>
For HugeTLB, there are cases were the sysadmin won't configure the server
because it's a tunable that can badly affect the machine if they get it
wrong. In those cases, the users just get small pages, the performance
penalty and are told to like it.
> Nobody sane _cares_. Nobody sane is asking for these things. Only people
> with special needs are asking for it, and they know their needs.
>
> You have to realize that the first rule of engineering is to work out the
> balances. The undeniable fact is, that 99.99% of all users will never care
> one whit, and memory management is complex and fragile. End result: the
> 0.01% of users will have to do some manual configuration to keep things
> simpler for the cases that really matter.
>
Ok, so lets consider the 99.99% of users then. One two machines, aim9
benchmarks posted during this thread show some improvements on page_test,
fork_test and brk_test, the paths you would expect to be hit by these
patches. They are very minor improvements but 99.99% of users benefit from
this. Aim9 might be considered artifical so somewhere in that 99.99% of
users are kernel developers who care about kbuild so here are the timings
of "kernel untar ; make defconfig ; make"
2.6.14-rc5-mm1: 1093 seconds
2.6.14-rc5-mm1-mbuddy-v19-withoutdefrag 1089 seconds
2.6.14-rc5-mm1-mbuddy-v19-withdefrag:: 1086 seconds
The withoutdefrag mark is with the core of anti-defrag disabled via a
configure option. The option to disable was a separate patch produced
during this thread. To be really honest, I don't think a configurable page
allocator is a great idea.
Building kernels is faster with this set of patches which a few people on
this list care about. aim9 shows very minor improvements which benefit a
very large number of people and 0.01% of people who care about
fragmentation get lower fragmentation.
Of course, maybe there is something magic with my test machines (or maybe
I am willing it faster) so figures from other people wouldn't hurt whether
they show gains or regressions. On my machine at least, 99.99% of people
are still benefitting.
I am going to wait to see if people post figures that show regressions
before asking "are you still saying no?" to this set of patches
> Because the case that really matters is the sane case. The one where we
> - don't change memory (normal)
> - only add memory (easy)
> - only switch out memory with hardware support (ie the _hardware_
> supports parallel memory, and you can switch out a DIMM without
> software ever really even noticing)
> - have system maintainers that do strange things, but _know_ that.
>
> We simply DO NOT CARE about some theoretical "general case", because the
> general case is (a) insane and (b) impossible to cater to without
> excessive complexity.
>
> Guys, a kernel developer needs to know when to say NO.
>
> And we say NO, HELL NO!! to generic software-only memory hotplug.
>
> If you are running a DB that needs to benchmark well, you damn well KNOW
> IT IN ADVANCE, AND YOU TUNE FOR IT.
>
> Nobody takes a random machine and says "ok, we'll now put our most
> performance-critical database on this machine, and oh, btw, you can't
> reboot it and tune for it beforehand". And if you have such a person, you
> need to learn to IGNORE THE CRAZY PEOPLE.
>
> When you hear voices in your head that tell you to shoot the pope, do you
> do what they say? Same thing goes for customers and managers. They are the
> crazy voices in your head, and you need to set them right, not just
> blindly do what they ask for.
>
> Linus
>
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 17:51 ` Martin J. Bligh
2005-11-03 17:59 ` Arjan van de Ven
2005-11-03 18:03 ` Linus Torvalds
@ 2005-11-03 18:48 ` Martin J. Bligh
2005-11-03 19:08 ` Linus Torvalds
2 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 18:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
> For amusement, let me put in some tritely oversimplified math. For the
> sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> I'm going to assume random distribution of free pages, which is
> oversimplified, but I'm trying to demonstrate a general premise, not get
> accurate numbers.
>
> 8MB = 2048 pages.
>
> On a 64MB system, we have 16384 pages, 2048 free. Very rougly speaking, for
> each free page, chance of it's buddy being free is 2048/16384. So in
> grossly-oversimplified stats-land, if I can remember anything at all,
> chance of finding one page with a free buddy is 1-(1-2048/16384)^2048,
> which is, for all intents and purposes ... 1.
>
> 1 GB. system, 262144 pages 1-(1-2048/16384)^2048 = 0.9999989
>
> 128GB system. 33554432 pages. 0.1175 probability
>
> yes, yes, my math sucks and I'm a simpleton. The point is that as memory
> gets bigger, the odds suck for getting contiguous pages. And would also
> explain why you think there's no problem, and I do ;-) And bear in mind
> that's just for order 1 allocs. For bigger stuff, it REALLY sucks - I'll
> spare you more wild attempts at foully-approximated math.
>
> Hmmm. If we keep 128MB free, that totally kills off the above calculation
> I think I'll just tweak it so the limit is not so hard on really big
> systems. Will send you a patch. However ... larger allocs will still
> suck ... I guess I'd better gross you out with more incorrect math after
> all ...
Ha. Just because I don't think I made you puke hard enough already with
foul approximations ... for order 2, I think it's
1-(1-(free_pool/total)^3)^free_pool
because all 3 of his buddies have to be free as well.
(and generically ... 2^order - 1)
ORDER: 1
1024MB system, 8MB pool = 1.000000
131072MB system, 8MB pool = 0.117506
1024MB system, 128MB pool = 1.000000
131072MB system, 128MB pool = 1.000000
ORDER: 2
1024MB system, 8MB pool = 0.000976
131072MB system, 8MB pool = 0.000000
1024MB system, 128MB pool = 1.000000
131072MB system, 128MB pool = 0.000031
ORDER: 3
1024MB system, 8MB pool = 0.000000
131072MB system, 8MB pool = 0.000000
1024MB system, 128MB pool = 0.015504
131072MB system, 128MB pool = 0.000000
ORDER: 4
1024MB system, 8MB pool = 0.000000
131072MB system, 8MB pool = 0.000000
1024MB system, 128MB pool = 0.000000
131072MB system, 128MB pool = 0.000000
------------------------
I really should learn not to post my rusty math in such public places ...
but I still think the point is correct. Anyway, I'm sure somewhere in
the resultant flamewar, someone will come up with some better approx ;-)
And yes, I appreciate the random distribution thing is wrong. But it's
still not going to work for bigger allocs. Fixing the free watermarks
will help us a bit though.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 18:48 ` Martin J. Bligh
@ 2005-11-03 19:08 ` Linus Torvalds
2005-11-03 22:37 ` Martin J. Bligh
2005-11-04 16:22 ` Mel Gorman
0 siblings, 2 replies; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 19:08 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> Ha. Just because I don't think I made you puke hard enough already with
> foul approximations ... for order 2, I think it's
Your basic fault is in believing that the free watermark would stay
constant.
That's insane.
Would you keep 8MB free on a 64MB system?
Would you keep 8MB free on a 8GB system?
The point being, that if you start with insane assumptions, you'll get
insane answers.
The _correct_ assumption is that you aim to keep some fixed percentage of
memory free. With that assumption and your math, finding higher-order
pages is equally hard regardless of amount of memory.
Now, your math then doesn't allow for the fact that buddy automatically
coalesces for you, so in fact things get _easier_ with more memory, but
hey, that needs more math than I can come up with (I never did it as math,
only as simulations with allocation patterns - "smart people use math,
plodding people just try to simulate an estimate" ;)
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 19:08 ` Linus Torvalds
@ 2005-11-03 22:37 ` Martin J. Bligh
2005-11-03 23:16 ` Linus Torvalds
2005-11-04 16:22 ` Mel Gorman
1 sibling, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 22:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
>> Ha. Just because I don't think I made you puke hard enough already with
>> foul approximations ... for order 2, I think it's
>
> Your basic fault is in believing that the free watermark would stay
> constant.
>
> That's insane.
>
> Would you keep 8MB free on a 64MB system?
>
> Would you keep 8MB free on a 8GB system?
>
> The point being, that if you start with insane assumptions, you'll get
> insane answers.
Ummm. I was basing it on what we actually do now in the code, unless I
misread it, which is perfectly possible. Do you want this patch?
diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c
--- linux-2.6.14/mm/page_alloc.c 2005-10-27 18:52:20.000000000 -0700
+++ 2.6.14-no_water_cap/mm/page_alloc.c 2005-11-03 14:36:06.000000000 -0800
@@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi
min_pages = zone->present_pages / 1024;
if (min_pages < SWAP_CLUSTER_MAX)
min_pages = SWAP_CLUSTER_MAX;
- if (min_pages > 128)
- min_pages = 128;
zone->pages_min = min_pages;
} else {
/* if it's a lowmem zone, reserve a number of pages
> The _correct_ assumption is that you aim to keep some fixed percentage of
> memory free. With that assumption and your math, finding higher-order
> pages is equally hard regardless of amount of memory.
That would, indeed, make more sense.
> Now, your math then doesn't allow for the fact that buddy automatically
> coalesces for you, so in fact things get _easier_ with more memory, but
> hey, that needs more math than I can come up with (I never did it as math,
> only as simulations with allocation patterns - "smart people use math,
> plodding people just try to simulate an estimate" ;)
Not sure what people who do math, but wrongly, are called, but I'm sure
it's not polite, and I'm sure I'm one of those ;-)
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 22:37 ` Martin J. Bligh
@ 2005-11-03 23:16 ` Linus Torvalds
2005-11-03 23:39 ` Martin J. Bligh
2005-11-04 4:39 ` Andrew Morton
0 siblings, 2 replies; 253+ messages in thread
From: Linus Torvalds @ 2005-11-03 23:16 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> Ummm. I was basing it on what we actually do now in the code, unless I
> misread it, which is perfectly possible. Do you want this patch?
>
> diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c
> --- linux-2.6.14/mm/page_alloc.c 2005-10-27 18:52:20.000000000 -0700
> +++ 2.6.14-no_water_cap/mm/page_alloc.c 2005-11-03 14:36:06.000000000 -0800
> @@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi
> min_pages = zone->present_pages / 1024;
> if (min_pages < SWAP_CLUSTER_MAX)
> min_pages = SWAP_CLUSTER_MAX;
> - if (min_pages > 128)
> - min_pages = 128;
> zone->pages_min = min_pages;
> } else {
> /* if it's a lowmem zone, reserve a number of pages
Ahh, you're right, there's a totally separate watermark for highmem.
I think I even remember this. I may even be responsible. I know some of
our less successful highmem balancing efforts in the 2.4.x timeframe had
serious trouble when they ran out of highmem, and started pruning lowmem
very very aggressively. Limiting the highmem water marks meant that it
wouldn't do that very often.
I think your patch may in fact be fine, but quite frankly, it needs
testing under real load with highmem.
In general, I don't _think_ we should do anything different for highmem at
all, and we should just in general try to keep a percentage of pages
available. Now, the percentage probably does depend on the zone: we should
be more aggressive about more "limited" zones, ie the old 16MB DMA zone
should probably try to keep a higher percentage of free pages around than
the normal zone, and that in turn should probably keep a higher percentage
of pages around than the highmem zones.
And that's not because of fragmentation so much, but simply because the
lower zones tend to have more "desperate" users. Running out of the normal
zone is thus a "worse" situation than running out of highmem. And we
effectively never want to allocate from the 16MB DMA zone at all, unless
it is our only choice.
We actually do try to do that with that "lowmem_reserve[]" logic, which
reserves more pages in the lower zones the bigger the upper zones are (ie
if we _only_ have memory in the low 16MB, then we don't reserve any of it,
but if we have _tons_ of memory in the high zones, then we reserve more
memory for the low zones and thus make the watermarks higher for them).
So the watermarking interacts with that lowmem_reserve logic, and I think
that on HIGHMEM, you'd be screwed _twice_: first because the "pages_min"
is limited, and second because HIGHMEM has no lowmem_reserve.
Does that make sense?
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 23:16 ` Linus Torvalds
@ 2005-11-03 23:39 ` Martin J. Bligh
2005-11-04 0:42 ` Nick Piggin
2005-11-04 4:39 ` Andrew Morton
1 sibling, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 23:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
> Ahh, you're right, there's a totally separate watermark for highmem.
>
> I think I even remember this. I may even be responsible. I know some of
> our less successful highmem balancing efforts in the 2.4.x timeframe had
> serious trouble when they ran out of highmem, and started pruning lowmem
> very very aggressively. Limiting the highmem water marks meant that it
> wouldn't do that very often.
>
> I think your patch may in fact be fine, but quite frankly, it needs
> testing under real load with highmem.
>
> In general, I don't _think_ we should do anything different for highmem at
> all, and we should just in general try to keep a percentage of pages
> available. Now, the percentage probably does depend on the zone: we should
> be more aggressive about more "limited" zones, ie the old 16MB DMA zone
> should probably try to keep a higher percentage of free pages around than
> the normal zone, and that in turn should probably keep a higher percentage
> of pages around than the highmem zones.
Hmm. it strikes me that there will be few (if any?) allocations out of
highmem. PPC64 et al dump everything into ZONE_DMA though - so those should
be uncapped already.
> And that's not because of fragmentation so much, but simply because the
> lower zones tend to have more "desperate" users. Running out of the normal
> zone is thus a "worse" situation than running out of highmem. And we
> effectively never want to allocate from the 16MB DMA zone at all, unless
> it is our only choice.
Well it's not 16MB on the other platforms, but ...
> We actually do try to do that with that "lowmem_reserve[]" logic, which
> reserves more pages in the lower zones the bigger the upper zones are (ie
> if we _only_ have memory in the low 16MB, then we don't reserve any of it,
> but if we have _tons_ of memory in the high zones, then we reserve more
> memory for the low zones and thus make the watermarks higher for them).
>
> So the watermarking interacts with that lowmem_reserve logic, and I think
> that on HIGHMEM, you'd be screwed _twice_: first because the "pages_min"
> is limited, and second because HIGHMEM has no lowmem_reserve.
>
> Does that make sense?
Yes. So we were only capping highmem before, now I squint at it closer.
I was going off a simplification I'd written for a paper, which is
not generally correct. I doubt frag is a problem in highmem, so maybe
the code is correct as-is. We only want contig allocs for virtual when
it's mapped 1-1 to physical (ie the kernel mapping) or real physical
things.
I suppose I could write something to trawl the source tree to check
that assumption, but it feels right ...
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 23:39 ` Martin J. Bligh
@ 2005-11-04 0:42 ` Nick Piggin
0 siblings, 0 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-04 0:42 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Linus Torvalds, Mel Gorman, Arjan van de Ven, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
Martin J. Bligh wrote:
>>Ahh, you're right, there's a totally separate watermark for highmem.
>>
>>I think I even remember this. I may even be responsible. I know some of
>>our less successful highmem balancing efforts in the 2.4.x timeframe had
>>serious trouble when they ran out of highmem, and started pruning lowmem
>>very very aggressively. Limiting the highmem water marks meant that it
>>wouldn't do that very often.
>>
>>I think your patch may in fact be fine, but quite frankly, it needs
>>testing under real load with highmem.
>>
I'd prefer not. The reason is that it increases the "min"
watermark, which only gets used basically by GFP_ATOMIC and
PF_MEMALLOC allocators - neither of which are likely to want
highmem.
Also, I don't think anybody cares about higher order highmem
allocations. At least the patches in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113082256231168&w=2
Should be applied before this. However they also need more
testing so I'll be sending them to Andrew first.
Patch 2 does basically the same thing as your patch, without
increasing the min watermark.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 23:16 ` Linus Torvalds
2005-11-03 23:39 ` Martin J. Bligh
@ 2005-11-04 4:39 ` Andrew Morton
1 sibling, 0 replies; 253+ messages in thread
From: Andrew Morton @ 2005-11-04 4:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: mbligh, mel, arjan, nickpiggin, haveblue, mingo, kravetz,
linux-mm, linux-kernel, lhms-devel, arjanv
Linus Torvalds <torvalds@osdl.org> wrote:
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> >
> > Ummm. I was basing it on what we actually do now in the code, unless I
> > misread it, which is perfectly possible. Do you want this patch?
> >
> > diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c
> > --- linux-2.6.14/mm/page_alloc.c 2005-10-27 18:52:20.000000000 -0700
> > +++ 2.6.14-no_water_cap/mm/page_alloc.c 2005-11-03 14:36:06.000000000 -0800
> > @@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi
> > min_pages = zone->present_pages / 1024;
> > if (min_pages < SWAP_CLUSTER_MAX)
> > min_pages = SWAP_CLUSTER_MAX;
> > - if (min_pages > 128)
> > - min_pages = 128;
> > zone->pages_min = min_pages;
> > } else {
> > /* if it's a lowmem zone, reserve a number of pages
>
> Ahh, you're right, there's a totally separate watermark for highmem.
>
> I think I even remember this. I may even be responsible. I know some of
> our less successful highmem balancing efforts in the 2.4.x timeframe had
> serious trouble when they ran out of highmem, and started pruning lowmem
> very very aggressively. Limiting the highmem water marks meant that it
> wouldn't do that very often.
No, that was me and Matthew Dobson, circa 2.5.71. The thinking was that
highmem is just for userspace pages and we don't need to keep the free
memory pool around for things like atomic allocations. Especially as a
proportionally-sized highmem emergency pool would be potentially hundreds of
(wasted) megabytes.
iirc, things worked ok with a highmem min_pages threshold of zero pages. Back
in 2.5.70, before everyone else broke it ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 19:08 ` Linus Torvalds
2005-11-03 22:37 ` Martin J. Bligh
@ 2005-11-04 16:22 ` Mel Gorman
1 sibling, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-04 16:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Martin J. Bligh, Arjan van de Ven, Nick Piggin, Dave Hansen,
Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Arjan van de Ven
On Thu, 3 Nov 2005, Linus Torvalds wrote:
>
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> >
> > Ha. Just because I don't think I made you puke hard enough already with
> > foul approximations ... for order 2, I think it's
>
> Your basic fault is in believing that the free watermark would stay
> constant.
>
> That's insane.
>
> Would you keep 8MB free on a 64MB system?
>
> Would you keep 8MB free on a 8GB system?
>
> The point being, that if you start with insane assumptions, you'll get
> insane answers.
>
> The _correct_ assumption is that you aim to keep some fixed percentage of
> memory free. With that assumption and your math, finding higher-order
> pages is equally hard regardless of amount of memory.
>
> Now, your math then doesn't allow for the fact that buddy automatically
> coalesces for you, so in fact things get _easier_ with more memory, but
> hey, that needs more math than I can come up with (I never did it as math,
> only as simulations with allocation patterns - "smart people use math,
> plodding people just try to simulate an estimate" ;)
>
My math is not that great either, so here is a simulation.
Setup: Reboot the machine which is a quad xeon xSeries 350 with 1.5GiB of
RAM. Configure /proc/sys/vm/min_free_kbytes to try and keep 1/8th of
physical memory free. This is to keep in line with your suggestion that
fragmentation is low when there is a higher percentage of memory free.
Load: Run a load - 7 kernels compiling simultaneously at -j2 which gives
loads between 10-14. Try and get 50% worth of physical memory in 4MiB
pages (1024 contiguous pages) while compiling. When the test ends and the
system is quiet, try again. 4MiB in this case is a single HugeTLB page.
Here are the results;
2.6.14-rc5-mm1-clean (OOM killer disabled) Allocating Under Load
Order: 10
Allocation type: HighMem
Attempted allocations: 160
Success allocs: 24
Failed allocs: 136
DMA zone allocs: 0
Normal zone allocs: 16
HighMem zone allocs: 8
% Success: 15
2.6.14-rc5-mm1-mbuddy-v19 Allocating Under Load
Order: 10
Allocation type: HighMem
Attempted allocations: 160
Success allocs: 24
Failed allocs: 136
DMA zone allocs: 0
Normal zone allocs: 11
HighMem zone allocs: 13
% Success: 15
Not a lot of difference there and the success rate is not great.
mbuddy-v19 is a bit better at the normal zone and that's about it. These
results are not surprising as kswapd is making no effort to get contiguous
pages. Under a load of 7 kernel compiles, kswapd will not free pages fast
enough.
When the test ends and the system is quiet, try and get 80% of physical
memory in large pages. 4 attempts are made to satisfy the requests to give
kswapd lots of time.
2.6.14-rc5-mm1-clean (OOM killer disabled) Allocating while rested
Order: 10
Allocation type: HighMem
Attempted allocations: 300
Success allocs: 159
Failed allocs: 141
DMA zone allocs: 0
Normal zone allocs: 46
HighMem zone allocs: 113
% Success: 53
Mainly highmem there.
2.6.14-rc5-mm1-mbuddy-v19 Allocating while rested
Order: 10
Allocation type: HighMem
Attempted allocations: 300
Success allocs: 212
Failed allocs: 88
DMA zone allocs: 0
Normal zone allocs: 102
HighMem zone allocs: 110
% Success: 70
Look at the big difference in the number of successful allocations in
ZONE_NORMAL because the kernel allocations were kept together. Experience
has shown me that failure to get higher success rates depended on per-cpu
pages and the number of kernel pages that leaked to other areas (56 over
the course of this test). Kernel pages leaking was helped a lot by setting
min_free_kbytes higher than the default.
I then ported forward the linear scanner and ran the tests again. The
linear scanner does two things - finds linear reclaimable pages using
information provided by anti-defrag and drains the per-cpu caches. I'll
post the linear scanner code if people want to look at it but it's really
crap. It's slow, works too hard and doesn't try to hold on to the pages
for the process reclaiming the pages are just some of it's problems. I
need to rewrite it almost from scratch and avoid all the mistakes but it's
a path that is hit *only* if you are allocating high orders.
2.6.14-rc5-mm1-mbuddy-v19-lnscan Allocating under load
Order: 10
Allocation type: HighMem
Attempted allocations: 160
Success allocs: 155
Failed allocs: 0
DMA zone allocs: 0
Normal zone allocs: 12
HighMem zone allocs: 143
% Success: 96
Mainly got it's pages back from highmem which is always easier as long as
PTE pages are not in the way.
2.6.14-rc5-mm1-mbuddy-v19-lnscan Allocating while rested
Order: 10
Allocation type: HighMem
Attempted allocations: 300
Success allocs: 275
Failed allocs: 0
DMA zone allocs: 0
Normal zone allocs: 133
HighMem zone allocs: 142
% Success: 91
That is 71% of physical memory available in contiguous blocks with the
linear scanner but that code is not ready. anti-defrag on it's own as it
is today was able to get 55% of physical memory in 4MiB chunks.
This is provided without performance regressions in the normal case
everyone cares about. In my tests, there are minor improvements on aim9
which is artificial, and gained a few seconds on kernel build tests which
people do care about.
Does these patches still make no sense to you? Lower fragmentation that
does not impact the cases everyone cares about? If so, why?
To get the best possibly results, a zone approach could still be built on
top of this and it seems as if it's worth developing. At the cost of some
configuration, the zone would give *hard* guarantees on the available
number of large pages and anti-defrag would give best effort everywhere
else. By default without configuration, you would get best-effort.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 15:40 ` Arjan van de Ven
2005-11-03 15:51 ` Linus Torvalds
@ 2005-11-03 15:53 ` Martin J. Bligh
1 sibling, 0 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 15:53 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
lhms, Arjan van de Ven
--Arjan van de Ven <arjan@infradead.org> wrote (on Thursday, November 03, 2005 16:40:21 +0100):
> On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
>> >> Can we quit coming up with specialist hacks for hotplug, and try to solve
>> >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
>> >> in general is.
>> >>
>> >
>> > Not really it isn't. There have been a few cases (e1000 being the main
>> > one, and is fixed upstream) where fragmentation in general is a problem.
>> > But mostly it is not.
>>
>> Sigh. OK, tell me how you're going to fix kernel stacks > 4K please.
>
> with CONFIG_4KSTACKS :)
I've been told previously that doesn't work for x86_64, and other 64 bit
platforms. Is that incorrect?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 8:49 ` Ingo Molnar
2005-11-02 9:02 ` Nick Piggin
@ 2005-11-02 14:57 ` Martin J. Bligh
1 sibling, 0 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-02 14:57 UTC (permalink / raw)
To: Ingo Molnar, Dave Hansen
Cc: Mel Gorman, Nick Piggin, Andrew Morton, Linus Torvalds, kravetz,
linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven
> reliable hot-remove of generic kernel RAM is plain impossible even in a
> fully virtualized solution. It's impossible even with maximum hardware
> help. We simply dont have the means to fix up live kernel pointers still
> linked into the removed region, under the C programming model.
>
> the hurdles towards a reliable solution are so incredibly high, that
> other solutions _have_ to be considered: restrict the type of RAM that
> can be removed, and put it into a separate zone. That solves things
> easily: no kernel pointers will be allowed in those zones. It becomes
> similar to highmem: various kernel caches can opt-in to be included in
> that type of RAM, and the complexity (and maintainance impact) of the
> approach can thus be nicely scaled.
Forget about freeing up arbitrary regions of RAM. I don't think anyone has
strong enough drugs to seriously believe that works in the generic case.
What we need is to free a small region of RAM, that's contiguous. Larger
than one page, likely a few MB. Exactly WHICH region is pretty irrelevant,
since mostly we just want one contiguous chunk to use.
Hypervisors do not remap at page-level granularity - that seems to be
the source of some of the confusion here.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 15:01 ` Ingo Molnar
2005-11-01 15:22 ` Dave Hansen
@ 2005-11-01 16:48 ` Kamezawa Hiroyuki
2005-11-01 16:59 ` Kamezawa Hiroyuki
` (3 more replies)
1 sibling, 4 replies; 253+ messages in thread
From: Kamezawa Hiroyuki @ 2005-11-01 16:48 UTC (permalink / raw)
To: Ingo Molnar
Cc: Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
Ingo Molnar wrote:
> so it's all about expectations: _could_ you reasonably remove a piece of
> RAM? Customer will say: "I have stopped all nonessential services, and
> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
> the kernel!". No reasonable customer will say: "True, I have all RAM
> used up in mlock()ed sections, but i want to remove some RAM
> nevertheless".
>
Hi, I'm one of men in -lhms
In my understanding...
- Memory Hotremove on IBM's LPAR? approach is
[remove some amount of memory from somewhere.]
For this approach, Mel's patch will work well.
But this will not guaranntee a user can remove specified range of
memory at any time because how memory range is used is not defined by an admin
but by the kernel automatically. But to extract some amount of memory,
Mel's patch is very important and they need this.
My own target is NUMA node hotplug, what NUMA node hotplug want is
- [remove the range of memory] For this approach, admin should define
*core* node and removable node. Memory on removable node is removable.
Dividing area into removable and not-removable is needed, because
we cannot allocate any kernel's object on removable area.
Removable area should be 100% removable. Customer can know the limitation before using.
What I'm considering now is this:
- removable area is hot-added area
- not-removable area is memory which is visible to kernel at boot time.
(I'd like to achieve this by the limitation : hot-added node goes into only ZONE_HIGHMEM)
A customer can hot add their extra memory after boot. This is very easy to understand.
Peformance problem is trade-off.(I'm afraid of this ;)
If a cutomer wants to guarantee some memory areas should be hot-removable,
he will hot-add them.
I don't think adding memory for the kernel by hot-add is wanted by a customer.
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 16:48 ` Kamezawa Hiroyuki
@ 2005-11-01 16:59 ` Kamezawa Hiroyuki
2005-11-01 17:19 ` Mel Gorman
` (2 subsequent siblings)
3 siblings, 0 replies; 253+ messages in thread
From: Kamezawa Hiroyuki @ 2005-11-01 16:59 UTC (permalink / raw)
To: Kamezawa Hiroyuki
Cc: Ingo Molnar, Dave Hansen, Mel Gorman, Nick Piggin,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
Kamezawa Hiroyuki wrote:
> Ingo Molnar wrote:
>
>> so it's all about expectations: _could_ you reasonably remove a piece
>> of RAM? Customer will say: "I have stopped all nonessential services,
>> and free RAM is at 90%, still I cannot remove that piece of faulty
>> RAM, fix the kernel!". No reasonable customer will say: "True, I have
>> all RAM used up in mlock()ed sections, but i want to remove some RAM
>> nevertheless".
>>
> Hi, I'm one of men in -lhms
>
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
> [remove some amount of memory from somewhere.]
> For this approach, Mel's patch will work well.
> But this will not guaranntee a user can remove specified range of
> memory at any time because how memory range is used is not defined by
> an admin
> but by the kernel automatically. But to extract some amount of memory,
> Mel's patch is very important and they need this.
>
One more consideration...
Some cpus which support virtialization will be shipped by some vendor in near future.
If someone uses vitualized OS, only problem is *resizing*.
Hypervisor will be able to remap semi-physical pages anyware with hardware assistance
but system resizing needs operating system assistance.
To this direction, [remove some amount of memory from somewhere.] is important approach.
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 16:48 ` Kamezawa Hiroyuki
2005-11-01 16:59 ` Kamezawa Hiroyuki
@ 2005-11-01 17:19 ` Mel Gorman
2005-11-02 0:32 ` KAMEZAWA Hiroyuki
2005-11-01 18:06 ` linux-os (Dick Johnson)
2005-11-02 7:19 ` Ingo Molnar
3 siblings, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 17:19 UTC (permalink / raw)
To: Kamezawa Hiroyuki
Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
On Wed, 2 Nov 2005, Kamezawa Hiroyuki wrote:
> Ingo Molnar wrote:
> > so it's all about expectations: _could_ you reasonably remove a piece of
> > RAM? Customer will say: "I have stopped all nonessential services, and free
> > RAM is at 90%, still I cannot remove that piece of faulty RAM, fix the
> > kernel!". No reasonable customer will say: "True, I have all RAM used up in
> > mlock()ed sections, but i want to remove some RAM nevertheless".
> >
> Hi, I'm one of men in -lhms
>
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
> [remove some amount of memory from somewhere.]
> For this approach, Mel's patch will work well.
> But this will not guaranntee a user can remove specified range of
> memory at any time because how memory range is used is not defined by an
> admin
> but by the kernel automatically. But to extract some amount of memory,
> Mel's patch is very important and they need this.
>
> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
> *core* node and removable node. Memory on removable node is removable.
> Dividing area into removable and not-removable is needed, because
> we cannot allocate any kernel's object on removable area.
> Removable area should be 100% removable. Customer can know the limitation
> before using.
>
In this case, we would want some mechanism that says "don't put awkward
pages in this NUMA node" in a clear way. One way we could do this is;
1. Move fallback_allocs to be per-node. fallback_allocs is currently
defined as
int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
{RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES},
{RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
{RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
};
The effect is that a RCLM_NORCLM allocation, falls back to
RCLM_FALLBACK, RCLM_KERN, RCLM_EASY and then gives up.
2. Architectures would need to provide a function that allocates and
populates a fallback_allocs[][] array. If they do not provide one, a
generic function uses array like the one above
3. When adding a node that must be removable, make the array look like
this
int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
{RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
{RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
{RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
};
The effect of this is only allocations that are easily reclaimable will
end up in this node. This would be a straight-forward addition to build
upon this set of patches. The difference would only be visible to
architectures that cared.
> What I'm considering now is this:
> - removable area is hot-added area
> - not-removable area is memory which is visible to kernel at boot time.
> (I'd like to achieve this by the limitation : hot-added node goes into only
> ZONE_HIGHMEM)
ZONE_HIGHMEM can still end up with PTE pages if allocating PTE pages from
highmem is configured. This is bad. With the above approach, nodes that
are not hot-added that have a ZONE_HIGHMEM will be able to use it for PTEs
as well. But when a node is hot-added, it will have a ZONE_HIGHMEM that is
not used for PTE allocations because they are not RCLM_EASY allocations.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 17:19 ` Mel Gorman
@ 2005-11-02 0:32 ` KAMEZAWA Hiroyuki
2005-11-02 11:22 ` Mel Gorman
0 siblings, 1 reply; 253+ messages in thread
From: KAMEZAWA Hiroyuki @ 2005-11-02 0:32 UTC (permalink / raw)
To: Mel Gorman
Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
Mel Gorman wrote:
> 3. When adding a node that must be removable, make the array look like
> this
>
> int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
> {RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
> {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
> {RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
> };
>
> The effect of this is only allocations that are easily reclaimable will
> end up in this node. This would be a straight-forward addition to build
> upon this set of patches. The difference would only be visible to
> architectures that cared.
>
Thank you for illustration.
maybe fallback_list per pgdat/zone is what I need with your patch. right ?
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 0:32 ` KAMEZAWA Hiroyuki
@ 2005-11-02 11:22 ` Mel Gorman
0 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-02 11:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
On Wed, 2 Nov 2005, KAMEZAWA Hiroyuki wrote:
> Mel Gorman wrote:
> > 3. When adding a node that must be removable, make the array look like
> > this
> >
> > int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
> > {RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
> > {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
> > {RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
> > };
> >
> > The effect of this is only allocations that are easily reclaimable will
> > end up in this node. This would be a straight-forward addition to build
> > upon this set of patches. The difference would only be visible to
> > architectures that cared.
> >
> Thank you for illustration.
> maybe fallback_list per pgdat/zone is what I need with your patch. right ?
>
With my patch, yes. With zones, you need to change how zonelists are built
for each node.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 16:48 ` Kamezawa Hiroyuki
2005-11-01 16:59 ` Kamezawa Hiroyuki
2005-11-01 17:19 ` Mel Gorman
@ 2005-11-01 18:06 ` linux-os (Dick Johnson)
2005-11-02 7:19 ` Ingo Molnar
3 siblings, 0 replies; 253+ messages in thread
From: linux-os (Dick Johnson) @ 2005-11-01 18:06 UTC (permalink / raw)
To: Kamezawa Hiroyuki
Cc: Ingo Molnar, Dave Hansen, Mel Gorman, Nick Piggin,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Tue, 1 Nov 2005, Kamezawa Hiroyuki wrote:
> Ingo Molnar wrote:
>> so it's all about expectations: _could_ you reasonably remove a piece of
>> RAM? Customer will say: "I have stopped all nonessential services, and
>> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
>> the kernel!". No reasonable customer will say: "True, I have all RAM
>> used up in mlock()ed sections, but i want to remove some RAM
>> nevertheless".
>>
> Hi, I'm one of men in -lhms
>
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
> [remove some amount of memory from somewhere.]
> For this approach, Mel's patch will work well.
> But this will not guaranntee a user can remove specified range of
> memory at any time because how memory range is used is not defined by an admin
> but by the kernel automatically. But to extract some amount of memory,
> Mel's patch is very important and they need this.
>
> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
> *core* node and removable node. Memory on removable node is removable.
> Dividing area into removable and not-removable is needed, because
> we cannot allocate any kernel's object on removable area.
> Removable area should be 100% removable. Customer can know the limitation before using.
>
> What I'm considering now is this:
> - removable area is hot-added area
> - not-removable area is memory which is visible to kernel at boot time.
> (I'd like to achieve this by the limitation : hot-added node goes into only ZONE_HIGHMEM)
> A customer can hot add their extra memory after boot. This is very easy to understand.
> Peformance problem is trade-off.(I'm afraid of this ;)
>
> If a cutomer wants to guarantee some memory areas should be hot-removable,
> he will hot-add them.
> I don't think adding memory for the kernel by hot-add is wanted by a customer.
>
> -- Kame
With ix86 machines, the page directory pointed to by CR3 needs
to always be present in physical memory. This means that there
must always be some RAM that can't be hot-swapped (you can't
put back the contents of the page-directory without using
the CPU which needs the page directory).
This is explained on page 5-21 of the i486 reference manual.
This happens because there is no "present" bit in CR3 as there
are in the page tables themselves.
This problem means that "surprise" swaps are impossible. However,
given a forewarning, it is possible to build a new table somewhere
in existing RAM within the physical constraints required, call
some code there (needs to be a 1:1 translation), disable paging,
then proceed. The problem is that of writing of the contents
of RAM to be replaced, to storage media so the new page-table
needs to be loaded from the new location. This may not work
if the LDT and the GDT are not accessible from their current
locations. If they are in the RAM to be replaced, you are
in a world of hurt taking the "world" apart and putting it
back together again.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 16:48 ` Kamezawa Hiroyuki
` (2 preceding siblings ...)
2005-11-01 18:06 ` linux-os (Dick Johnson)
@ 2005-11-02 7:19 ` Ingo Molnar
2005-11-02 7:46 ` Gerrit Huizenga
2005-11-02 7:57 ` Nick Piggin
3 siblings, 2 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-02 7:19 UTC (permalink / raw)
To: Kamezawa Hiroyuki
Cc: Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
* Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
> *core* node and removable node. Memory on removable node is removable.
> Dividing area into removable and not-removable is needed, because
> we cannot allocate any kernel's object on removable area.
> Removable area should be 100% removable. Customer can know the limitation
> before using.
that's a perfectly fine method, and is quite similar to the 'separate
zone' approach Nick mentioned too. It is also easily understandable for
users/customers.
under such an approach, things become easier as well: if you have zones
you can to restrict (no kernel pinned-down allocations, no mlock-ed
pages, etc.), there's no need for any 'fragmentation avoidance' patches!
Basically all of that RAM becomes instantly removable (with some small
complications). That's the beauty of the separate-zones approach. It is
also a limitation: no kernel allocations, so all the highmem-alike
restrictions apply to it too.
but what is a dangerous fallacy is that we will be able to support hot
memory unplug of generic kernel RAM in any reliable way!
you really have to look at this from the conceptual angle: 'can an
approach ever lead to a satisfactory result'? If the answer is 'no',
then we _must not_ add a 90% solution that we _know_ will never be a
100% solution.
for the separate-removable-zones approach we see the end of the tunnel.
Separate zones are well-understood.
generic unpluggable kernel RAM _will not work_.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 7:19 ` Ingo Molnar
@ 2005-11-02 7:46 ` Gerrit Huizenga
2005-11-02 8:50 ` Nick Piggin
2005-11-02 10:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar
2005-11-02 7:57 ` Nick Piggin
1 sibling, 2 replies; 253+ messages in thread
From: Gerrit Huizenga @ 2005-11-02 7:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wed, 02 Nov 2005 08:19:43 +0100, Ingo Molnar wrote:
>
> * Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > My own target is NUMA node hotplug, what NUMA node hotplug want is
> > - [remove the range of memory] For this approach, admin should define
> > *core* node and removable node. Memory on removable node is removable.
> > Dividing area into removable and not-removable is needed, because
> > we cannot allocate any kernel's object on removable area.
> > Removable area should be 100% removable. Customer can know the limitation
> > before using.
>
> that's a perfectly fine method, and is quite similar to the 'separate
> zone' approach Nick mentioned too. It is also easily understandable for
> users/customers.
>
> under such an approach, things become easier as well: if you have zones
> you can to restrict (no kernel pinned-down allocations, no mlock-ed
> pages, etc.), there's no need for any 'fragmentation avoidance' patches!
> Basically all of that RAM becomes instantly removable (with some small
> complications). That's the beauty of the separate-zones approach. It is
> also a limitation: no kernel allocations, so all the highmem-alike
> restrictions apply to it too.
>
> but what is a dangerous fallacy is that we will be able to support hot
> memory unplug of generic kernel RAM in any reliable way!
>
> you really have to look at this from the conceptual angle: 'can an
> approach ever lead to a satisfactory result'? If the answer is 'no',
> then we _must not_ add a 90% solution that we _know_ will never be a
> 100% solution.
>
> for the separate-removable-zones approach we see the end of the tunnel.
> Separate zones are well-understood.
>
> generic unpluggable kernel RAM _will not work_.
Actually, it will. Well, depending on terminology.
There are two usage models here - those which intend to remove physical
elements and those where the kernel returnss management of its virtualized
"physical" memory to a hypervisor. In the latter case, a hypervisor
already maintains a virtual map of the memory and the OS needs to release
virtualized "physical" memory. I think you are referring to RAM here as
the physical component; however these same defrag patches help where a
hypervisor is maintaining the real physical memory below the operating
system and the OS is managing a virtualized "physical" memory.
On pSeries hardware or with Xen, a client OS can return chunks of memory
to the hypervisor. That memory needs to be returned in chunks of the
size that the hypervisor normally manages/maintains. But long ranges
of physical contiguity are not required. Just shorter ranges, depending
on what the hypervisor maintains, need to be returned from the OS to
the hypervisor.
In other words, if we can return 1 MB chunks, the hypervisor can hand
out those 1 MB chunks to other domains/partitions. So, if we can return
500 1 MB chunks from a 2 GB OS instance, we can add 500 MB dyanamically
to another OS image.
This happens to be a *very* satisfactory answer for virtualized environments.
The other answer, which is harder, is to return (free) entire large physical
chunks, e.g. the size of the full memory of a node, allowing a node to be
dynamically removed (or a DIMM/SIMM/etc.).
So, people are working towards two distinct solutions, both of which
require us to do a better job of defragmenting memory (or avoiding
fragementation in the first place).
gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 7:46 ` Gerrit Huizenga
@ 2005-11-02 8:50 ` Nick Piggin
2005-11-02 9:12 ` Gerrit Huizenga
2005-11-02 10:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar
1 sibling, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 8:50 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
Gerrit Huizenga wrote:
> So, people are working towards two distinct solutions, both of which
> require us to do a better job of defragmenting memory (or avoiding
> fragementation in the first place).
>
This is just going around in circles. Even with your fragmentation
avoidance and memory defragmentation, there are still going to be
cases where memory does get fragmented and can't be defragmented.
This is Ingo's point, I believe.
Isn't the solution for your hypervisor problem to dish out pages of
the same size that are used by the virtual machines. Doesn't this
provide you with a nice, 100% solution that doesn't add complexity
where it isn't needed?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 8:50 ` Nick Piggin
@ 2005-11-02 9:12 ` Gerrit Huizenga
2005-11-02 9:37 ` Nick Piggin
0 siblings, 1 reply; 253+ messages in thread
From: Gerrit Huizenga @ 2005-11-02 9:12 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:
> Gerrit Huizenga wrote:
>
> > So, people are working towards two distinct solutions, both of which
> > require us to do a better job of defragmenting memory (or avoiding
> > fragementation in the first place).
> >
>
> This is just going around in circles. Even with your fragmentation
> avoidance and memory defragmentation, there are still going to be
> cases where memory does get fragmented and can't be defragmented.
> This is Ingo's point, I believe.
>
> Isn't the solution for your hypervisor problem to dish out pages of
> the same size that are used by the virtual machines. Doesn't this
> provide you with a nice, 100% solution that doesn't add complexity
> where it isn't needed?
So do you see the problem with fragementation if the hypervisor is
handing out, say, 1 MB pages? Or, more likely, something like 64 MB
pages? What are the chances that an entire 64 MB page can be freed
on a large system that has been up a while?
And, if you create zones, you run into all of the zone rebalancing
problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on
any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
random allocations, making almost none of it available.
However, with reasonable defragmentation or fragmentation avoidance,
we have some potential to make large chunks available for return to
the hypervisor. And, that same capability continues to help those
who want to remove fixed ranges of physical memory.
gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 9:12 ` Gerrit Huizenga
@ 2005-11-02 9:37 ` Nick Piggin
2005-11-02 10:17 ` Gerrit Huizenga
2005-11-02 23:47 ` Rob Landley
0 siblings, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 9:37 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
Gerrit Huizenga wrote:
> On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:
>>Isn't the solution for your hypervisor problem to dish out pages of
>>the same size that are used by the virtual machines. Doesn't this
>>provide you with a nice, 100% solution that doesn't add complexity
>>where it isn't needed?
>
>
> So do you see the problem with fragementation if the hypervisor is
> handing out, say, 1 MB pages? Or, more likely, something like 64 MB
> pages? What are the chances that an entire 64 MB page can be freed
> on a large system that has been up a while?
>
I see the problem, but if you want to be able to shrink memory to a
given size, then you must either introduce a hard limit somewhere, or
have the hypervisor hand out guest sized pages. Use zones, or Xen?
> And, if you create zones, you run into all of the zone rebalancing
> problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on
> any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
> random allocations, making almost none of it available.
>
If there are zone rebalancing problems[*], then it would be great to
have more users of zones because then they will be more likely to get
fixed.
[*] and there are, sadly enough - see the recent patches I posted to
lkml for example. But I'm fairly confident that once the particularly
silly ones have been fixed, zone balancing will no longer be a
derogatory term as has been thrown around (maybe rightly) in this
thread!
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 9:37 ` Nick Piggin
@ 2005-11-02 10:17 ` Gerrit Huizenga
2005-11-02 23:47 ` Rob Landley
1 sibling, 0 replies; 253+ messages in thread
From: Gerrit Huizenga @ 2005-11-02 10:17 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wed, 02 Nov 2005 20:37:43 +1100, Nick Piggin wrote:
> Gerrit Huizenga wrote:
> > On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:
>
> >>Isn't the solution for your hypervisor problem to dish out pages of
> >>the same size that are used by the virtual machines. Doesn't this
> >>provide you with a nice, 100% solution that doesn't add complexity
> >>where it isn't needed?
> >
> >
> > So do you see the problem with fragementation if the hypervisor is
> > handing out, say, 1 MB pages? Or, more likely, something like 64 MB
> > pages? What are the chances that an entire 64 MB page can be freed
> > on a large system that has been up a while?
>
> I see the problem, but if you want to be able to shrink memory to a
> given size, then you must either introduce a hard limit somewhere, or
> have the hypervisor hand out guest sized pages. Use zones, or Xen?
So why do you believe there must be a hard limit?
Any reduction in memory usage is going to be workload related.
If the workload is consuming less memory than is available, memory
reclaim is easy (e.g. handle fragmentation, find nice sized chunks).
The workload determines how much the administrator can free. If
the workload is using all of the resources available (e.g. lots of
associated kernel memory locked down, locked user pages, etc.)
then the administrator will logically be able to reduce less memory
from the machine.
The amount of memory to be freed up is not determined by some pre-defined
machine constraints but based on the actual workload's use of the machine.
In other words, who really cares if there is some hard limit? The
only limit should be the number of pages not currently needed by
a given workload, not some arbitrary zone size.
> > And, if you create zones, you run into all of the zone rebalancing
> > problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on
> > any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
> > random allocations, making almost none of it available.
>
> If there are zone rebalancing problems[*], then it would be great to
> have more users of zones because then they will be more likely to get
> fixed.
>
> [*] and there are, sadly enough - see the recent patches I posted to
> lkml for example. But I'm fairly confident that once the particularly
> silly ones have been fixed, zone balancing will no longer be a
> derogatory term as has been thrown around (maybe rightly) in this
> thread!
You are more optimistic here than I. You might have improved the
problem but I think that any zone rebalancing problem is intrinsicly
hard given the way those zones are used and the fact that we sort
of want them to be dynamic and yet physically contiguous. Those two
core constraints seem to be relatively at odds with each other.
I'm not a huge fan of dividing memory up into different types which
are all special purposed. Everything that becomes special purposed
over time limits its use and brings up questions on what special purpose
bucket each allocation should use (e.g. ZONE_NORMAL or ZONE_HIGHMEM
or ZONE_DMA or ZONE_HOTPLUGGABLE). And then, when you run out of
ZONE_HIGHMEM and have to reach into ZONE_HOTPLUGGABLE for some pinned
memory allocation, it seems the whole concept leads to a messy
train wreck.
gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 9:37 ` Nick Piggin
2005-11-02 10:17 ` Gerrit Huizenga
@ 2005-11-02 23:47 ` Rob Landley
2005-11-03 4:43 ` Nick Piggin
1 sibling, 1 reply; 253+ messages in thread
From: Rob Landley @ 2005-11-02 23:47 UTC (permalink / raw)
To: Nick Piggin
Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wednesday 02 November 2005 03:37, Nick Piggin wrote:
> > So do you see the problem with fragementation if the hypervisor is
> > handing out, say, 1 MB pages? Or, more likely, something like 64 MB
> > pages? What are the chances that an entire 64 MB page can be freed
> > on a large system that has been up a while?
>
> I see the problem, but if you want to be able to shrink memory to a
> given size, then you must either introduce a hard limit somewhere, or
> have the hypervisor hand out guest sized pages. Use zones, or Xen?
In the UML case, I want the system to automatically be able to hand back any
sufficiently large chunks of memory it currently isn't using.
What does this have to do with specifying hard limits of anything? What's to
specify? Workloads vary. Deal with it.
> If there are zone rebalancing problems[*], then it would be great to
> have more users of zones because then they will be more likely to get
> fixed.
Ok, so you want to artificially turn this into a zone balancing issue in hopes
of giving that area of the code more testing when, if zones weren't involved,
there would be no need for balancing at all?
How does that make sense?
> [*] and there are, sadly enough - see the recent patches I posted to
> lkml for example.
I was under the impression that zone balancing is, conceptually speaking, a
difficult problem.
> But I'm fairly confident that once the particularly
> silly ones have been fixed,
Great, you're advocating migrating the fragmentation patches to an area of
code that has known problems you yourself describe as "particularly silly".
A ringing endorsement, that.
The fact that the migrated version wouldn't even address fragmentation
avoidance at all (the topic of this thread!) is apparently a side issue.
> zone balancing will no longer be a
> derogatory term as has been thrown around (maybe rightly) in this
> thread!
If I'm not mistaken, you introduced zones into this thread, you are the
primary (possibly only) proponent of them. Yes, zones are a way of
categorizing memory. They're not a way of defragmenting it.
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 23:47 ` Rob Landley
@ 2005-11-03 4:43 ` Nick Piggin
2005-11-03 6:07 ` Rob Landley
0 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-03 4:43 UTC (permalink / raw)
To: Rob Landley
Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
Rob Landley wrote:
> In the UML case, I want the system to automatically be able to hand back any
> sufficiently large chunks of memory it currently isn't using.
>
I'd just be happy with UML handing back page sized chunks of memory that
it isn't currently using. How does contiguous memory (in either the host
or the guest) help this?
> What does this have to do with specifying hard limits of anything? What's to
> specify? Workloads vary. Deal with it.
>
Umm, if you hadn't bothered to read the thread then I won't go through
it all again. The short of it is that if you want guaranteed unfragmented
memory you have to specify a limit.
>
>>If there are zone rebalancing problems[*], then it would be great to
>>have more users of zones because then they will be more likely to get
>>fixed.
>
>
> Ok, so you want to artificially turn this into a zone balancing issue in hopes
> of giving that area of the code more testing when, if zones weren't involved,
> there would be no need for balancing at all?
>
> How does that make sense?
>
Have you looked at the frag patches? Do you realise that they have to
balance between the different types of memory blocks? Duplicating the
same or similar infrastructure (in this case, a memory zoning facility)
is a bad thing in general.
>
>>[*] and there are, sadly enough - see the recent patches I posted to
>> lkml for example.
>
>
> I was under the impression that zone balancing is, conceptually speaking, a
> difficult problem.
>
I am under the impression that you think proper fragmentation avoidance
is easier.
>
>> But I'm fairly confident that once the particularly
>> silly ones have been fixed,
>
>
> Great, you're advocating migrating the fragmentation patches to an area of
> code that has known problems you yourself describe as "particularly silly".
> A ringing endorsement, that.
>
Err, the point is so we don't now have 2 layers doing very similar things,
at least one of which has "particularly silly" bugs in it.
> The fact that the migrated version wouldn't even address fragmentation
> avoidance at all (the topic of this thread!) is apparently a side issue.
>
Zones can be used to guaranteee physically contiguous regions with exactly
the same effectiveness as the frag patches.
>
>> zone balancing will no longer be a
>> derogatory term as has been thrown around (maybe rightly) in this
>> thread!
>
>
> If I'm not mistaken, you introduced zones into this thread, you are the
> primary (possibly only) proponent of them.
So you didn't look at Yasunori Goto's patch from last year that implements
exactly what I described, then?
> Yes, zones are a way of categorizing memory.
Yes, have you read Mel's patches? Guess what they do?
> They're not a way of defragmenting it.
Guess what they don't?
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 4:43 ` Nick Piggin
@ 2005-11-03 6:07 ` Rob Landley
2005-11-03 7:34 ` Nick Piggin
2005-11-03 16:35 ` Jeff Dike
0 siblings, 2 replies; 253+ messages in thread
From: Rob Landley @ 2005-11-03 6:07 UTC (permalink / raw)
To: Nick Piggin
Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
> Rob Landley wrote:
> > In the UML case, I want the system to automatically be able to hand back
> > any sufficiently large chunks of memory it currently isn't using.
>
> I'd just be happy with UML handing back page sized chunks of memory that
> it isn't currently using. How does contiguous memory (in either the host
> or the guest) help this?
Smaller chunks of memory are likely to be reclaimed really soon, and adding in
the syscall overhead working with individual pages of memory is almost
guaranteed to slow us down. Plus with punch, we'd be fragmenting the heck
out of the underlying file.
> > What does this have to do with specifying hard limits of anything?
> > What's to specify? Workloads vary. Deal with it.
>
> Umm, if you hadn't bothered to read the thread then I won't go through
> it all again. The short of it is that if you want guaranteed unfragmented
> memory you have to specify a limit.
I read it. It just didn't contain an answer the the question. I want UML to
be able to hand back however much memory it's not using, but handing back
individual pages as we free them and inserting a syscall overhead for every
page freed and allocated is just nuts. (Plus, at page size, the OS isn't
likely to zero them much faster than we can ourselves even without the
syscall overhead.) Defragmentation means we can batch this into a
granularity that makes it worth it.
This has nothing to do with hard limits on anything.
> Have you looked at the frag patches?
I've read Mel's various descriptions, and tried to stay more or less up to
date ever since LWN brought it to my attention. But I can't say I'm a linux
VM system expert. (The last time I felt I had a really firm grasp on it was
before Andrea and Rik started arguing circa 2.4 and Andrea spent six months
just assuming everybody already knew what a classzone was. I've had other
things to do since then...)
> Do you realise that they have to
> balance between the different types of memory blocks?
I realise they merge them back together into larger chunks as they free up
space, and split larger chunks when they haven't got a smaller one.
> Duplicating the
> same or similar infrastructure (in this case, a memory zoning facility)
> is a bad thing in general.
Even when they keep track of very different things? The memory zoning thing
is about where stuff is in physical memory, and it exists because various
hardware that wants to access memory (24 bit DMA, 32 bit DMA, and PAE) is
evil and crippled and we have to humor it by not asking it to do stuff it
can't.
The fragmentation stuff is about what long contiguous runs of free memory we
can arrange, and it's also nice to be able to categorize them as "zeroed" or
"not zeroed" to make new allocations faster. Where they actually are in
memory is not at issue here.
You can have prezeroed memory in 32 bit DMA space, and prezeroed memory in
highmem, but there's memory in both that isn't prezeroed. I thought there
was a hierarchy of zones. You want overlapping, interlaced, randomly laid
out zones.
> >>[*] and there are, sadly enough - see the recent patches I posted to
> >> lkml for example.
> >
> > I was under the impression that zone balancing is, conceptually speaking,
> > a difficult problem.
>
> I am under the impression that you think proper fragmentation avoidance
> is easier.
I was under the impression it was orthogonal to figuring out whether or not a
given bank of physical memory is accessable to your sound blaster without an
IOMMU.
> >> But I'm fairly confident that once the particularly
> >> silly ones have been fixed,
> >
> > Great, you're advocating migrating the fragmentation patches to an area
> > of code that has known problems you yourself describe as "particularly
> > silly". A ringing endorsement, that.
>
> Err, the point is so we don't now have 2 layers doing very similar things,
> at least one of which has "particularly silly" bugs in it.
Similar is not identical. You seem to be implying that the IO elevator and
the network stack queueing should be merged because they do similar things.
> > The fact that the migrated version wouldn't even address fragmentation
> > avoidance at all (the topic of this thread!) is apparently a side issue.
>
> Zones can be used to guaranteee physically contiguous regions with exactly
> the same effectiveness as the frag patches.
If you'd like to write a counter-patch to Mel's to prove it...
> >> zone balancing will no longer be a
> >> derogatory term as has been thrown around (maybe rightly) in this
> >> thread!
> >
> > If I'm not mistaken, you introduced zones into this thread, you are the
> > primary (possibly only) proponent of them.
>
> So you didn't look at Yasunori Goto's patch from last year that implements
> exactly what I described, then?
I saw the patch he just posted, if that's what you mean. By his own
admission, it doesn't address fragmentation at all.
> > Yes, zones are a way of categorizing memory.
>
> Yes, have you read Mel's patches? Guess what they do?
The swap file is a way of storing data on disk. So is ext3. Obviously, one
is a trivial extension of the other and there's no reason to have both.
> > They're not a way of defragmenting it.
>
> Guess what they don't?
I have no idea what you intended to mean by that. Mel posted a set of patches
in a thread titled "fragmentation avoidance", and you've been arguing about
hotplug, and pointing to a set of patches from Goto that do not address
fragmentation at all. This confuses me.
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 6:07 ` Rob Landley
@ 2005-11-03 7:34 ` Nick Piggin
2005-11-03 17:54 ` Rob Landley
2005-11-03 16:35 ` Jeff Dike
1 sibling, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-03 7:34 UTC (permalink / raw)
To: Rob Landley
Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
Rob Landley wrote:
> On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
>
>>I'd just be happy with UML handing back page sized chunks of memory that
>>it isn't currently using. How does contiguous memory (in either the host
>>or the guest) help this?
>
>
> Smaller chunks of memory are likely to be reclaimed really soon, and adding in
> the syscall overhead working with individual pages of memory is almost
> guaranteed to slow us down.
Because UML doesn't already make a syscall per individual page of
memory freed? (If I read correctly)
> Plus with punch, we'd be fragmenting the heck
> out of the underlying file.
>
Why? No you wouldn't.
>
>>>What does this have to do with specifying hard limits of anything?
>>>What's to specify? Workloads vary. Deal with it.
>>
>>Umm, if you hadn't bothered to read the thread then I won't go through
>>it all again. The short of it is that if you want guaranteed unfragmented
>>memory you have to specify a limit.
>
>
> I read it. It just didn't contain an answer the the question. I want UML to
> be able to hand back however much memory it's not using, but handing back
> individual pages as we free them and inserting a syscall overhead for every
> page freed and allocated is just nuts. (Plus, at page size, the OS isn't
> likely to zero them much faster than we can ourselves even without the
> syscall overhead.) Defragmentation means we can batch this into a
> granularity that makes it worth it.
>
Oh you have measured it and found out that "defragmentation" makes
it worthwhile?
> This has nothing to do with hard limits on anything.
>
You said:
"What does this have to do with specifying hard limits of
anything? What's to specify? Workloads vary. Deal with it."
And I was answering your very polite questions.
>
>>Have you looked at the frag patches?
>
>
> I've read Mel's various descriptions, and tried to stay more or less up to
> date ever since LWN brought it to my attention. But I can't say I'm a linux
> VM system expert. (The last time I felt I had a really firm grasp on it was
> before Andrea and Rik started arguing circa 2.4 and Andrea spent six months
> just assuming everybody already knew what a classzone was. I've had other
> things to do since then...)
>
Maybe you have better things to do now as well?
>>Duplicating the
>>same or similar infrastructure (in this case, a memory zoning facility)
>>is a bad thing in general.
>
>
> Even when they keep track of very different things? The memory zoning thing
> is about where stuff is in physical memory, and it exists because various
> hardware that wants to access memory (24 bit DMA, 32 bit DMA, and PAE) is
> evil and crippled and we have to humor it by not asking it to do stuff it
> can't.
>
No, the buddy allocator is and always has been what tracks the "long
contiguous runs of free memory". Both zones and Mels patches classify
blocks of memory according to some criteria. They're not exactly the
same obviously, but they're equivalent in terms of capability to
guarantee contiguous freeable regions.
>
> I was under the impression it was orthogonal to figuring out whether or not a
> given bank of physical memory is accessable to your sound blaster without an
> IOMMU.
>
Huh?
>>Err, the point is so we don't now have 2 layers doing very similar things,
>>at least one of which has "particularly silly" bugs in it.
>
>
> Similar is not identical. You seem to be implying that the IO elevator and
> the network stack queueing should be merged because they do similar things.
>
No I don't.
>
> If you'd like to write a counter-patch to Mel's to prove it...
>
It has already been written as you have been told numerous times.
Now if you'd like to actually learn about what you're commenting on,
that would be really good too.
>>So you didn't look at Yasunori Goto's patch from last year that implements
>>exactly what I described, then?
>
>
> I saw the patch he just posted, if that's what you mean. By his own
> admission, it doesn't address fragmentation at all.
>
It seems to be that it provides exactly the same (actually stronger)
guarantees than the current frag patches do. Or were you going to point
out a bug in the implementation?
>
>>>Yes, zones are a way of categorizing memory.
>>
>>Yes, have you read Mel's patches? Guess what they do?
>
>
> The swap file is a way of storing data on disk. So is ext3. Obviously, one
> is a trivial extension of the other and there's no reason to have both.
>
Don't try to bullshit your way around with stupid analogies please, it
is an utter waste of time.
>
>>>They're not a way of defragmenting it.
>>
>>Guess what they don't?
>
>
> I have no idea what you intended to mean by that. Mel posted a set of patches
What I mean is that Mel's patches aren't a way of defragmenting memory either.
They fit exactly the description you gave for zones (ie. a way of categorizing,
not defragmenting).
> in a thread titled "fragmentation avoidance", and you've been arguing about
> hotplug, and pointing to a set of patches from Goto that do not address
> fragmentation at all. This confuses me.
>
Yeah it does seem like you are confused.
Now let's finish up this subthread and try to keep the SN ratio up, please?
I'm sure Jeff or someone knowledgeable in the area can chime in if there are
concerns about UML.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 7:34 ` Nick Piggin
@ 2005-11-03 17:54 ` Rob Landley
2005-11-03 20:13 ` Jeff Dike
0 siblings, 1 reply; 253+ messages in thread
From: Rob Landley @ 2005-11-03 17:54 UTC (permalink / raw)
To: Nick Piggin
Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Thursday 03 November 2005 01:34, Nick Piggin wrote:
> Rob Landley wrote:
> > On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
> >>I'd just be happy with UML handing back page sized chunks of memory that
> >>it isn't currently using. How does contiguous memory (in either the host
> >>or the guest) help this?
> >
> > Smaller chunks of memory are likely to be reclaimed really soon, and
> > adding in the syscall overhead working with individual pages of memory is
> > almost guaranteed to slow us down.
>
> Because UML doesn't already make a syscall per individual page of
> memory freed? (If I read correctly)
UML does a big mmap to get "physical" memory, and then manages itself using
the normal Linux kernel mechanisms for doing so. We even have page tables,
although I'm still somewhat unclear on quite how that works.
> > Plus with punch, we'd be fragmenting the heck
> > out of the underlying file.
>
> Why? No you wouldn't.
Creating holes in the file and freeing up the underlying blocks on disk? 4k
at a time? Randomly scattered?
> > I read it. It just didn't contain an answer the the question. I want
> > UML to be able to hand back however much memory it's not using, but
> > handing back individual pages as we free them and inserting a syscall
> > overhead for every page freed and allocated is just nuts. (Plus, at page
> > size, the OS isn't likely to zero them much faster than we can ourselves
> > even without the syscall overhead.) Defragmentation means we can batch
> > this into a granularity that makes it worth it.
>
> Oh you have measured it and found out that "defragmentation" makes
> it worthwhile?
Lots of work has gone into batching up syscalls and making as few of them as
possible because they are a performance bottleneck. You want to introduce a
syscall for every single individual page of memory allocated or freed.
That's stupid.
> > This has nothing to do with hard limits on anything.
>
> You said:
>
> "What does this have to do with specifying hard limits of
> anything? What's to specify? Workloads vary. Deal with it."
>
> And I was answering your very polite questions.
You didn't answer. You keep saying you've already answered, but there
continues to be no answer. Maybe you think you've answered, but I haven't
seen it yet. You brought up hard limits, I asked what that had to do with
anything, and in response you quote my question back at me.
> >>Have you looked at the frag patches?
> >
> > I've read Mel's various descriptions, and tried to stay more or less up
> > to date ever since LWN brought it to my attention. But I can't say I'm a
> > linux VM system expert. (The last time I felt I had a really firm grasp
> > on it was before Andrea and Rik started arguing circa 2.4 and Andrea
> > spent six months just assuming everybody already knew what a classzone
> > was. I've had other things to do since then...)
>
> Maybe you have better things to do now as well?
Yeah, thanks for reminding me. I need to test Mel's newest round of
fragmentation avoidance patches in my UML build system...
> >>Duplicating the
> >>same or similar infrastructure (in this case, a memory zoning facility)
> >>is a bad thing in general.
> >
> > Even when they keep track of very different things? The memory zoning
> > thing is about where stuff is in physical memory, and it exists because
> > various hardware that wants to access memory (24 bit DMA, 32 bit DMA, and
> > PAE) is evil and crippled and we have to humor it by not asking it to do
> > stuff it can't.
>
> No, the buddy allocator is and always has been what tracks the "long
> contiguous runs of free memory".
We are still discussing fragmentation avoidance, right? (I know _I'm_ trying
to...)
> Both zones and Mels patches classify blocks of memory according to some
> criteria. They're not exactly the same obviously, but they're equivalent in
> terms of capability to guarantee contiguous freeable regions.
Back up.
I don't care _where_ the freeable regions are. I just wan't them coalesced.
Zones are all about _where_ the memory is.
I'm pretty sure we're arguing past each other.
> > I was under the impression it was orthogonal to figuring out whether or
> > not a given bank of physical memory is accessable to your sound blaster
> > without an IOMMU.
>
> Huh?
Fragmentation avoidance is what is orthogonal to...
> >>Err, the point is so we don't now have 2 layers doing very similar
> >> things, at least one of which has "particularly silly" bugs in it.
> >
> > Similar is not identical. You seem to be implying that the IO elevator
> > and the network stack queueing should be merged because they do similar
> > things.
>
> No I don't.
They're similar though, aren't they? Why should we have different code in
there to do both? (I know why, but that's what your argument sounds like to
me.)
> > If you'd like to write a counter-patch to Mel's to prove it...
>
> It has already been written as you have been told numerous times.
Quoting Yasunori Goto, Yesterday at 2:33 pm,
Message-Id: <20051102172729.9E7C.Y-GOTO@jp.fujitsu.com>
> Hmmm. I don't see at this point.
> Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> fragmentation.
He's NOT ADDRESSING FRAGMENTATION.
So unless you're talking about some OTHER patch, we're talking past each other
again.
> Now if you'd like to actually learn about what you're commenting on,
> that would be really good too.
The feeling is mutual.
> >>So you didn't look at Yasunori Goto's patch from last year that
> >> implements exactly what I described, then?
> >
> > I saw the patch he just posted, if that's what you mean. By his own
> > admission, it doesn't address fragmentation at all.
>
> It seems to be that it provides exactly the same (actually stronger)
> guarantees than the current frag patches do. Or were you going to point
> out a bug in the implementation?
No, I'm going to point out that the author of the patch contradicts you.
> >>>Yes, zones are a way of categorizing memory.
> >>
> >>Yes, have you read Mel's patches? Guess what they do?
> >
> > The swap file is a way of storing data on disk. So is ext3. Obviously,
> > one is a trivial extension of the other and there's no reason to have
> > both.
>
> Don't try to bullshit your way around with stupid analogies please, it
> is an utter waste of time.
I agree that this conversation is a waste of time, and will stop trying to
reason with you now.
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 17:54 ` Rob Landley
@ 2005-11-03 20:13 ` Jeff Dike
0 siblings, 0 replies; 253+ messages in thread
From: Jeff Dike @ 2005-11-03 20:13 UTC (permalink / raw)
To: Rob Landley
Cc: Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki,
Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Thu, Nov 03, 2005 at 11:54:10AM -0600, Rob Landley wrote:
> Lots of work has gone into batching up syscalls and making as few of them as
> possible because they are a performance bottleneck. You want to introduce a
> syscall for every single individual page of memory allocated or freed.
>
> That's stupid.
I think what I'm optimizing is TLB flushes, not system calls. With
mmap et al, they are effectively the same thing though.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 6:07 ` Rob Landley
2005-11-03 7:34 ` Nick Piggin
@ 2005-11-03 16:35 ` Jeff Dike
2005-11-03 16:23 ` Badari Pulavarty
1 sibling, 1 reply; 253+ messages in thread
From: Jeff Dike @ 2005-11-03 16:35 UTC (permalink / raw)
To: Rob Landley
Cc: Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki,
Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Thu, Nov 03, 2005 at 12:07:33AM -0600, Rob Landley wrote:
> I want UML to
> be able to hand back however much memory it's not using, but handing back
> individual pages as we free them and inserting a syscall overhead for every
> page freed and allocated is just nuts. (Plus, at page size, the OS isn't
> likely to zero them much faster than we can ourselves even without the
> syscall overhead.) Defragmentation means we can batch this into a
> granularity that makes it worth it.
I don't think that freeing pages back to the host in free_pages is the
way to go. The normal behavior for a Linux system, virtual or
physical, is to use all the memory it has. So, any memory that's
freed is pretty likely to be reused for something else, wasting any
effort that's made to free pages back to the host.
The one counter-example I can think of is when a large process with a
lot of data exits. Then its data pages will be freed and they may
stay free for a while until the system finds other data to fill them
with.
Also, it's not the virtual machine's job to know how to make the host
perform optimally. It doesn't have the information to do it. It's
perfectly OK for a UML to hang on to memory if the host has plenty
free. So, it's the host's job to make sure that its memory pressure
is reflected to the UMLs.
My current thinking is that you'll have a daemon on the host keeping
track of memory pressure on the host and the UMLs, plugging and
unplugging memory in order to keep the busy machines, including the
host, supplied with memory, and periodically pushing down the memory
of idle UMLs in order to force them to GC their page caches.
With Badari's patch and UML memory hotplug, the infrastructure is
there to make this work. The one thing I'm puzzling over right now is
how to measure memory pressure.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 16:35 ` Jeff Dike
@ 2005-11-03 16:23 ` Badari Pulavarty
2005-11-03 18:27 ` Jeff Dike
` (2 more replies)
0 siblings, 3 replies; 253+ messages in thread
From: Badari Pulavarty @ 2005-11-03 16:23 UTC (permalink / raw)
To: Jeff Dike
Cc: Rob Landley, Nick Piggin, Gerrit Huizenga, Ingo Molnar,
Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
On Thu, 2005-11-03 at 11:35 -0500, Jeff Dike wrote:
> On Thu, Nov 03, 2005 at 12:07:33AM -0600, Rob Landley wrote:
> > I want UML to
> > be able to hand back however much memory it's not using, but handing back
> > individual pages as we free them and inserting a syscall overhead for every
> > page freed and allocated is just nuts. (Plus, at page size, the OS isn't
> > likely to zero them much faster than we can ourselves even without the
> > syscall overhead.) Defragmentation means we can batch this into a
> > granularity that makes it worth it.
>
> I don't think that freeing pages back to the host in free_pages is the
> way to go. The normal behavior for a Linux system, virtual or
> physical, is to use all the memory it has. So, any memory that's
> freed is pretty likely to be reused for something else, wasting any
> effort that's made to free pages back to the host.
>
> The one counter-example I can think of is when a large process with a
> lot of data exits. Then its data pages will be freed and they may
> stay free for a while until the system finds other data to fill them
> with.
>
> Also, it's not the virtual machine's job to know how to make the host
> perform optimally. It doesn't have the information to do it. It's
> perfectly OK for a UML to hang on to memory if the host has plenty
> free. So, it's the host's job to make sure that its memory pressure
> is reflected to the UMLs.
>
> My current thinking is that you'll have a daemon on the host keeping
> track of memory pressure on the host and the UMLs, plugging and
> unplugging memory in order to keep the busy machines, including the
> host, supplied with memory, and periodically pushing down the memory
> of idle UMLs in order to force them to GC their page caches.
>
> With Badari's patch and UML memory hotplug, the infrastructure is
> there to make this work. The one thing I'm puzzling over right now is
> how to measure memory pressure.
Yep. This is the exactly the issue other product groups normally raise
on Linux. How do we measure memory pressure in linux ? Some of our
software products want to grow or shrink their memory usage depending
on the memory pressure in the system. Since most memory is used for
cache, "free" really doesn't indicate anything -they are monitoring
info in /proc/meminfo and swapping rates to "guess" on the memory
pressure. They want a clear way of finding out "how badly" system
is under memory pressure. (As a starting point, they want to find out
out of "cached" memory - how much is really easily "reclaimable"
under memory pressure - without swapping). I know this is kind of
crazy, but interesting to think about :)
Thanks,
Badari
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 16:23 ` Badari Pulavarty
@ 2005-11-03 18:27 ` Jeff Dike
2005-11-03 18:49 ` Rob Landley
2005-11-04 4:52 ` Andrew Morton
2 siblings, 0 replies; 253+ messages in thread
From: Jeff Dike @ 2005-11-03 18:27 UTC (permalink / raw)
To: Badari Pulavarty
Cc: Rob Landley, Nick Piggin, Gerrit Huizenga, Ingo Molnar,
Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
On Thu, Nov 03, 2005 at 08:23:20AM -0800, Badari Pulavarty wrote:
> Yep. This is the exactly the issue other product groups normally raise
> on Linux. How do we measure memory pressure in linux ? Some of our
> software products want to grow or shrink their memory usage depending
> on the memory pressure in the system.
I think this is wrong. Applications shouldn't be measuring host
memory pressure and trying to react to it.
This gives you no way to implement a global memory use policy - you
can't say "App X is the most important thing on the system and must
have all the memory it needs in order run as quickly as possible".
You can't establish any sort of priority between apps when it comes to
memory use, or change those priorities.
And how does this work when the system can change the amount of memory
that it has, such as when the app is inside a UML?
I think the right way to go is for willing apps to have an interface
through which they can be told "change your memory consumption by +-X"
and have a single daemon on the host tracking memory use and memory
pressure, and shuffling memory between the apps.
This allows the admin to set memory use priorities between the apps
and to exempt important ones from having memory pulled.
Measuring at the bottom and pushing memory pressure upwards also works
naturally for virtual machines and the apps running inside them. The
host will push memory pressure at the virtual machines, which in turn
will push that pressure at their apps.
With UML, I have an interface where a daemon on the host can add or
remove memory from an instance. I think the apps that are willing to
adjust should implement something similar.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 16:23 ` Badari Pulavarty
2005-11-03 18:27 ` Jeff Dike
@ 2005-11-03 18:49 ` Rob Landley
2005-11-04 4:52 ` Andrew Morton
2 siblings, 0 replies; 253+ messages in thread
From: Rob Landley @ 2005-11-03 18:49 UTC (permalink / raw)
To: Badari Pulavarty
Cc: Jeff Dike, Nick Piggin, Gerrit Huizenga, Ingo Molnar,
Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
On Thursday 03 November 2005 10:23, Badari Pulavarty wrote:
> Yep. This is the exactly the issue other product groups normally raise
> on Linux. How do we measure memory pressure in linux ? Some of our
> software products want to grow or shrink their memory usage depending
> on the memory pressure in the system. Since most memory is used for
> cache, "free" really doesn't indicate anything -they are monitoring
> info in /proc/meminfo and swapping rates to "guess" on the memory
> pressure. They want a clear way of finding out "how badly" system
> is under memory pressure. (As a starting point, they want to find out
> out of "cached" memory - how much is really easily "reclaimable"
> under memory pressure - without swapping). I know this is kind of
> crazy, but interesting to think about :)
If we do ever get prezeroing, we'd want a tuneable to say how much memory
should be spent on random page cache and how much should be prezeroed. And
large chunks of prezeroed memory lying around are what you'd think about
handing back to the host OS...
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 16:23 ` Badari Pulavarty
2005-11-03 18:27 ` Jeff Dike
2005-11-03 18:49 ` Rob Landley
@ 2005-11-04 4:52 ` Andrew Morton
2005-11-04 5:35 ` Paul Jackson
2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar
2 siblings, 2 replies; 253+ messages in thread
From: Andrew Morton @ 2005-11-04 4:52 UTC (permalink / raw)
To: Badari Pulavarty
Cc: jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue,
mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel
Badari Pulavarty <pbadari@gmail.com> wrote:
>
> > With Badari's patch and UML memory hotplug, the infrastructure is
> > there to make this work. The one thing I'm puzzling over right now is
> > how to measure memory pressure.
>
> Yep. This is the exactly the issue other product groups normally raise
> on Linux. How do we measure memory pressure in linux ? Some of our
> software products want to grow or shrink their memory usage depending
> on the memory pressure in the system. Since most memory is used for
> cache, "free" really doesn't indicate anything -they are monitoring
> info in /proc/meminfo and swapping rates to "guess" on the memory
> pressure. They want a clear way of finding out "how badly" system
> is under memory pressure. (As a starting point, they want to find out
> out of "cached" memory - how much is really easily "reclaimable"
> under memory pressure - without swapping). I know this is kind of
> crazy, but interesting to think about :)
Similarly, that SGI patch which was rejected 6-12 months ago to kill off
processes once they started swapping. We thought that it could be done
from userspace, but we need a way for userspace to detect when a task is
being swapped on a per-task basis.
I'm thinking a few numbers in the mm_struct, incremented in the pageout
code, reported via /proc/stat.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 4:52 ` Andrew Morton
@ 2005-11-04 5:35 ` Paul Jackson
2005-11-04 5:48 ` Andrew Morton
2005-11-04 6:16 ` Bron Nelson
2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar
1 sibling, 2 replies; 253+ messages in thread
From: Paul Jackson @ 2005-11-04 5:35 UTC (permalink / raw)
To: Andrew Morton
Cc: pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
lhms-devel
> Similarly, that SGI patch which was rejected 6-12 months ago to kill off
> processes once they started swapping. We thought that it could be done
> from userspace, but we need a way for userspace to detect when a task is
> being swapped on a per-task basis.
>
> I'm thinking a few numbers in the mm_struct, incremented in the pageout
> code, reported via /proc/stat.
I just sent in a proposed patch for this - one more per-cpuset
number, tracking the recent rate of calls into the synchronous
(direct) page reclaim by tasks in the cpuset.
See the message sent a few minutes ago, with subject:
[PATCH 5/5] cpuset: memory reclaim rate meter
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 5:35 ` Paul Jackson
@ 2005-11-04 5:48 ` Andrew Morton
2005-11-04 6:42 ` Paul Jackson
2005-11-04 6:16 ` Bron Nelson
1 sibling, 1 reply; 253+ messages in thread
From: Andrew Morton @ 2005-11-04 5:48 UTC (permalink / raw)
To: Paul Jackson, Bron Nelson
Cc: pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
lhms-devel
Paul Jackson <pj@sgi.com> wrote:
>
> > Similarly, that SGI patch which was rejected 6-12 months ago to kill off
> > processes once they started swapping. We thought that it could be done
> > from userspace, but we need a way for userspace to detect when a task is
> > being swapped on a per-task basis.
> >
> > I'm thinking a few numbers in the mm_struct, incremented in the pageout
> > code, reported via /proc/stat.
>
> I just sent in a proposed patch for this - one more per-cpuset
> number, tracking the recent rate of calls into the synchronous
> (direct) page reclaim by tasks in the cpuset.
>
> See the message sent a few minutes ago, with subject:
>
> [PATCH 5/5] cpuset: memory reclaim rate meter
>
uh, OK. If that patch is merged, does that make Bron happy, so I don't
have to reply to his plaintive email?
I was kind of thinking that the stats should be per-process (actually
per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 5:48 ` Andrew Morton
@ 2005-11-04 6:42 ` Paul Jackson
2005-11-04 7:10 ` Andrew Morton
0 siblings, 1 reply; 253+ messages in thread
From: Paul Jackson @ 2005-11-04 6:42 UTC (permalink / raw)
To: Andrew Morton
Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
Andrew wrote:
> uh, OK. If that patch is merged, does that make Bron happy, so I don't
> have to reply to his plaintive email?
In theory yes, that should do it. I will ack again, by early next
week, after I have verified this further.
And it should also handle some other folks who have plaintive emails
in my inbox, that haven't gotten bold enough to pester you, yet.
It really is, for the users who know my email address (*), job based
memory pressure, not task based, that matters. Sticking it in a
cpuset, which is the natural job container, is easier, more natural,
and more efficient for all concerned.
It's jobs that are being run in cpusets with dedicated (not shared)
CPUs and Memory Nodes that care about this, so far as I know.
When running a system in a more typical sharing mode, with multiple
jobs and applications competing for the same resources, then the kernel
needs to be master of processor scheduling and memory allocation.
When running jobs in cpusets with dedicated CPUs and Memory Nodes,
then less is being asked of the kernel, and some per-job controls
from userspace make more sense. This is where a simple hook like
this reclaim rate meter comes into play - passing up to user space
another clue to help it do its job.
> I was kind of thinking that the stats should be per-process (actually
> per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something.
There may well be a market for these too. But such stats sound like
more work, and the market isn't one that's paying my salary.
So I will leave that challenge on the table for someone else.
(*) Of course, there is some self selection going on here.
Folks not doing cpuset-based jobs are far less likely
to know my email address ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 6:42 ` Paul Jackson
@ 2005-11-04 7:10 ` Andrew Morton
2005-11-04 7:45 ` Paul Jackson
2005-11-04 15:19 ` Martin J. Bligh
0 siblings, 2 replies; 253+ messages in thread
From: Andrew Morton @ 2005-11-04 7:10 UTC (permalink / raw)
To: Paul Jackson
Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
Paul Jackson <pj@sgi.com> wrote:
>
> > I was kind of thinking that the stats should be per-process (actually
> > per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something.
>
> There may well be a market for these too. But such stats sound like
> more work, and the market isn't one that's paying my salary.
But I have to care for all users.
> So I will leave that challenge on the table for someone else.
And I won't merge your patch ;)
Seriously, it does appear that doing it per-task is adequate for your
needs, and it is certainly more general.
I cannot understand why you decided to count only the number of
direct-reclaim events, via a "digitally filtered, constant time based,
event frequency meter".
a) It loses information. If we were to export the number of pages
reclaimed from the mm, filtering can be done in userspace.
b) It omits reclaim performed by kswapd and by other tasks (ok, it's
very cpuset-specific).
c) It only counts synchronous try_to_free_pages() attempts. What if an
attempt only freed pagecache, or didbn't manage to free anything?
d) It doesn't notice if kswapd is swapping the heck out of your
not-allocating-any-memory-now process.
I think all the above can be addressed by exporting per-task (actually
per-mm) reclaim info. (I haven't put much though into what info that
should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
etc)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 7:10 ` Andrew Morton
@ 2005-11-04 7:45 ` Paul Jackson
2005-11-04 8:02 ` Andrew Morton
2005-11-04 15:19 ` Martin J. Bligh
1 sibling, 1 reply; 253+ messages in thread
From: Paul Jackson @ 2005-11-04 7:45 UTC (permalink / raw)
To: Andrew Morton
Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
Andrew wrote:
> > So I will leave that challenge on the table for someone else.
>
> And I won't merge your patch ;)
Be that way ;).
> Seriously, it does appear that doing it per-task is adequate for your
> needs, and it is certainly more general.
My motivations for the per-cpuset, digitally filtered rate, as opposed
to the per-task raw counter mostly have to do with minimizing total
cost (user + kernel) of collecting this information. I have this phobia,
perhaps not well founded, that moving critical scheduling/allocation
decisions like this into user space will fail in some cases because
the cost of gathering the critical information will be too intrusive
on system performance and scalability.
A per-task stat requires walking the tasklist, to build a list of the
tasks to query.
A raw counter requires repeated polling to determine the recent rate of
activity.
The filtered per-cpuset rate avoids any need to repeatedly access
global resources such as the tasklist, and minimizes the total cpu
cycles required to get the interesting stat.
> But I have to care for all users.
Well you should, and well you do.
If you have good reason, or just good instincts, to think that there
are uses for per-task raw counters, then your choice is clear.
As indeed it was clear.
I don't recall hearing of any desire for per-task memory pressure data,
until tonight.
I will miss this patch. It had provided exactly what I thought was
needed, with an extremely small impact on system (kern+user) performance.
Oh well.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 7:45 ` Paul Jackson
@ 2005-11-04 8:02 ` Andrew Morton
2005-11-04 9:52 ` Paul Jackson
0 siblings, 1 reply; 253+ messages in thread
From: Andrew Morton @ 2005-11-04 8:02 UTC (permalink / raw)
To: Paul Jackson
Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
Paul Jackson <pj@sgi.com> wrote:
>
> A per-task stat requires walking the tasklist, to build a list of the
> tasks to query.
Nope, just task->mm->whatever.
> A raw counter requires repeated polling to determine the recent rate of
> activity.
True.
> The filtered per-cpuset rate avoids any need to repeatedly access
> global resources such as the tasklist, and minimizes the total cpu
> cycles required to get the interesting stat.
>
Well no. Because the filtered-whatsit takes two spinlocks and does a bunch
of arith for each and every task, each time it calls try_to_free_pages().
The frequency of that could be very high indeed, even when nobody is
interested in the metric which is being maintained(!).
And I'd suggest that only a minority of workloads would be interested in
this metric?
ergo, polling the thing once per five seconds in those situations where we
actually want to poll the thing may well be cheaper, in global terms?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 8:02 ` Andrew Morton
@ 2005-11-04 9:52 ` Paul Jackson
2005-11-04 15:27 ` Martin J. Bligh
0 siblings, 1 reply; 253+ messages in thread
From: Paul Jackson @ 2005-11-04 9:52 UTC (permalink / raw)
To: Andrew Morton
Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
> > A per-task stat requires walking the tasklist, to build a list of the
> > tasks to query.
>
> Nope, just task->mm->whatever.
Nope.
Agreed - once you have the task, then sure, that's enough.
However - a batch scheduler will end up having to figure out what tasks
there are to inquire, by either listing the tasks in a cpuset, or
by listing /proc. Either way, that's a tasklist scan. And it will
have to do that pretty much every iteration of polling, since it has
no a priori knowledge of what tasks a job is firing up.
> Well no. Because the filtered-whatsit takes two spinlocks and does a bunch
> of arith for each and every task, each time it calls try_to_free_pages().
Neither spinlock is global - the task and a lock in its cpuset.
I see a fair number of existing locks and semaphores, some global
and some in loops, that look to be in the code invoked by
try_to_free_pages(). And far more arithmetic than in that little
filter.
Granted, its cost seen by all, for the benefit of few. But other sorts
of per-task or per-mm stats are not going to be free either. I would
have figured that doing something per-page, even the most trivial
"counter++" (better have that mm locked) will likely cost more than
doing something per try_to_free_pages() call.
> The frequency of that could be very high indeed, even when nobody is
> interested in the metric which is being maintained(!)
When I have a task start allocating memory as fast it can, it is only
able to call try_to_free_pages() about 10 times a second on an idle
ia64 SN2 system, with a single thread, or about 20 times a second
running several threads at once allocating memory.
That's not "very high" in my book.
What sort of load would hit this much more often?
If more folks need these detailed stats, then that's how it should be.
But I am no fan of exposing more than the minimum kernel vm details for
use by production software.
We agree that my per-cpuset memory_reclaim_rate meter certainly hides
more detail than the sorts of stats you are suggesting. I thought that
was good, so long as what was needed was still present.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 9:52 ` Paul Jackson
@ 2005-11-04 15:27 ` Martin J. Bligh
0 siblings, 0 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-04 15:27 UTC (permalink / raw)
To: Paul Jackson, Andrew Morton
Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
kamezawa.hiroyu, haveblue, mel, kravetz, linux-mm, linux-kernel,
lhms-devel
> We agree that my per-cpuset memory_reclaim_rate meter certainly hides
> more detail than the sorts of stats you are suggesting. I thought that
> was good, so long as what was needed was still present.
But it's horribly specific to cpusets. If you want something multi-task,
would be better if it worked by more generic task groupings.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 7:10 ` Andrew Morton
2005-11-04 7:45 ` Paul Jackson
@ 2005-11-04 15:19 ` Martin J. Bligh
2005-11-04 17:38 ` Andrew Morton
1 sibling, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-04 15:19 UTC (permalink / raw)
To: Andrew Morton, Paul Jackson
Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
kamezawa.hiroyu, haveblue, mel, kravetz, linux-mm, linux-kernel,
lhms-devel
> Seriously, it does appear that doing it per-task is adequate for your
> needs, and it is certainly more general.
>
>
>
> I cannot understand why you decided to count only the number of
> direct-reclaim events, via a "digitally filtered, constant time based,
> event frequency meter".
>
> a) It loses information. If we were to export the number of pages
> reclaimed from the mm, filtering can be done in userspace.
>
> b) It omits reclaim performed by kswapd and by other tasks (ok, it's
> very cpuset-specific).
>
> c) It only counts synchronous try_to_free_pages() attempts. What if an
> attempt only freed pagecache, or didbn't manage to free anything?
>
> d) It doesn't notice if kswapd is swapping the heck out of your
> not-allocating-any-memory-now process.
>
>
> I think all the above can be addressed by exporting per-task (actually
> per-mm) reclaim info. (I haven't put much though into what info that
> should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
> etc)
I've been looking at similar things. When we page out / free something from
a shared library that 10 tasks have mapped, who does that count against
for pressure?
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 15:19 ` Martin J. Bligh
@ 2005-11-04 17:38 ` Andrew Morton
0 siblings, 0 replies; 253+ messages in thread
From: Andrew Morton @ 2005-11-04 17:38 UTC (permalink / raw)
To: Martin J. Bligh
Cc: pj, bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
kamezawa.hiroyu, haveblue, mel, kravetz, linux-mm, linux-kernel,
lhms-devel
"Martin J. Bligh" <mbligh@mbligh.org> wrote:
>
> > Seriously, it does appear that doing it per-task is adequate for your
> > needs, and it is certainly more general.
> >
> >
> >
> > I cannot understand why you decided to count only the number of
> > direct-reclaim events, via a "digitally filtered, constant time based,
> > event frequency meter".
> >
> > a) It loses information. If we were to export the number of pages
> > reclaimed from the mm, filtering can be done in userspace.
> >
> > b) It omits reclaim performed by kswapd and by other tasks (ok, it's
> > very cpuset-specific).
> >
> > c) It only counts synchronous try_to_free_pages() attempts. What if an
> > attempt only freed pagecache, or didbn't manage to free anything?
> >
> > d) It doesn't notice if kswapd is swapping the heck out of your
> > not-allocating-any-memory-now process.
> >
> >
> > I think all the above can be addressed by exporting per-task (actually
> > per-mm) reclaim info. (I haven't put much though into what info that
> > should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
> > etc)
>
> I've been looking at similar things. When we page out / free something from
> a shared library that 10 tasks have mapped, who does that count against
> for pressure?
Count pte unmappings and minor faults and account them against the
mm_struct, I guess.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 5:35 ` Paul Jackson
2005-11-04 5:48 ` Andrew Morton
@ 2005-11-04 6:16 ` Bron Nelson
1 sibling, 0 replies; 253+ messages in thread
From: Bron Nelson @ 2005-11-04 6:16 UTC (permalink / raw)
To: Paul Jackson, Andrew Morton
Cc: lhms-devel, linux-kernel, linux-mm, kravetz, mbligh, mel,
haveblue, kamezawa.hiroyu, mingo, gh, nickpiggin, rob, jdike,
pbadari
> I was kind of thinking that the stats should be per-process (actually
> per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something.
The particular people that I deal with care about constraining things
on a per-cpuset basis, so that is the information that I personally am
looking for. But it is simple enough to map tasks to cpusets and vice-versa,
so this is not really a serious consideration. I would generically be in
favor of the per-process stats (even though the application at hand is
actually interested in the cpuset aggregate stats), because we can always
produce an aggregate from the detailed, but not vice-versa. And no doubt
some future as-yet-unimagined application will want per-process info.
--
Bron Campbell Nelson bron@sgi.com
These statements are my own, not those of Silicon Graphics.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* [patch] swapin rlimit
2005-11-04 4:52 ` Andrew Morton
2005-11-04 5:35 ` Paul Jackson
@ 2005-11-04 7:26 ` Ingo Molnar
2005-11-04 7:36 ` Andrew Morton
2005-11-04 10:14 ` Bernd Petrovitsch
1 sibling, 2 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-04 7:26 UTC (permalink / raw)
To: Andrew Morton
Cc: Badari Pulavarty, Linus Torvalds, jdike, rob, nickpiggin, gh,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
* Andrew Morton <akpm@osdl.org> wrote:
> Similarly, that SGI patch which was rejected 6-12 months ago to kill
> off processes once they started swapping. We thought that it could be
> done from userspace, but we need a way for userspace to detect when a
> task is being swapped on a per-task basis.
wouldnt the clean solution here be a "swap ulimit"?
I.e. something like the 2-minute quick-hack below (against Linus-curr).
Ingo
---
implement a swap ulimit: RLIMIT_SWAP.
setting the ulimit to 0 causes any swapin activity to kill the task.
Setting the rlimit to 0 is allowed for unprivileged users too, since it
is a decrease of the default RLIM_INFINITY value. I.e. users could run
known-memory-intense jobs with such an ulimit set, and get a guarantee
that they wont put the system into a swap-storm.
Note: it's just swapin that causes the SIGKILL, because at swapout time
it's hard to identify the originating task. Pure swapouts and a buildup
in the swap-cache is not punished, only actual hard swapins. I didnt try
too hard to make the rlimit particularly finegrained - i.e. right now we
only know 'zero' and 'infinity' ...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
include/asm-generic/resource.h | 4 +++-
mm/memory.c | 13 +++++++++++++
2 files changed, 16 insertions(+), 1 deletion(-)
Index: linux/include/asm-generic/resource.h
===================================================================
--- linux.orig/include/asm-generic/resource.h
+++ linux/include/asm-generic/resource.h
@@ -44,8 +44,9 @@
#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
+#define RLIMIT_SWAP 15 /* maximum swapspace for task */
-#define RLIM_NLIMITS 15
+#define RLIM_NLIMITS 16
/*
* SuS says limits have to be unsigned.
@@ -86,6 +87,7 @@
[RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
[RLIMIT_NICE] = { 0, 0 }, \
[RLIMIT_RTPRIO] = { 0, 0 }, \
+ [RLIMIT_SWAP] = { RLIM_INFINITY, RLIM_INFINITY }, \
}
#endif /* __KERNEL__ */
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -1647,6 +1647,18 @@ void swapin_readahead(swp_entry_t entry,
}
/*
+ * Crude first-approximation swapin-avoidance: if there is a zero swap
+ * rlimit then kill the task.
+ */
+static inline void check_swap_rlimit(void)
+{
+ unsigned long limit = current->signal->rlim[RLIMIT_SWAP].rlim_cur;
+
+ if (limit != RLIM_INFINITY)
+ force_sig(SIGKILL, current);
+}
+
+/*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
@@ -1667,6 +1679,7 @@ static int do_swap_page(struct mm_struct
entry = pte_to_swp_entry(orig_pte);
page = lookup_swap_cache(entry);
if (!page) {
+ check_swap_rlimit();
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [patch] swapin rlimit
2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar
@ 2005-11-04 7:36 ` Andrew Morton
2005-11-04 8:07 ` Ingo Molnar
` (2 more replies)
2005-11-04 10:14 ` Bernd Petrovitsch
1 sibling, 3 replies; 253+ messages in thread
From: Andrew Morton @ 2005-11-04 7:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu,
haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
lhms-devel
Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Morton <akpm@osdl.org> wrote:
>
> > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > off processes once they started swapping. We thought that it could be
> > done from userspace, but we need a way for userspace to detect when a
> > task is being swapped on a per-task basis.
>
> wouldnt the clean solution here be a "swap ulimit"?
Well it's _a_ solution, but it's terribly specific.
How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
non-zero, kill <pid>?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [patch] swapin rlimit
2005-11-04 7:36 ` Andrew Morton
@ 2005-11-04 8:07 ` Ingo Molnar
2005-11-04 10:06 ` Paul Jackson
2005-11-04 15:24 ` Martin J. Bligh
2005-11-04 8:18 ` Arjan van de Ven
2005-11-04 15:14 ` Rob Landley
2 siblings, 2 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-04 8:07 UTC (permalink / raw)
To: Andrew Morton
Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu,
haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
lhms-devel
* Andrew Morton <akpm@osdl.org> wrote:
> Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Andrew Morton <akpm@osdl.org> wrote:
> >
> > > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > > off processes once they started swapping. We thought that it could be
> > > done from userspace, but we need a way for userspace to detect when a
> > > task is being swapped on a per-task basis.
> >
> > wouldnt the clean solution here be a "swap ulimit"?
>
> Well it's _a_ solution, but it's terribly specific.
>
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
> non-zero, kill <pid>?
on a system with possibly thousands of taks, over /proc, on a
high-performance node where for a 0.5% improvement they are willing to
sacrifice maidens? :)
Seriously, while nr_swapped_in_pages ought to be OK, i think there is a
generic problem with /proc based stats.
System instrumentation people are already complaining about how costly
/proc parsing is. If you have to get some nontrivial stat from all
threads in the system, and if Linux doesnt offer that counter or summary
by default, it gets pretty expensive.
One solution i can think of would be to make a binary representation of
/proc/<pid>/stats readonly-mmap-able. This would add a 4K page to every
task tracked that way, and stats updates would have to update this page
too - but it would make instrumentation of running apps really
unintrusive and scalable.
Another addition would be some mechanism for a monitoring app to capture
events in the PID space: so that they can mmap() new tasks [if they are
interested] on a non-polling basis, i.e. not like readdir on /proc. This
capability probably has to be a system-call though, as /proc seems too
quirky for it. The system does not wait on the monitoring app(s) to
catch up - if it's too slow in reacting and the event buffer overflows
then tough luck - monitoring apps will have no impact on the runtime
characteristics of other tasks. In theory this is somewhat similar to
auditing, but the purpose would be quite different, and it only cares
about PID-space events like 'fork/clone', 'exec' and 'exit'.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [patch] swapin rlimit
2005-11-04 8:07 ` Ingo Molnar
@ 2005-11-04 10:06 ` Paul Jackson
2005-11-04 15:24 ` Martin J. Bligh
1 sibling, 0 replies; 253+ messages in thread
From: Paul Jackson @ 2005-11-04 10:06 UTC (permalink / raw)
To: Ingo Molnar
Cc: akpm, pbadari, torvalds, jdike, rob, nickpiggin, gh,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
Ingo wrote:
> Seriously, while nr_swapped_in_pages ought to be OK, i think there is a
> generic problem with /proc based stats.
>
> System instrumentation people are already complaining about how costly
> /proc parsing is. If you have to get some nontrivial stat from all
> threads in the system, and if Linux doesnt offer that counter or summary
> by default, it gets pretty expensive.
Agreed.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [patch] swapin rlimit
2005-11-04 8:07 ` Ingo Molnar
2005-11-04 10:06 ` Paul Jackson
@ 2005-11-04 15:24 ` Martin J. Bligh
1 sibling, 0 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-04 15:24 UTC (permalink / raw)
To: Ingo Molnar, Andrew Morton
Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu,
haveblue, mel, kravetz, linux-mm, linux-kernel, lhms-devel
> System instrumentation people are already complaining about how costly
> /proc parsing is. If you have to get some nontrivial stat from all
> threads in the system, and if Linux doesnt offer that counter or summary
> by default, it gets pretty expensive.
>
> One solution i can think of would be to make a binary representation of
> /proc/<pid>/stats readonly-mmap-able. This would add a 4K page to every
> task tracked that way, and stats updates would have to update this page
> too - but it would make instrumentation of running apps really
> unintrusive and scalable.
That would be awesome - the current methods we have are mostly crap. There
are some atomicity issues though. Plus when I suggested this 2 years ago,
everyone told me to piss off, but I'm not bitter ;-) Seriously, we do
need a fast communication mechanism.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [patch] swapin rlimit
2005-11-04 7:36 ` Andrew Morton
2005-11-04 8:07 ` Ingo Molnar
@ 2005-11-04 8:18 ` Arjan van de Ven
2005-11-04 10:04 ` Paul Jackson
2005-11-04 15:14 ` Rob Landley
2 siblings, 1 reply; 253+ messages in thread
From: Arjan van de Ven @ 2005-11-04 8:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Ingo Molnar, pbadari, torvalds, jdike, rob, nickpiggin, gh,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
On Thu, 2005-11-03 at 23:36 -0800, Andrew Morton wrote:
> Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Andrew Morton <akpm@osdl.org> wrote:
> >
> > > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > > off processes once they started swapping. We thought that it could be
> > > done from userspace, but we need a way for userspace to detect when a
> > > task is being swapped on a per-task basis.
> >
> > wouldnt the clean solution here be a "swap ulimit"?
>
> Well it's _a_ solution, but it's terribly specific.
>
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
> non-zero, kill <pid>?
well or do it the other way around
write a counter to such a thing
and kill when it hits zero
(similar to the CPU perf counter stuff on x86)
doing this from userspace is tricky; what if the task dies of natural
causes and the pid gets reused, between the time the userspace app reads
the value and the time it decides the time is up and time for a kill....
(and on a busy server that can be quite a bit of time)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [patch] swapin rlimit
2005-11-04 8:18 ` Arjan van de Ven
@ 2005-11-04 10:04 ` Paul Jackson
0 siblings, 0 replies; 253+ messages in thread
From: Paul Jackson @ 2005-11-04 10:04 UTC (permalink / raw)
To: Arjan van de Ven
Cc: akpm, mingo, pbadari, torvalds, jdike, rob, nickpiggin, gh,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
Arjan wrote:
> doing this from userspace is tricky; what if the task dies of natural
> causes and the pid gets reused, between the time the userspace app reads
> the value and the time it decides the time is up and time for a kill....
> (and on a busy server that can be quite a bit of time)
If pids are being reused within seconds of their being freed up,
then the batch managers running on the big HPC systems I care
about are so screwed it isn't even funny. They depend heavily
on being able to identify the task pids in a job and then doing
something to those tasks (suspend, kill, gather stats, ...).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [patch] swapin rlimit
2005-11-04 7:36 ` Andrew Morton
2005-11-04 8:07 ` Ingo Molnar
2005-11-04 8:18 ` Arjan van de Ven
@ 2005-11-04 15:14 ` Rob Landley
2 siblings, 0 replies; 253+ messages in thread
From: Rob Landley @ 2005-11-04 15:14 UTC (permalink / raw)
To: Andrew Morton
Cc: Ingo Molnar, pbadari, torvalds, jdike, nickpiggin, gh,
kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
linux-kernel, lhms-devel
On Friday 04 November 2005 01:36, Andrew Morton wrote:
> > wouldnt the clean solution here be a "swap ulimit"?
>
> Well it's _a_ solution, but it's terribly specific.
>
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
> non-zero, kill <pid>?
Things like make fork lots of short-lived child processes, and some of those
can be quite memory intensive. (The gcc 4.0.2 build causes an outright swap
storm for me about halfway through, doing genattrtab and then again compiling
the result).
Is there any way for parents to collect their child process's statistics when
the children exit? Or by the time the actual swapper exits, do we not care
anymore?
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [patch] swapin rlimit
2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar
2005-11-04 7:36 ` Andrew Morton
@ 2005-11-04 10:14 ` Bernd Petrovitsch
2005-11-04 10:21 ` Ingo Molnar
1 sibling, 1 reply; 253+ messages in thread
From: Bernd Petrovitsch @ 2005-11-04 10:14 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob,
nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz,
linux-mm, linux-kernel, lhms-devel
On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> * Andrew Morton <akpm@osdl.org> wrote:
>
> > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > off processes once they started swapping. We thought that it could be
> > done from userspace, but we need a way for userspace to detect when a
> > task is being swapped on a per-task basis.
>
> wouldnt the clean solution here be a "swap ulimit"?
Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"?
OK, mlockall() can only be done by root (processes).
Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [patch] swapin rlimit
2005-11-04 10:14 ` Bernd Petrovitsch
@ 2005-11-04 10:21 ` Ingo Molnar
2005-11-04 11:17 ` Bernd Petrovitsch
0 siblings, 1 reply; 253+ messages in thread
From: Ingo Molnar @ 2005-11-04 10:21 UTC (permalink / raw)
To: Bernd Petrovitsch
Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob,
nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz,
linux-mm, linux-kernel, lhms-devel
* Bernd Petrovitsch <bernd@firmix.at> wrote:
> On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> > * Andrew Morton <akpm@osdl.org> wrote:
> >
> > > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > > off processes once they started swapping. We thought that it could be
> > > done from userspace, but we need a way for userspace to detect when a
> > > task is being swapped on a per-task basis.
> >
> > wouldnt the clean solution here be a "swap ulimit"?
>
> Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"?
> OK, mlockall() can only be done by root (processes).
what do you mean? mlockall pins down all pages. swapin ulimit kills the
task (and thus frees all the RAM it had) when it touches swap for the
first time. These two solutions almost oppose each other!
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [patch] swapin rlimit
2005-11-04 10:21 ` Ingo Molnar
@ 2005-11-04 11:17 ` Bernd Petrovitsch
0 siblings, 0 replies; 253+ messages in thread
From: Bernd Petrovitsch @ 2005-11-04 11:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob,
nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz,
linux-mm, linux-kernel, lhms-devel
On Fri, 2005-11-04 at 11:21 +0100, Ingo Molnar wrote:
> * Bernd Petrovitsch <bernd@firmix.at> wrote:
> > On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> > > * Andrew Morton <akpm@osdl.org> wrote:
> > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > > > off processes once they started swapping. We thought that it could be
> > > > done from userspace, but we need a way for userspace to detect when a
> > > > task is being swapped on a per-task basis.
> > >
> > > wouldnt the clean solution here be a "swap ulimit"?
> >
> > Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"?
> > OK, mlockall() can only be done by root (processes).
>
> what do you mean? mlockall pins down all pages. swapin ulimit kills the
in memory.
> task (and thus frees all the RAM it had) when it touches swap for the
> first time. These two solutions almost oppose each other!
Almost IMHO as locked pages in RAM avoid swapping totally. Probably
"complement each other" is more correct.
Given the limit for "max locked memory" it should pretty much behave the
same if the process gets on his limits.
OK, the difference may be loaded executable and lib pages.
Hmm, delivering a signal on the first swapped out page might be another
simple solution and the process might do something to avoid it.
The nice thing about "swap ulimit" is: It is easy to understand what it
is (which is always a good thing).
Generating a similar effect with the combination of 2 other features is
probably somewhat more arcane.
Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 7:46 ` Gerrit Huizenga
2005-11-02 8:50 ` Nick Piggin
@ 2005-11-02 10:41 ` Ingo Molnar
2005-11-02 11:04 ` Gerrit Huizenga
1 sibling, 1 reply; 253+ messages in thread
From: Ingo Molnar @ 2005-11-02 10:41 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
* Gerrit Huizenga <gh@us.ibm.com> wrote:
> > generic unpluggable kernel RAM _will not work_.
>
> Actually, it will. Well, depending on terminology.
'generic unpluggable kernel RAM' means what it says: any RAM seen by the
kernel can be unplugged, always. (as long as the unplug request is
reasonable and there is enough free space to migrate in-use pages to).
> There are two usage models here - those which intend to remove
> physical elements and those where the kernel returnss management of
> its virtualized "physical" memory to a hypervisor. In the latter
> case, a hypervisor already maintains a virtual map of the memory and
> the OS needs to release virtualized "physical" memory. I think you
> are referring to RAM here as the physical component; however these
> same defrag patches help where a hypervisor is maintaining the real
> physical memory below the operating system and the OS is managing a
> virtualized "physical" memory.
reliable unmapping of "generic kernel RAM" is not possible even in a
virtualized environment. Think of the 'live pointers' problem i outlined
in an earlier mail in this thread today.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 10:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar
@ 2005-11-02 11:04 ` Gerrit Huizenga
2005-11-02 12:00 ` Ingo Molnar
0 siblings, 1 reply; 253+ messages in thread
From: Gerrit Huizenga @ 2005-11-02 11:04 UTC (permalink / raw)
To: Ingo Molnar
Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
>
> * Gerrit Huizenga <gh@us.ibm.com> wrote:
>
> > > generic unpluggable kernel RAM _will not work_.
> >
> > Actually, it will. Well, depending on terminology.
>
> 'generic unpluggable kernel RAM' means what it says: any RAM seen by the
> kernel can be unplugged, always. (as long as the unplug request is
> reasonable and there is enough free space to migrate in-use pages to).
Okay, I understand your terminology. Yes, I can not point to any
particular piece of memory and say "I want *that* one" and have that
request succeed. However, I can say "find me 50 chunks of memory
of your choosing" and have a very good chance of finding enough
memory to satisfy my request.
> > There are two usage models here - those which intend to remove
> > physical elements and those where the kernel returnss management of
> > its virtualized "physical" memory to a hypervisor. In the latter
> > case, a hypervisor already maintains a virtual map of the memory and
> > the OS needs to release virtualized "physical" memory. I think you
> > are referring to RAM here as the physical component; however these
> > same defrag patches help where a hypervisor is maintaining the real
> > physical memory below the operating system and the OS is managing a
> > virtualized "physical" memory.
>
> reliable unmapping of "generic kernel RAM" is not possible even in a
> virtualized environment. Think of the 'live pointers' problem i outlined
> in an earlier mail in this thread today.
Yeah - and that isn't what is being proposed here. The goal is to ask
the kernel to identify some memory which can be legitimately freed and
hasten the freeing of that memory.
gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 11:04 ` Gerrit Huizenga
@ 2005-11-02 12:00 ` Ingo Molnar
2005-11-02 12:42 ` Dave Hansen
2005-11-02 15:02 ` Gerrit Huizenga
0 siblings, 2 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-02 12:00 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
* Gerrit Huizenga <gh@us.ibm.com> wrote:
>
> On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
> >
> > * Gerrit Huizenga <gh@us.ibm.com> wrote:
> >
> > > > generic unpluggable kernel RAM _will not work_.
> > >
> > > Actually, it will. Well, depending on terminology.
> >
> > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the
> > kernel can be unplugged, always. (as long as the unplug request is
> > reasonable and there is enough free space to migrate in-use pages to).
>
> Okay, I understand your terminology. Yes, I can not point to any
> particular piece of memory and say "I want *that* one" and have that
> request succeed. However, I can say "find me 50 chunks of memory
> of your choosing" and have a very good chance of finding enough
> memory to satisfy my request.
but that's obviously not 'generic unpluggable kernel RAM'. It's very
special RAM: RAM that is free or easily freeable. I never argued that
such RAM is not returnable to the hypervisor.
> > reliable unmapping of "generic kernel RAM" is not possible even in a
> > virtualized environment. Think of the 'live pointers' problem i outlined
> > in an earlier mail in this thread today.
>
> Yeah - and that isn't what is being proposed here. The goal is to
> ask the kernel to identify some memory which can be legitimately
> freed and hasten the freeing of that memory.
but that's very easy to identify: check the free list or the clean
list(s). No defragmentation necessary. [unless the unit of RAM mapping
between hypervisor and guest is too coarse (i.e. not 4K pages).]
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 12:00 ` Ingo Molnar
@ 2005-11-02 12:42 ` Dave Hansen
2005-11-02 15:02 ` Gerrit Huizenga
1 sibling, 0 replies; 253+ messages in thread
From: Dave Hansen @ 2005-11-02 12:42 UTC (permalink / raw)
To: Ingo Molnar
Cc: Gerrit Huizenga, KAMEZAWA Hiroyuki, Mel Gorman, Nick Piggin,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wed, 2005-11-02 at 13:00 +0100, Ingo Molnar wrote:
>
> > Yeah - and that isn't what is being proposed here. The goal is to
> > ask the kernel to identify some memory which can be legitimately
> > freed and hasten the freeing of that memory.
>
> but that's very easy to identify: check the free list or the clean
> list(s). No defragmentation necessary. [unless the unit of RAM mapping
> between hypervisor and guest is too coarse (i.e. not 4K pages).]
It needs to be that coarse in cases where HugeTLB is desired for use.
I'm not sure I could convince the DB guys to give up large pages,
they're pretty hooked on them. ;)
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 12:00 ` Ingo Molnar
2005-11-02 12:42 ` Dave Hansen
@ 2005-11-02 15:02 ` Gerrit Huizenga
2005-11-03 0:10 ` Rob Landley
1 sibling, 1 reply; 253+ messages in thread
From: Gerrit Huizenga @ 2005-11-02 15:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wed, 02 Nov 2005 13:00:48 +0100, Ingo Molnar wrote:
>
> * Gerrit Huizenga <gh@us.ibm.com> wrote:
>
> >
> > On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
> > >
> > > * Gerrit Huizenga <gh@us.ibm.com> wrote:
> > >
> > > > > generic unpluggable kernel RAM _will not work_.
> > > >
> > > > Actually, it will. Well, depending on terminology.
> > >
> > > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the
> > > kernel can be unplugged, always. (as long as the unplug request is
> > > reasonable and there is enough free space to migrate in-use pages to).
> >
> > Okay, I understand your terminology. Yes, I can not point to any
> > particular piece of memory and say "I want *that* one" and have that
> > request succeed. However, I can say "find me 50 chunks of memory
> > of your choosing" and have a very good chance of finding enough
> > memory to satisfy my request.
>
> but that's obviously not 'generic unpluggable kernel RAM'. It's very
> special RAM: RAM that is free or easily freeable. I never argued that
> such RAM is not returnable to the hypervisor.
Okay - and 'generic unpluggable kernel RAM' has not been a goal for
the hypervisor based environments. I believe it is closer to being
a goal for those machines which want to hot-remove DIMMs or physical
memory, e.g. those with IA64 machines wishing to remove entire nodes.
> > > reliable unmapping of "generic kernel RAM" is not possible even in a
> > > virtualized environment. Think of the 'live pointers' problem i outlined
> > > in an earlier mail in this thread today.
> >
> > Yeah - and that isn't what is being proposed here. The goal is to
> > ask the kernel to identify some memory which can be legitimately
> > freed and hasten the freeing of that memory.
>
> but that's very easy to identify: check the free list or the clean
> list(s). No defragmentation necessary. [unless the unit of RAM mapping
> between hypervisor and guest is too coarse (i.e. not 4K pages).]
Ah, but the hypervisor often manages large page sizes, e.g. 64 MB.
It doesn't manage page rights for each guest OS at the 4 K granularity.
Hypervisors are theoretically light in terms of memory needs and
general footprint. Picture the overhead of tracking rights/permissions
of each page of memory and its assignment to any of, say, 256 different
guest operating systems. For a machine of any size, that would be
a huge amount of state for a hypervisor to maintain. Would you
really want a hypervisor to keep that much state? Or is it more
reasonably for a hypervisor to track, say, 64 MB chunks and the
rights of that memory for a number of guest operating systems? Even
if the number of guests is small, the data structures for fast
memory management would grow quickly.
gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 15:02 ` Gerrit Huizenga
@ 2005-11-03 0:10 ` Rob Landley
0 siblings, 0 replies; 253+ messages in thread
From: Rob Landley @ 2005-11-03 0:10 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wednesday 02 November 2005 09:02, Gerrit Huizenga wrote:
> > but that's obviously not 'generic unpluggable kernel RAM'. It's very
> > special RAM: RAM that is free or easily freeable. I never argued that
> > such RAM is not returnable to the hypervisor.
>
> Okay - and 'generic unpluggable kernel RAM' has not been a goal for
> the hypervisor based environments. I believe it is closer to being
> a goal for those machines which want to hot-remove DIMMs or physical
> memory, e.g. those with IA64 machines wishing to remove entire nodes
Keep in mind that just about any virtualized environment might benefit from
being able to tell the parent system "we're not using this ram". I mentioned
UML, and I can also imagine a Linux driver that signals qemu (or even vmware)
to say "this chunk of physical memory isn't currently in use", and even if
they don't actually _free_ it they can call madvise() on it.
Heck, if we have prezeroing of large blocks, telling your emulator to
madvise(ADV_DONTNEED) the pages for you should just plug right in to that
infrastructure...
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 7:19 ` Ingo Molnar
2005-11-02 7:46 ` Gerrit Huizenga
@ 2005-11-02 7:57 ` Nick Piggin
1 sibling, 0 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 7:57 UTC (permalink / raw)
To: Ingo Molnar
Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
Ingo Molnar wrote:
> * Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>
>>My own target is NUMA node hotplug, what NUMA node hotplug want is
>>- [remove the range of memory] For this approach, admin should define
>> *core* node and removable node. Memory on removable node is removable.
>> Dividing area into removable and not-removable is needed, because
>> we cannot allocate any kernel's object on removable area.
>> Removable area should be 100% removable. Customer can know the limitation
>> before using.
>
>
> that's a perfectly fine method, and is quite similar to the 'separate
> zone' approach Nick mentioned too. It is also easily understandable for
> users/customers.
>
I agree - and I think it should be easy to configure out of the
kernel for those that don't want the functionality, and should
at very little complexity to core code (all without looking at
the patches so I could be very wrong!).
>
> but what is a dangerous fallacy is that we will be able to support hot
> memory unplug of generic kernel RAM in any reliable way!
>
Very true.
> you really have to look at this from the conceptual angle: 'can an
> approach ever lead to a satisfactory result'? If the answer is 'no',
> then we _must not_ add a 90% solution that we _know_ will never be a
> 100% solution.
>
> for the separate-removable-zones approach we see the end of the tunnel.
> Separate zones are well-understood.
>
Yep, I don't see why this doesn't cover all the needs that the frag
patches attempt (hot unplug, hugepage dynamic reserves).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:49 ` Dave Hansen
2005-11-01 15:01 ` Ingo Molnar
@ 2005-11-02 0:51 ` Nick Piggin
2005-11-02 7:42 ` Dave Hansen
2005-11-02 12:38 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman
1 sibling, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 0:51 UTC (permalink / raw)
To: Dave Hansen
Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
Dave Hansen wrote:
> What the fragmentation patches _can_ give us is the ability to have 100%
> success in removing certain areas: the "user-reclaimable" areas
> referenced in the patch. This gives a customer at least the ability to
> plan for how dynamically reconfigurable a system should be.
>
But the "user-reclaimable" areas can still be taken over by other
areas which become fragmented.
That's like saying we can already guarantee 100% success in removing
areas that are unfragmented and free, or freeable.
> After these patches, the next logical steps are to increase the
> knowledge that the slabs have about fragmentation, and to teach some of
> the shrinkers about fragmentation.
>
I don't like all this work and complexity and overheads going into a
partial solution.
Look: if you have to guarantee memory can be shrunk, set aside a zone
for it (that only fills with user reclaimable areas). This is better
than the current frag patches because it will give you the 100%
guarantee that you need (provided we have page migration to move mlocked
pages).
If you don't need a guarantee, then our current, simple system does the
job perfectly.
> After that, we'll need some kind of virtual remapping, breaking the 1:1
> kernel virtual mapping, so that the most problematic pages can be
> remapped. These pages would retain their virtual address, but getting a
> new physical. However, this is quite far down the road and will require
> some serious evaluation because it impacts how normal devices are able
> to to DMA. The ppc64 proprietary hypervisor has features to work around
> these issues, and any new hypervisors wishing to support partition
> memory hotplug would likely have to follow suit.
>
I would more like to see something like this happen (provided it was
nicely abstracted away and could be CONFIGed out for the 99.999% of
users who don't need the overhead or complexity).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 0:51 ` Nick Piggin
@ 2005-11-02 7:42 ` Dave Hansen
2005-11-02 8:24 ` Nick Piggin
2005-11-02 12:38 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman
1 sibling, 1 reply; 253+ messages in thread
From: Dave Hansen @ 2005-11-02 7:42 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Wed, 2005-11-02 at 11:51 +1100, Nick Piggin wrote:
> Look: if you have to guarantee memory can be shrunk, set aside a zone
> for it (that only fills with user reclaimable areas). This is better
> than the current frag patches because it will give you the 100%
> guarantee that you need (provided we have page migration to move mlocked
> pages).
With Mel's patches, you can easily add the same guarantee. Look at the
code in fallback_alloc() (patch 5/8). It would be quite easy to modify
the fallback lists to disallow fallbacks into areas from which we would
like to remove memory. That was left out for simplicity. As you say,
they're quite complex as it is. Would you be interested in seeing a
patch to provide those kinds of guarantees?
We've had a bit of experience with a hotpluggable zone approach before.
Just like the current topic patches, you're right, that approach can
also provide strong guarantees. However, the issue comes if the system
ever needs to move memory between such zones, such as if a user ever
decides that they'd prefer to break hotplug guarantees rather than OOM.
Do you think changing what a particular area of memory is being used for
would ever be needed?
One other thing, if we decide to take the zones approach, it would have
no other side benefits for the kernel. It would be for hotplug only and
I don't think even the large page users would get much benefit.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 7:42 ` Dave Hansen
@ 2005-11-02 8:24 ` Nick Piggin
2005-11-02 8:33 ` Yasunori Goto
0 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 8:24 UTC (permalink / raw)
To: Dave Hansen
Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
Dave Hansen wrote:
> On Wed, 2005-11-02 at 11:51 +1100, Nick Piggin wrote:
>
>>Look: if you have to guarantee memory can be shrunk, set aside a zone
>>for it (that only fills with user reclaimable areas). This is better
>>than the current frag patches because it will give you the 100%
>>guarantee that you need (provided we have page migration to move mlocked
>>pages).
>
>
> With Mel's patches, you can easily add the same guarantee. Look at the
> code in fallback_alloc() (patch 5/8). It would be quite easy to modify
> the fallback lists to disallow fallbacks into areas from which we would
> like to remove memory. That was left out for simplicity. As you say,
> they're quite complex as it is. Would you be interested in seeing a
> patch to provide those kinds of guarantees?
>
On top of Mel's patch? I think this is essiential for any guarantees
that you might be interested... but it would just mean that now you
have a redundant extra zoning layer.
I think ZONE_REMOVABLE is something that really needs to be looked at
again if you need a hotunplug solution in the kernel.
> We've had a bit of experience with a hotpluggable zone approach before.
> Just like the current topic patches, you're right, that approach can
> also provide strong guarantees. However, the issue comes if the system
> ever needs to move memory between such zones, such as if a user ever
> decides that they'd prefer to break hotplug guarantees rather than OOM.
>
I can imagine one could have a sysctl to allow/disallow non-easy-reclaim
allocations from ZONE_REMOVABLE.
As Ingo says, neither way is going to give a 100% solution - I wouldn't
like to see so much complexity added to bring us from a ZONE_REMOVABLE 80%
solution to a 90% solution. I believe this is where Linus' "perfect is
the enemy of good" quote applies.
> Do you think changing what a particular area of memory is being used for
> would ever be needed?
>
Perhaps, but Mel's patch only guarantees you to change once, same as
ZONE_REMOVABLE. Once you eat up those easy-to-reclaim areas, you can't
get them back.
> One other thing, if we decide to take the zones approach, it would have
> no other side benefits for the kernel. It would be for hotplug only and
> I don't think even the large page users would get much benefit.
>
Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
talking about other higher-order users, I still think we can't guarantee
past about order 1 or 2 with Mel's patch and they simply need to have
some other ways to do things.
But I think using zones would have advantages in that they would help
give zones and zone balancing more scrutiny and test coverage in the
kernel, which is sorely needed since everyone threw out their highmem
systems :P
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 8:24 ` Nick Piggin
@ 2005-11-02 8:33 ` Yasunori Goto
2005-11-02 8:43 ` Nick Piggin
0 siblings, 1 reply; 253+ messages in thread
From: Yasunori Goto @ 2005-11-02 8:33 UTC (permalink / raw)
To: Nick Piggin
Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
> > One other thing, if we decide to take the zones approach, it would have
> > no other side benefits for the kernel. It would be for hotplug only and
> > I don't think even the large page users would get much benefit.
> >
>
> Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
> talking about other higher-order users, I still think we can't guarantee
> past about order 1 or 2 with Mel's patch and they simply need to have
> some other ways to do things.
Hmmm. I don't see at this point.
Why do you think ZONE_REMOVABLE can satisfy for hugepage.
At leaset, my ZONE_REMOVABLE patch doesn't any concern about
fragmentation.
Bye.
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 8:33 ` Yasunori Goto
@ 2005-11-02 8:43 ` Nick Piggin
2005-11-02 14:51 ` Martin J. Bligh
2005-11-02 23:28 ` Rob Landley
0 siblings, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 8:43 UTC (permalink / raw)
To: Yasunori Goto
Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
Yasunori Goto wrote:
>>>One other thing, if we decide to take the zones approach, it would have
>>>no other side benefits for the kernel. It would be for hotplug only and
>>>I don't think even the large page users would get much benefit.
>>>
>>
>>Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
>>talking about other higher-order users, I still think we can't guarantee
>>past about order 1 or 2 with Mel's patch and they simply need to have
>>some other ways to do things.
>
>
> Hmmm. I don't see at this point.
> Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> fragmentation.
>
Well I think it can satisfy hugepage allocations simply because
we can be reasonably sure of being able to free contiguous regions.
Of course it will be memory no longer easily reclaimable, same as
the case for the frag patches. Nor would be name ZONE_REMOVABLE any
longer be the most appropriate!
But my point is, the basic mechanism is there and is workable.
Hugepages and memory unplug are the two main reasons for IBM to be
pushing this AFAIKS.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 8:43 ` Nick Piggin
@ 2005-11-02 14:51 ` Martin J. Bligh
2005-11-02 23:28 ` Rob Landley
1 sibling, 0 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-02 14:51 UTC (permalink / raw)
To: Nick Piggin, Yasunori Goto
Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
> Well I think it can satisfy hugepage allocations simply because
> we can be reasonably sure of being able to free contiguous regions.
> Of course it will be memory no longer easily reclaimable, same as
> the case for the frag patches. Nor would be name ZONE_REMOVABLE any
> longer be the most appropriate!
>
> But my point is, the basic mechanism is there and is workable.
> Hugepages and memory unplug are the two main reasons for IBM to be
> pushing this AFAIKS.
No, that's not true - those are just the "exciting" features that go
on the back of it. Look back in this email thread - there's lots of
other reasons to fix fragmentation. I don't believe you can eliminate
all the order > 0 allocations in the kernel.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 8:43 ` Nick Piggin
2005-11-02 14:51 ` Martin J. Bligh
@ 2005-11-02 23:28 ` Rob Landley
2005-11-03 5:26 ` Jeff Dike
1 sibling, 1 reply; 253+ messages in thread
From: Rob Landley @ 2005-11-02 23:28 UTC (permalink / raw)
To: Nick Piggin, user-mode-linux-devel
Cc: Yasunori Goto, Dave Hansen, Ingo Molnar, Mel Gorman,
Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
On Wednesday 02 November 2005 02:43, Nick Piggin wrote:
> > Hmmm. I don't see at this point.
> > Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> > At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> > fragmentation.
>
> Well I think it can satisfy hugepage allocations simply because
> we can be reasonably sure of being able to free contiguous regions.
> Of course it will be memory no longer easily reclaimable, same as
> the case for the frag patches. Nor would be name ZONE_REMOVABLE any
> longer be the most appropriate!
>
> But my point is, the basic mechanism is there and is workable.
> Hugepages and memory unplug are the two main reasons for IBM to be
> pushing this AFAIKS.
Who cares what IBM is pushing? I'm interested in fragmentation avoidance for
User Mode Linux.
I use User Mode Linux to virtualize a system build, and one problem I
currently have is that some workloads temporarily use a lot of memory. For
example, I can run a complete system build in about 48 megs of ram: except
for building GCC. That spikes to a couple hundred megabytes. If I allocate
256 megabytes of memory to UML, that's half the memory on my laptop and UML
will just use it for redundant cacheing and such while desktop performance
gets a bit unhappy with the build going.
UML gets an instance's "physical memory" by allocating a temporary file,
mmapping it, and deleting it (which signals to the vfs that flushing this
data to backing store should only be done under memory pressure from the rest
of the OS, because the file's going away when it's closed so there's no
With fragmentation reduction and prezeroing, UML suddenly gains the option of
calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of
prezeroing, B) a way of giving memory back to the host OS when it's not in
use.
This has _nothing_ to do with IBM. Or large systems. This is some random
developer trying to run a virtualized system build on his laptop.
(The reason I need to use UML is that I build uClibc with the newest 2.6
kernel headers I can, link apps against it, and then running many of those
apps during later stages of the build. If the kernel headers used to build
libc are sufficiently newer than the kernel the build is running under, I get
segfaults because the new libc tries use kernel features that aren't there on
the host system, but will be in the final system. I also get the ability to
mknod/chown/chroot without needing root access on the host system for
free...)
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 23:28 ` Rob Landley
@ 2005-11-03 5:26 ` Jeff Dike
2005-11-03 5:41 ` Rob Landley
0 siblings, 1 reply; 253+ messages in thread
From: Jeff Dike @ 2005-11-03 5:26 UTC (permalink / raw)
To: Rob Landley
Cc: Nick Piggin, user-mode-linux-devel, Yasunori Goto, Dave Hansen,
Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> With fragmentation reduction and prezeroing, UML suddenly gains the option of
> calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of
> prezeroing, B) a way of giving memory back to the host OS when it's not in
> use.
DONT_NEED is insufficient. It doesn't discard the data in dirty
file-backed pages.
Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
which does do the trick, and I have a UML patch which adds memory
hotplug. This combination does free memory back to the host.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 5:26 ` Jeff Dike
@ 2005-11-03 5:41 ` Rob Landley
2005-11-04 3:26 ` [uml-devel] " Blaisorblade
0 siblings, 1 reply; 253+ messages in thread
From: Rob Landley @ 2005-11-03 5:41 UTC (permalink / raw)
To: Jeff Dike
Cc: Nick Piggin, user-mode-linux-devel, Yasunori Goto, Dave Hansen,
Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Wednesday 02 November 2005 23:26, Jeff Dike wrote:
> On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> > With fragmentation reduction and prezeroing, UML suddenly gains the
> > option of calling madvise(DONT_NEED) on sufficiently large blocks as A) a
> > fast way of prezeroing, B) a way of giving memory back to the host OS
> > when it's not in use.
>
> DONT_NEED is insufficient. It doesn't discard the data in dirty
> file-backed pages.
I thought DONT_NEED would discard the page cache, and punch was only needed to
free up the disk space.
I was hoping that since the file was deleted from disk and is already getting
_some_ special treatment (since it's a longstanding "poor man's shared
memory" hack), that madvise wouldn't flush the data to disk, but would just
zero it out. A bit optimistic on my part, I know. :)
> Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
> which does do the trick, and I have a UML patch which adds memory
> hotplug. This combination does free memory back to the host.
I saw it wander by, and am all for it. If it goes in, it's obviously the
right thing to use. You may remember I asked about this two years ago:
http://seclists.org/lists/linux-kernel/2003/Dec/0919.html
And a reply indicated that SVr4 had it, but we don't. I assume the "naming
discussion" mentioned in the recent thread already scrubbed through this old
thread to determine that the SVr4 API was icky.
http://seclists.org/lists/linux-kernel/2003/Dec/0955.html
> Jeff
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-03 5:41 ` Rob Landley
@ 2005-11-04 3:26 ` Blaisorblade
2005-11-04 15:50 ` Rob Landley
0 siblings, 1 reply; 253+ messages in thread
From: Blaisorblade @ 2005-11-04 3:26 UTC (permalink / raw)
To: user-mode-linux-devel
Cc: Rob Landley, Jeff Dike, Nick Piggin, Yasunori Goto, Dave Hansen,
Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Thursday 03 November 2005 06:41, Rob Landley wrote:
> On Wednesday 02 November 2005 23:26, Jeff Dike wrote:
> > On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> > > With fragmentation reduction and prezeroing, UML suddenly gains the
> > > option of calling madvise(DONT_NEED) on sufficiently large blocks as A)
> > > a fast way of prezeroing, B) a way of giving memory back to the host OS
> > > when it's not in use.
> > DONT_NEED is insufficient. It doesn't discard the data in dirty
> > file-backed pages.
> I thought DONT_NEED would discard the page cache, and punch was only needed
> to free up the disk space.
This is correct, but...
> I was hoping that since the file was deleted from disk and is already
> getting _some_ special treatment (since it's a longstanding "poor man's
> shared memory" hack), that madvise wouldn't flush the data to disk, but
> would just zero it out. A bit optimistic on my part, I know. :)
I read at some time that this optimization existed but was deemed obsolete and
removed.
Why obsolete? Because... we have tmpfs! And that's the point. With DONTNEED,
we detach references from page tables, but the content is still pinned: it
_is_ the "disk"! (And you have TMPDIR on tmpfs, right?)
> > Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
> > which does do the trick, and I have a UML patch which adds memory
> > hotplug. This combination does free memory back to the host.
> I saw it wander by, and am all for it. If it goes in, it's obviously the
> right thing to use.
Btw, on this side of the picture, I think fragmentation avoidance is not
needed for that.
I guess you refer to using frag. avoidance on the guest (if it matters for the
host, let me know). When it will be present using it will be nice, but
currently we'd do madvise() on a page-per-page basis, and we'd do it on
non-consecutive pages (basically, free pages we either find or free or
purpose).
> You may remember I asked about this two years ago:
> http://seclists.org/lists/linux-kernel/2003/Dec/0919.html
> And a reply indicated that SVr4 had it, but we don't. I assume the "naming
> discussion" mentioned in the recent thread already scrubbed through this
> old thread to determine that the SVr4 API was icky.
> http://seclists.org/lists/linux-kernel/2003/Dec/0955.html
I assume not everybody did (even if somebody pointed out the existance of the
SVr4 API), but there was the need, in at least one usage, for a virtual
address-based API rather than a file offset based one, like the SVr4 one -
that user would need implementing backward mapping in userspace only for this
purpose, while we already have it in the kernel.
Anyway, the sys_punch() API will follow later - customers need mainly
madvise() for now.
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 3:26 ` [uml-devel] " Blaisorblade
@ 2005-11-04 15:50 ` Rob Landley
2005-11-04 17:18 ` Blaisorblade
0 siblings, 1 reply; 253+ messages in thread
From: Rob Landley @ 2005-11-04 15:50 UTC (permalink / raw)
To: user-mode-linux-devel
Cc: Blaisorblade, Jeff Dike, Nick Piggin, Yasunori Goto, Dave Hansen,
Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms
On Thursday 03 November 2005 21:26, Blaisorblade wrote:
> > I was hoping that since the file was deleted from disk and is already
> > getting _some_ special treatment (since it's a longstanding "poor man's
> > shared memory" hack), that madvise wouldn't flush the data to disk, but
> > would just zero it out. A bit optimistic on my part, I know. :)
>
> I read at some time that this optimization existed but was deemed obsolete
> and removed.
>
> Why obsolete? Because... we have tmpfs! And that's the point. With
> DONTNEED, we detach references from page tables, but the content is still
> pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?)
If I had that kind of control over environment my build would always be
deployed in (including root access), I wouldn't need UML. :)
(P.S. The default for Ubuntu "Horny Hedgehog" is no. The only tmpfs mount
is /dev/shm, and /tmp is on / which is ext3. Yeah, I need to upgrade my
laptop...)
> I guess you refer to using frag. avoidance on the guest
Yes. Moot point since Linus doesn't want it.
> (if it matters for
> the host, let me know). When it will be present using it will be nice, but
> currently we'd do madvise() on a page-per-page basis, and we'd do it on
> non-consecutive pages (basically, free pages we either find or free or
> purpose).
Might be a performance issue if that gets introduced with per-page
granularity, and how do you avoid giving back pages we're about to re-use?
Oh well, bench it when it happens. (And in any case, it needs a tunable to
beat the page cache into submission or there's no free memory to give back.
If there's already such a tuneable, I haven't found it yet.)
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 15:50 ` Rob Landley
@ 2005-11-04 17:18 ` Blaisorblade
2005-11-04 17:44 ` Rob Landley
0 siblings, 1 reply; 253+ messages in thread
From: Blaisorblade @ 2005-11-04 17:18 UTC (permalink / raw)
To: Rob Landley
Cc: user-mode-linux-devel, Jeff Dike, Nick Piggin, Yasunori Goto,
Dave Hansen, linux-mm, Linux Kernel Mailing List, lhms
(Note - I've removed a few CC's since we're too many ones, sorry for any
inconvenience).
On Friday 04 November 2005 16:50, Rob Landley wrote:
> On Thursday 03 November 2005 21:26, Blaisorblade wrote:
> > > I was hoping that since the file was deleted from disk and is already
> > > getting _some_ special treatment (since it's a longstanding "poor man's
> > > shared memory" hack), that madvise wouldn't flush the data to disk, but
> > > would just zero it out. A bit optimistic on my part, I know. :)
> >
> > I read at some time that this optimization existed but was deemed
> > obsolete and removed.
> >
> > Why obsolete? Because... we have tmpfs! And that's the point. With
> > DONTNEED, we detach references from page tables, but the content is still
> > pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?)
>
> If I had that kind of control over environment my build would always be
> deployed in (including root access), I wouldn't need UML. :)
Yep, right for your case... however currently the majority of users use tmpfs
(I hope for them)...
> > I guess you refer to using frag. avoidance on the guest
>
> Yes. Moot point since Linus doesn't want it.
See lwn.net last issue (when it becomes available) on this issue. In short,
however, the real point is that we need this kind of support.
> Might be a performance issue if that gets introduced with per-page
> granularity,
I'm aware of this possibility, and I've said in fact "Frag. avoidance will be
nice to use". However I'm not sure that the system call overhead is so big,
compared to flushing the TLB entries...
But for now we haven't the issue - you don't do hotunplug frequently. When
somebody will write the auto-hotunplug management daemon we could have a
problem on this...
> and how do you avoid giving back pages we're about to re-use?
Jeff's trick is call the buddy allocator (__get_free_pages()) to get a full
page (and it will do any needed work to free memory), so nobody else will use
it, and then madvise() it.
If a better API exists, that will be used.
> Oh well, bench it when it happens. (And in any case, it needs a tunable to
> beat the page cache into submission or there's no free memory to give back.
I couldn't parse your sentence. The allocation will free memory like when
memory is needed.
However look at /proc/sys/vm/swappiness or use Con Kolivas's patches to find
new tunable and policies.
> If there's already such a tuneable, I haven't found it yet.)
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-04 17:18 ` Blaisorblade
@ 2005-11-04 17:44 ` Rob Landley
0 siblings, 0 replies; 253+ messages in thread
From: Rob Landley @ 2005-11-04 17:44 UTC (permalink / raw)
To: Blaisorblade
Cc: user-mode-linux-devel, Jeff Dike, Nick Piggin, Yasunori Goto,
Dave Hansen, linux-mm, Linux Kernel Mailing List, lhms
On Friday 04 November 2005 11:18, Blaisorblade wrote:
> > Oh well, bench it when it happens. (And in any case, it needs a tunable
> > to beat the page cache into submission or there's no free memory to give
> > back.
>
> I couldn't parse your sentence. The allocation will free memory like when
> memory is needed.
If you've got a daemon running in the virtual system to hand back memory to
the host, then you don't need a tuneable.
What I was thinking is that if we get prezeroing infrastructure that can use
various prezeroing accelerators (as has been discussed but I don't believe
merged), then a logical prezeroing accelerator for UML would be calling
madvise on the host system. This has the advantage of automatically giving
back to the host system any memory that's not in use, but would require some
way to tell kswapd or some such that keeping around lots of prezeroed memory
is preferable to keeping around lots of page cache.
In my case, I have a workload that can mostly work with 32-48 megs of ram, but
it spikes up to 256 at one point. Right now, I'm telling UML mem=64 megs and
the feeding it a 256 swap file on ubd, but this is hideously inefficient when
it actually tries to use this swap file. (And since the host system is
running a 2.6.10 kernel, there's a five minute period during each build where
things on my desktop actually freeze for 15-30 seconds at a time. And this
is on a laptop with 512 megs of ram. I think it's because the disk is so
overwhelmed, and some things (like vim's .swp file, and something similar in
kmail's composer) do a gratuitous fsync...
> However look at /proc/sys/vm/swappiness
Setting swappiness to 0 triggers the OOM killer on 2.6.14 for a load that
completes with swappiness at 60. I mentioned this on the list a little while
ago and some people asked for copies of my test script...
> or use Con Kolivas's patches to find new tunable and policies.
The daemon you mentioned is an alternative, but I'm not quite sure how rapid
the daemon's reaction is going to be to potential OOM situations when
something suddenly wants an extra 200 megs...
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
2005-11-02 0:51 ` Nick Piggin
2005-11-02 7:42 ` Dave Hansen
@ 2005-11-02 12:38 ` Mel Gorman
2005-11-03 3:14 ` Nick Piggin
1 sibling, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-02 12:38 UTC (permalink / raw)
To: Nick Piggin
Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton,
kravetz, linux-mm, Linux Kernel Mailing List, lhms
On Wed, 2 Nov 2005, Nick Piggin wrote:
> Dave Hansen wrote:
>
> > What the fragmentation patches _can_ give us is the ability to have 100%
> > success in removing certain areas: the "user-reclaimable" areas
> > referenced in the patch. This gives a customer at least the ability to
> > plan for how dynamically reconfigurable a system should be.
> >
>
> But the "user-reclaimable" areas can still be taken over by other
> areas which become fragmented.
>
This is true, we have worst case scenarios. With our patches though, our
assertion it takes a lot longer to degrade and in good scenarios like
where the workload is not using all of physical memory, we don't degrade
at all. Assuming we get a page migration or active defragmentation in the
future, it will be a lot longer before they have to do any work. As we
only fragment when there is nothing else to do, page migration will also
have less work to do.
> That's like saying we can already guarantee 100% success in removing
> areas that are unfragmented and free, or freeable.
>
> > After these patches, the next logical steps are to increase the
> > knowledge that the slabs have about fragmentation, and to teach some of
> > the shrinkers about fragmentation.
> >
>
> I don't like all this work and complexity and overheads going into a
> partial solution.
>
> Look: if you have to guarantee memory can be shrunk, set aside a zone
> for it (that only fills with user reclaimable areas). This is better
> than the current frag patches because it will give you the 100%
> guarantee that you need (provided we have page migration to move mlocked
> pages).
>
> If you don't need a guarantee, then our current, simple system does the
> job perfectly.
>
Ok. To me, the rest of the thread are beating around the same points and
no one is giving ground. The points are made so lets summarise. Apologies
if anything is missing.
Problem
=======
Memory gets fragmented meaning that contiguous blocks of memory are not
free and not freeable no matter how much kswapd works
Impact
======
A number of different users are hit, in different ways
Physical Hotplug remove: Hotplug remove needs to be able to free a large
region of memory that is then unplugged. Different architectures have
different ways of doing this
Virtualization hotplug remove: The requirements are lighter here.
Contiguous Regions from 1MiB to 64MiB (figure taken from thread)
must be freed to move the memory between virtual machines
High order allocations: With fragmentation, high order allocations fail.
Depending on the workload, kswapd could work forever and not free up a
4MiB chunk
Who cares
=========
Physical hotplug remove: Vendors of the hardware that support this -
Fujitsu, HP (I think), IBM etc
Virtualization hotplug remove: Sellers of virtualization software, some
hardware like any IBM machine that lists LPAR in it's list of
features. Probably software solutions like Xen are also affected
if they want to be able to grow and shrink the virtual machines on
demand
High order allocations: Ultimately, hugepage users. Today, that is a
feature only big server users like Oracle care about. In the
future I reckon applications will be able to use them for things
like backing the heap by huge pages. Other users like GigE,
loopback devices with large MTUs, some filesystem like CIFS are
all interested although they are also been told use use smaller
pages.
Solutions
=========
Anti-defrag: This solution defines three groups of pages KernNoRclm,
KernRclm and EasyRclm. Small sub-zone regions of size
2^(MAX_ORDER-1) are reserved for each allocation type. If there
are no large blocks available and no reserved pages available, it
falls back and begins to fragment. This tries to delay
fragmentation for as long as possible
New Zone: Add a new zone for easyrclm only allocations. This means that
all kernel pages go in one place and all easyrclm go in another.
This solution would allow us to reclaim contiguous blocks of
(Note: This is basically what Solaris Kernel Cages are)
Note that I am leaving out Growing/Shrinking zone code for the moment.
While zones are currently able to get new pages with something like memory
hotadd, there is no mechanism available to move existing pages from one
zone into another. This will need planning and code. Code exists for page
migration so we can reasonable speculate about what it brings to the table
for both anti-defrag and New Zone approaches.
Pros/Cons of Solutions
======================
Anti-defrag Pros
o Aim9 shows no significant regressions (.37% on page_test). On some
tests, it shows performance gains (> 5% on fork_test)
o Stress tests show that it manages to keep fragmentation down to a far
lower level even without teaching kswapd how to linear reclaim
o Stress tests with a linear reclaim experimental patch shows that it
can successfully find large contiguous chunks of memory
o It is known to help hotplug on PPC64
o No tunables. The approach tries to manage itself as much as possible
o It exists, heavily tested, and synced against the latest -mm1
o Can be compiled away be redefining the RCLM_* macros and the
__GFP_*RCLM flags
Anti-defrag Cons
o More complexity within the page allocator
o Adds a new layer onto the allocator that effectively creates subzones
o Adding a new concept that maintainers have to work with
o Depending on the workload, it fragments anyway
New Zone Pros
o Zones are a well known and understood concept
o For people that do not care about hotplug, they can easily get rid of it
o Provides reliable areas of contiguous groups that can be freed for
HugeTLB pages going to userspace
o Uses existing zone infrastructure for balancing
New Zone Cons
o Zones historically have introduced balancing problems
o Been tried for hotplug and dropped because of being awkward to work with
o It only helps hotplug and potentially HugeTLB pages for userspace
o Tunable required. If you get it wrong, the system suffers a lot
o Needs to be planned for and developed
Scenarios
=========
Lets outline some situations then or workloads that can occur
1. Heavy job running that consumes 75% of physical memory. Like a kernel
build
Anti-defrag: It will not fragment as it will never have to fallback.High
order allocations will be possible in the remaining 25%.
Zone-based: After been tuned to a kernel build load, it will not
fragment. Get the tuning wrong, performance suffers or workload
fails. High order allocations will be possible in the remaining 25%.
Future work for scenario 1
Anti-defrag: No problem.
Zone-based: Tune some more if problems occur.
2. Heavy job running that needs 110% of physical memory, swap is used.
Example would be too many simultaneous kernel builds
Anti-defrag: UserRclm regions are stolen to prevent too many fallbacks.
KernNoRclm starts stealing UserRclm regions to avoid excessive
fragmentation but some fragmentation occurs. Extent depends on the
duration and heaviness of the load. High order allocations will
work if kswapd runs for long enough as it will reclaim the
UserRclm reserved areas. Your chances depend on the intensity of
KernNoRclm allocations
Zone-based: After been tuned to the new kernel build load, it will not
fragment. Get it wrong and performance suffers. High order
allocations will work if you're lucky enough to have enough
reclaimable pages together. Your chances are not good
Future work for scenario 2
Anti-defrag: kswapd would need to know how to reclaim EasyRclm pages
from the KernNoRclm, KernRclm and Fallback areas.
Zone-based: Keep tuning
3. HighMem intensive workload with CONFIG_HIGHPTE set. Example would be a
scientific job that was performing a very large calculation on an
anonymous region of memory. Possible that some desktop
workloads are like this - i.e. use large amounts of anonymous
memory
Anti-defrag: For ZONE_HIGHMEM, PTEs are grouped into one area,
everything else into another, no fragmentation. HugeTLB
allocations in ZONE_HIGHMEM will work if kswapd works long enough
Zone-based: PTEs go to anywhere in ZONE_HIGHMEM. Easy-reclaimed goes to
ZONE_HIGHMEM and ZONE_HOTREMOVABLE. ZONE_HIGHMEM fragments,
ZONE_HOTREMOVABLE does not. HugeTLB pages will be available in
ZONE_HOTREMOVABLE, but probably not in ZONE_HIGHMEM.
Future work for scenario 3
Anti-defrag: No problem. On-demand HugeTLB allocation for userspace is
possible. Would work better with linear page reclaim.
Zone-based: Depends if we care that ZONE_HIGHMEM gets fragmented. We
would only care if trying to allocate HugeTLB pages on demand from
ZONE_HIGHMEM. ZONE_HOTREMOVABLE depending on it's size would be
possible. Linear reclaim will help ZONE_HOTREMOVABLE, but not
ZONE_HIGHMEM
4. KBuild. Main concerns here are performance
Anti-defrag: May cause problems because of the .37% drop on page_test.
May cause improvements because of the 5% increase on fork_test. No
figures on kbuild available
Zone-based: No figures available. Depends heavily on being configured
correctly
Future work for scenario 4
Anti-defrag: Try and optimise the paths affected. Alternatively make
anti-defrag a configurable option by altering the values of RCLM_*
and __GFP_*RCLM. (Note, would people be interested in a
compile-time option for anti-defrag or would it make the complexity
worse for people?)
Zone-based: Tune for performance or compile away the zone
5. Physically unplug memory 25% of physical memory
Anti-defrag: Memory in the region gets reclaimed if it's EasyRclm.
Possibly will encounter awkward pages. Known that PPC64 has some
success. Fujitsu's use nodes for hotplug, they would need to
adjust the fallbacks to be fully reliable
Zone-based: If we are unplugging the right zone, reclaim the pages.
Possibly will encounter awkward pages (only mlock in this case)
Future work for scenario 5
Anti-defrag: fallback_allocs for each node for Fujitsu to be any way
reliable. Ability to move awkward pages around. For 100% success,
ability to move kernel pages
Zone-based: Ability to move awkward pages around. There is no 100%
success scenario here. You remove the ZONE_HOTREMOVEABLE area or
you turn the machine off.
6. Fsck a large filesystem (known to be a KernNoRclm heavy workload)
Anti-defrag: Memory fragments, but fsck is a short-lived kernel heavy
workload. It is also known that free blocks reappear through the
address space when it finishes. Contiguous blocks may appear in
the middle of the zone rather than either end.
Zone-based: If misconfigured, performance degrades. As a machine will
not be tuned for fsck, changes of degrading are pretty high. On
the other hand, fsck is something people can wait for
Future work for scenario 6
Anti-defrag: Ideally, in case of fallbacks, page migration would move
awkward pages out of UserRclm areas
Zone-based: Keep tuning if you run into problems
Lets say we agree on a way that ZONE_HOTREMOVABLE can be shrunk in such a
way to give pages to ZONE_NORMAL and ZONE_HIGHMEM as necessary (and we
have to be able to handle both), Situation 2 and 6 changes. Note that this
changing of zones sizes brings all the problems from the anti-defrag
approach to the zone-based approach.
2a. Heavy job running that needs 110% of physical memory, swap is used.
Anti-defrag: UserRclm regions are stolen to prevent too many fallbacks.
KernNoRclm starts stealing UserRclm regions to avoid excessive
fragmentation but some fragmentation occurs. Extent depends on the
duration and heaviness of the load.
Zone-based: ZONE_NORMAL grows into ZONE_HOTREMOVALE. The zone cannot be
shrunk so ZONE_NORMAL fragments as normal.
Future work for scenario 2a
Anti-defrag: kswapd would need to know how to clean EasyRclm pages
from the KernNoRclm, KernRclm and Fallback reserved areas. When
load drops off, regions will get reserved again for EasyRclm.
Contiguous blocks will become whenever possible be it the
beginning, middle or end of the zone. Page migration would help
fix up single kernel pages left in EasyRclm areas.
Zone-based: Page migration would need to move pages from the end of
the zone so it could be shrunk.
6a. Fsck
Anti-defrag: Memory fragments, but fsck is a short-lived kernel heavy
workload. It is also known that free blocks reappear through the
address space when it finishes. Once the free blocks appear, they
get reserved for the different allocation types on demand and
business continues as usual
Zone-based: ZONE_NORMAL grows into ZONE_HOTREMOVABLE. No mechanism to
shrink it so it doesn't recover
Future work for scenario 2a
Anti-defrag: Same as for Situation 2. kswapd would need to know how to
clean UserRclm pages from the KernNoRclm, KernRclm and Fallback
reserved areas.
Zone-based: Same as for 2a. Page migration would need to move pages
from the end of the zone so it could be shrunk
I've tried to be as objective as possible with the summary.
>From the points above though, I think that anti-defrag gets us a lot of
the way, with the complexity isolated in one place. It's downside is that
it can still break down and future work is needed to stop it degrading
(kswapd cleaning UserRclm areas and page migration when we get really
stuck). Zone-based is more reliable but only addresses a limited
situation, principally hotplug and it does not even go 100% of the way for
hotplug. It also depends on a tunable which is not cool and it is static.
If we make the zones growable+shrinkable, we run into all the same
problems that anti-defrag has today.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
2005-11-02 12:38 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman
@ 2005-11-03 3:14 ` Nick Piggin
2005-11-03 12:19 ` Mel Gorman
2005-11-03 15:34 ` Martin J. Bligh
0 siblings, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-03 3:14 UTC (permalink / raw)
To: Mel Gorman
Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton,
kravetz, linux-mm, Linux Kernel Mailing List, lhms
Mel Gorman wrote:
>
> Ok. To me, the rest of the thread are beating around the same points and
> no one is giving ground. The points are made so lets summarise. Apologies
> if anything is missing.
>
Thanks for attempting a summary of a difficult topic. I have a couple
of suggestions.
> Who cares
> =========
> Physical hotplug remove: Vendors of the hardware that support this -
> Fujitsu, HP (I think), IBM etc
>
> Virtualization hotplug remove: Sellers of virtualization software, some
> hardware like any IBM machine that lists LPAR in it's list of
> features. Probably software solutions like Xen are also affected
> if they want to be able to grow and shrink the virtual machines on
> demand
>
Ingo said that Xen is fine with per page granular freeing - this covers
embedded, desktop and small server users of VMs into the future I'd say.
> High order allocations: Ultimately, hugepage users. Today, that is a
> feature only big server users like Oracle care about. In the
> future I reckon applications will be able to use them for things
> like backing the heap by huge pages. Other users like GigE,
> loopback devices with large MTUs, some filesystem like CIFS are
> all interested although they are also been told use use smaller
> pages.
>
I think that saying its now OK to use higher order allocations is wrong
because as I said even with your patches they are going to run into
problems.
Actually I think one reason your patches may perform so well is because
there aren't actually a lot of higher order allocations in the kernel.
I think that probably leaves us realistically with demand hugepages,
hot unplug memory, and IBM lpars?
> Pros/Cons of Solutions
> ======================
>
> Anti-defrag Pros
> o Aim9 shows no significant regressions (.37% on page_test). On some
> tests, it shows performance gains (> 5% on fork_test)
> o Stress tests show that it manages to keep fragmentation down to a far
> lower level even without teaching kswapd how to linear reclaim
This sounds like a kind of funny test to me if nobody is actually
using higher order allocations.
When a higher order allocation is attempted, either you will satisfy
it from the kernel region, in which case the vanilla kernel would
have done the same. Or you satisfy it from an easy-reclaim contiguous
region, in which case it is no longer an easy-reclaim contiguous
region.
> o Stress tests with a linear reclaim experimental patch shows that it
> can successfully find large contiguous chunks of memory
> o It is known to help hotplug on PPC64
> o No tunables. The approach tries to manage itself as much as possible
But it has more dreaded heuristics :P
> o It exists, heavily tested, and synced against the latest -mm1
> o Can be compiled away be redefining the RCLM_* macros and the
> __GFP_*RCLM flags
>
> Anti-defrag Cons
> o More complexity within the page allocator
> o Adds a new layer onto the allocator that effectively creates subzones
> o Adding a new concept that maintainers have to work with
> o Depending on the workload, it fragments anyway
>
> New Zone Pros
> o Zones are a well known and understood concept
> o For people that do not care about hotplug, they can easily get rid of it
> o Provides reliable areas of contiguous groups that can be freed for
> HugeTLB pages going to userspace
> o Uses existing zone infrastructure for balancing
>
> New Zone Cons
> o Zones historically have introduced balancing problems
> o Been tried for hotplug and dropped because of being awkward to work with
> o It only helps hotplug and potentially HugeTLB pages for userspace
> o Tunable required. If you get it wrong, the system suffers a lot
Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
them get it right.
> o Needs to be planned for and developed
>
Yasunori Goto had patches around from last year. Not sure what sort
of shape they're in now but I'd think most of the hard work is done.
> Scenarios
> =========
>
> Lets outline some situations then or workloads that can occur
>
> 1. Heavy job running that consumes 75% of physical memory. Like a kernel
> build
>
> Anti-defrag: It will not fragment as it will never have to fallback.High
> order allocations will be possible in the remaining 25%.
> Zone-based: After been tuned to a kernel build load, it will not
> fragment. Get the tuning wrong, performance suffers or workload
> fails. High order allocations will be possible in the remaining 25%.
>
You don't need to continually tune things for each and every possible
workload under the sun. It is like how we currently drive 16GB highmem
systems quite nicely under most workloads with 1GB of normal memory.
Make that an 8:1 ratio if you're worried.
[snip]
>
> I've tried to be as objective as possible with the summary.
>
>>From the points above though, I think that anti-defrag gets us a lot of
> the way, with the complexity isolated in one place. It's downside is that
> it can still break down and future work is needed to stop it degrading
> (kswapd cleaning UserRclm areas and page migration when we get really
> stuck). Zone-based is more reliable but only addresses a limited
> situation, principally hotplug and it does not even go 100% of the way for
> hotplug.
To me it seems like it solves the hotplug, lpar hotplug, and hugepages
problems which seem to be the main ones.
> It also depends on a tunable which is not cool and it is static.
I think it is very cool because it means the tiny minority of Linux
users who want this can do so without impacting the rest of the code
or users. This is how Linux has been traditionally run and I still
have a tiny bit of faith left :)
> If we make the zones growable+shrinkable, we run into all the same
> problems that anti-defrag has today.
>
But we don't have the extra zones layer that anti defrag has today.
And anti defrag needs limits if it is to be reliable anyway.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
2005-11-03 3:14 ` Nick Piggin
@ 2005-11-03 12:19 ` Mel Gorman
2005-11-10 18:47 ` Steve Lord
2005-11-03 15:34 ` Martin J. Bligh
1 sibling, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-03 12:19 UTC (permalink / raw)
To: Nick Piggin
Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton,
kravetz, linux-mm, Linux Kernel Mailing List, lhms
On Thu, 3 Nov 2005, Nick Piggin wrote:
> Mel Gorman wrote:
>
> >
> > Ok. To me, the rest of the thread are beating around the same points and
> > no one is giving ground. The points are made so lets summarise. Apologies
> > if anything is missing.
> >
>
> Thanks for attempting a summary of a difficult topic. I have a couple
> of suggestions.
>
> > Who cares
> > =========
> > Physical hotplug remove: Vendors of the hardware that support this -
> > Fujitsu, HP (I think), IBM etc
> >
> > Virtualization hotplug remove: Sellers of virtualization software, some
> > hardware like any IBM machine that lists LPAR in it's list of
> > features. Probably software solutions like Xen are also affected
> > if they want to be able to grow and shrink the virtual machines on
> > demand
> >
>
> Ingo said that Xen is fine with per page granular freeing - this covers
> embedded, desktop and small server users of VMs into the future I'd say.
>
Ok, hard to argue with that.
> > High order allocations: Ultimately, hugepage users. Today, that is a
> > feature only big server users like Oracle care about. In the
> > future I reckon applications will be able to use them for things
> > like backing the heap by huge pages. Other users like GigE,
> > loopback devices with large MTUs, some filesystem like CIFS are
> > all interested although they are also been told use use smaller
> > pages.
> >
>
> I think that saying its now OK to use higher order allocations is wrong
> because as I said even with your patches they are going to run into
> problems.
>
Ok, I have not denied that they will run into problems. I have asserted
that, with more work built upon these patches, we can grant large pages
with a good degree of reliability. Subsystems should still use small
orders whenever possible and at the very least, large orders should be
short-lived.
For userspace users, I would like to move towards better availibility of
huge page without requiring boot-time tunables which are required today.
Do we agree that this would be useful at least for a few different users?
HugeTLB user 1: Todays users of hugetlbfs like big databases etc
HugeTLB user 2: HPC jobs that run with sparse data sets
HugeTLB user 3: Desktop applications that use large amounts of address space.
I got a mail from a user of category 2. He said I can quote his email, but
he didn't say I could quote his name which is inconvenient but I'm sure he
has good reasons.
To him, low fragmentation is "critical, at least in HPC environments".
Here is the core of his issue;
--- excerpt ---
Take the scenario that you have a large machine that is
used by multiple users, and the usage is regulated by a batch
scheduler. Loadleveler on ibm's for example. PBS on many
others. Both appear to be available in linux environments.
In the case of my codes, I find that having large pages is
extremely beneficial to my run times. As in factors of several,
modulo things that I've coded in by hand to try and avoid the
issues. I don't think my code is in any way unusual in this
magnitude of improvement.
--- excerpt ---
ok, so we have two potential solutions, anti-defrag and zones. We don't
need to rehash the pro's and cons. With zones, we just say "just reclaim
the easy reclaim zone, alloc your pages and away we go".
Now, his problem is that the server is not restarted between job times and
jobs takes days and weeks to complete. The system administrators will not
restart the machine so getting it to a prestine state is a difficulty. The
state he gets the system in is the state he works with and with
fragmentation, he doesn't get large pages unless he is lucky enough to be
the first user of the machine
With the zone approach, we would just be saying "tune it". Here is what he
says about that
--- excerpt ---
I specifically *don't* want things that I have to beg sysadmins to
tune correctly. They won't get it right because there is no `right'
that is right for everyone. They won't want to change it and it
won't work besides. Been there, done that. My experience is that
with linux so far, and some other non-linux machines too, they
always turn all the page stuff off because it breaks the machine.
--- excerpt ---
This is an example of a real user that "tune the size of your zone
correctly" is just not good enough. He makes a novel suggestion on how
anti-defrag + hotplug could be used.
--- excerpt ---
In the context of hotplug stuff and fragmentation avoidance,
this sort of reset would be implemented by performing the
the first step in the hot unplug, to migrate everything off
of that memory, including whatever kernel pages that exist
there, but not the second step. Just leave that memory plugged
in and reset the memory to a sane initial state. Essentially
this would be some sort of pseudo hotunplug followed by a pseudo
hotplug of that memory.
--- excerpt ---
I'm pretty sure this is not what hotplug was aimed at but it would get him
what he wants, large pages to echo BigNumber > nr_hugepages at the least.
It also needs hotplug remove to be working for some banks and regions of
memory although not the 100% case.
Ok, this is one example of a user for scientific workloads that "tune the
size of the zone" just is not good enough. The admins won't do it for him
because it'll just break for the next scheduled job.
> Actually I think one reason your patches may perform so well is because
> there aren't actually a lot of higher order allocations in the kernel.
>
> I think that probably leaves us realistically with demand hugepages,
> hot unplug memory, and IBM lpars?
>
>
> > Pros/Cons of Solutions
> > ======================
> >
> > Anti-defrag Pros
> > o Aim9 shows no significant regressions (.37% on page_test). On some
> > tests, it shows performance gains (> 5% on fork_test)
> > o Stress tests show that it manages to keep fragmentation down to a far
> > lower level even without teaching kswapd how to linear reclaim
>
> This sounds like a kind of funny test to me if nobody is actually
> using higher order allocations.
>
No one uses them because they always fail. This is a chicken and egg
problem.
> When a higher order allocation is attempted, either you will satisfy
> it from the kernel region, in which case the vanilla kernel would
> have done the same. Or you satisfy it from an easy-reclaim contiguous
> region, in which case it is no longer an easy-reclaim contiguous
> region.
>
Right, but right now, we say "don't use high order allocations ever". With
work, we'll be saying "ok, use high order allocations but they should be
short lived or you won't be allocating them for long"
> > o Stress tests with a linear reclaim experimental patch shows that it
> > can successfully find large contiguous chunks of memory
> > o It is known to help hotplug on PPC64
> > o No tunables. The approach tries to manage itself as much as possible
>
> But it has more dreaded heuristics :P
>
Yeah, but if it gets them wrong, the system chugs along anyway, just
fragmented like it is today. If the zone-based approach gets it wrong, the
system goes down the tubes.
At very worst, the patches give a kernel allocator that is as good as
todays. At very worst, the zone-based approach makes an unusable system.
The performance of the patches is another story. I've been posting aim9
figures based on my test machine. I'm trying to kick an ancient PowerPC
43P Model 150 machine into working. This machine is a different
architecture and ancient (I found it on the way to a skip) so should give
different figures.
> > o It exists, heavily tested, and synced against the latest -mm1
> > o Can be compiled away be redefining the RCLM_* macros and the
> > __GFP_*RCLM flags
> >
> > Anti-defrag Cons
> > o More complexity within the page allocator
> > o Adds a new layer onto the allocator that effectively creates subzones
> > o Adding a new concept that maintainers have to work with
> > o Depending on the workload, it fragments anyway
> >
> > New Zone Pros
> > o Zones are a well known and understood concept
> > o For people that do not care about hotplug, they can easily get rid of it
> > o Provides reliable areas of contiguous groups that can be freed for
> > HugeTLB pages going to userspace
> > o Uses existing zone infrastructure for balancing
> >
> > New Zone Cons
> > o Zones historically have introduced balancing problems
> > o Been tried for hotplug and dropped because of being awkward to work with
> > o It only helps hotplug and potentially HugeTLB pages for userspace
> > o Tunable required. If you get it wrong, the system suffers a lot
>
> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
> them get it right.
>
Unless you work in a place where they sysadmins will tell you to go away
such as the HPC user above. I'm not a sysadmin, but I'm pretty sure they
have better things to do than twiddle a tunable all day.
> > o Needs to be planned for and developed
> >
>
> Yasunori Goto had patches around from last year. Not sure what sort
> of shape they're in now but I'd think most of the hard work is done.
>
But Yasunori (thanks for sending the links ) himself says when he posted.
--- excerpt ---
Another one was a bit similar than Mel-san's one.
One of motivation of this patch was to create orthogonal relationship
between Removable and DMA/Normal/Highmem. I thought it is desirable.
Because, ppc64 can treat that all of memory is same (DMA) zone.
I thought that new zone spoiled its good feature.
--- excerpt ---
He thought that the new zone removed the ability of some architectures to
treat all memory the same. My patches give some of the benefits of using
another zone while still preserving an architectures ability to
treat all memory the same.
> > Scenarios
> > =========
> >
> > Lets outline some situations then or workloads that can occur
> >
> > 1. Heavy job running that consumes 75% of physical memory. Like a kernel
> > build
> >
> > Anti-defrag: It will not fragment as it will never have to fallback.High
> > order allocations will be possible in the remaining 25%.
> > Zone-based: After been tuned to a kernel build load, it will not
> > fragment. Get the tuning wrong, performance suffers or workload
> > fails. High order allocations will be possible in the remaining 25%.
> >
>
> You don't need to continually tune things for each and every possible
> workload under the sun. It is like how we currently drive 16GB highmem
> systems quite nicely under most workloads with 1GB of normal memory.
> Make that an 8:1 ratio if you're worried.
>
> [snip]
>
> >
> > I've tried to be as objective as possible with the summary.
> >
> > > From the points above though, I think that anti-defrag gets us a lot of
> > the way, with the complexity isolated in one place. It's downside is that
> > it can still break down and future work is needed to stop it degrading
> > (kswapd cleaning UserRclm areas and page migration when we get really
> > stuck). Zone-based is more reliable but only addresses a limited
> > situation, principally hotplug and it does not even go 100% of the way for
> > hotplug.
>
> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
> problems which seem to be the main ones.
>
> > It also depends on a tunable which is not cool and it is static.
>
> I think it is very cool because it means the tiny minority of Linux
> users who want this can do so without impacting the rest of the code
> or users. This is how Linux has been traditionally run and I still
> have a tiny bit of faith left :)
>
The impact of the code and users will depend on benchmarks. I've posted
benchmarks that show there are either very small regressions or else there
are performance gains. As I write this, some of the aim9 benchmarks
completed on the PowerPC.
This is a comparison between 2.6.14-rc5-mm1 and
2.6.14-rc5-mm1-mbuddy-v19-defragDisabledViaConfig
1 creat-clo 73500.00 72504.58 -995.42 -1.35% File Creations and Closes/second
2 page_test 30806.13 31076.49 270.36 0.88% System Allocations & Pages/second
3 brk_test 335299.02 341926.35 6627.33 1.98% System Memory Allocations/second
4 jmp_test 1641733.33 1644566.67 2833.34 0.17% Non-local gotos/second
5 signal_test 100883.19 98900.18 -1983.01 -1.97% Signal Traps/second
6 exec_test 116.53 118.44 1.91 1.64% Program Loads/second
7 fork_test 751.70 746.84 -4.86 -0.65% Task Creations/second
8 link_test 30217.11 30463.82 246.71 0.82% Link/Unlink Pairs/second
Performance gains on page_test, brk_test and exec_test. Even with
variances between tests, we are looking at "more or less the same", not
regressions. No user impact there.
This is a comparison between 2.6.14-rc5-mm1 and
2.6.14-rc5-mm1-mbuddy-v19-withantidefrag
1 creat-clo 73500.00 71188.14 -2311.86 -3.15% File Creations and Closes/second
2 page_test 30806.13 31060.96 254.83 0.83% System Allocations & Pages/second
3 brk_test 335299.02 344361.15 9062.13 2.70% System Memory Allocations/second
4 jmp_test 1641733.33 1627228.80 -14504.53 -0.88% Non-local gotos/second
5 signal_test 100883.19 100233.33 -649.86 -0.64% Signal Traps/second
6 exec_test 116.53 117.63 1.10 0.94% Program Loads/second
7 fork_test 751.70 763.73 12.03 1.60% Task Creations/second
8 link_test 30217.11 30322.10 104.99 0.35% Link/Unlink Pairs/second
Performance gains on page_test, brk_test, exec_test and fork_test. Not bad
going for complex overhead. create-clo took a beating, but what workload
opens and closes files at that rate?
This is an old, small machine. If I hotplug this, I'll be lucky if it ever
turns on again. The aim9 benchmarks on two machines show that there is
similar and, in some cases better, performance with these patches. If a
workload does suffer badly, an additional patch has been supplied that
disables anti-defrag. A run in -mm will tell us if this is the general
case for machines or are my two test boxes running on magic beans.
So, the small number of users that want this, get this. The rest of the
users who just run the code, should not notice or care. This brings us
back to the main stickler, code complexity. I think that the code has been
very well isolated from the code allocator code and people looking at the
allocator could avoid it if they really wanted while stilling knowing what
the buddy allocator was doing.
> > If we make the zones growable+shrinkable, we run into all the same
> > problems that anti-defrag has today.
> >
>
> But we don't have the extra zones layer that anti defrag has today.
>
So, we just have an extra layer on the side that has to be configured. All
of the problems with all of the configuration.
> And anti defrag needs limits if it is to be reliable anyway.
>
I'm confident given time that I can make this manage itself with a very
good degree of reliability.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
2005-11-03 12:19 ` Mel Gorman
@ 2005-11-10 18:47 ` Steve Lord
0 siblings, 0 replies; 253+ messages in thread
From: Steve Lord @ 2005-11-10 18:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List,
lhms
Flogging a dead horse here maybe, I missed this whole thread when it was
live, and someone may already have covered this.
Another reason for avoiding memory fragmentation, which may have been lost
in the discussion, is avoiding scatter/gather in I/O. The block layer now
has the smarts to join together physically contiguous pages into a single
scatter/gather element. It always had the smarts to deal with I/O from lots
of small chunks of memory, and let the hardware do the work of reassembling
it. This does not come for free though.
I have come across situations where a raid controller gets cpu bound dealing
with I/O from Linux, but not from Windows. The reason being that Windows seems
to manage to present the same amount of memory in less scatter gather entries.
Because the number of DMA elements is another limiting factor, Windows also
managed to submit larger individual requests. Once Linux reaches steady state,
it ends up submitting one page per scatter gather entry.
OK, if you are going via the page cache, then this is not going to mean anything
unless the idea of having PAGE_CACHE_SIZE > PAGE_SIZE gets dusted off. However,
for direct userspace <-> disk direct I/O, having the address space of a process
be more physically contiguous could help here. Specifically allocated huge pages
is another way to achieve this, but it does require special coding in an app
to do it.
I'll go back to my day job now ;-)
Steve
Mel Gorman wrote:
> On Thu, 3 Nov 2005, Nick Piggin wrote:
>
>> Mel Gorman wrote:
>>
>>> Ok. To me, the rest of the thread are beating around the same points and
>>> no one is giving ground. The points are made so lets summarise. Apologies
>>> if anything is missing.
>>>
>> Thanks for attempting a summary of a difficult topic. I have a couple
>> of suggestions.
>>
>>> Who cares
>>> =========
>>> Physical hotplug remove: Vendors of the hardware that support this -
>>> Fujitsu, HP (I think), IBM etc
>>>
>>> Virtualization hotplug remove: Sellers of virtualization software, some
>>> hardware like any IBM machine that lists LPAR in it's list of
>>> features. Probably software solutions like Xen are also affected
>>> if they want to be able to grow and shrink the virtual machines on
>>> demand
>>>
>> Ingo said that Xen is fine with per page granular freeing - this covers
>> embedded, desktop and small server users of VMs into the future I'd say.
>>
>
> Ok, hard to argue with that.
>
>>> High order allocations: Ultimately, hugepage users. Today, that is a
>>> feature only big server users like Oracle care about. In the
>>> future I reckon applications will be able to use them for things
>>> like backing the heap by huge pages. Other users like GigE,
>>> loopback devices with large MTUs, some filesystem like CIFS are
>>> all interested although they are also been told use use smaller
>>> pages.
>>>
>> I think that saying its now OK to use higher order allocations is wrong
>> because as I said even with your patches they are going to run into
>> problems.
>>
>
> Ok, I have not denied that they will run into problems. I have asserted
> that, with more work built upon these patches, we can grant large pages
> with a good degree of reliability. Subsystems should still use small
> orders whenever possible and at the very least, large orders should be
> short-lived.
>
> For userspace users, I would like to move towards better availibility of
> huge page without requiring boot-time tunables which are required today.
> Do we agree that this would be useful at least for a few different users?
>
> HugeTLB user 1: Todays users of hugetlbfs like big databases etc
> HugeTLB user 2: HPC jobs that run with sparse data sets
> HugeTLB user 3: Desktop applications that use large amounts of address space.
>
> I got a mail from a user of category 2. He said I can quote his email, but
> he didn't say I could quote his name which is inconvenient but I'm sure he
> has good reasons.
>
> To him, low fragmentation is "critical, at least in HPC environments".
> Here is the core of his issue;
>
> --- excerpt ---
> Take the scenario that you have a large machine that is
> used by multiple users, and the usage is regulated by a batch
> scheduler. Loadleveler on ibm's for example. PBS on many
> others. Both appear to be available in linux environments.
>
> In the case of my codes, I find that having large pages is
> extremely beneficial to my run times. As in factors of several,
> modulo things that I've coded in by hand to try and avoid the
> issues. I don't think my code is in any way unusual in this
> magnitude of improvement.
> --- excerpt ---
>
> ok, so we have two potential solutions, anti-defrag and zones. We don't
> need to rehash the pro's and cons. With zones, we just say "just reclaim
> the easy reclaim zone, alloc your pages and away we go".
>
> Now, his problem is that the server is not restarted between job times and
> jobs takes days and weeks to complete. The system administrators will not
> restart the machine so getting it to a prestine state is a difficulty. The
> state he gets the system in is the state he works with and with
> fragmentation, he doesn't get large pages unless he is lucky enough to be
> the first user of the machine
>
> With the zone approach, we would just be saying "tune it". Here is what he
> says about that
>
> --- excerpt ---
> I specifically *don't* want things that I have to beg sysadmins to
> tune correctly. They won't get it right because there is no `right'
> that is right for everyone. They won't want to change it and it
> won't work besides. Been there, done that. My experience is that
> with linux so far, and some other non-linux machines too, they
> always turn all the page stuff off because it breaks the machine.
> --- excerpt ---
>
> This is an example of a real user that "tune the size of your zone
> correctly" is just not good enough. He makes a novel suggestion on how
> anti-defrag + hotplug could be used.
>
> --- excerpt ---
> In the context of hotplug stuff and fragmentation avoidance,
> this sort of reset would be implemented by performing the
> the first step in the hot unplug, to migrate everything off
> of that memory, including whatever kernel pages that exist
> there, but not the second step. Just leave that memory plugged
> in and reset the memory to a sane initial state. Essentially
> this would be some sort of pseudo hotunplug followed by a pseudo
> hotplug of that memory.
> --- excerpt ---
>
> I'm pretty sure this is not what hotplug was aimed at but it would get him
> what he wants, large pages to echo BigNumber > nr_hugepages at the least.
> It also needs hotplug remove to be working for some banks and regions of
> memory although not the 100% case.
>
> Ok, this is one example of a user for scientific workloads that "tune the
> size of the zone" just is not good enough. The admins won't do it for him
> because it'll just break for the next scheduled job.
>
>> Actually I think one reason your patches may perform so well is because
>> there aren't actually a lot of higher order allocations in the kernel.
>>
>> I think that probably leaves us realistically with demand hugepages,
>> hot unplug memory, and IBM lpars?
>>
>
>
>>> Pros/Cons of Solutions
>>> ======================
>>>
>>> Anti-defrag Pros
>>> o Aim9 shows no significant regressions (.37% on page_test). On some
>>> tests, it shows performance gains (> 5% on fork_test)
>>> o Stress tests show that it manages to keep fragmentation down to a far
>>> lower level even without teaching kswapd how to linear reclaim
>> This sounds like a kind of funny test to me if nobody is actually
>> using higher order allocations.
>>
>
> No one uses them because they always fail. This is a chicken and egg
> problem.
>
>> When a higher order allocation is attempted, either you will satisfy
>> it from the kernel region, in which case the vanilla kernel would
>> have done the same. Or you satisfy it from an easy-reclaim contiguous
>> region, in which case it is no longer an easy-reclaim contiguous
>> region.
>>
>
> Right, but right now, we say "don't use high order allocations ever". With
> work, we'll be saying "ok, use high order allocations but they should be
> short lived or you won't be allocating them for long"
>
>>> o Stress tests with a linear reclaim experimental patch shows that it
>>> can successfully find large contiguous chunks of memory
>>> o It is known to help hotplug on PPC64
>>> o No tunables. The approach tries to manage itself as much as possible
>> But it has more dreaded heuristics :P
>>
>
> Yeah, but if it gets them wrong, the system chugs along anyway, just
> fragmented like it is today. If the zone-based approach gets it wrong, the
> system goes down the tubes.
>
> At very worst, the patches give a kernel allocator that is as good as
> todays. At very worst, the zone-based approach makes an unusable system.
> The performance of the patches is another story. I've been posting aim9
> figures based on my test machine. I'm trying to kick an ancient PowerPC
> 43P Model 150 machine into working. This machine is a different
> architecture and ancient (I found it on the way to a skip) so should give
> different figures.
>
>>> o It exists, heavily tested, and synced against the latest -mm1
>>> o Can be compiled away be redefining the RCLM_* macros and the
>>> __GFP_*RCLM flags
>>>
>>> Anti-defrag Cons
>>> o More complexity within the page allocator
>>> o Adds a new layer onto the allocator that effectively creates subzones
>>> o Adding a new concept that maintainers have to work with
>>> o Depending on the workload, it fragments anyway
>>>
>>> New Zone Pros
>>> o Zones are a well known and understood concept
>>> o For people that do not care about hotplug, they can easily get rid of it
>>> o Provides reliable areas of contiguous groups that can be freed for
>>> HugeTLB pages going to userspace
>>> o Uses existing zone infrastructure for balancing
>>>
>>> New Zone Cons
>>> o Zones historically have introduced balancing problems
>>> o Been tried for hotplug and dropped because of being awkward to work with
>>> o It only helps hotplug and potentially HugeTLB pages for userspace
>>> o Tunable required. If you get it wrong, the system suffers a lot
>> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
>> them get it right.
>>
>
> Unless you work in a place where they sysadmins will tell you to go away
> such as the HPC user above. I'm not a sysadmin, but I'm pretty sure they
> have better things to do than twiddle a tunable all day.
>
>>> o Needs to be planned for and developed
>>>
>> Yasunori Goto had patches around from last year. Not sure what sort
>> of shape they're in now but I'd think most of the hard work is done.
>>
>
> But Yasunori (thanks for sending the links ) himself says when he posted.
>
> --- excerpt ---
> Another one was a bit similar than Mel-san's one.
> One of motivation of this patch was to create orthogonal relationship
> between Removable and DMA/Normal/Highmem. I thought it is desirable.
> Because, ppc64 can treat that all of memory is same (DMA) zone.
> I thought that new zone spoiled its good feature.
> --- excerpt ---
>
> He thought that the new zone removed the ability of some architectures to
> treat all memory the same. My patches give some of the benefits of using
> another zone while still preserving an architectures ability to
> treat all memory the same.
>
>>> Scenarios
>>> =========
>>>
>>> Lets outline some situations then or workloads that can occur
>>>
>>> 1. Heavy job running that consumes 75% of physical memory. Like a kernel
>>> build
>>>
>>> Anti-defrag: It will not fragment as it will never have to fallback.High
>>> order allocations will be possible in the remaining 25%.
>>> Zone-based: After been tuned to a kernel build load, it will not
>>> fragment. Get the tuning wrong, performance suffers or workload
>>> fails. High order allocations will be possible in the remaining 25%.
>>>
>> You don't need to continually tune things for each and every possible
>> workload under the sun. It is like how we currently drive 16GB highmem
>> systems quite nicely under most workloads with 1GB of normal memory.
>> Make that an 8:1 ratio if you're worried.
>>
>> [snip]
>>
>>> I've tried to be as objective as possible with the summary.
>>>
>>>> From the points above though, I think that anti-defrag gets us a lot of
>>> the way, with the complexity isolated in one place. It's downside is that
>>> it can still break down and future work is needed to stop it degrading
>>> (kswapd cleaning UserRclm areas and page migration when we get really
>>> stuck). Zone-based is more reliable but only addresses a limited
>>> situation, principally hotplug and it does not even go 100% of the way for
>>> hotplug.
>> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
>> problems which seem to be the main ones.
>>
>>> It also depends on a tunable which is not cool and it is static.
>> I think it is very cool because it means the tiny minority of Linux
>> users who want this can do so without impacting the rest of the code
>> or users. This is how Linux has been traditionally run and I still
>> have a tiny bit of faith left :)
>>
>
> The impact of the code and users will depend on benchmarks. I've posted
> benchmarks that show there are either very small regressions or else there
> are performance gains. As I write this, some of the aim9 benchmarks
> completed on the PowerPC.
>
> This is a comparison between 2.6.14-rc5-mm1 and
> 2.6.14-rc5-mm1-mbuddy-v19-defragDisabledViaConfig
>
> 1 creat-clo 73500.00 72504.58 -995.42 -1.35% File Creations and Closes/second
> 2 page_test 30806.13 31076.49 270.36 0.88% System Allocations & Pages/second
> 3 brk_test 335299.02 341926.35 6627.33 1.98% System Memory Allocations/second
> 4 jmp_test 1641733.33 1644566.67 2833.34 0.17% Non-local gotos/second
> 5 signal_test 100883.19 98900.18 -1983.01 -1.97% Signal Traps/second
> 6 exec_test 116.53 118.44 1.91 1.64% Program Loads/second
> 7 fork_test 751.70 746.84 -4.86 -0.65% Task Creations/second
> 8 link_test 30217.11 30463.82 246.71 0.82% Link/Unlink Pairs/second
>
> Performance gains on page_test, brk_test and exec_test. Even with
> variances between tests, we are looking at "more or less the same", not
> regressions. No user impact there.
>
> This is a comparison between 2.6.14-rc5-mm1 and
> 2.6.14-rc5-mm1-mbuddy-v19-withantidefrag
>
> 1 creat-clo 73500.00 71188.14 -2311.86 -3.15% File Creations and Closes/second
> 2 page_test 30806.13 31060.96 254.83 0.83% System Allocations & Pages/second
> 3 brk_test 335299.02 344361.15 9062.13 2.70% System Memory Allocations/second
> 4 jmp_test 1641733.33 1627228.80 -14504.53 -0.88% Non-local gotos/second
> 5 signal_test 100883.19 100233.33 -649.86 -0.64% Signal Traps/second
> 6 exec_test 116.53 117.63 1.10 0.94% Program Loads/second
> 7 fork_test 751.70 763.73 12.03 1.60% Task Creations/second
> 8 link_test 30217.11 30322.10 104.99 0.35% Link/Unlink Pairs/second
>
> Performance gains on page_test, brk_test, exec_test and fork_test. Not bad
> going for complex overhead. create-clo took a beating, but what workload
> opens and closes files at that rate?
>
> This is an old, small machine. If I hotplug this, I'll be lucky if it ever
> turns on again. The aim9 benchmarks on two machines show that there is
> similar and, in some cases better, performance with these patches. If a
> workload does suffer badly, an additional patch has been supplied that
> disables anti-defrag. A run in -mm will tell us if this is the general
> case for machines or are my two test boxes running on magic beans.
>
> So, the small number of users that want this, get this. The rest of the
> users who just run the code, should not notice or care. This brings us
> back to the main stickler, code complexity. I think that the code has been
> very well isolated from the code allocator code and people looking at the
> allocator could avoid it if they really wanted while stilling knowing what
> the buddy allocator was doing.
>
>>> If we make the zones growable+shrinkable, we run into all the same
>>> problems that anti-defrag has today.
>>>
>> But we don't have the extra zones layer that anti defrag has today.
>>
>
> So, we just have an extra layer on the side that has to be configured. All
> of the problems with all of the configuration.
>
>> And anti defrag needs limits if it is to be reliable anyway.
>>
>
> I'm confident given time that I can make this manage itself with a very
> good degree of reliability.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
2005-11-03 3:14 ` Nick Piggin
2005-11-03 12:19 ` Mel Gorman
@ 2005-11-03 15:34 ` Martin J. Bligh
1 sibling, 0 replies; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-03 15:34 UTC (permalink / raw)
To: Nick Piggin, Mel Gorman
Cc: Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms
>> Physical hotplug remove: Vendors of the hardware that support this -
>> Fujitsu, HP (I think), IBM etc
>>
>> Virtualization hotplug remove: Sellers of virtualization software, some
>> hardware like any IBM machine that lists LPAR in it's list of
>> features. Probably software solutions like Xen are also affected
>> if they want to be able to grow and shrink the virtual machines on
>> demand
>
> Ingo said that Xen is fine with per page granular freeing - this covers
> embedded, desktop and small server users of VMs into the future I'd say.
Not using large page mappings for the kernel area will be a substantial
performance hit. It's a less efficient approach inside the hypervisor,
and not all VMs / hardware can support it.
>> High order allocations: Ultimately, hugepage users. Today, that is a
>> feature only big server users like Oracle care about. In the
>> future I reckon applications will be able to use them for things
>> like backing the heap by huge pages. Other users like GigE,
>> loopback devices with large MTUs, some filesystem like CIFS are
>> all interested although they are also been told use use smaller
>> pages.
>
> I think that saying its now OK to use higher order allocations is wrong
> because as I said even with your patches they are going to run into
> problems.
>
> Actually I think one reason your patches may perform so well is because
> there aren't actually a lot of higher order allocations in the kernel.
>
> I think that probably leaves us realistically with demand hugepages,
> hot unplug memory, and IBM lpars?
Sigh. You seem obsessed with this. There are various critical places in
the kernel that use higher order allocations. Yes, they're normally
smaller ones rather than larger ones, but .... please try re-reading
the earlier portions of this thread. You are NOT going to be able to
get rid of all higher-order allocations - please quit pretending you
can - living in denial is not going to help us.
If you really, really believe you can do that, please go ahead and prove
it. Until that point, please let go of the "it's only for a few specialized
users" arguement, and acknowledge we DO actually use higher order allocs
in the kernel right now.
>> o Aim9 shows no significant regressions (.37% on page_test). On some
>> tests, it shows performance gains (> 5% on fork_test)
>> o Stress tests show that it manages to keep fragmentation down to a far
>> lower level even without teaching kswapd how to linear reclaim
>
> This sounds like a kind of funny test to me if nobody is actually
> using higher order allocations.
It's a regression test. To, like, test for regressions in the normal
case ;-)
>> New Zone Cons
>> o Zones historically have introduced balancing problems
>> o Been tried for hotplug and dropped because of being awkward to work with
>> o It only helps hotplug and potentially HugeTLB pages for userspace
>> o Tunable required. If you get it wrong, the system suffers a lot
>
> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
> them get it right.
Having met some of them ... that's not a pro ;-) We have quite enough
meaningless tunables already. And to be honest, the bigger problem is
that it's a problem with no correct answer - workloads shift day vs.
night, etc.
> You don't need to continually tune things for each and every possible
> workload under the sun. It is like how we currently drive 16GB highmem
> systems quite nicely under most workloads with 1GB of normal memory.
> Make that an 8:1 ratio if you're worried.
Thanks for turning my 64 bit system back into a 32 bit one. really
appreciate that. Note the last 5 years of endless whining about all
the problems with large 32 bit systems, and how they're unfixable
and we should all move to 64 bit please.
> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
> problems which seem to be the main ones.
That's because you're not listening, you're going on your own preconcieved
notions ...
> I think it is very cool because it means the tiny minority of Linux
> users who want this can do so without impacting the rest of the code
> or users.
Ditto.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 13:56 ` Ingo Molnar
2005-11-01 14:10 ` Dave Hansen
@ 2005-11-01 14:41 ` Mel Gorman
2005-11-01 14:46 ` Ingo Molnar
` (2 more replies)
2005-11-01 18:23 ` Rob Landley
2 siblings, 3 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 14:41 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel
On Tue, 1 Nov 2005, Ingo Molnar wrote:
> * Mel Gorman <mel@csn.ul.ie> wrote:
>
> > The set of patches do fix a lot and make a strong start at addressing
> > the fragmentation problem, just not 100% of the way. [...]
>
> do you have an expectation to be able to solve the 'fragmentation
> problem', all the time, in a 100% way, now or in the future?
>
Not now, but I expect to make 100% on demand in the future for all but
GFP_ATOMIC and GFP_NOFS allocations. As GFP_ATOMIC and GFP_NOFS cannot do
any reclaim work themselves, they will still be required to use smaller
orders or private pools that are refilled using GFP_KERNEL if necessary.
The high order pages would have to be reclaimed by another process like
kswapd just like what happens for order-0 pages today.
> > So, with this set of patches, how fragmented you get is dependant on
> > the workload and it may still break down and high order allocations
> > will fail. But the current situation is that it will defiantly break
> > down. The fact is that it has been reported that memory hotplug remove
> > works with these patches and doesn't without them. Granted, this is
> > just one feature on a high-end machine, but it is one solid operation
> > we can perform with the patches and cannot without them. [...]
>
> can you always, under any circumstance hot unplug RAM with these patches
> applied? If not, do you have any expectation to reach 100%?
>
No, you cannot guarantee hot unplug RAM with these patches applied.
Anecdotal evidence suggests your chances are better on PPC64 which is a
start but we have to start somewhere. The full 100% solution would be a
large set of far reaching patches that would touch a lot of the memory
manager. This would get rejected because the patches should have have
arrived piecemeal. These patches are one piece. To reach 100%, other
mechanisms are also needed such as;
o Page migration to move unreclaimable pages like mlock()ed pages or
kernel pages that had fallen back into easy-reclaim areas. A mechanism
would also be needed to move things like kernel text. I think the memory
hotplug tree has done a lot of work here
o Mechanism for taking regions of memory offline. Again, I think the
memory hotplug crowd have something for this. If they don't, one of them
will chime in.
o linear page reclaim that linearly scans a region of memory reclaims or
moves all the pages it. I have a proof-of-concept patch that does the
linear scan and reclaim but it's currently ugly and depends on this set
of patches been applied.
These patches are the *starting* point that other things like linear page
reclaim can be based on.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
@ 2005-11-01 14:46 ` Ingo Molnar
2005-11-01 15:23 ` Mel Gorman
2005-11-01 18:33 ` Rob Landley
2005-11-01 14:50 ` Dave Hansen
2005-11-02 5:11 ` Andrew Morton
2 siblings, 2 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-01 14:46 UTC (permalink / raw)
To: Mel Gorman
Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel
* Mel Gorman <mel@csn.ul.ie> wrote:
> [...] The full 100% solution would be a large set of far reaching
> patches that would touch a lot of the memory manager. This would get
> rejected because the patches should have have arrived piecemeal. These
> patches are one piece. To reach 100%, other mechanisms are also needed
> such as;
>
> o Page migration to move unreclaimable pages like mlock()ed pages or
> kernel pages that had fallen back into easy-reclaim areas. A mechanism
> would also be needed to move things like kernel text. I think the memory
> hotplug tree has done a lot of work here
> o Mechanism for taking regions of memory offline. Again, I think the
> memory hotplug crowd have something for this. If they don't, one of them
> will chime in.
> o linear page reclaim that linearly scans a region of memory reclaims or
> moves all the pages it. I have a proof-of-concept patch that does the
> linear scan and reclaim but it's currently ugly and depends on this set
> of patches been applied.
how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
that is pinned down, and to/from which live pointers may exist? That
alone can prevent RAM from being removable.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:46 ` Ingo Molnar
@ 2005-11-01 15:23 ` Mel Gorman
2005-11-01 18:33 ` Rob Landley
1 sibling, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 15:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel
On Tue, 1 Nov 2005, Ingo Molnar wrote:
>
> * Mel Gorman <mel@csn.ul.ie> wrote:
>
> > [...] The full 100% solution would be a large set of far reaching
> > patches that would touch a lot of the memory manager. This would get
> > rejected because the patches should have have arrived piecemeal. These
> > patches are one piece. To reach 100%, other mechanisms are also needed
> > such as;
> >
> > o Page migration to move unreclaimable pages like mlock()ed pages or
> > kernel pages that had fallen back into easy-reclaim areas. A mechanism
> > would also be needed to move things like kernel text. I think the memory
> > hotplug tree has done a lot of work here
> > o Mechanism for taking regions of memory offline. Again, I think the
> > memory hotplug crowd have something for this. If they don't, one of them
> > will chime in.
> > o linear page reclaim that linearly scans a region of memory reclaims or
> > moves all the pages it. I have a proof-of-concept patch that does the
> > linear scan and reclaim but it's currently ugly and depends on this set
> > of patches been applied.
>
> how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> that is pinned down, and to/from which live pointers may exist? That
> alone can prevent RAM from being removable.
>
It would require the page to have it's virtual->physical mapping changed
in the pagetables for each running process and the master page table. That
would be another step on the road to 100% support.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:46 ` Ingo Molnar
2005-11-01 15:23 ` Mel Gorman
@ 2005-11-01 18:33 ` Rob Landley
2005-11-01 19:02 ` Ingo Molnar
1 sibling, 1 reply; 253+ messages in thread
From: Rob Landley @ 2005-11-01 18:33 UTC (permalink / raw)
To: Ingo Molnar
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, linux-kernel, lhms-devel
On Tuesday 01 November 2005 08:46, Ingo Molnar wrote:
> how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> that is pinned down, and to/from which live pointers may exist? That
> alone can prevent RAM from being removable.
Would you like to apply your "100% or nothing" argument to the virtual memory
management subsystem and see how it sounds in that context? (As an argument
that we shouldn't _have_ one?)
> Ingo
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 18:33 ` Rob Landley
@ 2005-11-01 19:02 ` Ingo Molnar
0 siblings, 0 replies; 253+ messages in thread
From: Ingo Molnar @ 2005-11-01 19:02 UTC (permalink / raw)
To: Rob Landley
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, linux-kernel, lhms-devel
* Rob Landley <rob@landley.net> wrote:
> On Tuesday 01 November 2005 08:46, Ingo Molnar wrote:
> > how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> > that is pinned down, and to/from which live pointers may exist? That
> > alone can prevent RAM from being removable.
>
> Would you like to apply your "100% or nothing" argument to the virtual
> memory management subsystem and see how it sounds in that context?
> (As an argument that we shouldn't _have_ one?)
that would be comparing apples to oranges. There is a big difference
between "VM failures under high load", and "failure of VM functionality
for no user-visible reason". The fragmentation problem here has nothing
to do with pathological workloads. It has to do with 'unlucky'
allocation patterns that pin down RAM areas which thus become
non-removable. The RAM module will be non-removable for no user-visible
reason. Possible under zero load, and with lots of free RAM otherwise.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-11-01 14:46 ` Ingo Molnar
@ 2005-11-01 14:50 ` Dave Hansen
2005-11-01 15:24 ` Mel Gorman
2005-11-02 5:11 ` Andrew Morton
2 siblings, 1 reply; 253+ messages in thread
From: Dave Hansen @ 2005-11-01 14:50 UTC (permalink / raw)
To: Mel Gorman
Cc: Ingo Molnar, Nick Piggin, Martin J. Bligh, Andrew Morton,
kravetz, linux-mm, Linux Kernel Mailing List, lhms
On Tue, 2005-11-01 at 14:41 +0000, Mel Gorman wrote:
> o Mechanism for taking regions of memory offline. Again, I think the
> memory hotplug crowd have something for this. If they don't, one of them
> will chime in.
I'm not sure what you're asking for here.
Right now, you can offline based on NUMA node, or physical address.
It's all revealed in sysfs. Sounds like "regions" to me. :)
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:50 ` Dave Hansen
@ 2005-11-01 15:24 ` Mel Gorman
0 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 15:24 UTC (permalink / raw)
To: Dave Hansen
Cc: Ingo Molnar, Nick Piggin, Martin J. Bligh, Andrew Morton,
kravetz, linux-mm, Linux Kernel Mailing List, lhms
On Tue, 1 Nov 2005, Dave Hansen wrote:
> On Tue, 2005-11-01 at 14:41 +0000, Mel Gorman wrote:
> > o Mechanism for taking regions of memory offline. Again, I think the
> > memory hotplug crowd have something for this. If they don't, one of them
> > will chime in.
>
> I'm not sure what you're asking for here.
>
> Right now, you can offline based on NUMA node, or physical address.
> It's all revealed in sysfs. Sounds like "regions" to me. :)
>
Ah yes, that would do the job all right.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-11-01 14:46 ` Ingo Molnar
2005-11-01 14:50 ` Dave Hansen
@ 2005-11-02 5:11 ` Andrew Morton
2 siblings, 0 replies; 253+ messages in thread
From: Andrew Morton @ 2005-11-02 5:11 UTC (permalink / raw)
To: Mel Gorman
Cc: mingo, nickpiggin, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel
Mel Gorman <mel@csn.ul.ie> wrote:
>
> As GFP_ATOMIC and GFP_NOFS cannot do
> any reclaim work themselves
Both GFP_NOFS and GFP_NOIO can indeed perform direct reclaim. All
we require is __GFP_WAIT.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 13:56 ` Ingo Molnar
2005-11-01 14:10 ` Dave Hansen
2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
@ 2005-11-01 18:23 ` Rob Landley
2005-11-01 20:31 ` Joel Schopp
2 siblings, 1 reply; 253+ messages in thread
From: Rob Landley @ 2005-11-01 18:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
linux-mm, linux-kernel, lhms-devel
On Tuesday 01 November 2005 07:56, Ingo Molnar wrote:
> * Mel Gorman <mel@csn.ul.ie> wrote:
> > The set of patches do fix a lot and make a strong start at addressing
> > the fragmentation problem, just not 100% of the way. [...]
>
> do you have an expectation to be able to solve the 'fragmentation
> problem', all the time, in a 100% way, now or in the future?
Considering anybody can allocate memory and never release it, _any_ 100%
solution is going to require migrating existing pages, regardless of
allocation strategy.
> > So, with this set of patches, how fragmented you get is dependant on
> > the workload and it may still break down and high order allocations
> > will fail. But the current situation is that it will defiantly break
> > down. The fact is that it has been reported that memory hotplug remove
> > works with these patches and doesn't without them. Granted, this is
> > just one feature on a high-end machine, but it is one solid operation
> > we can perform with the patches and cannot without them. [...]
>
> can you always, under any circumstance hot unplug RAM with these patches
> applied? If not, do you have any expectation to reach 100%?
You're asking intentionally leading questions, aren't you? Without on-demand
page migration a given area of physical memory would only ever be free by
sheer coincidence. Less fragmented page allocation doesn't address _where_
the free areas are, it just tries to make them contiguous.
A page migration strategy would have to do less work if there's less
fragmention, and it also allows you to cluster the "difficult" cases (such as
kernel structures that just ain't moving) so you can much more easily
hot-unplug everything else. It also makes larger order allocations easier to
do so drivers needing that can load as modules after boot, and it also means
hugetlb comes a lot closer to general purpose infrastructure rather than a
funky boot-time reservation thing. Plus page prezeroing approaches get to
work on larger chunks, and so on.
But any strategy to demand that "this physical memory range must be freed up
now" will by definition require moving pages...
> Ingo
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 18:23 ` Rob Landley
@ 2005-11-01 20:31 ` Joel Schopp
0 siblings, 0 replies; 253+ messages in thread
From: Joel Schopp @ 2005-11-01 20:31 UTC (permalink / raw)
To: Rob Landley
Cc: Ingo Molnar, Mel Gorman, Nick Piggin, Martin J. Bligh,
Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel
>>>The set of patches do fix a lot and make a strong start at addressing
>>>the fragmentation problem, just not 100% of the way. [...]
>>
>>do you have an expectation to be able to solve the 'fragmentation
>>problem', all the time, in a 100% way, now or in the future?
>
>
> Considering anybody can allocate memory and never release it, _any_ 100%
> solution is going to require migrating existing pages, regardless of
> allocation strategy.
>
Three issues here. Fragmentation of memory in general, fragmentation of usage,
and being able to have 100% success rate at removing memory.
We will never be able to have 100% contiguous memory with no fragmentation.
Ever. Certainly not while we have non-movable pieces of memory. Even if we
could move every piece of memory it would be impractical. What these patches do
for general fragmentation is to keep the allocations that never will get freed
away from the rest of memory, so that memory has a chance to form larger
contiguous ranges when it is freed.
By separating memory based on usage there is another side effect. It also makes
possible some more active defragmentation methods on easier memory, because it
doesn't have annoying hard memory scattered throughout. Suddenly we can talk
about being able to do memory hotplug remove on significant portions of memory.
Or allocating these hugepages after boot. Or doing active defragmentation.
Or modules being able to be modules because they don't have to preallocate big
pieces of contiguous memory.
Some people will argue that we need 100% separation of usage or no separation at
all. Well, change the array of fallback to not allow kernel non-reclaimable to
fallback and we are done. 4 line change, 100% separation. But the tradeoff is
that under memory pressure we might fail allocations when we still have free
memory. There are other options for fallback of course, the fallback_alloc()
function is easily replaceable if somebody wants to. Many of these options get
easier once memory migration is in. The way fallback is done in the current
patches is to maintain current behavior as much as possible, satisfy
allocations, and not affect performance.
As to the 100% success at removing memory, this set of patches doesn't solve
that. But it solves the 80% problem quite nicely (when combined with the memory
migration patches). 80% is great for virtualized systems where the OS has some
choice over which memory to remove, but not the quantity to remove. It is also
a good start to 100%, because we can separate and identify the easy memory from
the hard memory. Dave Hansen has outlined in separate posts how we can get to
100%, including hard memory.
>>can you always, under any circumstance hot unplug RAM with these patches
>>applied? If not, do you have any expectation to reach 100%?
>
>
> You're asking intentionally leading questions, aren't you? Without on-demand
> page migration a given area of physical memory would only ever be free by
> sheer coincidence. Less fragmented page allocation doesn't address _where_
> the free areas are, it just tries to make them contiguous.
>
> A page migration strategy would have to do less work if there's less
> fragmention, and it also allows you to cluster the "difficult" cases (such as
> kernel structures that just ain't moving) so you can much more easily
> hot-unplug everything else. It also makes larger order allocations easier to
> do so drivers needing that can load as modules after boot, and it also means
> hugetlb comes a lot closer to general purpose infrastructure rather than a
> funky boot-time reservation thing. Plus page prezeroing approaches get to
> work on larger chunks, and so on.
>
> But any strategy to demand that "this physical memory range must be freed up
> now" will by definition require moving pages...
Perfectly stated.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 2:35 ` Nick Piggin
2005-11-01 11:57 ` Mel Gorman
@ 2005-11-01 20:59 ` Joel Schopp
2005-11-02 1:06 ` Nick Piggin
1 sibling, 1 reply; 253+ messages in thread
From: Joel Schopp @ 2005-11-01 20:59 UTC (permalink / raw)
To: Nick Piggin
Cc: Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
>> The patches have gone through a large number of revisions, have been
>> heavily tested and reviewed by a few people. The memory footprint of this
>> approach is smaller than introducing new zones. If the cache footprint,
>> increased branches and instructions were a problem, I would expect
>> them to
>> show up in the aim9 benchmark or the benchmark that ran ghostscript
>> multiple times on a large file.
>>
>
> I appreciate that a lot of work has gone into them. You must appreciate
> that they add a reasonable amount of complexity and a non-zero perormance
> cost to the page allocator.
The patches do ad a reasonable amount of complexity to the page allocator. In
my opinion that is the only downside of these patches, even though it is a big
one. What we need to decide as a community is if there is a less complex way to
do this, and if there isn't a less complex way then is the benefit worth the
increased complexity.
As to the non-zero performance cost, I think hard numbers should carry more
weight than they have been given in this area. Mel has posted hard numbers that
say the patches are a wash with respect to performance. I don't see any
evidence to contradict those results.
>> The will need high order allocations if we want to provide HugeTLB pages
>> to userspace on-demand rather than reserving at boot-time. This is a
>> future problem, but it's one that is not worth tackling until the
>> fragmentation problem is fixed first.
>>
>
> Sure. In what form, we haven't agreed. I vote zones! :)
I'd like to hear more details of how zones would be less complex while still
solving the problem. I just don't get it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 20:59 ` Joel Schopp
@ 2005-11-02 1:06 ` Nick Piggin
2005-11-02 1:41 ` Martin J. Bligh
` (2 more replies)
0 siblings, 3 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 1:06 UTC (permalink / raw)
To: Joel Schopp
Cc: Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
Joel Schopp wrote:
> The patches do ad a reasonable amount of complexity to the page
> allocator. In my opinion that is the only downside of these patches,
> even though it is a big one. What we need to decide as a community is
> if there is a less complex way to do this, and if there isn't a less
> complex way then is the benefit worth the increased complexity.
>
> As to the non-zero performance cost, I think hard numbers should carry
> more weight than they have been given in this area. Mel has posted hard
> numbers that say the patches are a wash with respect to performance. I
> don't see any evidence to contradict those results.
>
The numbers I have seen show that performance is decreased. People
like Ken Chen spend months trying to find a 0.05% improvement in
performance. Not long ago I just spent days getting our cached
kbuild performance back to where 2.4 is on my build system.
I can simply see they will cost more icache, more dcache, more branches,
etc. in what is the hottest part of the kernel in some workloads (kernel
compiles, for one).
I'm sorry if I sound like a wet blanket. I just don't look at a patch
and think "wow all those 3 guys with Linux on IBM mainframes and using
lpars are going to be so much happier now, this is something we need".
>>> The will need high order allocations if we want to provide HugeTLB pages
>>> to userspace on-demand rather than reserving at boot-time. This is a
>>> future problem, but it's one that is not worth tackling until the
>>> fragmentation problem is fixed first.
>>>
>>
>> Sure. In what form, we haven't agreed. I vote zones! :)
>
>
> I'd like to hear more details of how zones would be less complex while
> still solving the problem. I just don't get it.
>
You have an extra zone. You size that zone at boot according to the
amount of memory you need to be able to free. Only easy-reclaim stuff
goes in that zone.
It is less complex because zones are a complexity we already have to
live with. 99% of the infrastructure is already there to do this.
If you want to hot unplug memory or guarantee hugepage allocation,
this is the way to do it. Nobody has told me why this *doesn't* work.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 1:06 ` Nick Piggin
@ 2005-11-02 1:41 ` Martin J. Bligh
2005-11-02 2:03 ` Nick Piggin
2005-11-02 11:37 ` Mel Gorman
2005-11-02 15:11 ` Mel Gorman
2 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-02 1:41 UTC (permalink / raw)
To: Nick Piggin, Joel Schopp
Cc: Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel,
lhms-devel, Ingo Molnar
> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance. Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.
Ironically, we're currently trying to chase down a 'database benchmark'
regression that seems to have been cause by the last round of "let's
rewrite the scheduler again" (more details later). Nick, you've added an
awful lot of complexity to some of these code paths yourself ... seems
ironic that you're the one complaining about it ;-)
>>> Sure. In what form, we haven't agreed. I vote zones! :)
>>
>>
>> I'd like to hear more details of how zones would be less complex while
>> still solving the problem. I just don't get it.
>>
>
> You have an extra zone. You size that zone at boot according to the
> amount of memory you need to be able to free. Only easy-reclaim stuff
> goes in that zone.
>
> It is less complex because zones are a complexity we already have to
> live with. 99% of the infrastructure is already there to do this.
>
> If you want to hot unplug memory or guarantee hugepage allocation,
> this is the way to do it. Nobody has told me why this *doesn't* work.
Because the zone is statically sized, and you're back to the same crap
we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
effectively. Define how much you need for system ram, and how much
for easily reclaimable memory at boot time. You can't - it doesn't work.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 1:41 ` Martin J. Bligh
@ 2005-11-02 2:03 ` Nick Piggin
2005-11-02 2:24 ` Martin J. Bligh
2005-11-02 11:41 ` Mel Gorman
0 siblings, 2 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 2:03 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
Martin J. Bligh wrote:
>>The numbers I have seen show that performance is decreased. People
>>like Ken Chen spend months trying to find a 0.05% improvement in
>>performance. Not long ago I just spent days getting our cached
>>kbuild performance back to where 2.4 is on my build system.
>
>
> Ironically, we're currently trying to chase down a 'database benchmark'
> regression that seems to have been cause by the last round of "let's
> rewrite the scheduler again" (more details later). Nick, you've added an
> awful lot of complexity to some of these code paths yourself ... seems
> ironic that you're the one complaining about it ;-)
>
Yeah that's unfortunate, but I think a large portion of the problem
(if they are anything the same) has been narrowed down to some over
eager wakeup balancing for which there are a number of proposed
patches.
But in this case I was more worried about getting the groundwork done
for handling the multicore multicore systems that everyone will soon
be using rather than several % performance regression on TPC-C (not
to say that I don't care about that at all)... I don't see the irony.
But let's move this to another thread if it is going to continue. I
would be happy to discuss scheduler problems.
>>You have an extra zone. You size that zone at boot according to the
>>amount of memory you need to be able to free. Only easy-reclaim stuff
>>goes in that zone.
>>
>>It is less complex because zones are a complexity we already have to
>>live with. 99% of the infrastructure is already there to do this.
>>
>>If you want to hot unplug memory or guarantee hugepage allocation,
>>this is the way to do it. Nobody has told me why this *doesn't* work.
>
>
> Because the zone is statically sized, and you're back to the same crap
> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
> effectively. Define how much you need for system ram, and how much
> for easily reclaimable memory at boot time. You can't - it doesn't work.
>
You can't what? What doesn't work? If you have no hard limits set,
then the frag patches can't guarantee anything either.
You can't have it both ways. Either you have limits for things or
you don't need any guarantees. Zones handle the former case nicely,
and we currently do the latter case just fine (along with the frag
patches).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 2:03 ` Nick Piggin
@ 2005-11-02 2:24 ` Martin J. Bligh
2005-11-02 2:49 ` Nick Piggin
2005-11-02 11:41 ` Mel Gorman
1 sibling, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-02 2:24 UTC (permalink / raw)
To: Nick Piggin
Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
>>> The numbers I have seen show that performance is decreased. People
>>> like Ken Chen spend months trying to find a 0.05% improvement in
>>> performance. Not long ago I just spent days getting our cached
>>> kbuild performance back to where 2.4 is on my build system.
>>
>> Ironically, we're currently trying to chase down a 'database benchmark'
>> regression that seems to have been cause by the last round of "let's
>> rewrite the scheduler again" (more details later). Nick, you've added an
>> awful lot of complexity to some of these code paths yourself ... seems
>> ironic that you're the one complaining about it ;-)
>
> Yeah that's unfortunate, but I think a large portion of the problem
> (if they are anything the same) has been narrowed down to some over
> eager wakeup balancing for which there are a number of proposed
> patches.
>
> But in this case I was more worried about getting the groundwork done
> for handling the multicore multicore systems that everyone will soon
> be using rather than several % performance regression on TPC-C (not
> to say that I don't care about that at all)... I don't see the irony.
>
> But let's move this to another thread if it is going to continue. I
> would be happy to discuss scheduler problems.
My point was that most things we do add complexity to the codebase,
including the things you do yourself ... I'm not saying the we're worse
off for the changes you've made, by any means - I think they've been
mostly beneficial. I'm just pointing out that we ALL do it, so let us
not be too quick to judge when others propose adding something that does ;-)
>> Because the zone is statically sized, and you're back to the same crap
>> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
>> effectively. Define how much you need for system ram, and how much
>> for easily reclaimable memory at boot time. You can't - it doesn't work.
>
> You can't what? What doesn't work? If you have no hard limits set,
> then the frag patches can't guarantee anything either.
>
> You can't have it both ways. Either you have limits for things or
> you don't need any guarantees. Zones handle the former case nicely,
> and we currently do the latter case just fine (along with the frag
> patches).
I'll go look through Mel's current patchset again. I was under the
impression it didn't suffer from this problem, at least not as much
as zones did.
Nothing is guaranteed. You can shag the whole machine and/or VM in
any number of ways ... if we can significantly improve the probability
of existing higher order allocs working, and new functionality has
an excellent probability of success, that's as good as you're going to
get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 2:24 ` Martin J. Bligh
@ 2005-11-02 2:49 ` Nick Piggin
2005-11-02 4:39 ` Martin J. Bligh
` (2 more replies)
0 siblings, 3 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 2:49 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
Martin J. Bligh wrote:
>>But let's move this to another thread if it is going to continue. I
>>would be happy to discuss scheduler problems.
>
>
> My point was that most things we do add complexity to the codebase,
> including the things you do yourself ... I'm not saying the we're worse
> off for the changes you've made, by any means - I think they've been
> mostly beneficial.
Heh - I like the "mostly" ;)
> I'm just pointing out that we ALL do it, so let us
> not be too quick to judge when others propose adding something that does ;-)
>
What I'm getting worried about is the marked increase in the
rate of features and complexity going in.
I am almost certainly never going to use memory hotplug or
demand paging of hugepages. I am pretty likely going to have
to wade through this code at some point in the future if it
is merged.
It is also going to slow down my kernel by maybe 1% when
doing kbuilds, but hey let's not worry about that until we've
merged 10 more such slowdowns (ok that wasn't aimed at you or
Mel, but my perception of the status quo).
>
>>You can't what? What doesn't work? If you have no hard limits set,
>>then the frag patches can't guarantee anything either.
>>
>>You can't have it both ways. Either you have limits for things or
>>you don't need any guarantees. Zones handle the former case nicely,
>>and we currently do the latter case just fine (along with the frag
>>patches).
>
>
> I'll go look through Mel's current patchset again. I was under the
> impression it didn't suffer from this problem, at least not as much
> as zones did.
>
Over time, I don't think it can offer any stronger a guarantee
than what we currently have. I'm not even sure that it would be
any better at all for problematic workloads as time -> infinity.
> Nothing is guaranteed. You can shag the whole machine and/or VM in
> any number of ways ... if we can significantly improve the probability
> of existing higher order allocs working, and new functionality has
> an excellent probability of success, that's as good as you're going to
> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
>
I think it falls down if these higher order allocations actually
get *used* for anything. You'll simply be going through the process
of replacing your contiguous, easy-to-reclaim memory with pinned
kernel memory.
However, for the purpose of memory hot unplug, a new zone *will*
guarantee memory can be reclaimed and unplugged.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 2:49 ` Nick Piggin
@ 2005-11-02 4:39 ` Martin J. Bligh
2005-11-02 5:09 ` Nick Piggin
2005-11-02 7:19 ` Yasunori Goto
2005-11-02 11:48 ` Mel Gorman
2 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-02 4:39 UTC (permalink / raw)
To: Nick Piggin
Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
>> I'm just pointing out that we ALL do it, so let us
>> not be too quick to judge when others propose adding something that does ;-)
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
Mmm. Though whether any one of us will personally use each feature
is perhaps not the most ideal criteria to judge things by ;-)
> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
If it's really 1%, yes, that's a huge problem. And yes, I agree with
you that there's a problem with the rate of change. Part of that is
a lack of performance measurement and testing, and the quality sometimes
scares me (though the last month has actually been significantly better,
the tree mostly builds and boots now!). I've tried to do something on
the testing front, but I'm acutely aware it's not sufficient by any means.
>>> You can't what? What doesn't work? If you have no hard limits set,
>>> then the frag patches can't guarantee anything either.
>>>
>>> You can't have it both ways. Either you have limits for things or
>>> you don't need any guarantees. Zones handle the former case nicely,
>>> and we currently do the latter case just fine (along with the frag
>>> patches).
>>
>> I'll go look through Mel's current patchset again. I was under the
>> impression it didn't suffer from this problem, at least not as much
>> as zones did.
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
Sounds worth discussing. We need *some* way of dealing with fragmentation
issues. To me that means both an avoidance strategy, and an ability
to actively defragment if we need it. Linux is evolved software, it
may not be perfect at first - that's the way we work, and it's served
us well up till now. To me, that's the biggest advantage we have over
the proprietary model.
>> Nothing is guaranteed. You can shag the whole machine and/or VM in
>> any number of ways ... if we can significantly improve the probability
>> of existing higher order allocs working, and new functionality has
>> an excellent probability of success, that's as good as you're going to
>> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
It seems inevitable that we need both physically contiguous memory
sections, and virtually contiguous in kernel space (which equates to
the same thing, unless we totally break the 1-1 P-V mapping and
lose the large page mapping for kernel, which I'd hate to do.)
> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
It's not just about memory hotplug. There are, as we have discussed
already, many usage for physically contiguous (and virtually contiguous)
memory segments. Focusing purely on any one of them will not solve the
issue at hand ...
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 4:39 ` Martin J. Bligh
@ 2005-11-02 5:09 ` Nick Piggin
2005-11-02 5:14 ` Martin J. Bligh
0 siblings, 1 reply; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 5:09 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
Martin J. Bligh wrote:
>>I am almost certainly never going to use memory hotplug or
>>demand paging of hugepages. I am pretty likely going to have
>>to wade through this code at some point in the future if it
>>is merged.
>
>
> Mmm. Though whether any one of us will personally use each feature
> is perhaps not the most ideal criteria to judge things by ;-)
>
Of course, but I'd say very few people will. Then again maybe
I'm just a luddite who doesn't know what's good for him ;)
>
>>It is also going to slow down my kernel by maybe 1% when
>>doing kbuilds, but hey let's not worry about that until we've
>>merged 10 more such slowdowns (ok that wasn't aimed at you or
>>Mel, but my perception of the status quo).
>
>
> If it's really 1%, yes, that's a huge problem. And yes, I agree with
> you that there's a problem with the rate of change. Part of that is
> a lack of performance measurement and testing, and the quality sometimes
> scares me (though the last month has actually been significantly better,
> the tree mostly builds and boots now!). I've tried to do something on
> the testing front, but I'm acutely aware it's not sufficient by any means.
>
To be honest I haven't tested so this is an unfounded guess. However
it is based on what I have seen of Mel's numbers, and the fact that
the kernel spends nearly 1/3rd of its time in the page allocator when
running a kbuild.
I may get around to getting some real numbers when my current patch
queues shrink.
>>Over time, I don't think it can offer any stronger a guarantee
>>than what we currently have. I'm not even sure that it would be
>>any better at all for problematic workloads as time -> infinity.
>
>
> Sounds worth discussing. We need *some* way of dealing with fragmentation
> issues. To me that means both an avoidance strategy, and an ability
> to actively defragment if we need it. Linux is evolved software, it
> may not be perfect at first - that's the way we work, and it's served
> us well up till now. To me, that's the biggest advantage we have over
> the proprietary model.
>
True and I'm also annoyed that we have these issues at all. I just
don't see that the avoidance strategy helps that much because as I
said, you don't need to keep these lovely contiguous regions just for
show (or other easy-to-reclaim user pages).
The absolute priority is to move away from higher order allocs or
use fallbacks IMO. And that doesn't necessarily mean order 1 or even
2 allocations because we've don't seem to have a problem with those.
Because I want Linux to be as robust as you do.
>>I think it falls down if these higher order allocations actually
>>get *used* for anything. You'll simply be going through the process
>>of replacing your contiguous, easy-to-reclaim memory with pinned
>>kernel memory.
>
>
> It seems inevitable that we need both physically contiguous memory
> sections, and virtually contiguous in kernel space (which equates to
> the same thing, unless we totally break the 1-1 P-V mapping and
> lose the large page mapping for kernel, which I'd hate to do.)
>
I think this isn't as bad an idea as you think. If it means those
guys doing memory hotplug take a few % performance hit and nobody else
has to bear the costs then that sounds great.
>
>>However, for the purpose of memory hot unplug, a new zone *will*
>>guarantee memory can be reclaimed and unplugged.
>
>
> It's not just about memory hotplug. There are, as we have discussed
> already, many usage for physically contiguous (and virtually contiguous)
> memory segments. Focusing purely on any one of them will not solve the
> issue at hand ...
>
True, but we don't seem to have huge problems with other things. The
main ones that have come up on lkml are e1000 which is getting fixed,
and maybe XFS which I think there are also moves to improve.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 5:09 ` Nick Piggin
@ 2005-11-02 5:14 ` Martin J. Bligh
2005-11-02 6:23 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-02 5:14 UTC (permalink / raw)
To: Nick Piggin
Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
>> It's not just about memory hotplug. There are, as we have discussed
>> already, many usage for physically contiguous (and virtually contiguous)
>> memory segments. Focusing purely on any one of them will not solve the
>> issue at hand ...
>
> True, but we don't seem to have huge problems with other things. The
> main ones that have come up on lkml are e1000 which is getting fixed,
> and maybe XFS which I think there are also moves to improve.
It should be fairly easy to trawl through the list of all allocations
and pull out all the higher order ones from the whole source tree. I
suspect there's a lot ... maybe I'll play with it later on.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 5:14 ` Martin J. Bligh
@ 2005-11-02 6:23 ` KAMEZAWA Hiroyuki
2005-11-02 10:15 ` Nick Piggin
0 siblings, 1 reply; 253+ messages in thread
From: KAMEZAWA Hiroyuki @ 2005-11-02 6:23 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Nick Piggin, Joel Schopp, Mel Gorman, Andrew Morton, kravetz,
linux-mm, linux-kernel, lhms-devel, Ingo Molnar
Martin J. Bligh wrote:
>>True, but we don't seem to have huge problems with other things. The
>>main ones that have come up on lkml are e1000 which is getting fixed,
>>and maybe XFS which I think there are also moves to improve.
>
>
> It should be fairly easy to trawl through the list of all allocations
> and pull out all the higher order ones from the whole source tree. I
> suspect there's a lot ... maybe I'll play with it later on.
>
please check kmalloc(32k,64k)
For example, loopback device's default MTU=16436 means order=3 and
maybe there are other high MTU device.
I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be
sufferd from fragmentation when MTU is big. They allocs large skb by
gathering fragmented skbs.When these skb_* funcs failed, the packet
is silently discarded by netfilter. If fragmentation is heavy, packets
(especialy TCP) uses large MTU never reachs its end, even if loopback.
Honestly, I'm not familiar with network code, could anyone comment this ?
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 6:23 ` KAMEZAWA Hiroyuki
@ 2005-11-02 10:15 ` Nick Piggin
0 siblings, 0 replies; 253+ messages in thread
From: Nick Piggin @ 2005-11-02 10:15 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Martin J. Bligh, Joel Schopp, Mel Gorman, Andrew Morton, kravetz,
linux-mm, linux-kernel, lhms-devel, Ingo Molnar
KAMEZAWA Hiroyuki wrote:
> Martin J. Bligh wrote:
>
> please check kmalloc(32k,64k)
>
> For example, loopback device's default MTU=16436 means order=3 and
> maybe there are other high MTU device.
>
> I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be
> sufferd from fragmentation when MTU is big. They allocs large skb by
> gathering fragmented skbs.When these skb_* funcs failed, the packet
> is silently discarded by netfilter. If fragmentation is heavy, packets
> (especialy TCP) uses large MTU never reachs its end, even if loopback.
>
> Honestly, I'm not familiar with network code, could anyone comment this ?
>
I'd be interested to know, actually. I was hoping loopback should always
use order-0 allocations, because the loopback driver is SG, FRAGLIST,
and HIGHDMA capable. However I'm likewise not familiar with network code.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 2:49 ` Nick Piggin
2005-11-02 4:39 ` Martin J. Bligh
@ 2005-11-02 7:19 ` Yasunori Goto
2005-11-02 11:48 ` Mel Gorman
2 siblings, 0 replies; 253+ messages in thread
From: Yasunori Goto @ 2005-11-02 7:19 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar, Mel Gorman
Hello.
Nick-san.
I posted patches to make ZONE_REMOVABLE to LHMS.
I don't say they are better than Mel-san's patch.
I hope this will be base of good discussion.
There were 2 types.
One was just add ZONE_REMOVABLE.
This patch came from early implementation of memory hotplug VA-Linux
team.
http://sourceforge.net/mailarchive/forum.php?thread_id=5969508&forum_id=223
ZONE_HIGHMEM is used for this purpose at early implementation.
We thought ZONE_HIGHMEM is easier removing than other zone.
But some of archtecture don't use it. That is why ZONE_REMOVABLE
was born.
(And I remember that ZONE_DMA32 was defined after this patch.
So, number of zone became 5, and one more bit was necessary in
page->flags. (I don't know recent progress of ZONE_DMA32)).
Another one was a bit similar than Mel-san's one.
One of motivation of this patch was to create orthogonal relationship
between Removable and DMA/Normal/Highmem. I thought it is desirable.
Because, ppc64 can treat that all of memory is same (DMA) zone.
I thought that new zone spoiled its good feature.
http://sourceforge.net/mailarchive/forum.php?thread_id=5345977&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345978&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345979&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345980&forum_id=223
Thanks.
P.S. to Mel-san.
I'm sorry for late writing of this. This threads was mail bomb for me
to read with my poor English skill. :-(
> Martin J. Bligh wrote:
>
> >>But let's move this to another thread if it is going to continue. I
> >>would be happy to discuss scheduler problems.
> >
> >
> > My point was that most things we do add complexity to the codebase,
> > including the things you do yourself ... I'm not saying the we're worse
> > off for the changes you've made, by any means - I think they've been
> > mostly beneficial.
>
> Heh - I like the "mostly" ;)
>
> > I'm just pointing out that we ALL do it, so let us
> > not be too quick to judge when others propose adding something that does ;-)
> >
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
>
> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
>
> >
> >>You can't what? What doesn't work? If you have no hard limits set,
> >>then the frag patches can't guarantee anything either.
> >>
> >>You can't have it both ways. Either you have limits for things or
> >>you don't need any guarantees. Zones handle the former case nicely,
> >>and we currently do the latter case just fine (along with the frag
> >>patches).
> >
> >
> > I'll go look through Mel's current patchset again. I was under the
> > impression it didn't suffer from this problem, at least not as much
> > as zones did.
> >
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
>
> > Nothing is guaranteed. You can shag the whole machine and/or VM in
> > any number of ways ... if we can significantly improve the probability
> > of existing higher order allocs working, and new functionality has
> > an excellent probability of success, that's as good as you're going to
> > get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> >
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
>
> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
>
> --
> SUSE Labs, Novell Inc.
>
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 2:49 ` Nick Piggin
2005-11-02 4:39 ` Martin J. Bligh
2005-11-02 7:19 ` Yasunori Goto
@ 2005-11-02 11:48 ` Mel Gorman
2 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-02 11:48 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
On Wed, 2 Nov 2005, Nick Piggin wrote:
> Martin J. Bligh wrote:
>
> > > But let's move this to another thread if it is going to continue. I
> > > would be happy to discuss scheduler problems.
> >
> >
> > My point was that most things we do add complexity to the codebase,
> > including the things you do yourself ... I'm not saying the we're worse
> > off for the changes you've made, by any means - I think they've been
> > mostly beneficial.
>
> Heh - I like the "mostly" ;)
>
> > I'm just pointing out that we ALL do it, so let us
> > not be too quick to judge when others propose adding something that does ;-)
> >
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
>
Plenty of features in the kernel I don't use either :) .
> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
>
Ok, my patches show performance gains and losses on different parts of
Aim9. page_test is slightly down but fork_test was considerably up. Both
would have an effect on kbuild so more figures are needed on mode
machines. That will only be found from testing from a variety of machines.
> >
> > > You can't what? What doesn't work? If you have no hard limits set,
> > > then the frag patches can't guarantee anything either.
> > >
> > > You can't have it both ways. Either you have limits for things or
> > > you don't need any guarantees. Zones handle the former case nicely,
> > > and we currently do the latter case just fine (along with the frag
> > > patches).
> >
> >
> > I'll go look through Mel's current patchset again. I was under the
> > impression it didn't suffer from this problem, at least not as much
> > as zones did.
> >
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
>
Not as they currently stand no. As I've said elsewhere, to really
guarantee things, kswapd would need to know how to clear out UesrRclm
pages from the other reserve types.
> > Nothing is guaranteed. You can shag the whole machine and/or VM in
> > any number of ways ... if we can significantly improve the probability of
> > existing higher order allocs working, and new functionality has
> > an excellent probability of success, that's as good as you're going to get.
> > Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> >
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
>
And a misconfigured zone-based approach just falls apart. Going to finish
that summary mail to avoid repetition.
> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
>
>
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 2:03 ` Nick Piggin
2005-11-02 2:24 ` Martin J. Bligh
@ 2005-11-02 11:41 ` Mel Gorman
1 sibling, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-02 11:41 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
On Wed, 2 Nov 2005, Nick Piggin wrote:
> Martin J. Bligh wrote:
> > > The numbers I have seen show that performance is decreased. People
> > > like Ken Chen spend months trying to find a 0.05% improvement in
> > > performance. Not long ago I just spent days getting our cached
> > > kbuild performance back to where 2.4 is on my build system.
> >
> >
> > Ironically, we're currently trying to chase down a 'database benchmark'
> > regression that seems to have been cause by the last round of "let's
> > rewrite the scheduler again" (more details later). Nick, you've added an
> > awful lot of complexity to some of these code paths yourself ... seems
> > ironic that you're the one complaining about it ;-)
> >
>
> Yeah that's unfortunate, but I think a large portion of the problem
> (if they are anything the same) has been narrowed down to some over
> eager wakeup balancing for which there are a number of proposed
> patches.
>
> But in this case I was more worried about getting the groundwork done
> for handling the multicore multicore systems that everyone will soon
> be using rather than several % performance regression on TPC-C (not
> to say that I don't care about that at all)... I don't see the irony.
>
> But let's move this to another thread if it is going to continue. I
> would be happy to discuss scheduler problems.
>
> > > You have an extra zone. You size that zone at boot according to the
> > > amount of memory you need to be able to free. Only easy-reclaim stuff
> > > goes in that zone.
> > >
> > > It is less complex because zones are a complexity we already have to
> > > live with. 99% of the infrastructure is already there to do this.
> > >
> > > If you want to hot unplug memory or guarantee hugepage allocation,
> > > this is the way to do it. Nobody has told me why this *doesn't* work.
> >
> >
> > Because the zone is statically sized, and you're back to the same crap
> > we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
> > effectively. Define how much you need for system ram, and how much
> > for easily reclaimable memory at boot time. You can't - it doesn't work.
> >
>
> You can't what? What doesn't work? If you have no hard limits set,
> then the frag patches can't guarantee anything either.
>
True, but the difference is
Anti-defrag: Best effort at low cost (according to Aim9) without tunable
Zones: Will work, but requires tunable. falls apart if tuned wrong
> You can't have it both ways. Either you have limits for things or
> you don't need any guarantees. Zones handle the former case nicely,
> and we currently do the latter case just fine (along with the frag
> patches).
>
Sure, so you compromise and do best effort for as long as possible.
Always try to keep fragmentation low. If the system is configured to
really need low fragmentation, then after a long period of time, a
page-migration mechanism kicks in to move the kernel pages out of EasyRclm
areas and we continue on.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 1:06 ` Nick Piggin
2005-11-02 1:41 ` Martin J. Bligh
@ 2005-11-02 11:37 ` Mel Gorman
2005-11-02 15:11 ` Mel Gorman
2 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-02 11:37 UTC (permalink / raw)
To: Nick Piggin
Cc: Joel Schopp, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
On Wed, 2 Nov 2005, Nick Piggin wrote:
> Joel Schopp wrote:
>
> > The patches do ad a reasonable amount of complexity to the page allocator.
> > In my opinion that is the only downside of these patches, even though it is
> > a big one. What we need to decide as a community is if there is a less
> > complex way to do this, and if there isn't a less complex way then is the
> > benefit worth the increased complexity.
> >
> > As to the non-zero performance cost, I think hard numbers should carry more
> > weight than they have been given in this area. Mel has posted hard numbers
> > that say the patches are a wash with respect to performance. I don't see
> > any evidence to contradict those results.
> >
>
> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance.
Fine, that is understandable. The AIM9 benchmarks also show performance
improvements in other areas like fork_test. About a 5% difference which is
also important for kernel builds. Wider testing would be needed to see if
the improvements are specific to my tests or not. Every set of patches
have had a performance regression test run with Aim9 so I certainly have
not been ignoring perforkmance.
> Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.
>
Then it would be interesting to find out how 2.6.14-rc5-mm1 compares
against 2.6.14-rc5-mm1-mbuddy-v19?
> I can simply see they will cost more icache, more dcache, more branches,
> etc. in what is the hottest part of the kernel in some workloads (kernel
> compiles, for one).
>
> I'm sorry if I sound like a wet blanket. I just don't look at a patch
> and think "wow all those 3 guys with Linux on IBM mainframes and using
> lpars are going to be so much happier now, this is something we need".
>
I developed this as the beginning of a long term solution for on-demand
HugeTLB pages as part of a PhD. This could potentially help desktop
workloads in the future. Hotplug machines are a benefit that was picked up
by the work on the way. We can help hotplug to some extent today and
desktop users in the future (and given time, all of the hotplug problems
as well). But if we tell desktop users "Yeah, your applications will run a
bit better with HugeTLB pages as long as you configure the size of the
zone correctly" at any stage, we'll be told where to go.
> > > > The will need high order allocations if we want to provide HugeTLB pages
> > > > to userspace on-demand rather than reserving at boot-time. This is a
> > > > future problem, but it's one that is not worth tackling until the
> > > > fragmentation problem is fixed first.
> > > >
> > >
> > > Sure. In what form, we haven't agreed. I vote zones! :)
> >
> >
> > I'd like to hear more details of how zones would be less complex while still
> > solving the problem. I just don't get it.
> >
>
> You have an extra zone. You size that zone at boot according to the
> amount of memory you need to be able to free. Only easy-reclaim stuff
> goes in that zone.
>
Helps hotplug, no one else. Rules out HugeTLB on demand for userspace
unless we are willing to tell desktop users to configure this tunable.
> It is less complex because zones are a complexity we already have to
> live with. 99% of the infrastructure is already there to do this.
>
The simplicity of zones is still in dispute. I am putting together a mail
of pros, cons, situations and future work for both approaches. I hope to
sent it out fairly soon.
> If you want to hot unplug memory or guarantee hugepage allocation,
> this is the way to do it. Nobody has told me why this *doesn't* work.
>
Hot unplug the configured zone of memory and guarantee hugepage allocation
only for userspace. There is no help for kernel allocations to steal a
huge page under any circumstance. Our approach allows the kernel to get
the large page at the cost of fragmentation degrading slowly over time. To
stop it fragmenting slowly over time, more work is needed.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-02 1:06 ` Nick Piggin
2005-11-02 1:41 ` Martin J. Bligh
2005-11-02 11:37 ` Mel Gorman
@ 2005-11-02 15:11 ` Mel Gorman
2 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-02 15:11 UTC (permalink / raw)
To: Nick Piggin
Cc: Joel Schopp, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
linux-kernel, lhms-devel, Ingo Molnar
On (02/11/05 12:06), Nick Piggin didst pronounce:
> Joel Schopp wrote:
>
> >The patches do ad a reasonable amount of complexity to the page
> >allocator. In my opinion that is the only downside of these patches,
> >even though it is a big one. What we need to decide as a community is
> >if there is a less complex way to do this, and if there isn't a less
> >complex way then is the benefit worth the increased complexity.
> >
> >As to the non-zero performance cost, I think hard numbers should carry
> >more weight than they have been given in this area. Mel has posted hard
> >numbers that say the patches are a wash with respect to performance. I
> >don't see any evidence to contradict those results.
> >
>
> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance. Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.
>
One contention point is the overhead this introduces. Lets say that
we do discover that kbuild is slower with this patch (still unknown),
then we have to get rid of mbuddy, disable it or replace it with an
as-yet-to-be-written-zone-based-approach.
I wrote a quick patch that disables anti-defrag via a config option and ran
aim9 on the test machine I have been using all along. I deliberatly changed
the minimum amount of anti-defrag as possible but maybe we could make this
patch even smaller or go the other way and conditionally take out as much
anti-defrag as possible.
Here are the Aim9 comparisons between -clean and
-mbuddy-v19-antidefrag-disabled-with-config-option (just the one run)
These are both based on 2.6.14-rc5-mm1
vanilla-mm mbuddy-disabled-via-config
1 creat-clo 16006.00 15844.72 -161.28 -1.01% File Creations and Closes/second
2 page_test 117515.83 119696.77 2180.94 1.86% System Allocations & Pages/second
3 brk_test 440289.81 439870.04 -419.77 -0.10% System Memory Allocations/second
4 jmp_test 4179466.67 4179150.00 -316.67 -0.01% Non-local gotos/second
5 signal_test 80803.20 82055.98 1252.78 1.55% Signal Traps/second
6 exec_test 61.75 61.53 -0.22 -0.36% Program Loads/second
7 fork_test 1327.01 1344.55 17.54 1.32% Task Creations/second
8 link_test 5531.53 5548.33 16.80 0.30% Link/Unlink Pairs/second
On this kernel, I forgot to disable the collection of buddy allocator
statistics. Collection introduces more overhead in both CPU and memory.
Here are the figures when statistic collection is also disabled via the
config option.
vanilla-mm mbuddy-disabled-via-config-nostats
1 creat-clo 16006.00 15906.06 -99.94 -0.62% File Creations and Closes/second
2 page_test 117515.83 120736.54 3220.71 2.74% System Allocations & Pages/second
3 brk_test 440289.81 430311.61 -9978.20 -2.27% System Memory Allocations/second
4 jmp_test 4179466.67 4181683.33 2216.66 0.05% Non-local gotos/second
5 signal_test 80803.20 87387.54 6584.34 8.15% Signal Traps/second
6 exec_test 61.75 62.14 0.39 0.63% Program Loads/second
7 fork_test 1327.01 1345.77 18.76 1.41% Task Creations/second
8 link_test 5531.53 5556.72 25.19 0.46% Link/Unlink Pairs/second
So, now we have performance gains in a number of areas. Nice big jump in
page_test and that fork_test improvement probably won't hurt kbuild either with
exec_test giving a bit of a nudge. signal_test has a big hike for some reason,
not sure who will benefit there, but hey, it can't be bad. I am annoyed with
brk_test especially as it is very similar to page_test in the aim9 source
code but there is no point hiding the result either. These figures does not
tell us how kbuild really performs of course. For that, kbuild needs to be run
on both kernels and compared. This applies to any workload.
This anti-defrag makes the code more complex and harder to read, no
arguement there. However, on at least one test machine, there is a very small
difference when anti-defrag is enabled in comparison to a vanilla kernel.
When the patches applied and the anti-defrag disabled via a kernel option,
we see a number of performance gains, on one machine at least which is a
good thing. Wider testing would show if these good figures are specific to
my testbed or not.
If other testbeds show up nothing bad, anti-defrag with this additional
patch could give us the best of both worlds. If you have a hotplug machine
or you care about high orders, enable this option. Otherwise, choose N and
avoid the anti-defrag overhead.
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/gfp.h linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/gfp.h
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/gfp.h 2005-11-02 12:44:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/gfp.h 2005-11-02 12:49:24.000000000 +0000
@@ -50,6 +50,7 @@ struct vm_area_struct;
#define __GFP_HARDWALL 0x40000u /* Enforce hardwall cpuset memory allocs */
#define __GFP_VALID 0x80000000u /* valid GFP flags */
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
/*
* Allocation type modifiers, these are required to be adjacent
* __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
@@ -61,6 +62,11 @@ struct vm_area_struct;
#define __GFP_EASYRCLM 0x80000u /* User and other easily reclaimed pages */
#define __GFP_KERNRCLM 0x100000u /* Kernel page that is reclaimable */
#define __GFP_RCLM_BITS (__GFP_EASYRCLM|__GFP_KERNRCLM)
+#else
+#define __GFP_EASYRCLM 0
+#define __GFP_KERNRCLM 0
+#define __GFP_RCLM_BITS 0
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/mmzone.h linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/mmzone.h 2005-11-02 12:44:07.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/mmzone.h 2005-11-02 13:00:56.000000000 +0000
@@ -23,6 +23,7 @@
#endif
#define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1))
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
/*
* The two bit field __GFP_RECLAIMBITS enumerates the following types of
* page reclaimability.
@@ -33,6 +34,14 @@
#define RCLM_FALLBACK 3
#define RCLM_TYPES 4
#define BITS_PER_RCLM_TYPE 2
+#else
+#define RCLM_NORCLM 0
+#define RCLM_EASY 0
+#define RCLM_KERN 0
+#define RCLM_FALLBACK 0
+#define RCLM_TYPES 1
+#define BITS_PER_RCLM_TYPE 0
+#endif
#define for_each_rclmtype_order(type, order) \
for (order = 0; order < MAX_ORDER; order++) \
@@ -60,6 +69,7 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
/*
* Indices into pcpu_list
* PCPU_KERNEL: For RCLM_NORCLM and RCLM_KERN allocations
@@ -68,6 +78,11 @@ struct zone_padding {
#define PCPU_KERNEL 0
#define PCPU_EASY 1
#define PCPU_TYPES 2
+#else
+#define PCPU_KERNEL 0
+#define PCPU_EASY 0
+#define PCPU_TYPES 1
+#endif
struct per_cpu_pages {
int count[PCPU_TYPES]; /* Number of pages on each list */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/init/Kconfig linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/init/Kconfig
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/init/Kconfig 2005-11-02 12:42:20.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/init/Kconfig 2005-11-02 12:59:49.000000000 +0000
@@ -419,6 +419,17 @@ config CC_ALIGN_JUMPS
no dummy operations need be executed.
Zero means use compiler's default.
+config PAGEALLOC_ANTIDEFRAG
+ bool "Try and avoid fragmentation in the page allocator"
+ def_bool y
+ help
+ The standard allocator will fragment memory over time which means that
+ high order allocations will fail even if kswapd is running. If this
+ option is set, the allocator will try and group page types into
+ three groups KernNoRclm, KernRclm and EasyRclm. The gain is a best
+ effort attempt at lowering fragmentation. The loss is more complexity
+
+
endmenu # General setup
config TINY_SHMEM
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/mm/page_alloc.c linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/mm/page_alloc.c 2005-11-02 13:05:07.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/mm/page_alloc.c 2005-11-02 14:09:37.000000000 +0000
@@ -57,11 +57,17 @@ long nr_swap_pages;
* fallback_allocs contains the fallback types for low memory conditions
* where the preferred alloction type if not available.
*/
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
{RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES},
{RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
{RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
};
+#else
+int fallback_allocs[RCLM_TYPES][RCLM_TYPES+1] = {
+ {RCLM_NORCLM, RCLM_TYPES}
+};
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
/* Returns 1 if the needed percentage of the zone is reserved for fallbacks */
static inline int min_fallback_reserved(struct zone *zone)
@@ -98,6 +104,7 @@ EXPORT_SYMBOL(totalram_pages);
#error __GFP_KERNRCLM not mapping to RCLM_KERN
#endif
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
/*
* This function maps gfpflags to their RCLM_TYPE. It makes assumptions
* on the location of the GFP flags.
@@ -115,6 +122,12 @@ static inline int gfpflags_to_rclmtype(g
return rclmbits >> RCLM_SHIFT;
}
+#else
+static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
+{
+ return RCLM_NORCLM;
+}
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
/*
* copy_bits - Copy bits between bitmaps
@@ -134,6 +147,9 @@ static inline void copy_bits(unsigned lo
int sindex_src,
int nr)
{
+ if (nr == 0)
+ return;
+
/*
* Written like this to take advantage of arch-specific
* set_bit() and clear_bit() functions
@@ -188,8 +204,12 @@ static char *zone_names[MAX_NR_ZONES] =
int min_free_kbytes = 1024;
#ifdef CONFIG_ALLOCSTATS
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
static char *type_names[RCLM_TYPES] = { "KernNoRclm", "EasyRclm",
"KernRclm", "Fallback"};
+#else
+static char *type_names[RCLM_TYPES] = { "KernNoRclm" };
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
#endif /* CONFIG_ALLOCSTATS */
unsigned long __initdata nr_kernel_pages;
@@ -2228,8 +2248,10 @@ static void __init setup_usemap(struct p
struct zone *zone, unsigned long zonesize)
{
unsigned long usemapsize = usemap_size(zonesize);
- zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
- memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+ if (usemapsize != 0) {
+ zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+ memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+ }
}
#else
static void inline setup_usemap(struct pglist_data *pgdat,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 1:31 ` Nick Piggin
2005-11-01 2:07 ` Mel Gorman
@ 2005-11-01 15:25 ` Martin J. Bligh
2005-11-01 15:33 ` Dave Hansen
1 sibling, 1 reply; 253+ messages in thread
From: Martin J. Bligh @ 2005-11-01 15:25 UTC (permalink / raw)
To: Nick Piggin, Mel Gorman
Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar
> I really don't think we *want* to say we support higher order allocations
> absolutely robustly, nor do we want people using them if possible. Because
> we don't. Even with your patches.
>
> Ingo also brought up this point at Ottawa.
Some of the driver issues can be fixed by scatter-gather DMA *if* the
h/w supports it. But what exactly do you propose to do about kernel
stacks, etc? By the time you've fixed all the individual usages of it,
frankly, it would be easier to provide a generic mechanism to fix the
problem ...
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 15:25 ` Martin J. Bligh
@ 2005-11-01 15:33 ` Dave Hansen
2005-11-01 16:57 ` Mel Gorman
2005-11-01 18:58 ` Rob Landley
0 siblings, 2 replies; 253+ messages in thread
From: Dave Hansen @ 2005-11-01 15:33 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Nick Piggin, Mel Gorman, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Ingo Molnar
On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote:
> > I really don't think we *want* to say we support higher order allocations
> > absolutely robustly, nor do we want people using them if possible. Because
> > we don't. Even with your patches.
> >
> > Ingo also brought up this point at Ottawa.
>
> Some of the driver issues can be fixed by scatter-gather DMA *if* the
> h/w supports it. But what exactly do you propose to do about kernel
> stacks, etc? By the time you've fixed all the individual usages of it,
> frankly, it would be easier to provide a generic mechanism to fix the
> problem ...
That generic mechanism is the kernel virtual remapping. However, it has
a runtime performance cost, which is increased TLB footprint inside the
kernel, and a more costly implementation of __pa() and __va().
I'll admit, I'm biased toward partial solutions without runtime cost
before we start incurring constant cost across the entire kernel,
especially when those partial solutions have other potential in-kernel
users.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 15:33 ` Dave Hansen
@ 2005-11-01 16:57 ` Mel Gorman
2005-11-01 17:00 ` Mel Gorman
2005-11-01 18:58 ` Rob Landley
1 sibling, 1 reply; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 16:57 UTC (permalink / raw)
To: Dave Hansen
Cc: Martin J. Bligh, Nick Piggin, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Ingo Molnar
On Tue, 1 Nov 2005, Dave Hansen wrote:
> On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote:
> > > I really don't think we *want* to say we support higher order allocations
> > > absolutely robustly, nor do we want people using them if possible. Because
> > > we don't. Even with your patches.
> > >
> > > Ingo also brought up this point at Ottawa.
> >
> > Some of the driver issues can be fixed by scatter-gather DMA *if* the
> > h/w supports it. But what exactly do you propose to do about kernel
> > stacks, etc? By the time you've fixed all the individual usages of it,
> > frankly, it would be easier to provide a generic mechanism to fix the
> > problem ...
>
> That generic mechanism is the kernel virtual remapping. However, it has
> a runtime performance cost, which is increased TLB footprint inside the
> kernel, and a more costly implementation of __pa() and __va().
>
> I'll admit, I'm biased toward partial solutions without runtime cost
> before we start incurring constant cost across the entire kernel,
> especially when those partial solutions have other potential in-kernel
> users.
To give an idea of the increased TLB footprint, I ran an aim9 test with
cpu_has_pse disabled in include/arch-i386/cpufeature.h to force the use
of small pages for the physical memory mappings.
This is the -clean results
clean clean-nopse
1 creat-clo 16006.00 15294.90 -711.10 -4.44% File Creations and Closes/second
2 page_test 117515.83 118677.11 1161.28 0.99% System Allocations & Pages/second
3 brk_test 440289.81 436042.64 -4247.17 -0.96% System Memory Allocations/second
4 jmp_test 4179466.67 4173266.67 -6200.00 -0.15% Non-local gotos/second
5 signal_test 80803.20 78286.95 -2516.25 -3.11% Signal Traps/second
6 exec_test 61.75 60.45 -1.30 -2.11% Program Loads/second
7 fork_test 1327.01 1318.11 -8.90 -0.67% Task Creations/second
8 link_test 5531.53 5406.60 -124.93 -2.26% Link/Unlink Pairs/second
This is what mbuddy-v19 with and without pse looks like
mbuddy-v19 mbuddy-v19-nopse
1 creat-clo 15889.41 15328.22 -561.19 -3.53% File Creations and Closes/second
2 page_test 117082.15 116892.70 -189.45 -0.16% System Allocations & Pages/second
3 brk_test 437887.37 432716.97 -5170.40 -1.18% System Memory Allocations/second
4 jmp_test 4179950.00 4176087.32 -3862.68 -0.09% Non-local gotos/second
5 signal_test 85335.78 78553.57 -6782.21 -7.95% Signal Traps/second
6 exec_test 61.92 60.61 -1.31 -2.12% Program Loads/second
7 fork_test 1342.21 1292.26 -49.95 -3.72% Task Creations/second
8 link_test 5555.55 5412.90 -142.65 -2.57% Link/Unlink Pairs/second
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 16:57 ` Mel Gorman
@ 2005-11-01 17:00 ` Mel Gorman
0 siblings, 0 replies; 253+ messages in thread
From: Mel Gorman @ 2005-11-01 17:00 UTC (permalink / raw)
To: Dave Hansen
Cc: Martin J. Bligh, Nick Piggin, Andrew Morton, kravetz, linux-mm,
Linux Kernel Mailing List, lhms, Ingo Molnar
On Tue, 1 Nov 2005, Mel Gorman wrote:
> On Tue, 1 Nov 2005, Dave Hansen wrote:
>
> > On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote:
> > > > I really don't think we *want* to say we support higher order allocations
> > > > absolutely robustly, nor do we want people using them if possible. Because
> > > > we don't. Even with your patches.
> > > >
> > > > Ingo also brought up this point at Ottawa.
> > >
> > > Some of the driver issues can be fixed by scatter-gather DMA *if* the
> > > h/w supports it. But what exactly do you propose to do about kernel
> > > stacks, etc? By the time you've fixed all the individual usages of it,
> > > frankly, it would be easier to provide a generic mechanism to fix the
> > > problem ...
> >
> > That generic mechanism is the kernel virtual remapping. However, it has
> > a runtime performance cost, which is increased TLB footprint inside the
> > kernel, and a more costly implementation of __pa() and __va().
> >
> > I'll admit, I'm biased toward partial solutions without runtime cost
> > before we start incurring constant cost across the entire kernel,
> > especially when those partial solutions have other potential in-kernel
> > users.
>
> To give an idea of the increased TLB footprint, I ran an aim9 test with
> cpu_has_pse disabled in include/arch-i386/cpufeature.h to force the use
> of small pages for the physical memory mappings.
>
> This is the -clean results
>
> clean clean-nopse
> 1 creat-clo 16006.00 15294.90 -711.10 -4.44% File Creations and Closes/second
> 2 page_test 117515.83 118677.11 1161.28 0.99% System Allocations & Pages/second
> 3 brk_test 440289.81 436042.64 -4247.17 -0.96% System Memory Allocations/second
> 4 jmp_test 4179466.67 4173266.67 -6200.00 -0.15% Non-local gotos/second
> 5 signal_test 80803.20 78286.95 -2516.25 -3.11% Signal Traps/second
> 6 exec_test 61.75 60.45 -1.30 -2.11% Program Loads/second
> 7 fork_test 1327.01 1318.11 -8.90 -0.67% Task Creations/second
> 8 link_test 5531.53 5406.60 -124.93 -2.26% Link/Unlink Pairs/second
>
> This is what mbuddy-v19 with and without pse looks like
>
> mbuddy-v19 mbuddy-v19-nopse
> 1 creat-clo 15889.41 15328.22 -561.19 -3.53% File Creations and Closes/second
> 2 page_test 117082.15 116892.70 -189.45 -0.16% System Allocations & Pages/second
> 3 brk_test 437887.37 432716.97 -5170.40 -1.18% System Memory Allocations/second
> 4 jmp_test 4179950.00 4176087.32 -3862.68 -0.09% Non-local gotos/second
> 5 signal_test 85335.78 78553.57 -6782.21 -7.95% Signal Traps/second
> 6 exec_test 61.92 60.61 -1.31 -2.12% Program Loads/second
> 7 fork_test 1342.21 1292.26 -49.95 -3.72% Task Creations/second
> 8 link_test 5555.55 5412.90 -142.65 -2.57% Link/Unlink Pairs/second
>
I forgot to include the comparison between -clean and -mbuddy-v19-nopse
clean mbuddy-v19-nopse
1 creat-clo 16006.00 15328.22 -677.78 -4.23% File Creations and Closes/second
2 page_test 117515.83 116892.70 -623.13 -0.53% System Allocations & Pages/second
3 brk_test 440289.81 432716.97 -7572.84 -1.72% System Memory Allocations/second
4 jmp_test 4179466.67 4176087.32 -3379.35 -0.08% Non-local gotos/second
5 signal_test 80803.20 78553.57 -2249.63 -2.78% Signal Traps/second
6 exec_test 61.75 60.61 -1.14 -1.85% Program Loads/second
7 fork_test 1327.01 1292.26 -34.75 -2.62% Task Creations/second
8 link_test 5531.53 5412.90 -118.63 -2.14% Link/Unlink Pairs/second
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-11-01 15:33 ` Dave Hansen
2005-11-01 16:57 ` Mel Gorman
@ 2005-11-01 18:58 ` Rob Landley
1 sibling, 0 replies; 253+ messages in thread
From: Rob Landley @ 2005-11-01 18:58 UTC (permalink / raw)
To: Dave Hansen
Cc: Martin J. Bligh, Nick Piggin, Mel Gorman, Andrew Morton, kravetz,
linux-mm, Linux Kernel Mailing List, lhms, Ingo Molnar
On Tuesday 01 November 2005 09:33, Dave Hansen wrote:
> On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote:
> > > I really don't think we *want* to say we support higher order
> > > allocations absolutely robustly, nor do we want people using them if
> > > possible. Because we don't. Even with your patches.
> > >
> > > Ingo also brought up this point at Ottawa.
> >
> > Some of the driver issues can be fixed by scatter-gather DMA *if* the
> > h/w supports it. But what exactly do you propose to do about kernel
> > stacks, etc? By the time you've fixed all the individual usages of it,
> > frankly, it would be easier to provide a generic mechanism to fix the
> > problem ...
>
> That generic mechanism is the kernel virtual remapping. However, it has
> a runtime performance cost, which is increased TLB footprint inside the
> kernel, and a more costly implementation of __pa() and __va().
Ok, right now the kernel _has_ a virtual mapping, it's just a 1:1 with the
physical mapping, right?
In theory, if you restrict all kernel unmovable mappings to a physically
contiguous address range (something like ZONE_DMA) that's at the start of the
physical address space, then what you could do is have a two-kernel-monte
like situation where if you _NEED_ to move the kernel you quiesce the system
(as if you're going to swsusp), figure out where the new start of physical
memory will be when this bank goes bye-bye, memcpy the whole mess to the new
location, adjust your one VMA, and then call the swsusp unfreeze stuff.
This is ugly, and a huge latency spike, but why wouldn't it work? The problem
now becomes finding some NEW physically contiguous range to shoehorn the
kernel into, and that's a problem that Mel's already addressing...
Rob
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
2005-10-31 14:34 ` Martin J. Bligh
2005-10-31 19:24 ` Andrew Morton
2005-10-31 23:29 ` Nick Piggin
@ 2005-11-01 14:40 ` Avi Kivity
2 siblings, 0 replies; 253+ messages in thread
From: Avi Kivity @ 2005-11-01 14:40 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Andrew Morton, Nick Piggin, kravetz, mel, linux-mm, linux-kernel,
lhms-devel
Martin J. Bligh wrote:
>To me, the question is "do we support higher order allocations, or not?".
>Pretending we do, making a half-assed job of it, and then it not working
>well under pressure is not helping anyone. I'm told, for instance, that
>AMD64 requires > 4K stacks - that's pretty fundamental, as just one
>instance. I'd rather make Linux pretty bulletproof - the added feature
>stuff is just a bonus that comes for free with that.
>
>
This particular example doesn't warrant higher-order allocations. We can
easily reserve 8GB of virtual space and map 8K stacks there. This is
enough for 1M threads, and if you want more, there's plenty of virtual
address space where those 8GB came from.
The other common examples (jumbo frames) can probably use
scatter-gather, though that depends on the hardware.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 253+ messages in thread