* [PATCH 0/3] teach kswapd about higher order allocations
@ 2004-10-27 8:00 Nick Piggin
2004-10-27 8:00 ` [PATCH 1/3] keep count of free areas Nick Piggin
0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2004-10-27 8:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Memory Management, Linus Torvalds
Hi Andrew,
Can this get into mm please?
I haven't been able to do a huge amount of testing, because
there aren't many higher order allocations these days.
However:
I have got into situations where memory becomes completely
fragmented and the Gbe network card starts spewing a lot of
allocations failures. In some cases (eg. ifup) it will just
sit there indefinitely cranking out order:2 allocation
failures. These patches definitely fix those situations by
having kswapd free some higher order areas.
Higher order area watermarks are enforced lazily - that is,
if nobody is doing order 1 allocations, no attempt is ever
made to free order 1 areas, even if none are available. Also,
if kswapd can't free up the right amount of higher order areas,
it eventually gives up on them until being kicked again.
In this way, I don't think this patch has any overscanning
failure cases.
Note:
It generally doesn't take much work to free up memory to get
networking going because the skb allocations are transient,
so you only need some set number of higher order areas free,
and the network buffers just keep reusing them. Other allocations
don't tend to touch them much because the buddy allocator takes
low order pages first.
Linus was the only one with any real objections, but once I
explained myself better he thought this was fairly reasonable
(I think).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/3] keep count of free areas
2004-10-27 8:00 [PATCH 0/3] teach kswapd about higher order allocations Nick Piggin
@ 2004-10-27 8:00 ` Nick Piggin
2004-10-27 8:02 ` [PATCH 2/3] higher order watermarks Nick Piggin
0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2004-10-27 8:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Memory Management, Linus Torvalds
[-- Attachment #1: Type: text/plain, Size: 4 bytes --]
1/3
[-- Attachment #2: vm-free-order-pages.patch --]
[-- Type: text/x-patch, Size: 3618 bytes --]
Keep track of the number of free pages of each order in the buddy allocator.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
---
linux-2.6-npiggin/include/linux/mmzone.h | 1 +
linux-2.6-npiggin/mm/page_alloc.c | 23 +++++++++--------------
2 files changed, 10 insertions(+), 14 deletions(-)
diff -puN mm/page_alloc.c~vm-free-order-pages mm/page_alloc.c
--- linux-2.6/mm/page_alloc.c~vm-free-order-pages 2004-10-27 14:27:31.000000000 +1000
+++ linux-2.6-npiggin/mm/page_alloc.c 2004-10-27 16:41:28.000000000 +1000
@@ -209,6 +209,7 @@ static inline void __free_pages_bulk (st
BUG_ON(bad_range(zone, buddy1));
BUG_ON(bad_range(zone, buddy2));
list_del(&buddy1->lru);
+ area->nr_free--;
mask <<= 1;
order++;
area++;
@@ -216,6 +217,7 @@ static inline void __free_pages_bulk (st
page_idx &= mask;
}
list_add(&(base + page_idx)->lru, &area->free_list);
+ area->nr_free++;
}
static inline void free_pages_check(const char *function, struct page *page)
@@ -317,6 +319,7 @@ expand(struct zone *zone, struct page *p
size >>= 1;
BUG_ON(bad_range(zone, &page[size]));
list_add(&page[size].lru, &area->free_list);
+ area->nr_free++;
MARK_USED(index + size, high, area);
}
return page;
@@ -380,6 +383,7 @@ static struct page *__rmqueue(struct zon
page = list_entry(area->free_list.next, struct page, lru);
list_del(&page->lru);
+ area->nr_free--;
index = page - zone->zone_mem_map;
if (current_order != MAX_ORDER-1)
MARK_USED(index, current_order, area);
@@ -1120,7 +1124,6 @@ void show_free_areas(void)
}
for_each_zone(zone) {
- struct list_head *elem;
unsigned long nr, flags, order, total = 0;
show_node(zone);
@@ -1132,9 +1135,7 @@ void show_free_areas(void)
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
- nr = 0;
- list_for_each(elem, &zone->free_area[order].free_list)
- ++nr;
+ nr = zone->free_area[order].nr_free;
total += nr << order;
printk("%lu*%lukB ", nr, K(1UL) << order);
}
@@ -1460,6 +1461,7 @@ void zone_init_free_lists(struct pglist_
bitmap_size = pages_to_bitmap_size(order, size);
zone->free_area[order].map =
(unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+ zone->free_area[order].nr_free = 0;
}
}
@@ -1647,8 +1649,7 @@ static void frag_stop(struct seq_file *m
}
/*
- * This walks the freelist for each zone. Whilst this is slow, I'd rather
- * be slow here than slow down the fast path by keeping stats - mjbligh
+ * This walks the free areas for each zone.
*/
static int frag_show(struct seq_file *m, void *arg)
{
@@ -1664,14 +1665,8 @@ static int frag_show(struct seq_file *m,
spin_lock_irqsave(&zone->lock, flags);
seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order) {
- unsigned long nr_bufs = 0;
- struct list_head *elem;
-
- list_for_each(elem, &(zone->free_area[order].free_list))
- ++nr_bufs;
- seq_printf(m, "%6lu ", nr_bufs);
- }
+ for (order = 0; order < MAX_ORDER; ++order)
+ seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
diff -puN include/linux/mmzone.h~vm-free-order-pages include/linux/mmzone.h
--- linux-2.6/include/linux/mmzone.h~vm-free-order-pages 2004-10-27 14:27:31.000000000 +1000
+++ linux-2.6-npiggin/include/linux/mmzone.h 2004-10-27 16:41:28.000000000 +1000
@@ -23,6 +23,7 @@
struct free_area {
struct list_head free_list;
unsigned long *map;
+ unsigned long nr_free;
};
struct pglist_data;
_
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/3] higher order watermarks
2004-10-27 8:00 ` [PATCH 1/3] keep count of free areas Nick Piggin
@ 2004-10-27 8:02 ` Nick Piggin
2004-10-27 8:02 ` [PATCH 3/3] teach kswapd about higher order areas Nick Piggin
2004-11-04 8:57 ` [PATCH 2/3] higher order watermarks Marcelo Tosatti
0 siblings, 2 replies; 17+ messages in thread
From: Nick Piggin @ 2004-10-27 8:02 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Memory Management, Linus Torvalds
[-- Attachment #1: Type: text/plain, Size: 4 bytes --]
2/3
[-- Attachment #2: vm-alloc-order-watermarks.patch --]
[-- Type: text/x-patch, Size: 3958 bytes --]
Move the watermark checking code into a single function. Extend it to account
for the order of the allocation and the number of free pages that could satisfy
such a request.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
---
linux-2.6-npiggin/include/linux/mmzone.h | 2 +
linux-2.6-npiggin/mm/page_alloc.c | 58 ++++++++++++++++++++-----------
2 files changed, 41 insertions(+), 19 deletions(-)
diff -puN mm/page_alloc.c~vm-alloc-order-watermarks mm/page_alloc.c
--- linux-2.6/mm/page_alloc.c~vm-alloc-order-watermarks 2004-10-27 16:41:32.000000000 +1000
+++ linux-2.6-npiggin/mm/page_alloc.c 2004-10-27 17:53:33.000000000 +1000
@@ -586,6 +586,37 @@ buffered_rmqueue(struct zone *zone, int
}
/*
+ * Return 1 if free pages are above 'mark'. This takes into account the order
+ * of the allocation.
+ */
+int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+ int alloc_type, int can_try_harder, int gfp_high)
+{
+ /* free_pages my go negative - that's OK */
+ long min = mark, free_pages = z->free_pages - (1 << order) + 1;
+ int o;
+
+ if (gfp_high)
+ min -= min / 2;
+ if (can_try_harder)
+ min -= min / 4;
+
+ if (free_pages <= min + z->protection[alloc_type])
+ return 0;
+ for (o = 0; o < order; o++) {
+ /* At the next order, this order's pages become unavailable */
+ free_pages -= z->free_area[order].nr_free << o;
+
+ /* Require fewer higher order pages to be free */
+ min >>= 1;
+
+ if (free_pages <= min)
+ return 0;
+ }
+ return 1;
+}
+
+/*
* This is the 'heart' of the zoned buddy allocator.
*
* Herein lies the mysterious "incremental min". That's the
@@ -606,7 +637,6 @@ __alloc_pages(unsigned int gfp_mask, uns
struct zonelist *zonelist)
{
const int wait = gfp_mask & __GFP_WAIT;
- unsigned long min;
struct zone **zones, *z;
struct page *page;
struct reclaim_state reclaim_state;
@@ -636,9 +666,9 @@ __alloc_pages(unsigned int gfp_mask, uns
/* Go through the zonelist once, looking for a zone with enough free */
for (i = 0; (z = zones[i]) != NULL; i++) {
- min = z->pages_low + (1<<order) + z->protection[alloc_type];
- if (z->free_pages < min)
+ if (!zone_watermark_ok(z, order, z->pages_low,
+ alloc_type, 0, 0))
continue;
page = buffered_rmqueue(z, order, gfp_mask);
@@ -654,14 +684,9 @@ __alloc_pages(unsigned int gfp_mask, uns
* coming from realtime tasks to go deeper into reserves
*/
for (i = 0; (z = zones[i]) != NULL; i++) {
- min = z->pages_min;
- if (gfp_mask & __GFP_HIGH)
- min /= 2;
- if (can_try_harder)
- min -= min / 4;
- min += (1<<order) + z->protection[alloc_type];
-
- if (z->free_pages < min)
+ if (!zone_watermark_ok(z, order, z->pages_min,
+ alloc_type, can_try_harder,
+ gfp_mask & __GFP_HIGH))
continue;
page = buffered_rmqueue(z, order, gfp_mask);
@@ -697,14 +722,9 @@ rebalance:
/* go through the zonelist yet one more time */
for (i = 0; (z = zones[i]) != NULL; i++) {
- min = z->pages_min;
- if (gfp_mask & __GFP_HIGH)
- min /= 2;
- if (can_try_harder)
- min -= min / 4;
- min += (1<<order) + z->protection[alloc_type];
-
- if (z->free_pages < min)
+ if (!zone_watermark_ok(z, order, z->pages_min,
+ alloc_type, can_try_harder,
+ gfp_mask & __GFP_HIGH))
continue;
page = buffered_rmqueue(z, order, gfp_mask);
diff -puN include/linux/mmzone.h~vm-alloc-order-watermarks include/linux/mmzone.h
--- linux-2.6/include/linux/mmzone.h~vm-alloc-order-watermarks 2004-10-27 16:41:32.000000000 +1000
+++ linux-2.6-npiggin/include/linux/mmzone.h 2004-10-27 17:52:07.000000000 +1000
@@ -279,6 +279,8 @@ void get_zone_counts(unsigned long *acti
unsigned long *free);
void build_all_zonelists(void);
void wakeup_kswapd(struct zone *zone);
+int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+ int alloc_type, int can_try_harder, int gfp_high);
/*
* zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
_
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/3] teach kswapd about higher order areas
2004-10-27 8:02 ` [PATCH 2/3] higher order watermarks Nick Piggin
@ 2004-10-27 8:02 ` Nick Piggin
2004-10-27 8:13 ` Nick Piggin
2004-11-04 8:57 ` [PATCH 2/3] higher order watermarks Marcelo Tosatti
1 sibling, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2004-10-27 8:02 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Memory Management, Linus Torvalds
[-- Attachment #1: Type: text/plain, Size: 4 bytes --]
3/3
[-- Attachment #2: vm-kswapd-heed-order-watermarks.patch --]
[-- Type: text/x-patch, Size: 5856 bytes --]
Teach kswapd to free memory on behalf of higher order allocators. This could
be important for higher order atomic allocations because they otherwise have
no means to free the memory themselves.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
---
linux-2.6-npiggin/include/linux/mmzone.h | 5 +--
linux-2.6-npiggin/mm/page_alloc.c | 3 +
linux-2.6-npiggin/mm/vmscan.c | 48 ++++++++++++++++++-------------
3 files changed, 34 insertions(+), 22 deletions(-)
diff -puN mm/vmscan.c~vm-kswapd-heed-order-watermarks mm/vmscan.c
--- linux-2.6/mm/vmscan.c~vm-kswapd-heed-order-watermarks 2004-10-27 17:57:28.000000000 +1000
+++ linux-2.6-npiggin/mm/vmscan.c 2004-10-27 17:57:28.000000000 +1000
@@ -851,9 +851,6 @@ shrink_caches(struct zone **zones, struc
for (i = 0; zones[i] != NULL; i++) {
struct zone *zone = zones[i];
- if (zone->present_pages == 0)
- continue;
-
zone->temp_priority = sc->priority;
if (zone->prev_priority > sc->priority)
zone->prev_priority = sc->priority;
@@ -968,7 +965,7 @@ out:
* the page allocator fallback scheme to ensure that aging of pages is balanced
* across the zones.
*/
-static int balance_pgdat(pg_data_t *pgdat, int nr_pages)
+static int balance_pgdat(pg_data_t *pgdat, int nr_pages, int order)
{
int to_free = nr_pages;
int all_zones_ok;
@@ -1007,14 +1004,12 @@ loop_again:
for (i = pgdat->nr_zones - 1; i >= 0; i--) {
struct zone *zone = pgdat->node_zones + i;
- if (zone->present_pages == 0)
- continue;
-
if (zone->all_unreclaimable &&
priority != DEF_PRIORITY)
continue;
- if (zone->free_pages <= zone->pages_high) {
+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, 0, 0, 0)) {
end_zone = i;
goto scan;
}
@@ -1042,14 +1037,12 @@ scan:
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- if (zone->present_pages == 0)
- continue;
-
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
if (nr_pages == 0) { /* Not software suspend */
- if (zone->free_pages <= zone->pages_high)
+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, end_zone, 0, 0))
all_zones_ok = 0;
}
zone->temp_priority = priority;
@@ -1126,6 +1119,7 @@ out:
*/
static int kswapd(void *p)
{
+ unsigned long order;
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;
DEFINE_WAIT(wait);
@@ -1154,14 +1148,28 @@ static int kswapd(void *p)
*/
tsk->flags |= PF_MEMALLOC|PF_KSWAPD;
+ order = 0;
for ( ; ; ) {
+ unsigned long new_order;
if (current->flags & PF_FREEZE)
refrigerator(PF_FREEZE);
+
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
- schedule();
+ new_order = pgdat->kswapd_max_order;
+ pgdat->kswapd_max_order = 0;
+ if (order < new_order) {
+ /*
+ * Don't sleep if someone wants a larger 'order'
+ * allocation
+ */
+ order = new_order;
+ } else {
+ schedule();
+ order = pgdat->kswapd_max_order;
+ }
finish_wait(&pgdat->kswapd_wait, &wait);
- balance_pgdat(pgdat, 0);
+ balance_pgdat(pgdat, 0, order);
}
return 0;
}
@@ -1169,12 +1177,14 @@ static int kswapd(void *p)
/*
* A zone is low on free memory, so wake its kswapd task to service it.
*/
-void wakeup_kswapd(struct zone *zone)
+void wakeup_kswapd(struct zone *zone, int order)
{
- if (zone->present_pages == 0)
- return;
- if (zone->free_pages > zone->pages_low)
+ pg_data_t *pgdat = zone->zone_pgdat;
+
+ if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0, 0))
return;
+ if (pgdat->kswapd_max_order < order)
+ pgdat->kswapd_max_order = order;
if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
return;
wake_up_interruptible(&zone->zone_pgdat->kswapd_wait);
@@ -1197,7 +1207,7 @@ int shrink_all_memory(int nr_pages)
current->reclaim_state = &reclaim_state;
for_each_pgdat(pgdat) {
int freed;
- freed = balance_pgdat(pgdat, nr_to_free);
+ freed = balance_pgdat(pgdat, nr_to_free, 0);
ret += freed;
nr_to_free -= freed;
if (nr_to_free <= 0)
diff -puN mm/page_alloc.c~vm-kswapd-heed-order-watermarks mm/page_alloc.c
--- linux-2.6/mm/page_alloc.c~vm-kswapd-heed-order-watermarks 2004-10-27 17:57:28.000000000 +1000
+++ linux-2.6-npiggin/mm/page_alloc.c 2004-10-27 17:57:28.000000000 +1000
@@ -677,7 +677,7 @@ __alloc_pages(unsigned int gfp_mask, uns
}
for (i = 0; (z = zones[i]) != NULL; i++)
- wakeup_kswapd(z);
+ wakeup_kswapd(z, order);
/*
* Go through the zonelist again. Let __GFP_HIGH and allocations
@@ -1506,6 +1506,7 @@ static void __init free_area_init_core(s
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ pgdat->kswapd_max_order = 0;
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
diff -puN include/linux/mmzone.h~vm-kswapd-heed-order-watermarks include/linux/mmzone.h
--- linux-2.6/include/linux/mmzone.h~vm-kswapd-heed-order-watermarks 2004-10-27 17:57:28.000000000 +1000
+++ linux-2.6-npiggin/include/linux/mmzone.h 2004-10-27 17:57:28.000000000 +1000
@@ -263,8 +263,9 @@ typedef struct pglist_data {
range, including holes */
int node_id;
struct pglist_data *pgdat_next;
- wait_queue_head_t kswapd_wait;
+ wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
+ int kswapd_max_order;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -278,7 +279,7 @@ void __get_zone_counts(unsigned long *ac
void get_zone_counts(unsigned long *active, unsigned long *inactive,
unsigned long *free);
void build_all_zonelists(void);
-void wakeup_kswapd(struct zone *zone);
+void wakeup_kswapd(struct zone *zone, int order);
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
int alloc_type, int can_try_harder, int gfp_high);
_
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3] teach kswapd about higher order areas
2004-10-27 8:02 ` [PATCH 3/3] teach kswapd about higher order areas Nick Piggin
@ 2004-10-27 8:13 ` Nick Piggin
0 siblings, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2004-10-27 8:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Memory Management, Linus Torvalds
Nick Piggin wrote:
> 3/3
>
>
> ------------------------------------------------------------------------
>
>
>
> Teach kswapd to free memory on behalf of higher order allocators. This could
> be important for higher order atomic allocations because they otherwise have
> no means to free the memory themselves.
>
> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
>
>
> ---
>
> linux-2.6-npiggin/include/linux/mmzone.h | 5 +--
> linux-2.6-npiggin/mm/page_alloc.c | 3 +
> linux-2.6-npiggin/mm/vmscan.c | 48 ++++++++++++++++++-------------
> 3 files changed, 34 insertions(+), 22 deletions(-)
>
> diff -puN mm/vmscan.c~vm-kswapd-heed-order-watermarks mm/vmscan.c
> --- linux-2.6/mm/vmscan.c~vm-kswapd-heed-order-watermarks 2004-10-27 17:57:28.000000000 +1000
> +++ linux-2.6-npiggin/mm/vmscan.c 2004-10-27 17:57:28.000000000 +1000
> @@ -851,9 +851,6 @@ shrink_caches(struct zone **zones, struc
> for (i = 0; zones[i] != NULL; i++) {
> struct zone *zone = zones[i];
>
> - if (zone->present_pages == 0)
> - continue;
> -
Sorry, slight mismerge - this gets rid of thse new fangled checks
(which it shouldn't). Other than that it looks ok though.
Let me know if you want a fixed up patch.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-10-27 8:02 ` [PATCH 2/3] higher order watermarks Nick Piggin
2004-10-27 8:02 ` [PATCH 3/3] teach kswapd about higher order areas Nick Piggin
@ 2004-11-04 8:57 ` Marcelo Tosatti
2004-11-04 12:20 ` Nick Piggin
2004-11-10 16:23 ` Marcelo Tosatti
1 sibling, 2 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2004-11-04 8:57 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management, Linus Torvalds
On Wed, Oct 27, 2004 at 06:02:12PM +1000, Nick Piggin wrote:
> 2/3
>
>
> Move the watermark checking code into a single function. Extend it to account
> for the order of the allocation and the number of free pages that could satisfy
> such a request.
>
> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Hi Nick,
I have a few comments and doubts.
> linux-2.6-npiggin/include/linux/mmzone.h | 2 +
> linux-2.6-npiggin/mm/page_alloc.c | 58 ++++++++++++++++++++-----------
> 2 files changed, 41 insertions(+), 19 deletions(-)
>
> diff -puN mm/page_alloc.c~vm-alloc-order-watermarks mm/page_alloc.c
> --- linux-2.6/mm/page_alloc.c~vm-alloc-order-watermarks 2004-10-27 16:41:32.000000000 +1000
> +++ linux-2.6-npiggin/mm/page_alloc.c 2004-10-27 17:53:33.000000000 +1000
> @@ -586,6 +586,37 @@ buffered_rmqueue(struct zone *zone, int
> }
>
> /*
> + * Return 1 if free pages are above 'mark'. This takes into account the order
> + * of the allocation.
> + */
> +int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> + int alloc_type, int can_try_harder, int gfp_high)
> +{
> + /* free_pages my go negative - that's OK */
> + long min = mark, free_pages = z->free_pages - (1 << order) + 1;
> + int o;
> +
> + if (gfp_high)
> + min -= min / 2;
> + if (can_try_harder)
> + min -= min / 4;
> +
> + if (free_pages <= min + z->protection[alloc_type])
> + return 0;
> + for (o = 0; o < order; o++) {
> + /* At the next order, this order's pages become unavailable */
> + free_pages -= z->free_area[order].nr_free << o;
> +
> + /* Require fewer higher order pages to be free */
> + min >>= 1;
I can't understand this. You decrease from free_pages
nr_order_free_pages << o, in a loop, and divide min by two.
What is the meaning of "nr_free_pages[order] << o" ? Its only meaningful
when o == order?
You're multiplying the number of free pages of the order the allocation
wants by "0, 1..order". The two values have different meanings, until
o == order.
In the first iteration of the loop, order is 0, so you decrease from free_pages
"z->free_area[order].nr_free". Again, the two values mean different things.
Can you enlight me?
I see you're trying to have some kind of extra protection, but the calculation
is difficult to understand for me.
> +
> + if (free_pages <= min)
> + return 0;
> + }
> + return 1;
> +}
> +
> +/*
> * This is the 'heart' of the zoned buddy allocator.
> *
> * Herein lies the mysterious "incremental min". That's the
> @@ -606,7 +637,6 @@ __alloc_pages(unsigned int gfp_mask, uns
> struct zonelist *zonelist)
> {
> const int wait = gfp_mask & __GFP_WAIT;
> - unsigned long min;
> struct zone **zones, *z;
> struct page *page;
> struct reclaim_state reclaim_state;
> @@ -636,9 +666,9 @@ __alloc_pages(unsigned int gfp_mask, uns
>
> /* Go through the zonelist once, looking for a zone with enough free */
> for (i = 0; (z = zones[i]) != NULL; i++) {
> - min = z->pages_low + (1<<order) + z->protection[alloc_type];
>
> - if (z->free_pages < min)
> + if (!zone_watermark_ok(z, order, z->pages_low,
> + alloc_type, 0, 0))
The original code didnt had the can_try_harder/gfp_high decrease
which is now on zone_watermark_ok.
Means that those allocations will now be successful earlier, instead
of going to the next zonelist iteration. kswapd will not be awake
when it used to be.
Hopefully it doesnt matter that much. You did this by intention?
TIA
> continue;
>
> page = buffered_rmqueue(z, order, gfp_mask);
> @@ -654,14 +684,9 @@ __alloc_pages(unsigned int gfp_mask, uns
> * coming from realtime tasks to go deeper into reserves
> */
> for (i = 0; (z = zones[i]) != NULL; i++) {
> - min = z->pages_min;
> - if (gfp_mask & __GFP_HIGH)
> - min /= 2;
> - if (can_try_harder)
> - min -= min / 4;
> - min += (1<<order) + z->protection[alloc_type];
> -
> - if (z->free_pages < min)
> + if (!zone_watermark_ok(z, order, z->pages_min,
> + alloc_type, can_try_harder,
> + gfp_mask & __GFP_HIGH))
> continue;
>
> page = buffered_rmqueue(z, order, gfp_mask);
> @@ -697,14 +722,9 @@ rebalance:
>
> /* go through the zonelist yet one more time */
> for (i = 0; (z = zones[i]) != NULL; i++) {
> - min = z->pages_min;
> - if (gfp_mask & __GFP_HIGH)
> - min /= 2;
> - if (can_try_harder)
> - min -= min / 4;
> - min += (1<<order) + z->protection[alloc_type];
> -
> - if (z->free_pages < min)
> + if (!zone_watermark_ok(z, order, z->pages_min,
> + alloc_type, can_try_harder,
> + gfp_mask & __GFP_HIGH))
> continue;
>
> page = buffered_rmqueue(z, order, gfp_mask);
> diff -puN include/linux/mmzone.h~vm-alloc-order-watermarks include/linux/mmzone.h
> --- linux-2.6/include/linux/mmzone.h~vm-alloc-order-watermarks 2004-10-27 16:41:32.000000000 +1000
> +++ linux-2.6-npiggin/include/linux/mmzone.h 2004-10-27 17:52:07.000000000 +1000
> @@ -279,6 +279,8 @@ void get_zone_counts(unsigned long *acti
> unsigned long *free);
> void build_all_zonelists(void);
> void wakeup_kswapd(struct zone *zone);
> +int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> + int alloc_type, int can_try_harder, int gfp_high);
>
> /*
> * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
>
> _
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-04 12:20 ` Nick Piggin
@ 2004-11-04 9:55 ` Marcelo Tosatti
2004-11-05 1:06 ` Nick Piggin
2004-11-04 10:02 ` Marcelo Tosatti
1 sibling, 1 reply; 17+ messages in thread
From: Marcelo Tosatti @ 2004-11-04 9:55 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management, Linus Torvalds
Hi Nick!
On Thu, Nov 04, 2004 at 11:20:54PM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> >On Wed, Oct 27, 2004 at 06:02:12PM +1000, Nick Piggin wrote:
> >
> >>2/3
> >
> >
> >>
> >>Move the watermark checking code into a single function. Extend it to
> >>account
> >>for the order of the allocation and the number of free pages that could
> >>satisfy
> >>such a request.
> >>
> >>Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
> >
> >
> >Hi Nick,
> >
> >I have a few comments and doubts.
> >
>
> Hi Marcelo,
> Thanks for the comments and review. It is always very helpful to
> have more eyes on this area of code especially. Let's see...
>
> >
> >>linux-2.6-npiggin/include/linux/mmzone.h | 2 +
> >>linux-2.6-npiggin/mm/page_alloc.c | 58
> >>++++++++++++++++++++-----------
> >>2 files changed, 41 insertions(+), 19 deletions(-)
> >>
> >>diff -puN mm/page_alloc.c~vm-alloc-order-watermarks mm/page_alloc.c
> >>--- linux-2.6/mm/page_alloc.c~vm-alloc-order-watermarks 2004-10-27
> >>16:41:32.000000000 +1000
> >>+++ linux-2.6-npiggin/mm/page_alloc.c 2004-10-27
> >>17:53:33.000000000 +1000
> >>@@ -586,6 +586,37 @@ buffered_rmqueue(struct zone *zone, int
> >>}
> >>
> >>/*
> >>+ * Return 1 if free pages are above 'mark'. This takes into account the
> >>order
> >>+ * of the allocation.
> >>+ */
> >>+int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> >>+ int alloc_type, int can_try_harder, int gfp_high)
> >>+{
> >>+ /* free_pages my go negative - that's OK */
> >>+ long min = mark, free_pages = z->free_pages - (1 << order) + 1;
> >>+ int o;
> >>+
> >>+ if (gfp_high)
> >>+ min -= min / 2;
> >>+ if (can_try_harder)
> >>+ min -= min / 4;
> >>+
> >>+ if (free_pages <= min + z->protection[alloc_type])
> >>+ return 0;
> >>+ for (o = 0; o < order; o++) {
> >>+ /* At the next order, this order's pages become unavailable
> >>*/
> >>+ free_pages -= z->free_area[order].nr_free << o;
> >>+
> >>+ /* Require fewer higher order pages to be free */
> >>+ min >>= 1;
> >
> >
> >I can't understand this. You decrease from free_pages
> >nr_order_free_pages << o, in a loop, and divide min by two.
> >
> >What is the meaning of "nr_free_pages[order] << o" ? Its only meaningful
> >when o == order?
> >
> >You're multiplying the number of free pages of the order the allocation
> >wants by "0, 1..order". The two values have different meanings, until
> >o == order.
> >
> >In the first iteration of the loop, order is 0, so you decrease from
> >free_pages "z->free_area[order].nr_free". Again, the two values mean
> >different things.
> >
> >Can you enlight me?
> >
> >I see you're trying to have some kind of extra protection, but the
> >calculation is difficult to understand for me.
> >
>
> OK, we store the number of "order-pages" free for each order, so for
> example, 16K worth of order-2 pages (on a 4K page architecture) will
> count towards just 1 nr_free.
>
> So now what we need to do in order to calculate, say the amount of memory
> that will satisfy order-2 *and above* (this is important) is the following:
>
> z->free_pages - (order[0].nr_free << 0) - (order[1].nr_free << 1)
Shouldnt that be then
free_pages -= z->free_area[o].nr_free << o;
instead of the current
free_pages -= z->free_area[order].nr_free << o;
No?
> to find order-3 and above, you also need to subtract (order[2].nr_free <<
> 2).
>
> I quite liked this method because it has progressively less cost on lower
> order allocations, and for order-0 we don't need to do any calculation.
OK, now I get it. The only think which bugs me is the multiplication of
values with different meanings.
> Of course it is slightly racy, which is why I say free_pages can go
> negative,
> but that should be OK.
Yeap.
>
> Probably the comment there is woefully inadequate? - I sometimes forget that
> people can't read my mind :\
>
> >
> >>+
> >>+ if (free_pages <= min)
> >>+ return 0;
> >>+ }
> >>+ return 1;
> >>+}
> >>+
> >>+/*
> >> * This is the 'heart' of the zoned buddy allocator.
> >> *
> >> * Herein lies the mysterious "incremental min". That's the
> >>@@ -606,7 +637,6 @@ __alloc_pages(unsigned int gfp_mask, uns
> >> struct zonelist *zonelist)
> >>{
> >> const int wait = gfp_mask & __GFP_WAIT;
> >>- unsigned long min;
> >> struct zone **zones, *z;
> >> struct page *page;
> >> struct reclaim_state reclaim_state;
> >>@@ -636,9 +666,9 @@ __alloc_pages(unsigned int gfp_mask, uns
> >>
> >> /* Go through the zonelist once, looking for a zone with enough free
> >> */
> >> for (i = 0; (z = zones[i]) != NULL; i++) {
> >>- min = z->pages_low + (1<<order) + z->protection[alloc_type];
> >>
> >>- if (z->free_pages < min)
> >>+ if (!zone_watermark_ok(z, order, z->pages_low,
> >>+ alloc_type, 0, 0))
> >
> >
> >
> >The original code didnt had the can_try_harder/gfp_high decrease
> >which is now on zone_watermark_ok.
> >
> >Means that those allocations will now be successful earlier, instead
> >of going to the next zonelist iteration. kswapd will not be awake
> >when it used to be.
> >
> >Hopefully it doesnt matter that much. You did this by intention?
> >
>
> That should be OK: the last two zero arguments mean that doesn't
> get evaluated; so it should work as you'd expect I think?
Oh correct, pardon me.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-04 12:20 ` Nick Piggin
2004-11-04 9:55 ` Marcelo Tosatti
@ 2004-11-04 10:02 ` Marcelo Tosatti
2004-11-05 1:12 ` Nick Piggin
1 sibling, 1 reply; 17+ messages in thread
From: Marcelo Tosatti @ 2004-11-04 10:02 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management
On Thu, Nov 04, 2004 at 11:20:54PM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> >On Wed, Oct 27, 2004 at 06:02:12PM +1000, Nick Piggin wrote:
> >
> >>2/3
> >
> >
> >>
> >>Move the watermark checking code into a single function. Extend it to
> >>account
> >>for the order of the allocation and the number of free pages that could
> >>satisfy
> >>such a request.
> >>
> >>Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
> >
> >
> >Hi Nick,
> >
> >I have a few comments and doubts.
> >
>
> Hi Marcelo,
> Thanks for the comments and review. It is always very helpful to
> have more eyes on this area of code especially. Let's see...
>
> >
> >>linux-2.6-npiggin/include/linux/mmzone.h | 2 +
> >>linux-2.6-npiggin/mm/page_alloc.c | 58
> >>++++++++++++++++++++-----------
> >>2 files changed, 41 insertions(+), 19 deletions(-)
> >>
> >>diff -puN mm/page_alloc.c~vm-alloc-order-watermarks mm/page_alloc.c
> >>--- linux-2.6/mm/page_alloc.c~vm-alloc-order-watermarks 2004-10-27
> >>16:41:32.000000000 +1000
> >>+++ linux-2.6-npiggin/mm/page_alloc.c 2004-10-27
> >>17:53:33.000000000 +1000
> >>@@ -586,6 +586,37 @@ buffered_rmqueue(struct zone *zone, int
> >>}
> >>
> >>/*
> >>+ * Return 1 if free pages are above 'mark'. This takes into account the
> >>order
> >>+ * of the allocation.
> >>+ */
> >>+int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> >>+ int alloc_type, int can_try_harder, int gfp_high)
> >>+{
> >>+ /* free_pages my go negative - that's OK */
> >>+ long min = mark, free_pages = z->free_pages - (1 << order) + 1;
> >>+ int o;
> >>+
> >>+ if (gfp_high)
> >>+ min -= min / 2;
> >>+ if (can_try_harder)
> >>+ min -= min / 4;
> >>+
> >>+ if (free_pages <= min + z->protection[alloc_type])
> >>+ return 0;
> >>+ for (o = 0; o < order; o++) {
> >>+ /* At the next order, this order's pages become unavailable
> >>*/
> >>+ free_pages -= z->free_area[order].nr_free << o;
> >>+
> >>+ /* Require fewer higher order pages to be free */
> >>+ min >>= 1;
> >
> >
> >I can't understand this. You decrease from free_pages
> >nr_order_free_pages << o, in a loop, and divide min by two.
> >
> >What is the meaning of "nr_free_pages[order] << o" ? Its only meaningful
> >when o == order?
> >
> >You're multiplying the number of free pages of the order the allocation
> >wants by "0, 1..order". The two values have different meanings, until
> >o == order.
> >
> >In the first iteration of the loop, order is 0, so you decrease from
> >free_pages "z->free_area[order].nr_free". Again, the two values mean
> >different things.
> >
> >Can you enlight me?
> >
> >I see you're trying to have some kind of extra protection, but the
> >calculation is difficult to understand for me.
> >
>
> OK, we store the number of "order-pages" free for each order, so for
> example, 16K worth of order-2 pages (on a 4K page architecture) will
> count towards just 1 nr_free.
>
> So now what we need to do in order to calculate, say the amount of memory
> that will satisfy order-2 *and above* (this is important) is the following:
>
> z->free_pages - (order[0].nr_free << 0) - (order[1].nr_free << 1)
>
> to find order-3 and above, you also need to subtract (order[2].nr_free <<
> 2).
>
> I quite liked this method because it has progressively less cost on lower
> order allocations, and for order-0 we don't need to do any calculation.
>
> Of course it is slightly racy, which is why I say free_pages can go
> negative,
> but that should be OK.
>
> Probably the comment there is woefully inadequate? - I sometimes forget that
> people can't read my mind :\
Nick, care to add a comment on top of zone_watermark_ok explaining
the reasoning behind the calculation and its expected effects?
That would be really nice.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-04 8:57 ` [PATCH 2/3] higher order watermarks Marcelo Tosatti
@ 2004-11-04 12:20 ` Nick Piggin
2004-11-04 9:55 ` Marcelo Tosatti
2004-11-04 10:02 ` Marcelo Tosatti
2004-11-10 16:23 ` Marcelo Tosatti
1 sibling, 2 replies; 17+ messages in thread
From: Nick Piggin @ 2004-11-04 12:20 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Andrew Morton, Linux Memory Management, Linus Torvalds
Marcelo Tosatti wrote:
> On Wed, Oct 27, 2004 at 06:02:12PM +1000, Nick Piggin wrote:
>
>>2/3
>
>
>>
>>Move the watermark checking code into a single function. Extend it to account
>>for the order of the allocation and the number of free pages that could satisfy
>>such a request.
>>
>>Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
>
>
> Hi Nick,
>
> I have a few comments and doubts.
>
Hi Marcelo,
Thanks for the comments and review. It is always very helpful to
have more eyes on this area of code especially. Let's see...
>
>> linux-2.6-npiggin/include/linux/mmzone.h | 2 +
>> linux-2.6-npiggin/mm/page_alloc.c | 58 ++++++++++++++++++++-----------
>> 2 files changed, 41 insertions(+), 19 deletions(-)
>>
>>diff -puN mm/page_alloc.c~vm-alloc-order-watermarks mm/page_alloc.c
>>--- linux-2.6/mm/page_alloc.c~vm-alloc-order-watermarks 2004-10-27 16:41:32.000000000 +1000
>>+++ linux-2.6-npiggin/mm/page_alloc.c 2004-10-27 17:53:33.000000000 +1000
>>@@ -586,6 +586,37 @@ buffered_rmqueue(struct zone *zone, int
>> }
>>
>> /*
>>+ * Return 1 if free pages are above 'mark'. This takes into account the order
>>+ * of the allocation.
>>+ */
>>+int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>>+ int alloc_type, int can_try_harder, int gfp_high)
>>+{
>>+ /* free_pages my go negative - that's OK */
>>+ long min = mark, free_pages = z->free_pages - (1 << order) + 1;
>>+ int o;
>>+
>>+ if (gfp_high)
>>+ min -= min / 2;
>>+ if (can_try_harder)
>>+ min -= min / 4;
>>+
>>+ if (free_pages <= min + z->protection[alloc_type])
>>+ return 0;
>>+ for (o = 0; o < order; o++) {
>>+ /* At the next order, this order's pages become unavailable */
>>+ free_pages -= z->free_area[order].nr_free << o;
>>+
>>+ /* Require fewer higher order pages to be free */
>>+ min >>= 1;
>
>
> I can't understand this. You decrease from free_pages
> nr_order_free_pages << o, in a loop, and divide min by two.
>
> What is the meaning of "nr_free_pages[order] << o" ? Its only meaningful
> when o == order?
>
> You're multiplying the number of free pages of the order the allocation
> wants by "0, 1..order". The two values have different meanings, until
> o == order.
>
> In the first iteration of the loop, order is 0, so you decrease from free_pages
> "z->free_area[order].nr_free". Again, the two values mean different things.
>
> Can you enlight me?
>
> I see you're trying to have some kind of extra protection, but the calculation
> is difficult to understand for me.
>
OK, we store the number of "order-pages" free for each order, so for
example, 16K worth of order-2 pages (on a 4K page architecture) will
count towards just 1 nr_free.
So now what we need to do in order to calculate, say the amount of memory
that will satisfy order-2 *and above* (this is important) is the following:
z->free_pages - (order[0].nr_free << 0) - (order[1].nr_free << 1)
to find order-3 and above, you also need to subtract (order[2].nr_free << 2).
I quite liked this method because it has progressively less cost on lower
order allocations, and for order-0 we don't need to do any calculation.
Of course it is slightly racy, which is why I say free_pages can go negative,
but that should be OK.
Probably the comment there is woefully inadequate? - I sometimes forget that
people can't read my mind :\
>
>>+
>>+ if (free_pages <= min)
>>+ return 0;
>>+ }
>>+ return 1;
>>+}
>>+
>>+/*
>> * This is the 'heart' of the zoned buddy allocator.
>> *
>> * Herein lies the mysterious "incremental min". That's the
>>@@ -606,7 +637,6 @@ __alloc_pages(unsigned int gfp_mask, uns
>> struct zonelist *zonelist)
>> {
>> const int wait = gfp_mask & __GFP_WAIT;
>>- unsigned long min;
>> struct zone **zones, *z;
>> struct page *page;
>> struct reclaim_state reclaim_state;
>>@@ -636,9 +666,9 @@ __alloc_pages(unsigned int gfp_mask, uns
>>
>> /* Go through the zonelist once, looking for a zone with enough free */
>> for (i = 0; (z = zones[i]) != NULL; i++) {
>>- min = z->pages_low + (1<<order) + z->protection[alloc_type];
>>
>>- if (z->free_pages < min)
>>+ if (!zone_watermark_ok(z, order, z->pages_low,
>>+ alloc_type, 0, 0))
>
>
>
> The original code didnt had the can_try_harder/gfp_high decrease
> which is now on zone_watermark_ok.
>
> Means that those allocations will now be successful earlier, instead
> of going to the next zonelist iteration. kswapd will not be awake
> when it used to be.
>
> Hopefully it doesnt matter that much. You did this by intention?
>
That should be OK: the last two zero arguments mean that doesn't
get evaluated; so it should work as you'd expect I think?
Thanks,
Nick
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-05 1:06 ` Nick Piggin
@ 2004-11-04 22:47 ` Marcelo Tosatti
2004-11-05 2:08 ` Andrew Morton
2004-11-05 2:14 ` Nick Piggin
0 siblings, 2 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2004-11-04 22:47 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management, Linus Torvalds
On Fri, Nov 05, 2004 at 12:06:21PM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> >Hi Nick!
> >
> >On Thu, Nov 04, 2004 at 11:20:54PM +1100, Nick Piggin wrote:
> >
>
> >>So now what we need to do in order to calculate, say the amount of memory
> >>that will satisfy order-2 *and above* (this is important) is the
> >>following:
> >>
> >> z->free_pages - (order[0].nr_free << 0) - (order[1].nr_free << 1)
> >
> >
> >Shouldnt that be then
> >
> >free_pages -= z->free_area[o].nr_free << o;
> >
> >instead of the current
> >
> >free_pages -= z->free_area[order].nr_free << o;
> >
> >No?
> >
>
> Yes, you're absolutely right. Sorry, this is what you were getting
> at all along :P
>
> >
> >>to find order-3 and above, you also need to subtract (order[2].nr_free <<
> >>2).
> >>
> >>I quite liked this method because it has progressively less cost on lower
> >>order allocations, and for order-0 we don't need to do any calculation.
> >
> >
> >OK, now I get it. The only think which bugs me is the multiplication of
> >values with different meanings.
> >
>
> Yeah it's wrong, of course. Good catch, thanks.
>
> If you would care to send a patch Marcelo? I don't have a recent
> -mm on hand at the moment. Would that be alright?
Sure, I'll prepare a patch.
This typo probably means that the current code is not actually working as
intented - did any of you receive any feedback on this patch, wrt high order allocation
intensive driver setup/workload, Nick and Andrew?
I have been using a simple module which allocates a big number of high order
allocations in a row (for the defragmentation code testing), I'll give it a
shot tomorrow to test Nick's high-order-kswapd scheme effectiveness.
Anyway, here is the patch:
Description:
Fix typo in Nick's kswapd-high-order awareness patch
--- linux-2.6.10-rc1-mm2/mm/page_alloc.c.orig 2004-11-04 22:52:00.505365136 -0200
+++ linux-2.6.10-rc1-mm2/mm/page_alloc.c 2004-11-04 22:52:03.121967352 -0200
@@ -733,7 +733,7 @@
return 0;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[order].nr_free << o;
+ free_pages -= z->free_area[o].nr_free << o;
/* Require fewer higher order pages to be free */
min >>= 1;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-04 9:55 ` Marcelo Tosatti
@ 2004-11-05 1:06 ` Nick Piggin
2004-11-04 22:47 ` Marcelo Tosatti
0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2004-11-05 1:06 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Andrew Morton, Linux Memory Management, Linus Torvalds
Marcelo Tosatti wrote:
> Hi Nick!
>
> On Thu, Nov 04, 2004 at 11:20:54PM +1100, Nick Piggin wrote:
>
>>So now what we need to do in order to calculate, say the amount of memory
>>that will satisfy order-2 *and above* (this is important) is the following:
>>
>> z->free_pages - (order[0].nr_free << 0) - (order[1].nr_free << 1)
>
>
> Shouldnt that be then
>
> free_pages -= z->free_area[o].nr_free << o;
>
> instead of the current
>
> free_pages -= z->free_area[order].nr_free << o;
>
> No?
>
Yes, you're absolutely right. Sorry, this is what you were getting
at all along :P
>
>>to find order-3 and above, you also need to subtract (order[2].nr_free <<
>>2).
>>
>>I quite liked this method because it has progressively less cost on lower
>>order allocations, and for order-0 we don't need to do any calculation.
>
>
> OK, now I get it. The only think which bugs me is the multiplication of
> values with different meanings.
>
Yeah it's wrong, of course. Good catch, thanks.
If you would care to send a patch Marcelo? I don't have a recent
-mm on hand at the moment. Would that be alright?
Thanks,
Nick
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-04 10:02 ` Marcelo Tosatti
@ 2004-11-05 1:12 ` Nick Piggin
0 siblings, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2004-11-05 1:12 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Andrew Morton, Linux Memory Management
Marcelo Tosatti wrote:
> On Thu, Nov 04, 2004 at 11:20:54PM +1100, Nick Piggin wrote:
>>Probably the comment there is woefully inadequate? - I sometimes forget that
>>people can't read my mind :\
>
>
> Nick, care to add a comment on top of zone_watermark_ok explaining
> the reasoning behind the calculation and its expected effects?
>
> That would be really nice.
Yep, I'll come up with something for it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-04 22:47 ` Marcelo Tosatti
@ 2004-11-05 2:08 ` Andrew Morton
2004-11-05 2:14 ` Nick Piggin
1 sibling, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2004-11-05 2:08 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: nickpiggin, linux-mm, torvalds
Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> This typo probably means that the current code is not actually working as
> intented - did any of you receive any feedback on this patch, wrt high order allocation
> intensive driver setup/workload, Nick and Andrew?
None whatsoever.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-04 22:47 ` Marcelo Tosatti
2004-11-05 2:08 ` Andrew Morton
@ 2004-11-05 2:14 ` Nick Piggin
1 sibling, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2004-11-05 2:14 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Andrew Morton, Linux Memory Management, Linus Torvalds
Marcelo Tosatti wrote:
> On Fri, Nov 05, 2004 at 12:06:21PM +1100, Nick Piggin wrote:
>
>>Yeah it's wrong, of course. Good catch, thanks.
>>
>>If you would care to send a patch Marcelo? I don't have a recent
>>-mm on hand at the moment. Would that be alright?
>
>
> Sure, I'll prepare a patch.
>
> This typo probably means that the current code is not actually working as
> intented - did any of you receive any feedback on this patch, wrt high order allocation
> intensive driver setup/workload, Nick and Andrew?
>
I had tested it and it did do the right thing when fragmentation meant
higer order memory was completely depleted - previously atomic allocations
would just stop working, but the patches got them going again.
I did have one guy who tested the patches in a production sort of system
that was getting some allocation failures. He said they didn't make much
difference (increasing min_free_kbytes was effective).
So hopefully this should make things work better.
What would happen without the patch, is that if you had 0 order-2 pages free
and wanted to make an order-2 allocation, free_pages would never get
decremented, so it wouldn't detect the shortage.
> I have been using a simple module which allocates a big number of high order
> allocations in a row (for the defragmentation code testing), I'll give it a
> shot tomorrow to test Nick's high-order-kswapd scheme effectiveness.
>
> Anyway, here is the patch:
>
ACK. Thanks.
> Description:
> Fix typo in Nick's kswapd-high-order awareness patch
>
> --- linux-2.6.10-rc1-mm2/mm/page_alloc.c.orig 2004-11-04 22:52:00.505365136 -0200
> +++ linux-2.6.10-rc1-mm2/mm/page_alloc.c 2004-11-04 22:52:03.121967352 -0200
> @@ -733,7 +733,7 @@
> return 0;
> for (o = 0; o < order; o++) {
> /* At the next order, this order's pages become unavailable */
> - free_pages -= z->free_area[order].nr_free << o;
> + free_pages -= z->free_area[o].nr_free << o;
>
> /* Require fewer higher order pages to be free */
> min >>= 1;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-04 8:57 ` [PATCH 2/3] higher order watermarks Marcelo Tosatti
2004-11-04 12:20 ` Nick Piggin
@ 2004-11-10 16:23 ` Marcelo Tosatti
2004-11-11 1:41 ` Nick Piggin
1 sibling, 1 reply; 17+ messages in thread
From: Marcelo Tosatti @ 2004-11-10 16:23 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management
On Thu, Nov 04, 2004 at 06:57:45AM -0200, Marcelo Tosatti wrote:
> The original code didnt had the can_try_harder/gfp_high decrease
> which is now on zone_watermark_ok.
>
> Means that those allocations will now be successful earlier, instead
> of going to the next zonelist iteration. kswapd will not be awake
> when it used to be.
>
> Hopefully it doesnt matter that much. You did this by intention?
Another thing Nick is that now balance_pgdat uses zone_watermark_ok,
and that sums "z->protection[alloc_type]".
if (free_pages <= min + z->protection[alloc_type])
return 0;
Since balance_pgdat calls with alloc_type=0, the code will sum ZONE_DMA
(alloc_type = 0) protection, and it should not.
kswapd should be working on the bare min/low/high watermarks AFAICT,
without the protections.
Comments?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-10 16:23 ` Marcelo Tosatti
@ 2004-11-11 1:41 ` Nick Piggin
2004-11-11 10:18 ` Marcelo Tosatti
0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2004-11-11 1:41 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Andrew Morton, Linux Memory Management
Marcelo Tosatti wrote:
> On Thu, Nov 04, 2004 at 06:57:45AM -0200, Marcelo Tosatti wrote:
>
>
>>The original code didnt had the can_try_harder/gfp_high decrease
>>which is now on zone_watermark_ok.
>>
>>Means that those allocations will now be successful earlier, instead
>>of going to the next zonelist iteration. kswapd will not be awake
>>when it used to be.
>>
>>Hopefully it doesnt matter that much. You did this by intention?
>
>
> Another thing Nick is that now balance_pgdat uses zone_watermark_ok,
> and that sums "z->protection[alloc_type]".
>
> if (free_pages <= min + z->protection[alloc_type])
> return 0;
>
> Since balance_pgdat calls with alloc_type=0, the code will sum ZONE_DMA
> (alloc_type = 0) protection, and it should not.
>
> kswapd should be working on the bare min/low/high watermarks AFAICT,
> without the protections.
>
> Comments?
>
>
Yeah.. I think z->protection[0] should always be 0, shouldn't it?
I was just hesitant to add another parameter to the function and
have yet another case to check.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] higher order watermarks
2004-11-11 1:41 ` Nick Piggin
@ 2004-11-11 10:18 ` Marcelo Tosatti
0 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2004-11-11 10:18 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management
On Thu, Nov 11, 2004 at 12:41:02PM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> >On Thu, Nov 04, 2004 at 06:57:45AM -0200, Marcelo Tosatti wrote:
> >
> >
> >>The original code didnt had the can_try_harder/gfp_high decrease
> >>which is now on zone_watermark_ok.
> >>
> >>Means that those allocations will now be successful earlier, instead
> >>of going to the next zonelist iteration. kswapd will not be awake
> >>when it used to be.
> >>
> >>Hopefully it doesnt matter that much. You did this by intention?
> >
> >
> >Another thing Nick is that now balance_pgdat uses zone_watermark_ok,
> >and that sums "z->protection[alloc_type]".
> >
> > if (free_pages <= min + z->protection[alloc_type])
> > return 0;
> >
> >Since balance_pgdat calls with alloc_type=0, the code will sum ZONE_DMA
> >(alloc_type = 0) protection, and it should not.
> >
> >kswapd should be working on the bare min/low/high watermarks AFAICT,
> >without the protections.
> >
> >Comments?
> >
> >
>
> Yeah.. I think z->protection[0] should always be 0, shouldn't it?
Oh yes, fine.
> I was just hesitant to add another parameter to the function and
> have yet another case to check.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2004-11-11 10:18 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-27 8:00 [PATCH 0/3] teach kswapd about higher order allocations Nick Piggin
2004-10-27 8:00 ` [PATCH 1/3] keep count of free areas Nick Piggin
2004-10-27 8:02 ` [PATCH 2/3] higher order watermarks Nick Piggin
2004-10-27 8:02 ` [PATCH 3/3] teach kswapd about higher order areas Nick Piggin
2004-10-27 8:13 ` Nick Piggin
2004-11-04 8:57 ` [PATCH 2/3] higher order watermarks Marcelo Tosatti
2004-11-04 12:20 ` Nick Piggin
2004-11-04 9:55 ` Marcelo Tosatti
2004-11-05 1:06 ` Nick Piggin
2004-11-04 22:47 ` Marcelo Tosatti
2004-11-05 2:08 ` Andrew Morton
2004-11-05 2:14 ` Nick Piggin
2004-11-04 10:02 ` Marcelo Tosatti
2004-11-05 1:12 ` Nick Piggin
2004-11-10 16:23 ` Marcelo Tosatti
2004-11-11 1:41 ` Nick Piggin
2004-11-11 10:18 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox