how not to write a search algorithm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* how not to write a search algorithm
@ 2002-08-04  8:35 Andrew Morton
  2002-08-04 13:16 ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-08-04  8:35 UTC (permalink / raw)
  To: linux-mm

Worked out why my box is going into a 3-5 minute coma with one test.
Think what the LRUs look like when the test first hits page reclaim
on this 2.5G ia32 box:

               head                           tail
active_list:   <800M of ZONE_NORMAL> <200M of ZONE_HIGHMEM>
inactive_list:          <1.5G of ZONE_HIGHMEM>

now, somebody does a GFP_KERNEL allocation.

uh-oh.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans 5000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 10000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 20000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 40000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 80000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 160000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 320000 pages, achieving nothing.

The page allocation fails.  So __alloc_pages tries it all again.

This all gets rather boring.

Per-zone LRUs will fix it up.  We need that anyway, because a ZONE_NORMAL
request will bogusly refile, on average, memory_size/800M pages to the
head of the inactive list, thus wrecking page aging.

Alan's kernel has a nice-looking implementation.  I'll lift that out
next week unless someone beats me to it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04  8:35 how not to write a search algorithm Andrew Morton
@ 2002-08-04 13:16 ` Rik van Riel
  2002-08-04 20:00   ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2002-08-04 13:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Sun, 4 Aug 2002, Andrew Morton wrote:

>                head                           tail
> active_list:   <800M of ZONE_NORMAL> <200M of ZONE_HIGHMEM>
> inactive_list:          <1.5G of ZONE_HIGHMEM>
>
> now, somebody does a GFP_KERNEL allocation.
>
> uh-oh.

> Per-zone LRUs will fix it up.  We need that anyway, because a ZONE_NORMAL
> request will bogusly refile, on average, memory_size/800M pages to the
> head of the inactive list, thus wrecking page aging.
>
> Alan's kernel has a nice-looking implementation.  I'll lift that out
> next week unless someone beats me to it.

Good to hear that you found this one ;)

cheers,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 13:16 ` Rik van Riel
@ 2002-08-04 20:00   ` Andrew Morton
  2002-08-04 19:54     ` Rik van Riel
  2002-08-04 20:38     ` William Lee Irwin III
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2002-08-04 20:00 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> ...
> > Alan's kernel has a nice-looking implementation.  I'll lift that out
> > next week unless someone beats me to it.
> 
> Good to hear that you found this one ;)

The same test panics Alan's kernel with pte_chain oom, so I can't
check whether/how well it fixes it :(

2.5 is no better off wrt pte_chain oom, and I expect it'll oops
with this test when per-zone-LRUs are implemented.

Is there a proposed way of recovering from pte_chain oom?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 20:00   ` Andrew Morton
@ 2002-08-04 19:54     ` Rik van Riel
  2002-08-04 20:38     ` William Lee Irwin III
  1 sibling, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2002-08-04 19:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, William Lee Irwin III

On Sun, 4 Aug 2002, Andrew Morton wrote:
> Rik van Riel wrote:
> >
> > ...
> > > Alan's kernel has a nice-looking implementation.  I'll lift that out
> > > next week unless someone beats me to it.
> >
> > Good to hear that you found this one ;)
>
> The same test panics Alan's kernel with pte_chain oom, so I can't
> check whether/how well it fixes it :(

> Is there a proposed way of recovering from pte_chain oom?

I think wli is working on a patch for this.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 20:00   ` Andrew Morton
  2002-08-04 19:54     ` Rik van Riel
@ 2002-08-04 20:38     ` William Lee Irwin III
  2002-08-04 21:09       ` Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: William Lee Irwin III @ 2002-08-04 20:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-mm

Rik van Riel wrote:
>> Good to hear that you found this one ;)

On Sun, Aug 04, 2002 at 01:00:14PM -0700, Andrew Morton wrote:
> The same test panics Alan's kernel with pte_chain oom, so I can't
> check whether/how well it fixes it :(
> 2.5 is no better off wrt pte_chain oom, and I expect it'll oops
> with this test when per-zone-LRUs are implemented.
> Is there a proposed way of recovering from pte_chain oom?

Yes. I'll outline my strategy here.

(1) alter pte-highmem semantics to sleep holding pagetable kmaps
	(A) reserve some virtual address space for mm-local mappings
	(B) shove pagetable mappings for "self" and "other" into it

(2) separate pte_chain allocation from insertion operations
	(A) provide a hook to preallocate batches outside the locks
	(B) convert the allocation to sleeping allocations
	(C) rearrange pagetable modifications for error recovery and
		to call preallocation hooks and pass in reserved state

(3) recovery from pte_chain proliferation by unmapping things on demand
	(A) per-mm and per-vma accounting of pte_chain space consumption
	(B) pte_chain memory recovery routine run on-demand
	(C) budget-based allocation and mm-local pte_chain recycling

(4) recovery from pagetable proliferation by unmapping files on demand
	(A) per-mm and per-vma accounting of pagetable space consumption
	(B) per 3rd-level pagetable accounting of occupancy
	(C) budget-based pagetable allocation and mm-local recycling
	(D) pagetable memory recovery routine run on-demand

(5) recovery from pagetable proliferation by swapping anonymous pagetables
	(A) per-mm and per-vma accounting of anonymous pagetable space
	(B) per 3rd-level pagetable accounting of occupancy
	(C) fault handling for non-present pmd's
	(D) swap I/O for pagetable pages
	(E) recovery of anonymous pagetable memory run on-demand

(6) Assign a global hard limit on the amount of space permissible to
	allocate for pagetables & pte_chains and enforce it with (1)-(5).


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 20:38     ` William Lee Irwin III
@ 2002-08-04 21:09       ` Andrew Morton
  2002-08-04 22:02         ` William Lee Irwin III
  2002-08-04 22:45         ` Daniel Phillips
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2002-08-04 21:09 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Rik van Riel, linux-mm

William Lee Irwin III wrote:
> 
> Rik van Riel wrote:
> >> Good to hear that you found this one ;)
> 
> On Sun, Aug 04, 2002 at 01:00:14PM -0700, Andrew Morton wrote:
> > The same test panics Alan's kernel with pte_chain oom, so I can't
> > check whether/how well it fixes it :(
> > 2.5 is no better off wrt pte_chain oom, and I expect it'll oops
> > with this test when per-zone-LRUs are implemented.
> > Is there a proposed way of recovering from pte_chain oom?
> 
> Yes. I'll outline my strategy here.
> 
> (1) alter pte-highmem semantics to sleep holding pagetable kmaps
>         (A) reserve some virtual address space for mm-local mappings
>         (B) shove pagetable mappings for "self" and "other" into it

Why?

> (2) separate pte_chain allocation from insertion operations
>         (A) provide a hook to preallocate batches outside the locks
>         (B) convert the allocation to sleeping allocations
>         (C) rearrange pagetable modifications for error recovery and
>                 to call preallocation hooks and pass in reserved state

Seems that simply changing the page_add_ramp() interface to require the
caller to pass in one (err, two) pte_chains would suffice.  The tricky
one is copy_page_range(), which is probably where -ac panics.

I suppose we could hang the pool of pte_chains off task_struct
and have a little "precharge the pte_chains" function.  Gack.

> (3) recovery from pte_chain proliferation by unmapping things on demand
>         (A) per-mm and per-vma accounting of pte_chain space consumption
>         (B) pte_chain memory recovery routine run on-demand
>         (C) budget-based allocation and mm-local pte_chain recycling
> 
> (4) recovery from pagetable proliferation by unmapping files on demand
>         (A) per-mm and per-vma accounting of pagetable space consumption
>         (B) per 3rd-level pagetable accounting of occupancy
>         (C) budget-based pagetable allocation and mm-local recycling
>         (D) pagetable memory recovery routine run on-demand
> 
> (5) recovery from pagetable proliferation by swapping anonymous pagetables
>         (A) per-mm and per-vma accounting of anonymous pagetable space
>         (B) per 3rd-level pagetable accounting of occupancy
>         (C) fault handling for non-present pmd's
>         (D) swap I/O for pagetable pages
>         (E) recovery of anonymous pagetable memory run on-demand
> 
> (6) Assign a global hard limit on the amount of space permissible to
>         allocate for pagetables & pte_chains and enforce it with (1)-(5).

Different problem ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 21:09       ` Andrew Morton
@ 2002-08-04 22:02         ` William Lee Irwin III
  2002-08-04 22:43           ` Andrew Morton
  2002-08-04 22:45         ` Daniel Phillips
  1 sibling, 1 reply; 20+ messages in thread
From: William Lee Irwin III @ 2002-08-04 22:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-mm

William Lee Irwin III wrote:
>> (1) alter pte-highmem semantics to sleep holding pagetable kmaps
>>         (A) reserve some virtual address space for mm-local mappings
>>         (B) shove pagetable mappings for "self" and "other" into it

On Sun, Aug 04, 2002 at 02:09:22PM -0700, Andrew Morton wrote:
> Why?

kmapped pte's are passed into the rmap interface, so they're forbidden
from sleeping for allocations if they do them. The additional twist is
that separating locking from holding a reference as in the later
portions opens up the opportunity to sleep while ensuring the pagetable
page will still exist after waking.

i.e.
(1) grab ->page_table_lock long enough to find 3rd-level pagetable
(2) inc refcount
(3) call things with the lock held
(4) if they do some sleeping allocations, they can drop the lock
(5) grab lock
(6) do the real stuff
(7) drop refcount & move on to a different page

Mostly needed for interactions with various kinds of pagetable pruning
and ZONE_NORMAL conservation as the locking requirements get stiffer
when things are prunable at times other than exit() and kmapped and/or
swappable above the 3rd-level of the pagetable. And this scheme, already
used by FreeBSD, has far lower TLB overhead than per-page TLB
invalidation even in the normal case. And the weird mapping stuff in the
generic code vaporizes too.

Note 32K tasks * 16K pmdspace/task = 512MB, i.e. impossible to allocate
from ZONE_NORMAL on i386 with sufficiently large mem_map[] and/or
hashtable bloat. Also, rmap never touches pmd's or pgd's to get at pages
in vmscan.c, so it needs no more than a single pte to scan pte_chains.

William Lee Irwin III wrote:
>> (2) separate pte_chain allocation from insertion operations
>>         (A) provide a hook to preallocate batches outside the locks
>>         (B) convert the allocation to sleeping allocations
>>         (C) rearrange pagetable modifications for error recovery and
>>                 to call preallocation hooks and pass in reserved state

On Sun, Aug 04, 2002 at 02:09:22PM -0700, Andrew Morton wrote:
> Seems that simply changing the page_add_ramp() interface to require the
> caller to pass in one (err, two) pte_chains would suffice.  The tricky
> one is copy_page_range(), which is probably where -ac panics.
> I suppose we could hang the pool of pte_chains off task_struct
> and have a little "precharge the pte_chains" function.  Gack.

This is (A) and (B), where the notion for (A) I had was more of simply
grabbing the most pte_chains needed for a single 3rd-level pagetable
copy and keeping the reference to it on the stack to keep the arrival
rates down. (C) is just

	if ((A) failed)
		goto nomem;

with proper drops of locks and refcounts and freeing of memory for the
failed operation.

William Lee Irwin III wrote:
>> (6) Assign a global hard limit on the amount of space permissible to
>>         allocate for pagetables & pte_chains and enforce it with (1)-(5).

On Sun, Aug 04, 2002 at 02:09:22PM -0700, Andrew Morton wrote:
> Different problem ;)

I had fixing "this box should be able to run a lot of tasks but drops
dead instead" in mind. What subset of this were you looking for?

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 22:02         ` William Lee Irwin III
@ 2002-08-04 22:43           ` Andrew Morton
  2002-08-04 22:47             ` William Lee Irwin III
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-08-04 22:43 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Rik van Riel, linux-mm

William Lee Irwin III wrote:
> 
> ...
> I had fixing "this box should be able to run a lot of tasks but drops
> dead instead" in mind. What subset of this were you looking for?

Getting the kernel back to the level of performance and stability
which it had before the rmap patch has to be the first step.

1) 50% increase in system load on fork/exec/exit workloads
2) Will oops on pte_chain oom
3) pte_highmem is bust
4) tripled ZONE_NORMAL consumption
5) pte chains go wrong with ntpd
6) Poor swapout bandwidth

The first three or four here are fatal to the retention of the
reverse map, IMO.  Futzing around fixing them is taking time
and is holding up other work.

I may have a handle on 1).  Still working it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 22:43           ` Andrew Morton
@ 2002-08-04 22:47             ` William Lee Irwin III
  2002-08-05  3:00               ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: William Lee Irwin III @ 2002-08-04 22:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-mm

On Sun, Aug 04, 2002 at 03:43:56PM -0700, Andrew Morton wrote:
> Getting the kernel back to the level of performance and stability
> which it had before the rmap patch has to be the first step.
> 1) 50% increase in system load on fork/exec/exit workloads
> 2) Will oops on pte_chain oom
> 3) pte_highmem is bust
> 4) tripled ZONE_NORMAL consumption
> 5) pte chains go wrong with ntpd
> The first three or four here are fatal to the retention of the
> reverse map, IMO.  Futzing around fixing them is taking time
> and is holding up other work.
> I may have a handle on 1).  Still working it.

(2) only needs the reservation bits from the preceding post if it's
	just dealing with kmem_cache_alloc() returning NULL.
(3) I ground out the half-assed quick & dirty "fish the pfn out of the
	kmap pte" a.k.a. virt_to_fix() and use physaddrs in pte_chains
	thingie and handed it off to others to debug/clean up/push.
(4) is part of the known tradeoff AFAIK, but phillips may have something
	taking it down to only double or so

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 22:47             ` William Lee Irwin III
@ 2002-08-05  3:00               ` Andrew Morton
  2002-08-05  2:55                 ` Rik van Riel
  2002-08-05  7:40                 ` William Lee Irwin III
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2002-08-05  3:00 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Rik van Riel, linux-mm

William Lee Irwin III wrote:
> 
> On Sun, Aug 04, 2002 at 03:43:56PM -0700, Andrew Morton wrote:
> > Getting the kernel back to the level of performance and stability
> > which it had before the rmap patch has to be the first step.
> > 1) 50% increase in system load on fork/exec/exit workloads
> > 2) Will oops on pte_chain oom
> > 3) pte_highmem is bust
> > 4) tripled ZONE_NORMAL consumption
> > 5) pte chains go wrong with ntpd
> > The first three or four here are fatal to the retention of the
> > reverse map, IMO.  Futzing around fixing them is taking time
> > and is holding up other work.
> > I may have a handle on 1).  Still working it.
> 
> (2) only needs the reservation bits from the preceding post if it's
>         just dealing with kmem_cache_alloc() returning NULL.

Well I think we'll need a per-cpu-pages thing to amortise zone->lock
contention anyway.  So what we can do is:

	fill_up_the_per_cpu_buffer(GFP_KERNEL);		/* disables preemption */
	spin_lock(lock);
	allocate(GFP_ATOMIC);
	spin_unlock(lock);
	preempt_enable();

We also prevent interrupt-time allocations from
stealing the final four pages from the per-cpu buffer.

The allocation is guaranteed to succeed, yes?   Can use
it for ratnodes as well.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-05  3:00               ` Andrew Morton
@ 2002-08-05  2:55                 ` Rik van Riel
  2002-08-05  7:40                 ` William Lee Irwin III
  1 sibling, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2002-08-05  2:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, linux-mm

On Sun, 4 Aug 2002, Andrew Morton wrote:

> Well I think we'll need a per-cpu-pages thing to amortise zone->lock
> contention anyway.  So what we can do is:
>
> 	fill_up_the_per_cpu_buffer(GFP_KERNEL);		/* disables preemption */
> 	spin_lock(lock);
> 	allocate(GFP_ATOMIC);
> 	spin_unlock(lock);
> 	preempt_enable();
>
> We also prevent interrupt-time allocations from
> stealing the final four pages from the per-cpu buffer.
>
> The allocation is guaranteed to succeed, yes?   Can use
> it for ratnodes as well.

Yes, that would work.

One page for the process, one page table page, one ratcache page
and one pte chain page ... anything else ?

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-05  3:00               ` Andrew Morton
  2002-08-05  2:55                 ` Rik van Riel
@ 2002-08-05  7:40                 ` William Lee Irwin III
  2002-08-05  8:44                   ` Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: William Lee Irwin III @ 2002-08-05  7:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-mm

William Lee Irwin III wrote:
>> (2) only needs the reservation bits from the preceding post if it's
>>         just dealing with kmem_cache_alloc() returning NULL.

On Sun, Aug 04, 2002 at 08:00:27PM -0700, Andrew Morton wrote:
> Well I think we'll need a per-cpu-pages thing to amortise zone->lock
> contention anyway.  So what we can do is:
> 	fill_up_the_per_cpu_buffer(GFP_KERNEL);	/* disables preemption */
> 	spin_lock(lock);
> 	allocate(GFP_ATOMIC);
> 	spin_unlock(lock);
> 	preempt_enable();
> We also prevent interrupt-time allocations from
> stealing the final four pages from the per-cpu buffer.
> The allocation is guaranteed to succeed, yes?   Can use
> it for ratnodes as well.

NFI how this is supposed to work with slab caches and/or get around the
GFP_ATOMIC failing. I understand how to bomb out of loops & return
-ENOMEM though. I also think it best to let this sleep, as it's not
happening in interrupt context. Or maybe I'm missing something.

Better ideas are of course welcome.

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-05  7:40                 ` William Lee Irwin III
@ 2002-08-05  8:44                   ` Andrew Morton
  2002-08-05 10:50                     ` William Lee Irwin III
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-08-05  8:44 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Rik van Riel, linux-mm

William Lee Irwin III wrote:
> 
> William Lee Irwin III wrote:
> >> (2) only needs the reservation bits from the preceding post if it's
> >>         just dealing with kmem_cache_alloc() returning NULL.
> 
> On Sun, Aug 04, 2002 at 08:00:27PM -0700, Andrew Morton wrote:
> > Well I think we'll need a per-cpu-pages thing to amortise zone->lock
> > contention anyway.  So what we can do is:
> >       fill_up_the_per_cpu_buffer(GFP_KERNEL); /* disables preemption */
> >       spin_lock(lock);
> >       allocate(GFP_ATOMIC);
> >       spin_unlock(lock);
> >       preempt_enable();
> > We also prevent interrupt-time allocations from
> > stealing the final four pages from the per-cpu buffer.
> > The allocation is guaranteed to succeed, yes?   Can use
> > it for ratnodes as well.
> 
> NFI how this is supposed to work with slab caches and/or get around the
> GFP_ATOMIC failing.

The GFP_ATOMIC allocation can't fail.  We've gone and arranged for
this CPU to have (say) eight free pages, all to itself.  If we
also arrange for interrupt context to leave at least four behind,
we *know* that there are pages available to the atomic allocation.

So if the slab allocation needs a page, it will get it from the
cpu-local pool.

It can fail if the allocation is higher-order, but we won't do that.

The nice thing is that it 99% leverages a per-cpu-pages mechanism.

We'd have to make fill_up_the_per_cpu_buffer() loop for ever
(but the page allocator does that anyway) or handle a failure
from that.  Just loop, I'd say.  Provided the caller isn't holding any
semaphores.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-05  8:44                   ` Andrew Morton
@ 2002-08-05 10:50                     ` William Lee Irwin III
  0 siblings, 0 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-08-05 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-mm

On Mon, Aug 05, 2002 at 01:44:06AM -0700, Andrew Morton wrote:
> The nice thing is that it 99% leverages a per-cpu-pages mechanism.
> We'd have to make fill_up_the_per_cpu_buffer() loop for ever
> (but the page allocator does that anyway) or handle a failure
> from that.  Just loop, I'd say.  Provided the caller isn't holding any
> semaphores.

It should hold the mm->mmap_sem


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 21:09       ` Andrew Morton
  2002-08-04 22:02         ` William Lee Irwin III
@ 2002-08-04 22:45         ` Daniel Phillips
  2002-08-04 23:03           ` Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: Daniel Phillips @ 2002-08-04 22:45 UTC (permalink / raw)
  To: Andrew Morton, William Lee Irwin III; +Cc: Rik van Riel, linux-mm

On Sunday 04 August 2002 23:09, Andrew Morton wrote:
> Seems that simply changing the page_add_ramp() interface to require the
> caller to pass in one (err, two) pte_chains would suffice.  The tricky
> one is copy_page_range(), which is probably where -ac panics.

Hmm, seems to me my recent patch did exactly that.  Somebody called
it 'ugly' ;-)

I did intend to move the initialization of that little pool outside
copy_page_range, and never free the remainder.

Why two pte_chains, by the way?

> I suppose we could hang the pool of pte_chains off task_struct
> and have a little "precharge the pte_chains" function.  Gack.

It's not that bad.  It's much nicer than hanging onto the rmap lock
while kmem_cache_alloc does its thing.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 22:45         ` Daniel Phillips
@ 2002-08-04 23:03           ` Andrew Morton
  2002-08-04 23:00             ` William Lee Irwin III
                               ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Andrew Morton @ 2002-08-04 23:03 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: William Lee Irwin III, Rik van Riel, linux-mm

Daniel Phillips wrote:
> 
> On Sunday 04 August 2002 23:09, Andrew Morton wrote:
> > Seems that simply changing the page_add_ramp() interface to require the
> > caller to pass in one (err, two) pte_chains would suffice.  The tricky
> > one is copy_page_range(), which is probably where -ac panics.
> 
> Hmm, seems to me my recent patch did exactly that.  Somebody called
> it 'ugly' ;-)
> 
> I did intend to move the initialization of that little pool outside
> copy_page_range, and never free the remainder.
> 
> Why two pte_chains, by the way?

Converting from a PageDirect representation to a shared-by-two
representation needs two pte_chains.

> > I suppose we could hang the pool of pte_chains off task_struct
> > and have a little "precharge the pte_chains" function.  Gack.
> 
> It's not that bad.  It's much nicer than hanging onto the rmap lock
> while kmem_cache_alloc does its thing.

The list walk is killing us now.   I think we need:

struct pte_chain {
	struct pte_chain *next;
	pte_t *ptes[L1_CACHE_BYTES/4 - 4];
};

Still poking...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 23:03           ` Andrew Morton
@ 2002-08-04 23:00             ` William Lee Irwin III
  2002-08-04 23:02             ` Daniel Phillips
  2002-08-05  0:03             ` Daniel Phillips
  2 siblings, 0 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-08-04 23:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, Rik van Riel, linux-mm

On Sun, Aug 04, 2002 at 04:03:11PM -0700, Andrew Morton wrote:
> The list walk is killing us now.   I think we need:
> struct pte_chain {
> 	struct pte_chain *next;
> 	pte_t *ptes[L1_CACHE_BYTES/4 - 4];
> };
> Still poking...

Could I get a

pte_t *ptes[(L1_CACHE_BYTES - sizeof(struct pte_chain *))/(sizeof(pte_t *))] ?

Well, regardless, the mean pte_chain length for chains of length > 1 is
around 6, and the std. dev. is around 12, and the distribution is *very*
long-tailed, so this is just about guaranteed to help at the cost of some
slight internal fragmentation.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 23:03           ` Andrew Morton
  2002-08-04 23:00             ` William Lee Irwin III
@ 2002-08-04 23:02             ` Daniel Phillips
  2002-08-04 23:21               ` Andrew Morton
  2002-08-05  0:03             ` Daniel Phillips
  2 siblings, 1 reply; 20+ messages in thread
From: Daniel Phillips @ 2002-08-04 23:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, Rik van Riel, linux-mm

On Monday 05 August 2002 01:03, Andrew Morton wrote:
> The list walk is killing us now.   I think we need:
> 
> struct pte_chain {
> 	struct pte_chain *next;
> 	pte_t *ptes[L1_CACHE_BYTES/4 - 4];
> };

Which list walk, the remove or the page_referenced?

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 23:02             ` Daniel Phillips
@ 2002-08-04 23:21               ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2002-08-04 23:21 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: William Lee Irwin III, Rik van Riel, linux-mm

Daniel Phillips wrote:
> 
> On Monday 05 August 2002 01:03, Andrew Morton wrote:
> > The list walk is killing us now.   I think we need:
> >
> > struct pte_chain {
> >       struct pte_chain *next;
> >       pte_t *ptes[L1_CACHE_BYTES/4 - 4];
> > };
> 
> Which list walk, the remove or the page_referenced?

The remove in this case.  I'll post some numbers
in the other thread.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: how not to write a search algorithm
  2002-08-04 23:03           ` Andrew Morton
  2002-08-04 23:00             ` William Lee Irwin III
  2002-08-04 23:02             ` Daniel Phillips
@ 2002-08-05  0:03             ` Daniel Phillips
  2 siblings, 0 replies; 20+ messages in thread
From: Daniel Phillips @ 2002-08-05  0:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, Rik van Riel, linux-mm

On Monday 05 August 2002 01:03, Andrew Morton wrote:
> The list walk is killing us now.   I think we need:
> 
> struct pte_chain {
> 	struct pte_chain *next;
> 	pte_t *ptes[L1_CACHE_BYTES/4 - 4];
> };

Strongly agreed.  A full 64 bytes might be a little much though.  Let me see, 
for 32 bytes the space breakeven is 4X sharing, for 64 bytes it's 8X and 
we'll rarely hit that, except in contrived benchmarks.

A variation on this idea makes the size of the node a property of an antire 
page's worth of nodes, so that nodes of different sizes can be allocated.  
The node size can be recorded at the base of the page, or in a vector of 
pointers to such pages.  Moving from size to size is by copying rather than 
list insertion, and only the largest size needs a list link.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2002-08-05 10:50 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-04  8:35 how not to write a search algorithm Andrew Morton
2002-08-04 13:16 ` Rik van Riel
2002-08-04 20:00   ` Andrew Morton
2002-08-04 19:54     ` Rik van Riel
2002-08-04 20:38     ` William Lee Irwin III
2002-08-04 21:09       ` Andrew Morton
2002-08-04 22:02         ` William Lee Irwin III
2002-08-04 22:43           ` Andrew Morton
2002-08-04 22:47             ` William Lee Irwin III
2002-08-05  3:00               ` Andrew Morton
2002-08-05  2:55                 ` Rik van Riel
2002-08-05  7:40                 ` William Lee Irwin III
2002-08-05  8:44                   ` Andrew Morton
2002-08-05 10:50                     ` William Lee Irwin III
2002-08-04 22:45         ` Daniel Phillips
2002-08-04 23:03           ` Andrew Morton
2002-08-04 23:00             ` William Lee Irwin III
2002-08-04 23:02             ` Daniel Phillips
2002-08-04 23:21               ` Andrew Morton
2002-08-05  0:03             ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox