inactive_dirty list

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* inactive_dirty list
@ 2002-09-06 20:42 Andrew Morton
  2002-09-06 21:03 ` Rik van Riel
  2002-09-07  2:14 ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2002-09-06 20:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik, it seems that the time has come...

I was doing some testing overnight with mem=1024m.  Page reclaim
was pretty inefficient at that level: kswapd consumed 6% of CPU
on a permanent basis (workload was heavy dbench plus looping
make -j6 bzImage).  kswapd was reclaiming only 3% of the pages
which it was looking at.

This doesn't happen at mem=768m, and I'm sure it won't happen at
mem=1.5G.

What is happening here is that the logic which clamps dirty+writeback
pagecache at 40% of memory is working nicely, and the allocate-from-
highmem-first logic is ensuring that all of ZONE_HIGHMEM is dirty
or under writeback all the time.  kswapd isn't allowed to block
against that pagecache, so it's scanning zillions of pages.

This is a fundamental problem when the size of the highmem zone is
approximately equal to 40% of total memory.

We could fix it by changing the page allocator to balance its
allocations across zones, but I don't think we want to do that.

I think it's best to split the inactive list into reclaimable
and unreclaimable.  (inactive_clean/inactive_dirty).

I'll code that tonight; please let me run some thoughts by you:

- inactive_dirty holds pages which are dirty or under writeback.

- end_page_writeback() will move the page onto inactive_clean.

- everywhere where we add a page to the inactive list will now
  add it to either inactive_clean or inactive_dirty, based on
  its PageDirty || PageWriteback state.

- the inactive target logic will remain the same.  So
  zone->nr_inactive_pages will be the sum of the pages on
  zone->inactive_clean and zone->inactive_dirty.

- swapcache pages don't go on inactive_dirty(!).  They remain on
  inactive_clean, so if a page allocator or kswapd hits a swapcache
  page, they block on it (swapout throttling).

  A result of this is that we never need to scan inactive_dirty.
  Those pages will always be written out in balance_dirty_pages
  by the write(2) caller, or by pdflush.

  (Hence: we don't need inactive_dirty at all.  We could just cut
  those pages off the LRU altogether.  But let's not do that).

- Hence: the only pages which are written out from within the VM
  are swapcache.

- So the only real source of throttling for tasks which aren't
  running generic_file_write() is the call to blk_congestion_wait()
  in try_to_free_pages().  Which seems sane to me - this will wake
  up after 1/4 of a second, or after someone frees a write request
  against *any* queue.  We know that the pages which were covered
  by that request were just placed onto inactive_clean, so off
  we go again.  Should work (heh).

- with this scheme, we don't actually need zone->nr_inactive_dirty_pages
  and zone->nr_inactive_clean_pages, but I may as well do that - it's
  easy enough.

- MAP_SHARED pages will be on inactive_clean, but if we change the
  logic in there to give these pages a second round on the LRU then
  the apges will automatically be added to inactive_dirty on the
  way out of shrink_zone().

How does that all sound?

btw, it is approximately the case that the pages will come clean
in LRU order (oldest-first) because of the writeback logic.  fs-writeback.c
walks the inodes in oldest-dirtied to newest-dirtied order, and
it walks the inode pages in oldest-dirtied to newest-dirtied
order.   But I think that end_page_writeback() should still move
cleaned pages onto the far (hot) end of inactive_clean?

I think all of this will not result in the zone balancing logic
going into a tailspin.  I'm just a bit worried about corner cases
when the number of reclaimable pages in highmem is getting low - the
classzone balancing code may keep on trying to refill that zone's free
memory pools too much.   We'll see...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 20:42 inactive_dirty list Andrew Morton
@ 2002-09-06 21:03 ` Rik van Riel
  2002-09-06 21:40   ` Andrew Morton
  2002-09-07  2:14 ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2002-09-06 21:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Fri, 6 Sep 2002, Andrew Morton wrote:

> What is happening here is that the logic which clamps dirty+writeback
> pagecache at 40% of memory is working nicely, and the allocate-from-
> highmem-first logic is ensuring that all of ZONE_HIGHMEM is dirty
> or under writeback all the time.

Does this mean that your 1024MB machine can degrade into the
situation where userspace has an effective 128MB memory available
for its working set ?

Or is balancing between the zones still happening ?

> We could fix it by changing the page allocator to balance its
> allocations across zones, but I don't think we want to do that.

Some balancing is needed, otherwise you'll have 800 MB of
old data sitting in ZONE_NORMAL and userspace getting its
hot data evicted from ZONE_HIGHMEM all the time.

OTOH, you don't want to overdo things of course ;)

> I think it's best to split the inactive list into reclaimable
> and unreclaimable.  (inactive_clean/inactive_dirty).
>
> I'll code that tonight; please let me run some thoughts by you:

Sounds like you're reinventing the whole 2.4.0 -> 2.4.7 -> 2.4.9-ac
-> 2.4.13-rmap -> 2.4.19-rmap evolution ;)

> - inactive_dirty holds pages which are dirty or under writeback.

> - everywhere where we add a page to the inactive list will now
>   add it to either inactive_clean or inactive_dirty, based on
>   its PageDirty || PageWriteback state.

If I had veto power I'd use it here ;)

We did this in early 2.4 kernels and it was a disaster. The
reason it was a disaster was that in many workloads we'd
always have some clean pages and we'd end up always reclaiming
those before even starting writeout on any of the dirty pages.

It also meant we could have dirty (or formerly dirty) inactive
pages eating up memory and never being recycled for more active
data.

What you need to do instead is:

- inactive_dirty contains pages from which we're not sure whether
  they're dirty or clean

- everywhere we add a page to the inactive list now, we add
  the page to the inactive_dirty list

This means we'll have a fairer scan and eviction rate between
clean and dirty pages.

> - swapcache pages don't go on inactive_dirty(!).  They remain on
>   inactive_clean, so if a page allocator or kswapd hits a swapcache
>   page, they block on it (swapout throttling).

We can also get rid of this logic. There is no difference between
swap pages and mmap'd file pages. If blk_congestion_wait() works
we can get rid of this special magic and just use it. If it doesn't
work, we need to fix blk_congestion_wait() since otherwise the VM
would fall over under heavy mmap() usage.

> - So the only real source of throttling for tasks which aren't
>   running generic_file_write() is the call to blk_congestion_wait()
>   in try_to_free_pages().  Which seems sane to me - this will wake
>   up after 1/4 of a second, or after someone frees a write request
>   against *any* queue.  We know that the pages which were covered
>   by that request were just placed onto inactive_clean, so off
>   we go again.  Should work (heh).

With this scheme, we can restrict tasks to scanning only the
inactive_clean list.

Kswapd's scanning of the inactive_dirty list is always asynchronous
so we don't need to worry about latency.  No need to waste CPU by
having other tasks also scan this very same list and submit IO.

> - with this scheme, we don't actually need zone->nr_inactive_dirty_pages
>   and zone->nr_inactive_clean_pages, but I may as well do that - it's
>   easy enough.

Agreed, good statistics are essential when you're trying to
balance a VM.

> How does that all sound?

Most of the plan sounds good, but your dirty/clean split is a
tried and tested recipy for disaster. ;)

> order.   But I think that end_page_writeback() should still move
> cleaned pages onto the far (hot) end of inactive_clean?

IMHO inactive_clean should just contain KNOWN FREEABLE pages,
as an area beyond the inactive_dirty list.

> I think all of this will not result in the zone balancing logic
> going into a tailspin.  I'm just a bit worried about corner cases
> when the number of reclaimable pages in highmem is getting low - the
> classzone balancing code may keep on trying to refill that zone's free
> memory pools too much.   We'll see...

There's a simple trick we can use here. If we _known_ that all
the inactive_clean pages can be immediately reclaimed, we can
count those as free pages for balancing purposes.

This should make life easier when one of the zones is under
heavy writeback pressure.

kind regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 21:03 ` Rik van Riel
@ 2002-09-06 21:40   ` Andrew Morton
  2002-09-06 21:49     ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2002-09-06 21:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 

hm.  Did that digeo.com address bounce?  grr.

> On Fri, 6 Sep 2002, Andrew Morton wrote:
> 
> > What is happening here is that the logic which clamps dirty+writeback
> > pagecache at 40% of memory is working nicely, and the allocate-from-
> > highmem-first logic is ensuring that all of ZONE_HIGHMEM is dirty
> > or under writeback all the time.
> 
> Does this mean that your 1024MB machine can degrade into the
> situation where userspace has an effective 128MB memory available
> for its working set ?
> 
> Or is balancing between the zones still happening ?

No, that's OK.  This problem is a consequence of the
per-zone LRU.  Whether it is kswapd or a direct-reclaimer,
he always looks at highmem first.  But we allocate pages
from highmem first, too.

With the non-blocking stuff, we blow a lot of CPU scanning past pages.

Prior to the nonblocking stuff, we would get stuck on request
queues trying to refill ZONE_HIGHMEM, probably needlessly,
because there's lots of reclaimable stuff in ZONE_NORMAL. Maybe.
 
> > We could fix it by changing the page allocator to balance its
> > allocations across zones, but I don't think we want to do that.
> 
> Some balancing is needed, otherwise you'll have 800 MB of
> old data sitting in ZONE_NORMAL and userspace getting its
> hot data evicted from ZONE_HIGHMEM all the time.
> 
> OTOH, you don't want to overdo things of course ;)

Well everyone still takes a pass across all zones, bringing
them up to pages_high.  It's just that the ZONE_HIGHMEM pass
is expensive, because that is where all the dirty pagecache
happens to be.

See, the zone balancing is out of whack wrt the page allocation:
we balance the zones nicely in reclaim, and we deliberately
*unbalance* them in the allocator.
 
> ...
> 
> We did this in early 2.4 kernels and it was a disaster. The
> reason it was a disaster was that in many workloads we'd
> always have some clean pages and we'd end up always reclaiming
> those before even starting writeout on any of the dirty pages.

OK.
 
> It also meant we could have dirty (or formerly dirty) inactive
> pages eating up memory and never being recycled for more active
> data.

The interrupt-time page motion should reduce that...

> What you need to do instead is:
> 
> - inactive_dirty contains pages from which we're not sure whether
>   they're dirty or clean
> 
> - everywhere we add a page to the inactive list now, we add
>   the page to the inactive_dirty list
> 
> This means we'll have a fairer scan and eviction rate between
> clean and dirty pages.

And how do they get onto inactive_clean?
 
> > - swapcache pages don't go on inactive_dirty(!).  They remain on
> >   inactive_clean, so if a page allocator or kswapd hits a swapcache
> >   page, they block on it (swapout throttling).
> 
> We can also get rid of this logic. There is no difference between
> swap pages and mmap'd file pages. If blk_congestion_wait() works
> we can get rid of this special magic and just use it. If it doesn't
> work, we need to fix blk_congestion_wait() since otherwise the VM
> would fall over under heavy mmap() usage.

That would probably work.  We'd need to do the pte_dirty->PageDirty
translation carefully.

> > - So the only real source of throttling for tasks which aren't
> >   running generic_file_write() is the call to blk_congestion_wait()
> >   in try_to_free_pages().  Which seems sane to me - this will wake
> >   up after 1/4 of a second, or after someone frees a write request
> >   against *any* queue.  We know that the pages which were covered
> >   by that request were just placed onto inactive_clean, so off
> >   we go again.  Should work (heh).
> 
> With this scheme, we can restrict tasks to scanning only the
> inactive_clean list.
> 
> Kswapd's scanning of the inactive_dirty list is always asynchronous
> so we don't need to worry about latency.  No need to waste CPU by
> having other tasks also scan this very same list and submit IO.

Why does kswapd need to scan that list?
 
> > - with this scheme, we don't actually need zone->nr_inactive_dirty_pages
> >   and zone->nr_inactive_clean_pages, but I may as well do that - it's
> >   easy enough.
> 
> Agreed, good statistics are essential when you're trying to
> balance a VM.
> 
> > How does that all sound?
> 
> Most of the plan sounds good, but your dirty/clean split is a
> tried and tested recipy for disaster. ;)

That's good to know, thanks.
 
> > order.   But I think that end_page_writeback() should still move
> > cleaned pages onto the far (hot) end of inactive_clean?
> 
> IMHO inactive_clean should just contain KNOWN FREEABLE pages,
> as an area beyond the inactive_dirty list.

Confused.  So where do anon pages go?   

> > I think all of this will not result in the zone balancing logic
> > going into a tailspin.  I'm just a bit worried about corner cases
> > when the number of reclaimable pages in highmem is getting low - the
> > classzone balancing code may keep on trying to refill that zone's free
> > memory pools too much.   We'll see...
> 
> There's a simple trick we can use here. If we _known_ that all
> the inactive_clean pages can be immediately reclaimed, we can
> count those as free pages for balancing purposes.

OK.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 21:40   ` Andrew Morton
@ 2002-09-06 21:49     ` Rik van Riel
  2002-09-06 21:58       ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2002-09-06 21:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Fri, 6 Sep 2002, Andrew Morton wrote:

> > It also meant we could have dirty (or formerly dirty) inactive
> > pages eating up memory and never being recycled for more active
> > data.
>
> The interrupt-time page motion should reduce that...

Not if you won't scan the dirty list as long as there are "enough"
clean pages.

> > What you need to do instead is:
> >
> > - inactive_dirty contains pages from which we're not sure whether
> >   they're dirty or clean
> >
> > - everywhere we add a page to the inactive list now, we add
> >   the page to the inactive_dirty list
> >
> > This means we'll have a fairer scan and eviction rate between
> > clean and dirty pages.
>
> And how do they get onto inactive_clean?

Once IO completes they get moved onto the clean list.

> > We can also get rid of this logic. There is no difference between
> > swap pages and mmap'd file pages. If blk_congestion_wait() works
> > we can get rid of this special magic and just use it. If it doesn't
> > work, we need to fix blk_congestion_wait() since otherwise the VM
> > would fall over under heavy mmap() usage.
>
> That would probably work.  We'd need to do the pte_dirty->PageDirty
> translation carefully.

Indeed. We probably want to give such pages a second chance on
the inactive_dirty list without starting the writeout, so we've
unmapped and PageDirtied all its friends for one big writeout.

> > With this scheme, we can restrict tasks to scanning only the
> > inactive_clean list.
> >
> > Kswapd's scanning of the inactive_dirty list is always asynchronous
> > so we don't need to worry about latency.  No need to waste CPU by
> > having other tasks also scan this very same list and submit IO.
>
> Why does kswapd need to scan that list?

The list should preferably only be scanned by one thread.
Scanning with multiple threads is a waste of CPU.

It doesn't really matter which thread is scanning, but I
think we want some preferably simple way to prevent all
CPUs in the system from going wild over the nonfreeable
lists.

> > > order.   But I think that end_page_writeback() should still move
> > > cleaned pages onto the far (hot) end of inactive_clean?
> >
> > IMHO inactive_clean should just contain KNOWN FREEABLE pages,
> > as an area beyond the inactive_dirty list.
>
> Confused.  So where do anon pages go?

All pages go onto the inactive_dirty list. When they reach
the end of the list either we move them to the inactive_clean
list, we submit IO or (in the case of a mapped page) we give
them another go-around on the list in order to build up a
cluster from the other still-mapped pages near it.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 21:49     ` Rik van Riel
@ 2002-09-06 21:58       ` Andrew Morton
  2002-09-06 22:04         ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2002-09-06 21:58 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> On Fri, 6 Sep 2002, Andrew Morton wrote:
> 
> > > It also meant we could have dirty (or formerly dirty) inactive
> > > pages eating up memory and never being recycled for more active
> > > data.
> >
> > The interrupt-time page motion should reduce that...
> 
> Not if you won't scan the dirty list as long as there are "enough"
> clean pages.

Well, something needs to start writeback of dirty pages.  That
is either:

- The VM kicked pdflush or

- We're over dirty limit, so the write(2) callers do it or

- The kupdate function syncs the data.

So why do we need to perform writeback from the VM?  Just for
swapcache, which pdflush doesn't do.

> > > What you need to do instead is:
> > >
> > > - inactive_dirty contains pages from which we're not sure whether
> > >   they're dirty or clean
> > >
> > > - everywhere we add a page to the inactive list now, we add
> > >   the page to the inactive_dirty list
> > >
> > > This means we'll have a fairer scan and eviction rate between
> > > clean and dirty pages.
> >
> > And how do they get onto inactive_clean?
> 
> Once IO completes they get moved onto the clean list.

But if there are clean pages accidentally on inactive_dirty,
we need to scan for them.  If that list only contains
dirty pagecache and pagecache which is under writeback, then
there should be no need to scan it?  Those pages will automatically
come back onto inactive_clean via pdflush/balance_dirty_pages writeout.
 
> > > We can also get rid of this logic. There is no difference between
> > > swap pages and mmap'd file pages. If blk_congestion_wait() works
> > > we can get rid of this special magic and just use it. If it doesn't
> > > work, we need to fix blk_congestion_wait() since otherwise the VM
> > > would fall over under heavy mmap() usage.
> >
> > That would probably work.  We'd need to do the pte_dirty->PageDirty
> > translation carefully.
> 
> Indeed. We probably want to give such pages a second chance on
> the inactive_dirty list without starting the writeout, so we've
> unmapped and PageDirtied all its friends for one big writeout.
> 
> > > With this scheme, we can restrict tasks to scanning only the
> > > inactive_clean list.
> > >
> > > Kswapd's scanning of the inactive_dirty list is always asynchronous
> > > so we don't need to worry about latency.  No need to waste CPU by
> > > having other tasks also scan this very same list and submit IO.
> >
> > Why does kswapd need to scan that list?
> 
> The list should preferably only be scanned by one thread.
> Scanning with multiple threads is a waste of CPU.
> 
> It doesn't really matter which thread is scanning, but I
> think we want some preferably simple way to prevent all
> CPUs in the system from going wild over the nonfreeable
> lists.

Let me rephrase: why does *anybody* need to scan inactive_dirty?
 
> > > > order.   But I think that end_page_writeback() should still move
> > > > cleaned pages onto the far (hot) end of inactive_clean?
> > >
> > > IMHO inactive_clean should just contain KNOWN FREEABLE pages,
> > > as an area beyond the inactive_dirty list.
> >
> > Confused.  So where do anon pages go?
> 
> All pages go onto the inactive_dirty list. When they reach
> the end of the list either we move them to the inactive_clean
> list, we submit IO or (in the case of a mapped page) we give
> them another go-around on the list in order to build up a
> cluster from the other still-mapped pages near it.

hum.  I'm trying to find a model where the VM can just ignore
dirty|writeback pagecache.  We know how many pages are out
there, sure.  But we don't scan them.  Possible?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 21:58       ` Andrew Morton
@ 2002-09-06 22:04         ` Rik van Riel
  2002-09-06 22:19           ` Andrew Morton
  2002-09-06 22:22           ` Rik van Riel
  0 siblings, 2 replies; 18+ messages in thread
From: Rik van Riel @ 2002-09-06 22:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Fri, 6 Sep 2002, Andrew Morton wrote:

> hum.  I'm trying to find a model where the VM can just ignore
> dirty|writeback pagecache.  We know how many pages are out
> there, sure.  But we don't scan them.  Possible?

Owww duh, I see it now.

So basically pages should _only_ go into the inactive_dirty list
when they are under writeout.

Note that leaving dirty pages on the list can result in a waste
of memory. Imagine the dirty limit being 40% and 30% of memory
being dirty but not written out at the moment ...

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 22:04         ` Rik van Riel
@ 2002-09-06 22:19           ` Andrew Morton
  2002-09-06 22:23             ` Rik van Riel
  2002-09-06 22:22           ` Rik van Riel
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2002-09-06 22:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> On Fri, 6 Sep 2002, Andrew Morton wrote:
> 
> > hum.  I'm trying to find a model where the VM can just ignore
> > dirty|writeback pagecache.  We know how many pages are out
> > there, sure.  But we don't scan them.  Possible?
> 
> Owww duh, I see it now.
> 
> So basically pages should _only_ go into the inactive_dirty list
> when they are under writeout.

Or if they're just dirty.  The thing I'm trying to achieve
is to minimise the amount of scanning of unreclaimable pages.

So park them elsewhere, and don't scan them.  We know how many
pages are there, so we can make decisions based on that.  But let
IO completion bring them back onto the inactive_reclaimable(?)
list.

> Note that leaving dirty pages on the list can result in a waste
> of memory. Imagine the dirty limit being 40% and 30% of memory
> being dirty but not written out at the moment ...

Right.  So the VM needs to kick pdflush at the right time to 
get that happening.  The nonblocking pdflush is very effective - I
think it can keep a ton of queues saturated with just a single process.

swapcache is a wart, because pdflush doesn't write swapcache.
It certainly _could_, but you had reasons why that was the 
wrong thing to do?

And something needs to be done with clean but unreclaimable
pages.  These will be on inactive_clean - I guess we just
continue to activate these.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 22:19           ` Andrew Morton
@ 2002-09-06 22:23             ` Rik van Riel
  2002-09-06 22:48               ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2002-09-06 22:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Fri, 6 Sep 2002, Andrew Morton wrote:

> > So basically pages should _only_ go into the inactive_dirty list
> > when they are under writeout.
>
> Or if they're just dirty.  The thing I'm trying to achieve
> is to minimise the amount of scanning of unreclaimable pages.
>
> So park them elsewhere, and don't scan them.  We know how many
> pages are there, so we can make decisions based on that.  But let
> IO completion bring them back onto the inactive_reclaimable(?)
> list.

I guess this means the dirty limit should be near 1% for the
VM.

Every time there is a noticable amount of dirty pages, kick
pdflush and have it write out a few of them, maybe the number
of pages needed to reach zone->pages_high ?

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 22:23             ` Rik van Riel
@ 2002-09-06 22:48               ` Andrew Morton
  2002-09-06 23:03                 ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2002-09-06 22:48 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> On Fri, 6 Sep 2002, Andrew Morton wrote:
> 
> > > So basically pages should _only_ go into the inactive_dirty list
> > > when they are under writeout.
> >
> > Or if they're just dirty.  The thing I'm trying to achieve
> > is to minimise the amount of scanning of unreclaimable pages.
> >
> > So park them elsewhere, and don't scan them.  We know how many
> > pages are there, so we can make decisions based on that.  But let
> > IO completion bring them back onto the inactive_reclaimable(?)
> > list.
> 
> I guess this means the dirty limit should be near 1% for the
> VM.

What is the thinking behind that?

> Every time there is a noticable amount of dirty pages, kick
> pdflush and have it write out a few of them, maybe the number
> of pages needed to reach zone->pages_high ?

Well we can certainly do that - the current wakeup_bdflush()
is pretty crude:

void wakeup_bdflush(void)
{
        struct page_state ps;

        get_page_state(&ps);
        pdflush_operation(background_writeout, ps.nr_dirty);
}

We can pass background_writeout 42 pages if necessary.  That's
not aware of zones, of course.  It will just write back the
oldest 42 pages from the oldest dirty inode against the last-mounted
superblock.

I still have not got my head around:

> We did this in early 2.4 kernels and it was a disaster. The
> reason it was a disaster was that in many workloads we'd
> always have some clean pages and we'd end up always reclaiming
> those before even starting writeout on any of the dirty pages.

Does this imply that we need to block on writeout *instead*
of reclaiming clean pagecache?

We could do something like:

	if (zone->nr_inactive_dirty > zone->nr_inactive_clean) {
		wakeup_bdflush();	/* Hope this writes the correct zone */
		yield();
	}

which would get the IO underway promptly.  But the caller would
still go in and gobble remaining clean pagecache.

The thing which happened (basically by accident) from my Wednesday
hackery was a partitioning of the machine.  40% of memory is
available to pagecache writeout, and that's clamped (ignoring
MAP_SHARED for now..).  And everyone else just walks around it.

So a 1G box running dbench 1000 acts like a 600M box.  Which
is not a bad model, perhaps.  If we can twiddle that 40%
up and down based on <mumble> criteria...

But that separaton of the 40% of unusable memory from the 
60% of usable memory is done by scanning at present, and
it costs a bit of CPU.  Not much, but a bit.

(btw, is there any reason at all for having page reserves
in ZONE_HIGHMEM?  I have a suspicion that this is just wasted
memory...)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 22:48               ` Andrew Morton
@ 2002-09-06 23:03                 ` Rik van Riel
  2002-09-06 23:34                   ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2002-09-06 23:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Fri, 6 Sep 2002, Andrew Morton wrote:
> Rik van Riel wrote:
> > On Fri, 6 Sep 2002, Andrew Morton wrote:
> >
> > I guess this means the dirty limit should be near 1% for the
> > VM.
>
> What is the thinking behind that?

Dirty pages could sit on the list practically forever
if there are enough clean pages. This means we can have
a significant amount of memory "parked" on the dirty
list, without it ever getting reclaimed, even if we
could use the memory for something better.


> I still have not got my head around:
>
> > We did this in early 2.4 kernels and it was a disaster. The
> > reason it was a disaster was that in many workloads we'd
> > always have some clean pages and we'd end up always reclaiming
> > those before even starting writeout on any of the dirty pages.
>
> Does this imply that we need to block on writeout *instead*
> of reclaiming clean pagecache?

No, it means that whenever we reclaim clean pagecache pages,
we should also start the writeout of some dirty pages.

> We could do something like:
>
> 	if (zone->nr_inactive_dirty > zone->nr_inactive_clean) {
> 		wakeup_bdflush();	/* Hope this writes the correct zone */
> 		yield();
> 	}
>
> which would get the IO underway promptly.  But the caller would
> still go in and gobble remaining clean pagecache.

This is nice, but it would still be possible to have oodles
of pages "parked" on the dirty list, which we definately
need to prevent.

> So a 1G box running dbench 1000 acts like a 600M box.  Which
> is not a bad model, perhaps.  If we can twiddle that 40%
> up and down based on <mumble> criteria...

Writing out dirty pages whenever we reclaim free pages could
fix that problem.

> But that separaton of the 40% of unusable memory from the
> 60% of usable memory is done by scanning at present, and
> it costs a bit of CPU.  Not much, but a bit.

There are other reasons we're wasting CPU in scanning:
1) the scanning isn't really rate limited yet (or is it?)
2) every thread in the system can fall into the scanning
   function, so if we have 50 page allocators they'll all
   happily scan the list, even though the first of these
   threads already found there wasn't anything freeable

> (btw, is there any reason at all for having page reserves
> in ZONE_HIGHMEM?  I have a suspicion that this is just wasted
> memory...)

Dunno, but I guess it is to prevent a 4GB box from acting
like a 900MB box under corner conditions ;)

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 23:03                 ` Rik van Riel
@ 2002-09-06 23:34                   ` Andrew Morton
  2002-09-07  0:00                     ` Rik van Riel
  2002-09-08 21:21                     ` Daniel Phillips
  0 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2002-09-06 23:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> On Fri, 6 Sep 2002, Andrew Morton wrote:
> > Rik van Riel wrote:
> > > On Fri, 6 Sep 2002, Andrew Morton wrote:
> > >
> > > I guess this means the dirty limit should be near 1% for the
> > > VM.
> >
> > What is the thinking behind that?
> 
> Dirty pages could sit on the list practically forever
> if there are enough clean pages. This means we can have
> a significant amount of memory "parked" on the dirty
> list, without it ever getting reclaimed, even if we
> could use the memory for something better.

yes.  We could have up to 10% (default value of dirty_background_ratio)
of physical memory just sitting there for up to 30 seconds (default
value of dirty_expire_centisecs)

(And that 10% may well go back to 30% or 40% - starting writeback
earlier will hurt some things such as copying 100M of files
on a 256M machine).

You're proposing that we get that IO underway sooner if there
is page reclaim pressure, and that one way to do that is to
write one page for every reclaimed one.  Guess that makes
sense as much as anything else ;)

> > I still have not got my head around:
> >
> > > We did this in early 2.4 kernels and it was a disaster. The
> > > reason it was a disaster was that in many workloads we'd
> > > always have some clean pages and we'd end up always reclaiming
> > > those before even starting writeout on any of the dirty pages.
> >
> > Does this imply that we need to block on writeout *instead*
> > of reclaiming clean pagecache?
> 
> No, it means that whenever we reclaim clean pagecache pages,
> we should also start the writeout of some dirty pages.
> 
> > We could do something like:
> >
> >       if (zone->nr_inactive_dirty > zone->nr_inactive_clean) {
> >               wakeup_bdflush();       /* Hope this writes the correct zone */
> >               yield();
> >       }
> >
> > which would get the IO underway promptly.  But the caller would
> > still go in and gobble remaining clean pagecache.
> 
> This is nice, but it would still be possible to have oodles
> of pages "parked" on the dirty list, which we definately
> need to prevent.
> 
> > So a 1G box running dbench 1000 acts like a 600M box.  Which
> > is not a bad model, perhaps.  If we can twiddle that 40%
> > up and down based on <mumble> criteria...
> 
> Writing out dirty pages whenever we reclaim free pages could
> fix that problem.

OK, I'll give that a whizz.

> > But that separaton of the 40% of unusable memory from the
> > 60% of usable memory is done by scanning at present, and
> > it costs a bit of CPU.  Not much, but a bit.
> 
> There are other reasons we're wasting CPU in scanning:
> 1) the scanning isn't really rate limited yet (or is it?)

Not sure what you mean by this?

My current code wastes CPU in the situation where the
zone is choked with dirty pagecache.  It works happily
with mem=768M, because only 40% of the pages in the zone
are dirty - worst case, we get a 60% reclaim success rate.

So I'm looking for ways to fix that.  The proposal is to
move those known-to-be-unreclaimable pages elsewhere.

Another possibility might be to say "gee, all dirty.  Try
the next zone".

> 2) every thread in the system can fall into the scanning
>    function, so if we have 50 page allocators they'll all
>    happily scan the list, even though the first of these
>    threads already found there wasn't anything freeable

hm.  Well if we push dirty pages onto a different list, and
pinned pages onto the active list then a zone with no freeable
memory should have a short list to scan.

more hm.  It's possible that, because of the per-zone-lru,
we end up putting way too much swap pressure onto anon pages
in highmem.  For the 1G boxes.  This is an interaction between
per-zone LRU and the page allocator's highmem-first policy.

Have you seen this in 2.4-rmap?  It would happen there, I suspect.

> > (btw, is there any reason at all for having page reserves
> > in ZONE_HIGHMEM?  I have a suspicion that this is just wasted
> > memory...)
> 
> Dunno, but I guess it is to prevent a 4GB box from acting
> like a 900MB box under corner conditions ;)

But we have a meg or so of emergency reserve in ZONE_HIGHMEM
which can only be used by a __GFP_HIGH|__GFP_HIGHMEM allocator
and some more memory reserved for PF_MEMALLOC|__GFP_HIGHMEM.

I don't think anybody actually does that.  Bounce buffers
can sometimes do __GFP_HIGHMEM|__GFP_HIGH I think.

Strikes me that we could just give that memory back.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 23:34                   ` Andrew Morton
@ 2002-09-07  0:00                     ` Rik van Riel
  2002-09-07  0:29                       ` Andrew Morton
  2002-09-08 21:21                     ` Daniel Phillips
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2002-09-07  0:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Fri, 6 Sep 2002, Andrew Morton wrote:

> My current code wastes CPU in the situation where the
> zone is choked with dirty pagecache.  It works happily
> with mem=768M, because only 40% of the pages in the zone
> are dirty - worst case, we get a 60% reclaim success rate.

Which still doesn't deal with the situation where the
dirty pages are primarily anonymous or MAP_SHARED
pages, which don't fall under your dirty page accounting.

> So I'm looking for ways to fix that.  The proposal is to
> move those known-to-be-unreclaimable pages elsewhere.

Basically, when scanning the zone we'll see "hmmm, all pages
were dirty and I scheduled a whole bunch for writeout" and
we _know_ it doesn't make sense for other threads to also
scan this zone over and over again, at least not until a
significant amount of IO has completed.

> Another possibility might be to say "gee, all dirty.  Try
> the next zone".

Note that this also means we shouldn't submit ALL dirty
pages we run into for IO. If we submit a GB worth of dirty
pages from ZONE_HIGHMEM for IO, it'll take _ages_ before
the IO for ZONE_NORMAL completes.

Worse, if we're keeping the IO queues busy with ZONE_HIGHMEM
pages we could create starvation of the other zones.

Another effect is that a GB of writes is sure to slow down
any subsequent reads, even if 100 MB of RAM has already been
freed...

Because of this I want to make sure we only submit a sane
amount of pages for IO at once, maybe <pulls number out of hat>
max(zone->pages_high, 4 * (zone->pages_high - zone->free_pages) ?

> more hm.  It's possible that, because of the per-zone-lru,
> we end up putting way too much swap pressure onto anon pages
> in highmem.  For the 1G boxes.  This is an interaction between
> per-zone LRU and the page allocator's highmem-first policy.
>
> Have you seen this in 2.4-rmap?  It would happen there, I suspect.

Shouldn't happen in 2.4-rmap, I've been careful to avoid any
kind of worst-case scenarios like that by having a number of
different watermarks.

Basically kswapd won't free pages from a zone which isn't in
severe trouble if we don't have a global memory shortage, so
we will have allocated memory from each zone already before
freeing the next batch of highmem pages.

> I don't think anybody actually does that.  Bounce buffers
> can sometimes do __GFP_HIGHMEM|__GFP_HIGH I think.
>
> Strikes me that we could just give that memory back.

You're right, duh.

cheers,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-07  0:00                     ` Rik van Riel
@ 2002-09-07  0:29                       ` Andrew Morton
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2002-09-07  0:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-mm

Rik van Riel wrote:
> 
> On Fri, 6 Sep 2002, Andrew Morton wrote:
> 
> > My current code wastes CPU in the situation where the
> > zone is choked with dirty pagecache.  It works happily
> > with mem=768M, because only 40% of the pages in the zone
> > are dirty - worst case, we get a 60% reclaim success rate.
> 
> Which still doesn't deal with the situation where the
> dirty pages are primarily anonymous or MAP_SHARED
> pages, which don't fall under your dirty page accounting.

That's right - we're writing those things out as soon
as we scan them at present.  If we move them over to the
dirty page list when their dirtiness is discovered then
the normal writeback stuff would kick in.  But it's laggy,
of course.

> > So I'm looking for ways to fix that.  The proposal is to
> > move those known-to-be-unreclaimable pages elsewhere.
> 
> Basically, when scanning the zone we'll see "hmmm, all pages
> were dirty and I scheduled a whole bunch for writeout" and
> we _know_ it doesn't make sense for other threads to also
> scan this zone over and over again, at least not until a
> significant amount of IO has completed.

Yup.  But with this proposal it's "hmm, the inactive_clean
list has zero pages, and the inactive_dirty list has 100,000
pages".  The VM knows exactly what is going on, without any
scanning.

The appropriate action would be to kick pdflush, advance to
the next zone, and if that fails, take a nap.

> > Another possibility might be to say "gee, all dirty.  Try
> > the next zone".
> 
> Note that this also means we shouldn't submit ALL dirty
> pages we run into for IO. If we submit a GB worth of dirty
> pages from ZONE_HIGHMEM for IO, it'll take _ages_ before
> the IO for ZONE_NORMAL completes.

The mapping->dirty_pages-based writeback doesn't know about
zones... 

Which is good in a way, because we can schedule IO in
filesystem-friendly patterns.

> Worse, if we're keeping the IO queues busy with ZONE_HIGHMEM
> pages we could create starvation of the other zones.

Right.  So for a really big high:low ratio, that could be a
problem.

For these systems, in practice, we know where the cleanable
ZONE_NORMAL pagecache lives:
blockdev_superblock->inodes->mapping->dirty_pages.

So we could easily schedule IO specifically targetted at the
normal zone if needed.  But it will be slow whatever we do,
because dirty blockdev pagecache is splattered all over the
platter.

> Another effect is that a GB of writes is sure to slow down
> any subsequent reads, even if 100 MB of RAM has already been
> freed...
> 
> Because of this I want to make sure we only submit a sane
> amount of pages for IO at once, maybe <pulls number out of hat>
> max(zone->pages_high, 4 * (zone->pages_high - zone->free_pages) ?

And what, may I ask, was wrong with 42? ;)

Point taken on the IO starvation thing.  But you know
my opinion of the read-vs-write policy in the IO scheduler...

> > more hm.  It's possible that, because of the per-zone-lru,
> > we end up putting way too much swap pressure onto anon pages
> > in highmem.  For the 1G boxes.  This is an interaction between
> > per-zone LRU and the page allocator's highmem-first policy.
> >
> > Have you seen this in 2.4-rmap?  It would happen there, I suspect.
> 
> Shouldn't happen in 2.4-rmap, I've been careful to avoid any
> kind of worst-case scenarios like that by having a number of
> different watermarks.
> 
> Basically kswapd won't free pages from a zone which isn't in
> severe trouble if we don't have a global memory shortage, so
> we will have allocated memory from each zone already before
> freeing the next batch of highmem pages.

I'm not sure that works...   If the machine has 800M normal
and 200M highmem and is cruising along with 190M of dirty
pagecache (steady state, via balance_dirty_state) then surely
the poor little 10M of anon pages which are in the highmem zone
will be swapped out quite quickly?

Probably it doesn't matter much - chances are they'll get swapped
back into ZONE_NORMAL and then live a happy life.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 23:34                   ` Andrew Morton
  2002-09-07  0:00                     ` Rik van Riel
@ 2002-09-08 21:21                     ` Daniel Phillips
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Phillips @ 2002-09-08 21:21 UTC (permalink / raw)
  To: Andrew Morton, Rik van Riel; +Cc: linux-mm

On Saturday 07 September 2002 01:34, Andrew Morton wrote:
> You're proposing that we get that IO underway sooner if there
> is page reclaim pressure, and that one way to do that is to
> write one page for every reclaimed one.  Guess that makes
> sense as much as anything else ;)

Not really.  The correct formula will incorporate the allocation rate and the 
inactive dirty/clean balance.  The reclaim rate is not relevant, it is a 
time-delayed consequence of the above.  Relying on it in a control loop is 
simply asking for oscillation.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 22:04         ` Rik van Riel
  2002-09-06 22:19           ` Andrew Morton
@ 2002-09-06 22:22           ` Rik van Riel
  1 sibling, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2002-09-06 22:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Fri, 6 Sep 2002, Rik van Riel wrote:
> On Fri, 6 Sep 2002, Andrew Morton wrote:
>
> > hum.  I'm trying to find a model where the VM can just ignore
> > dirty|writeback pagecache.  We know how many pages are out
> > there, sure.  But we don't scan them.  Possible?
>
> Owww duh, I see it now.
>
> So basically pages should _only_ go into the inactive_dirty list
> when they are under writeout.

As an aside, we might want to limit the amount of in-flight
data to a sane limit and just go to sleep for a bit if the
VM has far too much data in flight already.

If we need 2 MB of extra free memory, it doesn't make sense
to monopolise the whole IO subsystem by writing out 100 MB
at once ;)

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-06 20:42 inactive_dirty list Andrew Morton
  2002-09-06 21:03 ` Rik van Riel
@ 2002-09-07  2:14 ` Andrew Morton
  2002-09-07  2:10   ` Rik van Riel
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2002-09-07  2:14 UTC (permalink / raw)
  To: Rik van Riel, linux-mm

Andrew Morton wrote:
> 
> Rik, it seems that the time has come...
> 
> I was doing some testing overnight with mem=1024m.  Page reclaim
> was pretty inefficient at that level: kswapd consumed 6% of CPU
> on a permanent basis (workload was heavy dbench plus looping
> make -j6 bzImage).  kswapd was reclaiming only 3% of the pages
> which it was looking at.

I have a silly feeling that setting DEF_PRIORITY to "12" will
simply fix this.

Duh.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-07  2:14 ` Andrew Morton
@ 2002-09-07  2:10   ` Rik van Riel
  2002-09-07  5:28     ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2002-09-07  2:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Fri, 6 Sep 2002, Andrew Morton wrote:

> I have a silly feeling that setting DEF_PRIORITY to "12" will
> simply fix this.
>
> Duh.

Ideally we'd get rid of DEF_PRIORITY alltogether and would
just scan each zone once.

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inactive_dirty list
  2002-09-07  2:10   ` Rik van Riel
@ 2002-09-07  5:28     ` Andrew Morton
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2002-09-07  5:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> On Fri, 6 Sep 2002, Andrew Morton wrote:
> 
> > I have a silly feeling that setting DEF_PRIORITY to "12" will
> > simply fix this.
> >
> > Duh.
> 
> Ideally we'd get rid of DEF_PRIORITY alltogether and would
> just scan each zone once.
> 

What I'm doing now is:

#define DEF_PRIORITY 12		/* puke */

        for (priority = DEF_PRIORITY; priority; priority--) {
		int total_scanned = 0;

		shrink_caches(priority, &total_scanned);
		if (that didn't work) {
			wakeup_bdflush(total_scanned);
			blk_congestion_wait(WRITE, HZ/4);
		}
	}

and in shrink_caches():

	max_scan = zone->nr_inactive >> priority;
        if (max_scan < nr_pages * 2)
                max_scan = nr_pages * 2;
        nr_pages = shrink_zone(zone, max_scan, gfp_mask, nr_pages);

So in effect, for a 32-page reclaim attempt we'll scan 64 pages
of ZONE_HIGHMEM, then 128 pages of ZONE_NORMAL/DMA.  If that doesn't
yield 32 pages we ask pdflush to write 3*64 pages.  Then take a nap.

Then do it again: scan 64 pages of ZONE_HIGHMEM, then 128 of ZONE_NORMAL/DMA,
then write back 192 pages then nap.

Then do it again: scan 128 pages of ZONE_HIGHMEM, then 256 of ZONE_NORMAL/DMA,
then write back 384 pages then nap.

etc.  Plus there are the actual pages which we started IO against
during the LRU scan - there can be up to 32 of those.

BTW, it turns out that the main reason why kswapd was going silly was
that the VM is *not* treating the `priority' as a logarithmic thing at
all:

        int max_scan = nr_inactive_pages / priority;

so the claims about scanning 1/64th of the list are crap.  That
thing scans 1/6th of the queue on the first pass.  In the mem=1G
case, that's 30,000 damn pages.   Maybe someone should take a look
at Marcelo's kernel?

There are a few warts: pdflush_operation will fail if all pdflush threads
are out doing something (pretty unlikely with the nonblocking stuff.
Might happen if writeback has to run get_block()).  But we'll be writing
back stuff anyway.

I changed blk_congestion_wait a bit too.  The first version would
return immediately if no queues were congested ( > 75% full). Now,
it will sleep even if no queues are congested.  It will return
as soon as someone puts back a write request.  If someone is silly
enough to call blk_congestion_wait() when there are no write requests
in flight at all, they get to take the full 1/4 second sleep.

The mem=1G corner case is fixed, and page reclaim just doesn't
figure:

c012c034 288      0.317709    do_wp_page              
c0144ae0 316      0.348597    __block_commit_write    
c012c910 342      0.377279    do_anonymous_page       
c0143efc 353      0.389414    __find_get_block        
c012f7e0 356      0.392724    find_lock_page          
c012f9f0 356      0.392724    do_generic_file_read    
c01832bc 367      0.404858    ext2_free_branches      
c0136e70 371      0.409271    __free_pages_ok         
c010e7b4 386      0.425818    timer_interrupt         
c01e3cfc 414      0.456707    radix_tree_lookup       
c0141894 434      0.47877     vfs_write               
c012f580 474      0.522896    unlock_page             
c0134348 500      0.551578    kmem_cache_alloc        
c01347d0 531      0.585776    kmem_cache_free         
c013712c 574      0.633212    rmqueue                 
c0141320 605      0.667409    generic_file_llseek     
c0156924 616      0.679544    count_list              
c0142c04 617      0.680647    fget                    
c01091e0 793      0.874803    system_call             
c0155914 860      0.948714    __d_lookup              
c0144674 1076     1.187       __block_prepare_write   
c014c63c 1184     1.30614     link_path_walk          
c012fcd4 10932    12.0597     file_read_actor         
c0130674 16443    18.1392     generic_file_write_nolock 
c0107048 31293    34.5211     poll_idle               

The balancing of the zones looks OK from a first glance and of course
the change in system behaviour under heavy writeout loads is profound.

Let's do the MAP_SHARED-pages-get-a-second-round thing, and it'd
be good if we could come up with some algorithm for setting the
current dirty pagecache clamping level rather than relying on the
dopey /proc/sys/vm/dirty_async_ratio magic number. 

I'm thinking that dirty_async_ratio becomes a maximum ratio, and
that we dynamically lower it when large amounts of dirty pagecache
would be embarrassing.  Or maybe there's just no need for this.  Dunno.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2002-09-08 21:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-06 20:42 inactive_dirty list Andrew Morton
2002-09-06 21:03 ` Rik van Riel
2002-09-06 21:40   ` Andrew Morton
2002-09-06 21:49     ` Rik van Riel
2002-09-06 21:58       ` Andrew Morton
2002-09-06 22:04         ` Rik van Riel
2002-09-06 22:19           ` Andrew Morton
2002-09-06 22:23             ` Rik van Riel
2002-09-06 22:48               ` Andrew Morton
2002-09-06 23:03                 ` Rik van Riel
2002-09-06 23:34                   ` Andrew Morton
2002-09-07  0:00                     ` Rik van Riel
2002-09-07  0:29                       ` Andrew Morton
2002-09-08 21:21                     ` Daniel Phillips
2002-09-06 22:22           ` Rik van Riel
2002-09-07  2:14 ` Andrew Morton
2002-09-07  2:10   ` Rik van Riel
2002-09-07  5:28     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox