RE: la la la la ... swappiness

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* RE: la la la la ... swappiness
       [not found] <200612050641.kB56f7wY018196@ms-smtp-06.texas.rr.com>
@ 2006-12-05 16:17 ` Linus Torvalds
  2006-12-05 16:59   ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-12-05 16:17 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Nick Piggin', 'Tim Schmielau',
	'Andrew Morton',
	clameter, Linux Memory Management List

On Tue, 5 Dec 2006, Aucoin wrote:
>
> > Louis, exactly how do you allocate that big 1.6GB shared area?
> 
> Ummm, shm_open, ftruncate, mmap ? Is it a trick question ? The process
> responsible for initially setting up the shared area doesn't stay resident.

Not a trick question, I just suddenly realized that I really should have 
expected the SHM pages to show up in the LRU lists (either inactive or 
active) and shown up as "cached" pages too. Afaik, the SHM routines all 
end up using the page cache and the LRU for the backing store.

But your 1.6GB thing doesn't show up anywhere.

(I'm sure it's intentional, and I've just forgotten some detail. We 
probably remove pages from the LRU lists when they are locked. Anyway, my 
original point was that since the pages _aren't_ on the LRU lists, the VM 
really should basically act as if they didn't exist at all, but there are 
probably things that still base their decisions on the _total_ amount of 
memory)

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 16:17 ` la la la la ... swappiness Linus Torvalds
@ 2006-12-05 16:59   ` Andrew Morton
  2006-12-05 17:41     ` aucoin, Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2006-12-05 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Aucoin, 'Nick Piggin', 'Tim Schmielau',
	clameter, Linux Memory Management List

On Tue, 5 Dec 2006 08:17:51 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Tue, 5 Dec 2006, Aucoin wrote:
> >
> > > Louis, exactly how do you allocate that big 1.6GB shared area?
> > 
> > Ummm, shm_open, ftruncate, mmap ? Is it a trick question ? The process
> > responsible for initially setting up the shared area doesn't stay resident.
> 
> Not a trick question, I just suddenly realized that I really should have 
> expected the SHM pages to show up in the LRU lists (either inactive or 
> active) and shown up as "cached" pages too. Afaik, the SHM routines all 
> end up using the page cache and the LRU for the backing store.
> 
> But your 1.6GB thing doesn't show up anywhere.
> 
> (I'm sure it's intentional, and I've just forgotten some detail. We 
> probably remove pages from the LRU lists when they are locked. Anyway, my 
> original point was that since the pages _aren't_ on the LRU lists, the VM 
> really should basically act as if they didn't exist at all, but there are 
> probably things that still base their decisions on the _total_ amount of 
> memory)
> 

Yes, those pages should be on the LRU.  I suspect they never got paged in
or something.  But that would mean they weren't mlocked.  Is a mystery.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 16:59   ` Andrew Morton
@ 2006-12-05 17:41     ` aucoin, Andrew Morton
  2006-12-05 18:31       ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: aucoin, Andrew Morton @ 2006-12-05 17:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau',
	clameter, Linux Memory Management List

> Yes, those pages should be on the LRU.  I suspect they never got 

Oops, details, details.

These are huge pages .... apologies for leaving that out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 17:41     ` aucoin, Andrew Morton
@ 2006-12-05 18:31       ` Christoph Lameter
  2006-12-05 18:44         ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-12-05 18:31 UTC (permalink / raw)
  To: Aucoin
  Cc: Andrew Morton, Linus Torvalds, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006, aucoin@houston.rr.com wrote:

> From: Andrew Morton <akpm@osdl.org>
> > Yes, those pages should be on the LRU.  I suspect they never got 
> Oops, details, details.
> These are huge pages .... apologies for leaving that out.

We do not support swapping / reclaim for huge pages.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 18:31       ` Christoph Lameter
@ 2006-12-05 18:44         ` Linus Torvalds
  2006-12-05 19:32           ` Christoph Lameter
  2006-12-05 20:39           ` aucoin, Linus Torvalds
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-12-05 18:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aucoin, Andrew Morton, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006, Christoph Lameter wrote:
> 
> We do not support swapping / reclaim for huge pages.

Well, Louis doesn't actually _want_ swapping or reclaim on them. He just 
wants the system to run well with the remaining 400MB of memory in his 
machine.

Which it doesn't. It just OOM's for some reason.

We still haven't seen the oom debug output though, I think. It should talk 
about some of the state (it calls "show_mem()", which should call 
"show_free_areas()", which should tell a lot about why the heck it 
thought it was out of memory.

But maybe Louis posted it and I just missed it.

Anyway, if it's hugepages, then I don't see why Louis even _wants_ to turn 
down swappiness. The hugepages won't be swapped out regardless.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 18:44         ` Linus Torvalds
@ 2006-12-05 19:32           ` Christoph Lameter
  2006-12-05 20:02             ` Andrew Morton
  2006-12-05 20:39           ` aucoin, Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-12-05 19:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Aucoin, Andrew Morton, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006, Linus Torvalds wrote:
> On Tue, 5 Dec 2006, Christoph Lameter wrote:
> > We do not support swapping / reclaim for huge pages.
> 
> Well, Louis doesn't actually _want_ swapping or reclaim on them. He just 
> wants the system to run well with the remaining 400MB of memory in his 
> machine.
> 
> Which it doesn't. It just OOM's for some reason.

If you take huge chunks of memory out of a zone then the dirty limits as 
well as the min free kbytes etc are all off. As a result the VM may 
behave strangely.  F.e. too many dirty pages may cause an OOM since we do 
not enter synchrononous writeout during reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 19:32           ` Christoph Lameter
@ 2006-12-05 20:02             ` Andrew Morton
  2006-12-05 20:15               ` Christoph Lameter
  2006-12-05 20:52               ` Andrew Morton
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2006-12-05 20:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006 11:32:21 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 5 Dec 2006, Linus Torvalds wrote:
> > On Tue, 5 Dec 2006, Christoph Lameter wrote:
> > > We do not support swapping / reclaim for huge pages.
> > 
> > Well, Louis doesn't actually _want_ swapping or reclaim on them. He just 
> > wants the system to run well with the remaining 400MB of memory in his 
> > machine.
> > 
> > Which it doesn't. It just OOM's for some reason.
> 
> If you take huge chunks of memory out of a zone then the dirty limits as 
> well as the min free kbytes etc are all off. As a result the VM may 
> behave strangely.  F.e. too many dirty pages may cause an OOM since we do 
> not enter synchrononous writeout during reclaim.

yes, it's quite possible that this setup would cause the page reclaim
arithmetic to go wrong.

But otoh, it's a very common scenario, and nobody has observed it before. 
For example:

akpm2:/home/akpm# echo 4000 > /proc/sys/vm/nr_hugepages 

Free memory on this box instantly fell from 7G down to ~250MB.  It's now
happily chuggling its way through a `dbench 512' run.

But this is a 64-bit machine.  Could be that there are problems on 32-bit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:02             ` Andrew Morton
@ 2006-12-05 20:15               ` Christoph Lameter
  2006-12-05 20:48                 ` Andrew Morton
  2006-12-05 20:52               ` Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-12-05 20:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006, Andrew Morton wrote:

> But otoh, it's a very common scenario, and nobody has observed it before. 

This is the same scenario as mlocked memory. Kame-san has recently posted 
an occurence in ZONE_DMA. I have 3 customers where I have seen similar VM 
behavior with a special shared memory thingy locking down lots of 
memory.

In fact in the NUMA case with cpusets the limits being off is a very 
common problem. F.e. the dirty balancing logic does not take into account 
that the application can just run on a subset of the machine. So if a 
cpuset is just 1/10th of the whole machine then we will never be able to 
reach the dirty limits, all the nodes of a cpuset may be filled up with 
dirty pages. A simple cp of a large file will bring the machine into a 
continual reclaim on all nodes.

I am working on a solution for the dirty throttling but we have similar 
issues for the other limits. I wonder if we should not account for 
unreclaimable memory per zone and recalculate the limits if they change 
significantly. A series of huge page allocations would then retune the 
limits.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 18:44         ` Linus Torvalds
  2006-12-05 19:32           ` Christoph Lameter
@ 2006-12-05 20:39           ` aucoin, Linus Torvalds
  1 sibling, 0 replies; 15+ messages in thread
From: aucoin, Linus Torvalds @ 2006-12-05 20:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Andrew Morton, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

I didn't post it yet, I don't have a recent build with oom enabled at
the moment so I was digging through old bugzillas to see what I could
find. Here are some pieces from one oom firing, they're from old runs
and based on the bugzilla context I can't swear it's exactly the same
problem, I'm looking for more. The "ae" process that's being kill is one
of the three processes attached to the 1.6G shm.

Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c01049a4>] show_trace+0xd/0xf
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c0104a43>] dump_stack+0x17/0x19
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c0138f44>] out_of_memory+0x27/0x12f
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c013a617>] __alloc_pages+0x1e1/0x261
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c013a6bf>] __get_free_pages+0x28/0x37
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c015f066>] __pollwait+0x33/0x9e
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c01eb25c>] mqueue_poll_file+0x27/0x57
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c015fb9b>] do_sys_poll+0x165/0x2da
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c015ff24>] sys_poll+0x43/0x47
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c0103513>] sysenter_past_esp+0x54/0x75

Oct 11 19:08:19 QAR2MOVDB2 kernel: Free swap  = 421008kB
Oct 11 19:08:19 QAR2MOVDB2 kernel: Total swap = 524276kB
Oct 11 19:08:19 QAR2MOVDB2 kernel: Free swap:       421008kB
Oct 11 19:08:19 QAR2MOVDB2 kernel: 524224 pages of RAM
Oct 11 19:08:19 QAR2MOVDB2 kernel: 294848 pages of HIGHMEM
Oct 11 19:08:19 QAR2MOVDB2 kernel: 5437 reserved pages
Oct 11 19:08:19 QAR2MOVDB2 kernel: 1340645 pages shared
Oct 11 19:08:19 QAR2MOVDB2 kernel: 25817 pages swap cached
Oct 11 19:08:19 QAR2MOVDB2 kernel: 107 pages dirty
Oct 11 19:08:19 QAR2MOVDB2 kernel: 45405 pages writeback
Oct 11 19:08:19 QAR2MOVDB2 kernel: 2638 pages mapped
Oct 11 19:08:19 QAR2MOVDB2 kernel: 29632 pages slab
Oct 11 19:08:19 QAR2MOVDB2 kernel: 385 pages pagetables
Oct 11 19:08:19 QAR2MOVDB2 kernel: Out of Memory: Kill process 1636 (ae)
score
556471 and children.
Oct 11 19:08:19 QAR2MOVDB2 kernel: Out of memory: Killed process 1636 (ae).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:15               ` Christoph Lameter
@ 2006-12-05 20:48                 ` Andrew Morton
  2006-12-05 20:59                   ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2006-12-05 20:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006 12:15:46 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 5 Dec 2006, Andrew Morton wrote:
> 
> > But otoh, it's a very common scenario, and nobody has observed it before. 
> 
> This is the same scenario as mlocked memory.

Not quite - mlocked pages are on the page LRU and hence contribute to the
arithmetic in there.   The hugetlb pages are simply gone.

> Kame-san has recently posted 
> an occurence in ZONE_DMA. I have 3 customers where I have seen similar VM 
> behavior with a special shared memory thingy locking down lots of 
> memory.

I expect the mechanisms are different.  The mlocked shared-memory segment
will fill the LRU with unreclaimable pages and the machine will do lots of
scanning.  That's inefficient, but it is unexpected that this will lead to
fals declaration of OOM.

> In fact in the NUMA case with cpusets the limits being off is a very 
> common problem. F.e. the dirty balancing logic does not take into account 
> that the application can just run on a subset of the machine.

Yup.

> So if a 
> cpuset is just 1/10th of the whole machine then we will never be able to 
> reach the dirty limits, all the nodes of a cpuset may be filled up with 
> dirty pages. A simple cp of a large file will bring the machine into a 
> continual reclaim on all nodes.

It shouldn't be continual and it shouldn't be on all nodes.  What _should_
happen in this situation is that the dirty pages in those zones are written
back off the LRU by the vm scanner.

That's less efficient from an IO scheduling POV than writing them back via
the inodes, but it should work OK and it shouldn't affect other zones.

If the activity is really "continual" and "on all nodes" then we have some
bugs to fix.

> I am working on a solution for the dirty throttling but we have similar 
> issues for the other limits. I wonder if we should not account for 
> unreclaimable memory per zone and recalculate the limits if they change 
> significantly. A series of huge page allocations would then retune the 
> limits.

We should fix the existing code before even thinking about this sort of
thing.  Or at least, gain a full understanding of why it is failing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:02             ` Andrew Morton
  2006-12-05 20:15               ` Christoph Lameter
@ 2006-12-05 20:52               ` Andrew Morton
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2006-12-05 20:52 UTC (permalink / raw)
  To: Christoph Lameter, Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006 12:02:56 -0800
Andrew Morton <akpm@osdl.org> wrote:

> But otoh, it's a very common scenario, and nobody has observed it before. 
> For example:
> 
> akpm2:/home/akpm# echo 4000 > /proc/sys/vm/nr_hugepages 
> 
> Free memory on this box instantly fell from 7G down to ~250MB.  It's now
> happily chuggling its way through a `dbench 512' run.

FS(small)VO "happily".  It's running like a complete dog (but I guess
dbench 512 in 256M is a bit mean).  But it's still running!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:48                 ` Andrew Morton
@ 2006-12-05 20:59                   ` Christoph Lameter
  2006-12-05 21:39                     ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-12-05 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006, Andrew Morton wrote:

> > This is the same scenario as mlocked memory.
> 
> Not quite - mlocked pages are on the page LRU and hence contribute to the
> arithmetic in there.   The hugetlb pages are simply gone.

They cannot be swapped out and AFAICT the ratio calculations are assuming 
that pages can be evicted.

> > So if a 
> > cpuset is just 1/10th of the whole machine then we will never be able to 
> > reach the dirty limits, all the nodes of a cpuset may be filled up with 
> > dirty pages. A simple cp of a large file will bring the machine into a 
> > continual reclaim on all nodes.
> 
> It shouldn't be continual and it shouldn't be on all nodes.  What _should_

I meant all nodes of the cpuset.

> happen in this situation is that the dirty pages in those zones are written
> back off the LRU by the vm scanner.

Right in the best case that occurs. However, since we do not recognize 
that we are in a dirty overload situation we may not do synchrononous 
writes but return without having reclaimed any memory (a particular 
problem exists here in connections with NFS well known memory 
problems). If memory gets completely clogged then we OOM.

> That's less efficient from an IO scheduling POV than writing them back via
> the inodes, but it should work OK and it shouldn't affect other zones.

Could we get to the inode from the reclaim path and just start writing out 
all dirty pages of the indoe?

> If the activity is really "continual" and "on all nodes" then we have some
> bugs to fix.

Its continual on the nodes of the cpuset. Reclaim is constantly running 
and becomes very inefficient.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:59                   ` Christoph Lameter
@ 2006-12-05 21:39                     ` Andrew Morton
  2006-12-05 23:20                       ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2006-12-05 21:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006 12:59:14 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 5 Dec 2006, Andrew Morton wrote:
> 
> > > This is the same scenario as mlocked memory.
> > 
> > Not quite - mlocked pages are on the page LRU and hence contribute to the
> > arithmetic in there.   The hugetlb pages are simply gone.
> 
> They cannot be swapped out and AFAICT the ratio calculations are assuming 
> that pages can be evicted.

Some calculations assume that.  But a lot (most) of the reclaim code is
paced by number-of-pages-scanned.  mlocked pages on the LRU will be noted
by the scanner and will cause priority elevation, throttling, etc.  Pages
which have been gobbled by hugetlb will not.

> > > So if a 
> > > cpuset is just 1/10th of the whole machine then we will never be able to 
> > > reach the dirty limits, all the nodes of a cpuset may be filled up with 
> > > dirty pages. A simple cp of a large file will bring the machine into a 
> > > continual reclaim on all nodes.
> > 
> > It shouldn't be continual and it shouldn't be on all nodes.  What _should_
> 
> I meant all nodes of the cpuset.
> 
> > happen in this situation is that the dirty pages in those zones are written
> > back off the LRU by the vm scanner.
> 
> Right in the best case that occurs.

We want it to work in all cases.

> However, since we do not recognize 
> that we are in a dirty overload situation we may not do synchrononous 
> writes but return without having reclaimed any memory

Return from what?  try_to_free_pages() or balance_dirty_pages()?

The behaviour of page reclaim is independent of the level of dirty memory
and of the dirty-memory thresholds, as far as I recall...

> (a particular 
> problem exists here in connections with NFS well known memory 
> problems). If memory gets completely clogged then we OOM.

NFS causes problems because it needs to allocate memory (skbs) to be able
to write back dirty memory.  There have been fixes and things have
improved, but I wouldn't be surprised if there are still problems.

> > That's less efficient from an IO scheduling POV than writing them back via
> > the inodes, but it should work OK and it shouldn't affect other zones.
> 
> Could we get to the inode from the reclaim path and just start writing out 
> all dirty pages of the indoe?

Yeah, maybe.  But of course the pages on the inode can be from any zone at
all so the problem is that in some scenarios, we could write out tremendous
numbers of pages from zones which don't need that writeout.

> > If the activity is really "continual" and "on all nodes" then we have some
> > bugs to fix.
> 
> Its continual on the nodes of the cpuset. Reclaim is constantly running 
> and becomes very inefficient.

I think what you're saying is that we're not throttling in
balance_dirty_pages().  So a large write() which is performed by a process
inside your one-tenth-of-memory cpuset will just go and dirty all of the
pages in that cpuset's nodes and things get all gummed up.

That can certainly happen, and I suppose we can make changes to
balance_dirty_pages() to fix it (although it will have the
we-wrote-lots-of-pages-we-didnt-need-to failure mode).

But right now in 2.6.19 the machine should _not_ declare oom in this
situation.  If it does, then we should fix that.  If it's only happening
with NFS then yeah, OK, mumble, NFS still needs work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 21:39                     ` Andrew Morton
@ 2006-12-05 23:20                       ` Christoph Lameter
  2006-12-12 15:12                         ` Aucoin
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-12-05 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau',
	Linux Memory Management List

On Tue, 5 Dec 2006, Andrew Morton wrote:

> > However, since we do not recognize 
> > that we are in a dirty overload situation we may not do synchrononous 
> > writes but return without having reclaimed any memory
> 
> Return from what?  try_to_free_pages() or balance_dirty_pages()?

If we do not reach the dirty_ratio then we will not block but simply 
trigger writeouts.

try_to_free_pages() will trigger pdflush and we may wait 1/10th of a 
second in congestaion_wait() and in throttle_vm_writeout() (well not 
really since we check global limits) but we will not block. I think what 
happens is that try_to_free_pages() (given sufficient slowless of the 
writeout) at some point will start to return 0 and thus 
we OOM.

> The behaviour of page reclaim is independent of the level of dirty memory
> and of the dirty-memory thresholds, as far as I recall...

You cannot easily free a dirty page. We can only trigger writeout.

> > Could we get to the inode from the reclaim path and just start writing out 
> > all dirty pages of the indoe?
> 
> Yeah, maybe.  But of course the pages on the inode can be from any zone at
> all so the problem is that in some scenarios, we could write out tremendous
> numbers of pages from zones which don't need that writeout.

But we know that at least one page was in the correct zone. Writeout will 
be much faster if we can write a seris of block in sequence via the inode.

> > Its continual on the nodes of the cpuset. Reclaim is constantly running 
> > and becomes very inefficient.
> 
> I think what you're saying is that we're not throttling in
> balance_dirty_pages().  So a large write() which is performed by a process
> inside your one-tenth-of-memory cpuset will just go and dirty all of the
> pages in that cpuset's nodes and things get all gummed up.

Correct.

> That can certainly happen, and I suppose we can make changes to
> balance_dirty_pages() to fix it (although it will have the
> we-wrote-lots-of-pages-we-didnt-need-to failure mode).

Right. In addition to checking the limits of the nodes in the current 
cpuset (requires looping over all nodes and adding up the counters we 
need) I made some modification to pass a set of nodes in the 
writeback_control structure. We can then check if there are sufficient 
pages of the inode within the nodes of the cpuset. But I am a bit 
concerned about performance.

> But right now in 2.6.19 the machine should _not_ declare oom in this
> situation.  If it does, then we should fix that.  If it's only happening
> with NFS then yeah, OK, mumble, NFS still needs work.

We OOM only in some rare cases. Mostly it seems that the
machines just becomes extremely slow and the LRU locks become hot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: la la la la ... swappiness
  2006-12-05 23:20                       ` Christoph Lameter
@ 2006-12-12 15:12                         ` Aucoin
  0 siblings, 0 replies; 15+ messages in thread
From: Aucoin @ 2006-12-12 15:12 UTC (permalink / raw)
  To: 'Christoph Lameter', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Tim Schmielau', 'Linux Memory Management List'

For what it's worth we tried a version of tar recompiled with calls to
posix_fadvise and the no reuse flag but it had no effect on the issue.
Inactive pages still accumulated to the point of invoking swap instead of
reclaiming inactive pages.

> -----Original Message-----
> From: Christoph Lameter [mailto:christoph@schroedinger.engr.sgi.com] On
> Behalf Of Christoph Lameter
> Sent: Tuesday, December 05, 2006 5:21 PM
> To: Andrew Morton
> Cc: Linus Torvalds; Aucoin; 'Nick Piggin'; 'Tim Schmielau'; Linux Memory
> Management List
> Subject: Re: la la la la ... swappiness
> 
> On Tue, 5 Dec 2006, Andrew Morton wrote:
> 
> > > However, since we do not recognize
> > > that we are in a dirty overload situation we may not do synchrononous
> > > writes but return without having reclaimed any memory
> >
> > Return from what?  try_to_free_pages() or balance_dirty_pages()?
> 
> If we do not reach the dirty_ratio then we will not block but simply
> trigger writeouts.
> 
> try_to_free_pages() will trigger pdflush and we may wait 1/10th of a
> second in congestaion_wait() and in throttle_vm_writeout() (well not
> really since we check global limits) but we will not block. I think what
> happens is that try_to_free_pages() (given sufficient slowless of the
> writeout) at some point will start to return 0 and thus
> we OOM.
> 
> > The behaviour of page reclaim is independent of the level of dirty
> memory
> > and of the dirty-memory thresholds, as far as I recall...
> 
> You cannot easily free a dirty page. We can only trigger writeout.
> 
> > > Could we get to the inode from the reclaim path and just start writing
> out
> > > all dirty pages of the indoe?
> >
> > Yeah, maybe.  But of course the pages on the inode can be from any zone
> at
> > all so the problem is that in some scenarios, we could write out
> tremendous
> > numbers of pages from zones which don't need that writeout.
> 
> But we know that at least one page was in the correct zone. Writeout will
> be much faster if we can write a seris of block in sequence via the inode.
> 
> > > Its continual on the nodes of the cpuset. Reclaim is constantly
> running
> > > and becomes very inefficient.
> >
> > I think what you're saying is that we're not throttling in
> > balance_dirty_pages().  So a large write() which is performed by a
> process
> > inside your one-tenth-of-memory cpuset will just go and dirty all of the
> > pages in that cpuset's nodes and things get all gummed up.
> 
> Correct.
> 
> > That can certainly happen, and I suppose we can make changes to
> > balance_dirty_pages() to fix it (although it will have the
> > we-wrote-lots-of-pages-we-didnt-need-to failure mode).
> 
> Right. In addition to checking the limits of the nodes in the current
> cpuset (requires looping over all nodes and adding up the counters we
> need) I made some modification to pass a set of nodes in the
> writeback_control structure. We can then check if there are sufficient
> pages of the inode within the nodes of the cpuset. But I am a bit
> concerned about performance.
> 
> > But right now in 2.6.19 the machine should _not_ declare oom in this
> > situation.  If it does, then we should fix that.  If it's only happening
> > with NFS then yeah, OK, mumble, NFS still needs work.
> 
> We OOM only in some rare cases. Mostly it seems that the
> machines just becomes extremely slow and the LRU locks become hot.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-12-12 15:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200612050641.kB56f7wY018196@ms-smtp-06.texas.rr.com>
2006-12-05 16:17 ` la la la la ... swappiness Linus Torvalds
2006-12-05 16:59   ` Andrew Morton
2006-12-05 17:41     ` aucoin, Andrew Morton
2006-12-05 18:31       ` Christoph Lameter
2006-12-05 18:44         ` Linus Torvalds
2006-12-05 19:32           ` Christoph Lameter
2006-12-05 20:02             ` Andrew Morton
2006-12-05 20:15               ` Christoph Lameter
2006-12-05 20:48                 ` Andrew Morton
2006-12-05 20:59                   ` Christoph Lameter
2006-12-05 21:39                     ` Andrew Morton
2006-12-05 23:20                       ` Christoph Lameter
2006-12-12 15:12                         ` Aucoin
2006-12-05 20:52               ` Andrew Morton
2006-12-05 20:39           ` aucoin, Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox