Re: 0-order allocation problem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: 0-order allocation problem
       [not found] <Pine.LNX.4.21.0108152049100.973-100000@localhost.localdomain>
@ 2001-08-15 20:45 ` Linus Torvalds
  2001-08-15 20:55   ` Marcelo Tosatti
                     ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Linus Torvalds @ 2001-08-15 20:45 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Marcelo Tosatti, linux-mm

[ cc'd to linux-mm and Marcelo, as this was kind of interesting ]

On Wed, 15 Aug 2001, Hugh Dickins wrote:
>
> Exactly as you predict.  A batch of the printks at the usual point,
> then it recovers and proceeds happily on its way.  Same again each
> time (except first time clean as usual).  I should be pleased, but
> I feel dissatisfied.  I guess it's right for create_buffers() to
> try harder, but I'm surprised it got into that state at all.
> I'll try to understand it better.

Ok, then I understand the schenario.

This could _possibly_ be triggered by other things than swapoff too, but
it would probably be much harder. What happens is:

 - we have tons of free memory - so much that both inactive_shortage() and
   free_shortage() are happy as clams, and kswapd or anybody else won't
   ever try to balance out the fact that we have unusually low counts of
   inactive data while having a high "inactive_target".

   The only strange thing is that _despite_ having tons of memory, we are
   really short on inactive pages, because swapoff() really ate them all
   up.

   This part is fine. We're doing the right thing - if we have tons of
   memory, we shouldn't care. I'm just saying that it's unusual to be both
   short on some things and extremely well off on others.

 - Because we have lots of memory, we can easily allocate that free memory
   to user pages etc, and nobody will start checking the VM balance
   because the allocations themselves work out really well and never even
   feel that they have to wake up kswapd. So we quickly deplete the free
   pages that used to hide the imbalance.

   Now we're in a situation where we're low on memory, but we're _also_ in
   the unusual situation that we have almost no inactive pages, while at
   the same time having a high inactive target.

So fairly suddenly _everybody_ goes from "oh, we have tons of memory" to
"uhhuh, we're several thousand pages short of our inactive target".

Now, this is really not much of a problem normally. because normal
applications will just loop on try_to_free_pages() until they're happy
again. So for normal allocations, the worst that can happen is that
because of the sudden shift in balance, we'll get a lot of queue activity.
Not a big deal - in fact that's exactly what we want.

Not a big deal _except_ for GFP_NOFS (ie buffer) allocations and in
particular kswapd. Because those are special-cased, and return NULL
earlier (GFP_NOFS because __GFP_FS isn't set, and kswapd because
PF_MEMALLOC is set).

Which is _exactly_ why refill_freelist() will do it's extra song-and-dance
number.

And guess what? create_buffers() for the "struct page" case doesn't do
that. It just yields and hopes the situation goes away. And as that is the
thing that we want to use for writing out swap etc, we get the situation
where one of the most probable yielders in this case is kswapd. And the
situation never improves, surprise surprise. Most everybody will be in
PF_MEMALLOC and not make any progress.

This is why when you do the full song-and-dance in the create_buffers()
case too, the problem just goes away. Instead of waiting for things to
improve, we will actively try to improve them, and sure as hell, we have
lots of pages that we can evict if we just try. So instead of getting a
comatose machine, you get one that says a few times "I had trouble getting
memory", and then it continues happily.

Case solved.

Moral of the story: don't just hope things will improve. Do something
about it.

Other moral of the story: this "let's hope things improve" problem was
probably hidden by previously having refill_inactive() scan forever until
it hit its target. Or rather - I suspect that code was written exactly
because Rik or somebody _did_ hit this, and made refill_inactive() work
that way to make up for the simple fact that fs/buffer.c was broken.

And finally: It's not a good idea to try to make the VM make up for broken
kernel code.

Btw, the whole comment around the fs/buffer.c braindamage is telling:

        /* We're _really_ low on memory. Now we just
         * wait for old buffer heads to become free due to
         * finishing IO.  Since this is an async request and
         * the reserve list is empty, we're sure there are
         * async buffer heads in use.
         */
        run_task_queue(&tq_disk);

        current->policy |= SCHED_YIELD;
        __set_current_state(TASK_RUNNING);
        schedule();
        goto try_again;

It used to be correct, say about a few years ago. It's simply not true any
more: yes, we obviously have async buffer heads in use, but they don't
just free up when IO completes. They are the buffer heads that we've
allocated to a "struct page" in order to push it out - and they'll be
free'd only by page_launder(). Not by IO completion.

In short: we do have freeable memory. But it won't just come back to us.

So I'd suggest:
 - the one I already suggested: instead of just yielding, do the same
   thing refill_freelist() does.
 - also apply the one-liner patch which Marcelo already suggested some
   time ago, to just make 0-order allocations of GFP_NOFS loop inside the
   memory allocator until happy, because they _will_ eventually make
   progress.

(The one-liner in itself will probably already help us balance things much
faster and make it harder to hit the problem spot - but the "don't just
yield" thing is probably worth it anyway because when you get into this
situation many page allocators tend to be of the PF_MEMALLOC type, and
they will want to avoid recursion in try_to_free_pages() and will not
trigger the one-liner)

So something like the appended (UNTESTED!) should be better. How does it
work for you?

		Linus

-----
diff -u --recursive --new-file pre4/linux/mm/page_alloc.c linux/mm/page_alloc.c
--- pre4/linux/mm/page_alloc.c	Wed Aug 15 02:39:44 2001
+++ linux/mm/page_alloc.c	Wed Aug 15 13:35:02 2001
@@ -450,7 +450,7 @@
 		if (gfp_mask & __GFP_WAIT) {
 			if (!order || free_shortage()) {
 				int progress = try_to_free_pages(gfp_mask);
-				if (progress || (gfp_mask & __GFP_FS))
+				if (progress || (gfp_mask & __GFP_IO))
 					goto try_again;
 				/*
 				 * Fail in case no progress was made and the
diff -u --recursive --new-file pre4/linux/mm/vmscan.c linux/mm/vmscan.c
--- pre4/linux/mm/vmscan.c	Wed Aug 15 02:39:44 2001
+++ linux/mm/vmscan.c	Wed Aug 15 02:37:07 2001
@@ -788,6 +788,9 @@
 			zone_t *zone = pgdat->node_zones + i;
 			unsigned int inactive;

+			if (!zone->size)
+				continue;
+
 			inactive  = zone->inactive_dirty_pages;
 			inactive += zone->inactive_clean_pages;
 			inactive += zone->free_pages;
diff -u --recursive --new-file pre4/linux/fs/buffer.c linux/fs/buffer.c
--- pre4/linux/fs/buffer.c	Wed Aug 15 02:39:41 2001
+++ linux/fs/buffer.c	Wed Aug 15 13:37:35 2001
@@ -794,6 +794,17 @@
 		goto retry;
 }

+static void free_more_memory(void)
+{
+	balance_dirty(NODEV);
+	page_launder(GFP_NOFS, 0);
+	wakeup_bdflush();
+	wakeup_kswapd();
+	current->policy |= SCHED_YIELD;
+	__set_current_state(TASK_RUNNING);
+	schedule();
+}
+
 /*
  * We used to try various strange things. Let's not.
  * We'll just try to balance dirty buffers, and possibly
@@ -802,15 +813,8 @@
  */
 static void refill_freelist(int size)
 {
-	if (!grow_buffers(size)) {
-		balance_dirty(NODEV);
-		page_launder(GFP_NOFS, 0);
-		wakeup_bdflush();
-		wakeup_kswapd();
-		current->policy |= SCHED_YIELD;
-		__set_current_state(TASK_RUNNING);
-		schedule();
-	}
+	if (!grow_buffers(size))
+		free_more_memory();
 }

 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
@@ -1408,9 +1412,7 @@
 	 */
 	run_task_queue(&tq_disk);

-	current->policy |= SCHED_YIELD;
-	__set_current_state(TASK_RUNNING);
-	schedule();
+	free_more_memory();
 	goto try_again;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 20:45 ` 0-order allocation problem Linus Torvalds
@ 2001-08-15 20:55   ` Marcelo Tosatti
  2001-08-15 22:30     ` Linus Torvalds
  2001-08-15 23:27     ` Hugh Dickins
  2001-08-15 22:00   ` Rik van Riel
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2001-08-15 20:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hugh Dickins, linux-mm

On Wed, 15 Aug 2001, Linus Torvalds wrote:

> 
> [ cc'd to linux-mm and Marcelo, as this was kind of interesting ]
> 
> On Wed, 15 Aug 2001, Hugh Dickins wrote:
> >
> > Exactly as you predict.  A batch of the printks at the usual point,
> > then it recovers and proceeds happily on its way.  Same again each
> > time (except first time clean as usual).  I should be pleased, but
> > I feel dissatisfied.  I guess it's right for create_buffers() to
> > try harder, but I'm surprised it got into that state at all.
> > I'll try to understand it better.
> 
> Ok, then I understand the schenario.
> 
> This could _possibly_ be triggered by other things than swapoff too, but
> it would probably be much harder. What happens is:
> 
>  - we have tons of free memory - so much that both inactive_shortage() and
>    free_shortage() are happy as clams, and kswapd or anybody else won't
>    ever try to balance out the fact that we have unusually low counts of
>    inactive data while having a high "inactive_target".
> 
>    The only strange thing is that _despite_ having tons of memory, we are
>    really short on inactive pages, because swapoff() really ate them all
>    up.
> 
>    This part is fine. We're doing the right thing - if we have tons of
>    memory, we shouldn't care. I'm just saying that it's unusual to be both
>    short on some things and extremely well off on others.
> 
>  - Because we have lots of memory, we can easily allocate that free memory
>    to user pages etc, and nobody will start checking the VM balance
>    because the allocations themselves work out really well and never even
>    feel that they have to wake up kswapd. So we quickly deplete the free
>    pages that used to hide the imbalance.
> 
>    Now we're in a situation where we're low on memory, but we're _also_ in
>    the unusual situation that we have almost no inactive pages, while at
>    the same time having a high inactive target.
> 
> So fairly suddenly _everybody_ goes from "oh, we have tons of memory" to
> "uhhuh, we're several thousand pages short of our inactive target".
> 
> Now, this is really not much of a problem normally. because normal
> applications will just loop on try_to_free_pages() until they're happy
> again. So for normal allocations, the worst that can happen is that
> because of the sudden shift in balance, we'll get a lot of queue activity.
> Not a big deal - in fact that's exactly what we want.
> 
> Not a big deal _except_ for GFP_NOFS (ie buffer) allocations and in
> particular kswapd. Because those are special-cased, and return NULL
> earlier (GFP_NOFS because __GFP_FS isn't set, and kswapd because
> PF_MEMALLOC is set).
> 
> Which is _exactly_ why refill_freelist() will do it's extra song-and-dance
> number.
> 
> And guess what? create_buffers() for the "struct page" case doesn't do
> that. It just yields and hopes the situation goes away. And as that is the
> thing that we want to use for writing out swap etc, we get the situation
> where one of the most probable yielders in this case is kswapd. And the
> situation never improves, surprise surprise. Most everybody will be in
> PF_MEMALLOC and not make any progress.
> 
> This is why when you do the full song-and-dance in the create_buffers()
> case too, the problem just goes away. Instead of waiting for things to
> improve, we will actively try to improve them, and sure as hell, we have
> lots of pages that we can evict if we just try. So instead of getting a
> comatose machine, you get one that says a few times "I had trouble getting
> memory", and then it continues happily.
> 
> Case solved.
> 
> Moral of the story: don't just hope things will improve. Do something
> about it.
> 
> Other moral of the story: this "let's hope things improve" problem was
> probably hidden by previously having refill_inactive() scan forever until
> it hit its target. Or rather - I suspect that code was written exactly
> because Rik or somebody _did_ hit this, and made refill_inactive() work
> that way to make up for the simple fact that fs/buffer.c was broken.
> 
> And finally: It's not a good idea to try to make the VM make up for broken
> kernel code.
> 
> Btw, the whole comment around the fs/buffer.c braindamage is telling:
> 
>         /* We're _really_ low on memory. Now we just
>          * wait for old buffer heads to become free due to
>          * finishing IO.  Since this is an async request and
>          * the reserve list is empty, we're sure there are
>          * async buffer heads in use.
>          */
>         run_task_queue(&tq_disk);
> 
>         current->policy |= SCHED_YIELD;
>         __set_current_state(TASK_RUNNING);
>         schedule();
>         goto try_again;
> 
> It used to be correct, say about a few years ago. It's simply not true any
> more: yes, we obviously have async buffer heads in use, but they don't
> just free up when IO completes. They are the buffer heads that we've
> allocated to a "struct page" in order to push it out - and they'll be
> free'd only by page_launder(). Not by IO completion.
> 
> In short: we do have freeable memory. But it won't just come back to us.
> 
> So I'd suggest:
>  - the one I already suggested: instead of just yielding, do the same
>    thing refill_freelist() does.
>  - also apply the one-liner patch which Marcelo already suggested some
>    time ago, to just make 0-order allocations of GFP_NOFS loop inside the
>    memory allocator until happy, because they _will_ eventually make
>    progress.
> 
> (The one-liner in itself will probably already help us balance things much
> faster and make it harder to hit the problem spot - but the "don't just
> yield" thing is probably worth it anyway because when you get into this
> situation many page allocators tend to be of the PF_MEMALLOC type, and
> they will want to avoid recursion in try_to_free_pages() and will not
> trigger the one-liner)
> 
> So something like the appended (UNTESTED!) should be better.

 __GFP_IO is not going to help us that much on anon intensive workloads
(eg swapoff). Remember we are _never_ going to block on buffer_head's of
on flight swap pages because we can't see them in page_launder(). (if a
page is locked, we simply skip it)

Hugh, could you check which kind of allocation is failing and from where?
(allocation flags, etc).

> How does it work for you?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 23:09   ` Hugh Dickins
@ 2001-08-15 21:54     ` Marcelo Tosatti
  2001-08-15 23:38     ` Rik van Riel
  1 sibling, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2001-08-15 21:54 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, linux-mm


On Thu, 16 Aug 2001, Hugh Dickins wrote:

> On Wed, 15 Aug 2001, Linus Torvalds wrote:
> > 
> > So something like the appended (UNTESTED!) should be better. How does it
> > work for you?
> 
> Many thanks for your explanation.  You've convinced me that
> create_buffers() has very good reason to make that effort.
> 
> Your patch works fine for me, for getting things moving again.
> I'm not sure if you thought it would stop my "0-order allocation failed"
> messages: no, I still get a batch of those before it settles back to work.

What is the mask of the failing allocations ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 20:45 ` 0-order allocation problem Linus Torvalds
  2001-08-15 20:55   ` Marcelo Tosatti
@ 2001-08-15 22:00   ` Rik van Riel
  2001-08-15 22:15   ` Rik van Riel
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2001-08-15 22:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hugh Dickins, Marcelo Tosatti, linux-mm

On Wed, 15 Aug 2001, Linus Torvalds wrote:

> Btw, the whole comment around the fs/buffer.c braindamage is telling:
>
>         /* We're _really_ low on memory. Now we just
>          * wait for old buffer heads to become free due to
>          * finishing IO.  Since this is an async request and
>          * the reserve list is empty, we're sure there are
>          * async buffer heads in use.
>          */
>         run_task_queue(&tq_disk);
>
>         current->policy |= SCHED_YIELD;
>         __set_current_state(TASK_RUNNING);
>         schedule();
>         goto try_again;
>
> It used to be correct, say about a few years ago.

IIRC this code was introduced less than two months ago
due to a race condition in the old code, where the
allocator just went to sleep waiting for things to
improve. ;)

It's good to see you've reversed your position that
there would be nothing we could do in this situation.

The patch looks good at first sight, lets hope there
are no hidden locking issues in obscure situations...

regards,

Rik
--
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 20:45 ` 0-order allocation problem Linus Torvalds
  2001-08-15 20:55   ` Marcelo Tosatti
  2001-08-15 22:00   ` Rik van Riel
@ 2001-08-15 22:15   ` Rik van Riel
  2001-08-15 23:09   ` Hugh Dickins
  2001-08-16  8:30   ` Daniel Phillips
  4 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2001-08-15 22:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hugh Dickins, Marcelo Tosatti, linux-mm

On Wed, 15 Aug 2001, Linus Torvalds wrote:

> diff -u --recursive --new-file pre4/linux/mm/page_alloc.c linux/mm/page_alloc.c
> --- pre4/linux/mm/page_alloc.c	Wed Aug 15 02:39:44 2001
> +++ linux/mm/page_alloc.c	Wed Aug 15 13:35:02 2001
> @@ -450,7 +450,7 @@
>  		if (gfp_mask & __GFP_WAIT) {
>  			if (!order || free_shortage()) {
>  				int progress = try_to_free_pages(gfp_mask);
> -				if (progress || (gfp_mask & __GFP_FS))
> +				if (progress || (gfp_mask & __GFP_IO))
>  					goto try_again;
>  				/*
>  				 * Fail in case no progress was made and the

Hmmm, thinking about it a bit more I'm not sure about
this part. It could lead to us looping infinitely while
not being able to free pages because we'd need __GFP_FS
in order to call the various ->writepage() functions.

In case a GFP_BUFFER (or similar) allocation really cannot
make any progress here, we need to exit instead of looping
forever, so my intuition is that trying to let the allocation
loop forever can cause system hangs whereas failing the
allocation would the code path in buffer.c or one of the
filesystems to bail out in another way...

regards,

Rik
--
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 23:27     ` Hugh Dickins
@ 2001-08-15 22:15       ` Marcelo Tosatti
  0 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2001-08-15 22:15 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, linux-mm


On Thu, 16 Aug 2001, Hugh Dickins wrote:

> On Wed, 15 Aug 2001, Marcelo Tosatti wrote:
> > 
> > Hugh, could you check which kind of allocation is failing and from where?
> > (allocation flags, etc).
> 
> Whenever I looked the allocation flags were 0x70,
> __GFP_IO|__GFP_HIGH|__GFP_WAIT; but presumably PF_MEMALLOC too.
> 
> What I was doing was running a memory hog (for 600MB with 256MB
> RAM and 512MB swap), exiting that, doing swapoff -a and swapon -a
> (being interested in timing different swapoff methods).  First
> run no problem at all, but when immediately run again after,
> collapsed into endless 0-order allocation failure messages.
> Didn't happen in 2.4.8.  Linus' patch to 2.4.9-pre4 gets it
> back to work again, after a burst of those messages.
> 
> The stack trace was usually some high-level function, _alloc_pages,
> __alloc_pages, try_to_free_pages, do_try_to_free_pages, page_launder,
> swap_writepage, rw_swap_page, rw_swap_page_base, brw_page,
> create_empty_buffers, create_buffers, get_unused_buffer_head,
> kmem_cache_alloc, kmem_cache_grow, __get_free_pages,
> _alloc_pages, __alloc_pages, printk.
> 
> But on one occasion it was kswapd calling
> do_try_to_free_pages, page_launder, swap_writepage... as above.

Linus, 

The problem is probably "showing up" due to the reduced scan of the
inactive dirty list in 2.4.9pre.

It looks like allocations keep failing until page_launder() finds clean
buffers to free. Since the scan rate is much smaller now, that is likely
to happen.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 20:55   ` Marcelo Tosatti
@ 2001-08-15 22:30     ` Linus Torvalds
  2001-08-15 22:34       ` Rik van Riel
  2001-08-15 23:27     ` Hugh Dickins
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2001-08-15 22:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Hugh Dickins, linux-mm

On Wed, 15 Aug 2001, Marcelo Tosatti wrote:
>
>  __GFP_IO is not going to help us that much on anon intensive workloads
> (eg swapoff). Remember we are _never_ going to block on buffer_head's of
> on flight swap pages because we can't see them in page_launder(). (if a
> page is locked, we simply skip it)

Note that that is what we have the page_alloc (and buffer head) reserves
for - and it doesn't take that much to get the ball rolling. Certainly not
even close to our low-water-marks.. And once it snowballs it _does_ help
that people call page_launder().

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 22:30     ` Linus Torvalds
@ 2001-08-15 22:34       ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2001-08-15 22:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, Hugh Dickins, linux-mm

On Wed, 15 Aug 2001, Linus Torvalds wrote:
> On Wed, 15 Aug 2001, Marcelo Tosatti wrote:
> >
> >  __GFP_IO is not going to help us that much on anon intensive workloads
> > (eg swapoff). Remember we are _never_ going to block on buffer_head's of
> > on flight swap pages because we can't see them in page_launder(). (if a
> > page is locked, we simply skip it)
>
> Note that that is what we have the page_alloc (and buffer head)
> reserves for - and it doesn't take that much to get the ball rolling.
> Certainly not even close to our low-water-marks.. And once it
> snowballs it _does_ help that people call page_launder().

Also, page_launder() tends to "strip" the buffer heads
from pages as soon as they get cleaned, making them
immediately available to the process trying to allocate
a buffer head and calling page_launder() from buffer.c

regards,

Rik
--
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-16  0:07       ` Hugh Dickins
@ 2001-08-15 22:44         ` Marcelo Tosatti
  2001-08-16  0:50           ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2001-08-15 22:44 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Rik van Riel, Linus Torvalds, linux-mm


On Thu, 16 Aug 2001, Hugh Dickins wrote:

> On Wed, 15 Aug 2001, Rik van Riel wrote:
> > On Thu, 16 Aug 2001, Hugh Dickins wrote:
> > 
> > > 1. Why test free_shortage() in the high-order case?  The caller has
> > >    asked for a high-order allocation, and is prepared to wait: we
> > >    haven't found what the caller needs yet, we certainly should not
> > >    wait forever, but we should try harder: it's irrelevant whether
> > >    there's a free shortage or not - we've found a contiguity shortage.
> > 
> > It may be irrelevant, but remember that try_to_free_pages()
> > doesn't free any pages if there is no free shortage.
> 
> I think you've caught me out there.  When "try_to_free_pages()"
> actually tries to free pages is something that changes from time
> to time, and I hadn't looked to see what current behaviour is.
> 
> All the more reason not to call free_shortage(), if try_to_free_pages()
> will make its own decision.  The important bit is probably to recycle
> round to page_launder(); or perhaps it's just to spend a little time
> in the hope that something will turn up.... (not Linus' favoured
> strategy, but currently contiguity is given no weight at all in
> choosing pages).

Try this: Add a "priority" argument to page_launder(), and make the
refill_freelist() call to page_launder() use a very low priority, and keep
DEF_PRIORITY in the other callers.

That will confirm if my theory is correct. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 20:45 ` 0-order allocation problem Linus Torvalds
                     ` (2 preceding siblings ...)
  2001-08-15 22:15   ` Rik van Riel
@ 2001-08-15 23:09   ` Hugh Dickins
  2001-08-15 21:54     ` Marcelo Tosatti
  2001-08-15 23:38     ` Rik van Riel
  2001-08-16  8:30   ` Daniel Phillips
  4 siblings, 2 replies; 21+ messages in thread
From: Hugh Dickins @ 2001-08-15 23:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm

On Wed, 15 Aug 2001, Linus Torvalds wrote:
> 
> So something like the appended (UNTESTED!) should be better. How does it
> work for you?

Many thanks for your explanation.  You've convinced me that
create_buffers() has very good reason to make that effort.

Your patch works fine for me, for getting things moving again.
I'm not sure if you thought it would stop my "0-order allocation failed"
messages: no, I still get a batch of those before it settles back to work.

A variant of your patch appended below.  Ignore me if I'm blowing
you off track, but I just noticed "The Curse of the Incas" in vmscan.c;
and cannot look at that block of __alloc_pages() without remarking:

1. Why test free_shortage() in the high-order case?  The caller has
   asked for a high-order allocation, and is prepared to wait: we
   haven't found what the caller needs yet, we certainly should not
   wait forever, but we should try harder: it's irrelevant whether
   there's a free shortage or not - we've found a contiguity shortage.
2. It should not return NULL on failure at that point,
   should print the allocation failure message before returning.
3. Allocation failure message would do well to show gfp_mask too.

Hugh

--- linux-2.4.9-pre4/fs/buffer.c	Wed Aug 15 06:51:47 2001
+++ linux/fs/buffer.c	Wed Aug 15 22:23:16 2001
@@ -794,6 +794,17 @@
 		goto retry;
 }
 
+static void free_more_memory(void)
+{
+	balance_dirty(NODEV);
+	page_launder(GFP_NOFS, 0);
+	wakeup_bdflush();
+	wakeup_kswapd();
+	current->policy |= SCHED_YIELD;
+	__set_current_state(TASK_RUNNING);
+	schedule();
+}
+
 /*
  * We used to try various strange things. Let's not.
  * We'll just try to balance dirty buffers, and possibly
@@ -802,15 +813,8 @@
  */
 static void refill_freelist(int size)
 {
-	if (!grow_buffers(size)) {
-		balance_dirty(NODEV);
-		page_launder(GFP_NOFS, 0);		
-		wakeup_bdflush();
-		wakeup_kswapd();
-		current->policy |= SCHED_YIELD;
-		__set_current_state(TASK_RUNNING);
-		schedule();
-	}
+	if (!grow_buffers(size))
+		free_more_memory();
 }
 
 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
@@ -1408,9 +1412,7 @@
 	 */
 	run_task_queue(&tq_disk);
 
-	current->policy |= SCHED_YIELD;
-	__set_current_state(TASK_RUNNING);
-	schedule();
+	free_more_memory();
 	goto try_again;
 }
 
--- linux-2.4.9-pre4/mm/page_alloc.c	Wed Aug 15 06:51:49 2001
+++ linux/mm/page_alloc.c	Wed Aug 15 23:02:11 2001
@@ -283,6 +283,7 @@
 {
 	zone_t **zone;
 	int direct_reclaim = 0;
+	int loop = 0;
 	struct page * page;
 
 	/*
@@ -448,16 +449,17 @@
 		 * to give up than to deadlock the kernel looping here.
 		 */
 		if (gfp_mask & __GFP_WAIT) {
-			if (!order || free_shortage()) {
-				int progress = try_to_free_pages(gfp_mask);
-				if (progress || (gfp_mask & __GFP_FS))
+			int progress = try_to_free_pages(gfp_mask);
+			if (order) {
+				if (loop++ < 4)
 					goto try_again;
-				/*
-				 * Fail in case no progress was made and the
-				 * allocation may not be able to block on IO.
-				 */
-				return NULL;
-			}
+			} else if (progress || (gfp_mask & __GFP_IO))
+				goto try_again;
+			/*
+			 * Fail in case no progress was made and the
+			 * allocation may not be able to block on IO.
+			 */
+			goto fail;
 		}
 	}
 
@@ -501,8 +503,9 @@
 			return page;
 	}
 
+fail:
 	/* No luck.. */
-	printk(KERN_ERR "__alloc_pages: %lu-order allocation failed.\n", order);
+	printk(KERN_ERR "__alloc_pages: %lu-order allocation failed (gfp_mask 0x%x).\n", order, gfp_mask);
 	return NULL;
 }
 
--- linux-2.4.9-pre4/mm/vmscan.c	Wed Aug 15 06:51:49 2001
+++ linux/mm/vmscan.c	Wed Aug 15 23:09:54 2001
@@ -779,7 +779,7 @@
 {
 	pg_data_t *pgdat;
 	unsigned int global_target = freepages.high + inactive_target;
-	unsigned int global_incative = 0;
+	unsigned int global_inactive = 0;
 
 	pgdat = pgdat_list;
 	do {
@@ -788,6 +788,9 @@
 			zone_t *zone = pgdat->node_zones + i;
 			unsigned int inactive;
 
+			if (!zone->size)
+				continue;
+
 			inactive  = zone->inactive_dirty_pages;
 			inactive += zone->inactive_clean_pages;
 			inactive += zone->free_pages;
@@ -796,13 +799,13 @@
 			if (inactive < zone->pages_high)
 				return 1;
 
-			global_incative += inactive;
+			global_inactive += inactive;
 		}
 		pgdat = pgdat->node_next;
 	} while (pgdat);
 
 	/* Global shortage? */
-	return global_incative < global_target;
+	return global_inactive < global_target;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 20:55   ` Marcelo Tosatti
  2001-08-15 22:30     ` Linus Torvalds
@ 2001-08-15 23:27     ` Hugh Dickins
  2001-08-15 22:15       ` Marcelo Tosatti
  1 sibling, 1 reply; 21+ messages in thread
From: Hugh Dickins @ 2001-08-15 23:27 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, linux-mm

On Wed, 15 Aug 2001, Marcelo Tosatti wrote:
> 
> Hugh, could you check which kind of allocation is failing and from where?
> (allocation flags, etc).

Whenever I looked the allocation flags were 0x70,
__GFP_IO|__GFP_HIGH|__GFP_WAIT; but presumably PF_MEMALLOC too.

What I was doing was running a memory hog (for 600MB with 256MB
RAM and 512MB swap), exiting that, doing swapoff -a and swapon -a
(being interested in timing different swapoff methods).  First
run no problem at all, but when immediately run again after,
collapsed into endless 0-order allocation failure messages.
Didn't happen in 2.4.8.  Linus' patch to 2.4.9-pre4 gets it
back to work again, after a burst of those messages.

The stack trace was usually some high-level function, _alloc_pages,
__alloc_pages, try_to_free_pages, do_try_to_free_pages, page_launder,
swap_writepage, rw_swap_page, rw_swap_page_base, brw_page,
create_empty_buffers, create_buffers, get_unused_buffer_head,
kmem_cache_alloc, kmem_cache_grow, __get_free_pages,
_alloc_pages, __alloc_pages, printk.

But on one occasion it was kswapd calling
do_try_to_free_pages, page_launder, swap_writepage... as above.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 23:09   ` Hugh Dickins
  2001-08-15 21:54     ` Marcelo Tosatti
@ 2001-08-15 23:38     ` Rik van Riel
  2001-08-16  0:07       ` Hugh Dickins
  1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2001-08-15 23:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm

On Thu, 16 Aug 2001, Hugh Dickins wrote:

> 1. Why test free_shortage() in the high-order case?  The caller has
>    asked for a high-order allocation, and is prepared to wait: we
>    haven't found what the caller needs yet, we certainly should not
>    wait forever, but we should try harder: it's irrelevant whether
>    there's a free shortage or not - we've found a contiguity shortage.

It may be irrelevant, but remember that try_to_free_pages()
doesn't free any pages if there is no free shortage.

Besides, even if it did chances are you wouldn't be able
to allocate that 2MB contiguous area any time next week ;)

> 3. Allocation failure message would do well to show gfp_mask too.

Agreed, gfp_mask and PF_MEMALLOC would be useful things
to know here...

regards,

Rik
--
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 23:38     ` Rik van Riel
@ 2001-08-16  0:07       ` Hugh Dickins
  2001-08-15 22:44         ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Hugh Dickins @ 2001-08-16  0:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm

On Wed, 15 Aug 2001, Rik van Riel wrote:
> On Thu, 16 Aug 2001, Hugh Dickins wrote:
> 
> > 1. Why test free_shortage() in the high-order case?  The caller has
> >    asked for a high-order allocation, and is prepared to wait: we
> >    haven't found what the caller needs yet, we certainly should not
> >    wait forever, but we should try harder: it's irrelevant whether
> >    there's a free shortage or not - we've found a contiguity shortage.
> 
> It may be irrelevant, but remember that try_to_free_pages()
> doesn't free any pages if there is no free shortage.

I think you've caught me out there.  When "try_to_free_pages()"
actually tries to free pages is something that changes from time
to time, and I hadn't looked to see what current behaviour is.

All the more reason not to call free_shortage(), if try_to_free_pages()
will make its own decision.  The important bit is probably to recycle
round to page_launder(); or perhaps it's just to spend a little time
in the hope that something will turn up.... (not Linus' favoured
strategy, but currently contiguity is given no weight at all in
choosing pages).

> Besides, even if it did chances are you wouldn't be able
> to allocate that 2MB contiguous area any time next week ;)

I'll settle for less...

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 22:44         ` Marcelo Tosatti
@ 2001-08-16  0:50           ` Linus Torvalds
  0 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2001-08-16  0:50 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Hugh Dickins, Rik van Riel, linux-mm

On Wed, 15 Aug 2001, Marcelo Tosatti wrote:
>
> Try this: Add a "priority" argument to page_launder(), and make the
> refill_freelist() call to page_launder() use a very low priority, and keep
> DEF_PRIORITY in the other callers.

No. Don't do this. That is 100% equivalent to just calling the function
multiple times.

And you shouldn't do that EITHER. Not alone. There may be other forms of
imbalance, and trying to address just one is bad.

Look at do_try_to_free_page(). Read it. Grok it.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-15 20:45 ` 0-order allocation problem Linus Torvalds
                     ` (3 preceding siblings ...)
  2001-08-15 23:09   ` Hugh Dickins
@ 2001-08-16  8:30   ` Daniel Phillips
  2001-08-16 10:26     ` Stephen C. Tweedie
  4 siblings, 1 reply; 21+ messages in thread
From: Daniel Phillips @ 2001-08-16  8:30 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins; +Cc: Marcelo Tosatti, linux-mm

On August 15, 2001 10:45 pm, Linus Torvalds wrote:
> In short: we do have freeable memory. But it won't just come back to us.

Side note: we have 100% guaranteed not a snowball's chance in hell of
returning the correct result for out_of_memory until we can prove that
we always obtain a halfway correct statistic for total freeable memory,
and an algorithm that delivers same to the free lists when we need it.

<warning: ramble coming>In a sense, except for process data, almost
all pages are freeable, the only variable is the amount of time it
takes to free them.  Sometimes we'll have to wait for writeouts to
file or swap to complete, in other cases we have to wait for users
to drop their use counts on pages and/or buffers.  The significant
exception to this is pinned pages.  IMHO, the VM needs to know how
many pages are pinned and right now it has no reliable way to tell
because the use count is overloaded.  So how about adding a PG_pinned
flag, and users need to set it for any page they intend to pin.  We
can supply pin_page(page) and unpin_page(page) mm ops to bury the
details of keeping the necessary stats.  I've thought this through a
little more than I've written here, but I'll stop now and wait for
flames, fuzzies, whatever on the basic concept[1].</warning>

--
Daniel

[1] 2.5 of course
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-16  8:30   ` Daniel Phillips
@ 2001-08-16 10:26     ` Stephen C. Tweedie
  2001-08-16 12:18       ` Daniel Phillips
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen C. Tweedie @ 2001-08-16 10:26 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, Hugh Dickins, Marcelo Tosatti, linux-mm

Hi,

On Thu, Aug 16, 2001 at 10:30:35AM +0200, Daniel Phillips wrote:

> because the use count is overloaded.  So how about adding a PG_pinned
> flag, and users need to set it for any page they intend to pin.

It needs to be a count, not a flag (consider multiple mlock() calls
from different processes, or multiple direct IO writeouts from the
same memory to disk.)  

But yes, being able to distinguish freeable from unfreeable references
to a page would be very useful, especially if we want to support very
large memory allocations dynamically for things like i86 PSE 2MB/4MB
page tables.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-16 10:26     ` Stephen C. Tweedie
@ 2001-08-16 12:18       ` Daniel Phillips
  2001-08-16 15:35         ` Eric W. Biederman
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Phillips @ 2001-08-16 12:18 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Hugh Dickins, Marcelo Tosatti, linux-mm

On August 16, 2001 12:26 pm, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Aug 16, 2001 at 10:30:35AM +0200, Daniel Phillips wrote:
> 
> > because the use count is overloaded.  So how about adding a PG_pinned
> > flag, and users need to set it for any page they intend to pin.
> 
> It needs to be a count, not a flag (consider multiple mlock() calls
> from different processes, or multiple direct IO writeouts from the
> same memory to disk.)  

Yes, the question is how to do this without adding a yet another field
to struct page.

> But yes, being able to distinguish freeable from unfreeable references
> to a page would be very useful, especially if we want to support very
> large memory allocations dynamically for things like i86 PSE 2MB/4MB
> page tables.

--
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-16 12:18       ` Daniel Phillips
@ 2001-08-16 15:35         ` Eric W. Biederman
  2001-08-16 16:37           ` Stephen C. Tweedie
  0 siblings, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2001-08-16 15:35 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Linus Torvalds, Hugh Dickins,
	Marcelo Tosatti, linux-mm

Daniel Phillips <phillips@bonn-fries.net> writes:

> On August 16, 2001 12:26 pm, Stephen C. Tweedie wrote:
> > Hi,
> > 
> > On Thu, Aug 16, 2001 at 10:30:35AM +0200, Daniel Phillips wrote:
> > 
> > > because the use count is overloaded.  So how about adding a PG_pinned
> > > flag, and users need to set it for any page they intend to pin.
> > 
> > It needs to be a count, not a flag (consider multiple mlock() calls
> > from different processes, or multiple direct IO writeouts from the
> > same memory to disk.)  
> 
> Yes, the question is how to do this without adding a yet another field
> to struct page.

atomic_add(&page->count, 65536);  Basically you can add the high bits.  
But we only need the count seperate so that when a page becomes
demand freeable we can remove it from the global unfreeable page count.
But please let's not call a non-freeable page pinned.  We already use that
term for pages that are temporarily pinned for I/O.  And pinning in my mind
is not a permanent situation.

Actually except for mlock on a user space page we can use only a single bit,
so it might make more sense on the munlock case to walk the list of vma's
and see if the page is still mlocked somewhere else.

Something like:
if (test_bit(&page->flags, PG_Unfreeable)) {
        if (page->mapping && (page->mapping->i_mmap || page->mapping->i_mmap_shared)) {
                /* walk page->mapping->i_mmap & page->mapping->i_mmap->i_mmap_shared */
                /* if the page is no longer mlocked clear PG_Unfreeable */
	} else {
                clear_bit(&page->flags, PG_Unfreeable);
        }
	if (!test_bit(&page->flags, PG_Unfreeable)) {
		atomic_dec(&unfreeable_pages);
		/* Actually because of the limited range of the atomic
                 * types we probably need a spinlock...
		 */
        }       
}

kmalloc, the slab cache, and the inode cache are where we get most of
the pages that aren't freeable.  And since those cases don't mmap the
pages it shouldn't be too much overhead in the common cases.

Additionally if we do have a variant on free_page that only does the
tests for unlocking when we know we are freeing something from a
locked vma, we should be able to keep the overhead down quite nicely.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-16 15:35         ` Eric W. Biederman
@ 2001-08-16 16:37           ` Stephen C. Tweedie
  2001-08-17  3:20             ` Eric W. Biederman
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen C. Tweedie @ 2001-08-16 16:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Phillips, Stephen C. Tweedie, Linus Torvalds,
	Hugh Dickins, Marcelo Tosatti, linux-mm

Hi,

On Thu, Aug 16, 2001 at 09:35:50AM -0600, Eric W. Biederman wrote:

> > > It needs to be a count, not a flag (consider multiple mlock() calls
> > > from different processes, or multiple direct IO writeouts from the
> > > same memory to disk.)  
> > 
> > Yes, the question is how to do this without adding a yet another field
> > to struct page.
> 
> atomic_add(&page->count, 65536);

That only leaves 8 bits for the pinned references (some architectures
limit atomic_t to 24 bits), and 16 bits for genuine references isn't
enough for some pages such as the zero page.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-16 16:37           ` Stephen C. Tweedie
@ 2001-08-17  3:20             ` Eric W. Biederman
  2001-08-17 11:45               ` Stephen C. Tweedie
  0 siblings, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2001-08-17  3:20 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Daniel Phillips, Linus Torvalds, Hugh Dickins, Marcelo Tosatti, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Hi,
> 
> On Thu, Aug 16, 2001 at 09:35:50AM -0600, Eric W. Biederman wrote:
> 
> > > > It needs to be a count, not a flag (consider multiple mlock() calls
> > > > from different processes, or multiple direct IO writeouts from the
> > > > same memory to disk.)  
> > > 
> > > Yes, the question is how to do this without adding a yet another field
> > > to struct page.
> > 
> > atomic_add(&page->count, 65536);
> 
> That only leaves 8 bits for the pinned references (some architectures
> limit atomic_t to 24 bits), and 16 bits for genuine references isn't
> enough for some pages such as the zero page.

O.k. So that angle is out, but the other suggested approach where
we scan the list of vmas will still work.  Question do you know if
this logic would need to apply to things like ext3 and the journalling
filesystems.  

If we can limit the logic for accounting to things we have absolutely
no control over, it might just be reasonable.  Otherwise it starts
looking very tricky.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 0-order allocation problem
  2001-08-17  3:20             ` Eric W. Biederman
@ 2001-08-17 11:45               ` Stephen C. Tweedie
  0 siblings, 0 replies; 21+ messages in thread
From: Stephen C. Tweedie @ 2001-08-17 11:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Daniel Phillips, Linus Torvalds,
	Hugh Dickins, Marcelo Tosatti, linux-mm

Hi,

On Thu, Aug 16, 2001 at 09:20:21PM -0600, Eric W. Biederman wrote:

> O.k. So that angle is out, but the other suggested approach where
> we scan the list of vmas will still work.  Question do you know if
> this logic would need to apply to things like ext3 and the journalling
> filesystems.  

No.  The logic needed for those is _very_ different.  Advanced fs
features such as journaling or deferred block allocation can result in
situations where any dirty memory page can be flushed to disk, but the
kernel requires more memory to do so.  For journaling, we can't flush
to disk without a commit, and a commit will require that all syscalls
currently in progress are allowed to run to completion to get a
consistent on-disk image.  For deferred block allocation, we may need
to read fs metadata structures into memory to allocate the in-core
pages to on-disk blocks before we can do the writes.  

So the journaling case requires that we keep enough freeable memory to
satisfy the writeout memory allocation requirements for such dirty
pages, but as long as enough freeable memory is available, journaling
doesn't imply any permanent pin on the pages.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2001-08-17 11:45 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.21.0108152049100.973-100000@localhost.localdomain>
2001-08-15 20:45 ` 0-order allocation problem Linus Torvalds
2001-08-15 20:55   ` Marcelo Tosatti
2001-08-15 22:30     ` Linus Torvalds
2001-08-15 22:34       ` Rik van Riel
2001-08-15 23:27     ` Hugh Dickins
2001-08-15 22:15       ` Marcelo Tosatti
2001-08-15 22:00   ` Rik van Riel
2001-08-15 22:15   ` Rik van Riel
2001-08-15 23:09   ` Hugh Dickins
2001-08-15 21:54     ` Marcelo Tosatti
2001-08-15 23:38     ` Rik van Riel
2001-08-16  0:07       ` Hugh Dickins
2001-08-15 22:44         ` Marcelo Tosatti
2001-08-16  0:50           ` Linus Torvalds
2001-08-16  8:30   ` Daniel Phillips
2001-08-16 10:26     ` Stephen C. Tweedie
2001-08-16 12:18       ` Daniel Phillips
2001-08-16 15:35         ` Eric W. Biederman
2001-08-16 16:37           ` Stephen C. Tweedie
2001-08-17  3:20             ` Eric W. Biederman
2001-08-17 11:45               ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox