linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: blk_congestion_wait racy?
@ 2004-03-08 13:38 Martin Schwidefsky
  2004-03-08 23:50 ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-08 13:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm




> Gad, that'll make the VM scan its guts out.
Yes, I expected something like this.

> > 2.6.4-rc2 + "fix" with 1 cpu
> > sys     0m0.880s
> >
> > 2.6.4-rc2 + "fix" with 2 cpu
> > sys     0m1.560s
>
> system time was doubled though.
That would be the additional cost for not waiting.

> Nope, something is obviously broken.   I'll take a look.
That would be very much appreciated.

> Perhaps with two CPUs you are able to get kswapd and mempig running page
> reclaim at the same time, which causes seekier swap I/O patterns than with
> one CPU, where we only run one app or the other at any time.
>
> Serialising balance_pgdat() and try_to_free_pages() with a global semaphore
> would be a way of testing that theory.

Just tried the following patch:

Index: mm/vmscan.c
===================================================================
RCS file: /home/cvs/linux-2.5/mm/vmscan.c,v
retrieving revision 1.45
diff -u -r1.45 vmscan.c
--- mm/vmscan.c   18 Feb 2004 17:45:28 -0000    1.45
+++ mm/vmscan.c   8 Mar 2004 13:30:56 -0000
@@ -848,6 +848,7 @@
  * excessive rotation of the inactive list, which is _supposed_ to be an LRU,
  * yes?
  */
+static DECLARE_MUTEX(reclaim_sem);
 int try_to_free_pages(struct zone **zones,
            unsigned int gfp_mask, unsigned int order)
 {
@@ -858,6 +859,8 @@
      struct reclaim_state *reclaim_state = current->reclaim_state;
      int i;

+     down(&reclaim_sem);
+
      inc_page_state(allocstall);

      for (i = 0; zones[i] != 0; i++)
@@ -884,7 +887,10 @@
            wakeup_bdflush(total_scanned);

            /* Take a nap, wait for some writeback to complete */
+           up(&reclaim_sem);
            blk_congestion_wait(WRITE, HZ/10);
+           down(&reclaim_sem);
+
            if (zones[0] - zones[0]->zone_pgdat->node_zones < ZONE_HIGHMEM) {
                  shrink_slab(total_scanned, gfp_mask);
                  if (reclaim_state) {
@@ -898,6 +904,9 @@
 out:
      for (i = 0; zones[i] != 0; i++)
            zones[i]->prev_priority = zones[i]->temp_priority;
+
+     up(&reclaim_sem);
+
      return ret;
 }

@@ -926,6 +935,8 @@
      int i;
      struct reclaim_state *reclaim_state = current->reclaim_state;

+     down(&reclaim_sem);
+
      inc_page_state(pageoutrun);

      for (i = 0; i < pgdat->nr_zones; i++) {
@@ -974,8 +985,11 @@
            }
            if (all_zones_ok)
                  break;
-           if (to_free > 0)
+           if (to_free > 0) {
+                 up(&reclaim_sem);
                  blk_congestion_wait(WRITE, HZ/10);
+                 down(&reclaim_sem);
+           }
      }

      for (i = 0; i < pgdat->nr_zones; i++) {
@@ -983,6 +997,9 @@

            zone->prev_priority = zone->temp_priority;
      }
+
+     up(&reclaim_sem);
+
      return nr_pages - to_free;
 }


It didn't help. Still needs almost a minute.

blue skies,
   Martin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: blk_congestion_wait racy?
@ 2004-03-11 19:04 Martin Schwidefsky
  2004-03-11 23:25 ` Andrew Morton
  2004-03-12  2:31 ` Nick Piggin
  0 siblings, 2 replies; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-11 19:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, piggin




> Yes, sorry, all the world's an x86 :( Could you please send me whatever
> diffs were needed to get it all going?

I am just preparing that mail :-)

> I thought you were running a 256MB machine?  Two seconds for 400 megs of
> swapout?  What's up?

Roughly 400 MB of swapout. And two seconds isn't that bad ;-)

> An ouch-per-second sounds reasonable.  It could simply be that the CPUs
> were off running other tasks - those timeout are less than scheduling
> quanta.

I don't understand why an ouch-per-second is reasonable. The mempig is
the only process that runs on the machine and the blk_congestion_wait
uses HZ/10 as timeout value. I'd expect about 100 ouches for the 10
seconds the test runs.

The 4x performance difference remains not understood.


blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schonaicherstr. 220, D-71032 Boblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: blk_congestion_wait racy?
@ 2004-03-11 18:24 Martin Schwidefsky
  2004-03-11 18:55 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-11 18:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm




> Martin, have you tried adding this printk?

Sorry for the delay. I had to get 2.6.4-mm1 working before doing the
"ouch" test. The new pte_to_pgprot/pgoff_prot_to_pte stuff wasn't easy.
I tested 2.6.4-mm1 with the blk_run_queues move and the ouch printk.
The first interesting observation is that 2.6.4-mm1 behaves MUCH better
then 2.6.4:

2.6.4-mm1 with 1 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m2.587s
user    0m0.100s
sys     0m0.730s
#

2.6.4-mm1 with 2 cpus
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m10.313s
user    0m0.160s
sys     0m0.780s
#

2.6.4 takes > 1min for the test with 2 cpus.

The second observation is that I get only a few "ouch" messages. They
all come from the blk_congestion_wait in try_to_free_pages, as expected.
What I did not expect is that I only got 9 "ouches" for the run with
2 cpus.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schonaicherstr. 220, D-71032 Boblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: blk_congestion_wait racy?
@ 2004-03-09 17:54 Martin Schwidefsky
  2004-03-10  5:23 ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-09 17:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm




Hi Nick,

> Another problem is that if there are no requests anywhere in the system,
> sleepers in blk_congestion_wait will not get kicked. blk_congestion_wait
> could probably have blk_run_queues moved after prepare_to_wait, which
> might help.
I tried putting blk_run_queues after prepare_to_wait, it worked but it
didn't help. The test still needs close to a minute.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schonaicherstr. 220, D-71032 Boblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread
* blk_congestion_wait racy?
@ 2004-03-08  9:59 Martin Schwidefsky
  2004-03-08 12:24 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-08  9:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Hi,
we have a stupid little program that linearly allocates and touches
memory. We use this to see how fast s390 can swap. If this is combined
with the fastest block device we have (xpram) we see a very strange
effect:

2.6.4-rc2 with 1 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m2.516s
user    0m0.150s
sys     0m0.570s
#

2.6.4-rc2 with 2 cpus
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m56.086s
user    0m0.110s
sys     0m0.630s
#

I have the suspicion that the call to blk_congestion_wait in
try_to_free_pages is part of the problem. It initiates a wait for
a queue to exit congestion but this could already have happened
on another cpu before blk_congestion_wait has setup the wait
queue. In this case the process sleeps for 0.1 seconds. With
the swap test setup this happens all the time. If I "fix"
blk_congestion_wait not to wait:

diff -urN linux-2.6/drivers/block/ll_rw_blk.c linux-2.6-fix/drivers/block/ll_rw_blk.c
--- linux-2.6/drivers/block/ll_rw_blk.c	Fri Mar  5 14:50:28 2004
+++ linux-2.6-fix/drivers/block/ll_rw_blk.c	Fri Mar  5 14:51:05 2004
@@ -1892,7 +1892,9 @@
 
 	blk_run_queues();
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+#if 0
 	io_schedule_timeout(timeout);
+#endif
 	finish_wait(wqh, &wait);
 }
 
then the system reacts normal again:

2.6.4-rc2 + "fix" with 1 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m2.523s
user    0m0.200s
sys     0m0.880s
#

2.6.4-rc2 + "fix" with 2 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m2.029s
user    0m0.250s
sys     0m1.560s
#

2.6.4-rc2 + "fix" with 2 cpus


Since it isn't a solution to remove the call to io_schedule_timeout
I tried to understand what the event is, that blk_congestion_wait
is waiting for. The comment says it waits for a queue to exit congestion.
That is starting from prepare_to_wait it waits for a call to
clear_queue_congested. In my test scenario NO queue is congested on
enter to blk_congestion_wait. I'd like to see a proper wait_event
there but it is non-trivial to define the event to wait for.
Any useful hints ?

blue skies,
   Martin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2004-03-12  2:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-03-08 13:38 blk_congestion_wait racy? Martin Schwidefsky
2004-03-08 23:50 ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2004-03-11 19:04 Martin Schwidefsky
2004-03-11 23:25 ` Andrew Morton
2004-03-12  2:31 ` Nick Piggin
2004-03-11 18:24 Martin Schwidefsky
2004-03-11 18:55 ` Andrew Morton
2004-03-09 17:54 Martin Schwidefsky
2004-03-10  5:23 ` Nick Piggin
2004-03-10  5:35   ` Andrew Morton
2004-03-10  5:47     ` Nick Piggin
2004-03-08  9:59 Martin Schwidefsky
2004-03-08 12:24 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox