Re: blk_congestion_wait racy?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@osdl.org>
To: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: blk_congestion_wait racy?
Date: Mon, 8 Mar 2004 04:24:11 -0800	[thread overview]
Message-ID: <20040308042411.3b2cc9dd.akpm@osdl.org> (raw)
In-Reply-To: <20040308095919.GA1117@mschwid3.boeblingen.de.ibm.com>

Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
>
> Hi,
> we have a stupid little program that linearly allocates and touches
> memory. We use this to see how fast s390 can swap. If this is combined
> with the fastest block device we have (xpram) we see a very strange
> effect:
> 
> 2.6.4-rc2 with 1 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m2.516s
> user    0m0.150s
> sys     0m0.570s
> #
> 
> 2.6.4-rc2 with 2 cpus
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m56.086s
> user    0m0.110s
> sys     0m0.630s
> #

Interesting.

> I have the suspicion that the call to blk_congestion_wait in
> try_to_free_pages is part of the problem. It initiates a wait for
> a queue to exit congestion but this could already have happened
> on another cpu before blk_congestion_wait has setup the wait
> queue. In this case the process sleeps for 0.1 seconds.

The comment may be a bit stale.  The idea is that the VM needs to take a
nap while the disk system retires some writes.  So we go to sleep until a
write request gets put back.  We do this regardless of the queue's
congestion state - the queue could have thousands of request slots and may
never even become congested.

> the swap test setup this happens all the time. If I "fix"
> blk_congestion_wait not to wait:
> 
> diff -urN linux-2.6/drivers/block/ll_rw_blk.c linux-2.6-fix/drivers/block/ll_rw_blk.c
> --- linux-2.6/drivers/block/ll_rw_blk.c	Fri Mar  5 14:50:28 2004
> +++ linux-2.6-fix/drivers/block/ll_rw_blk.c	Fri Mar  5 14:51:05 2004
> @@ -1892,7 +1892,9 @@
>  
>  	blk_run_queues();
>  	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +#if 0
>  	io_schedule_timeout(timeout);
> +#endif
>  	finish_wait(wqh, &wait);
>  }

Gad, that'll make the VM scan its guts out.

> then the system reacts normal again:
> 
> 2.6.4-rc2 + "fix" with 1 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m2.523s
> user    0m0.200s
> sys     0m0.880s
> #
> 
> 2.6.4-rc2 + "fix" with 2 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m2.029s
> user    0m0.250s
> sys     0m1.560s
> #

system time was doubled though.

> Since it isn't a solution to remove the call to io_schedule_timeout
> I tried to understand what the event is, that blk_congestion_wait
> is waiting for. The comment says it waits for a queue to exit congestion.

It's just waiting for a write request to complete.  It's a pretty crude way
of throttling page reclaim to the I/O system's speed.

> That is starting from prepare_to_wait it waits for a call to
> clear_queue_congested. In my test scenario NO queue is congested on
> enter to blk_congestion_wait. I'd like to see a proper wait_event
> there but it is non-trivial to define the event to wait for.
> Any useful hints ?

Nope, something is obviously broken.   I'll take a look.

Perhaps with two CPUs you are able to get kswapd and mempig running page
reclaim at the same time, which causes seekier swap I/O patterns than with
one CPU, where we only run one app or the other at any time.

Serialising balance_pgdat() and try_to_free_pages() with a global semaphore
would be a way of testing that theory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

next prev parent reply	other threads:[~2004-03-08 12:24 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-03-08  9:59 Martin Schwidefsky
2004-03-08 12:24 ` Andrew Morton [this message]
2004-03-08 13:38 Martin Schwidefsky
2004-03-08 23:50 ` Nick Piggin
2004-03-09 17:54 Martin Schwidefsky
2004-03-10  5:23 ` Nick Piggin
2004-03-10  5:35   ` Andrew Morton
2004-03-10  5:47     ` Nick Piggin
2004-03-11 18:24 Martin Schwidefsky
2004-03-11 18:55 ` Andrew Morton
2004-03-11 19:04 Martin Schwidefsky
2004-03-11 23:25 ` Andrew Morton
2004-03-12  2:31 ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040308042411.3b2cc9dd.akpm@osdl.org \
    --to=akpm@osdl.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=schwidefsky@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox