From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from digeo-nav01.digeo.com (digeo-nav01.digeo.com [192.168.1.233]) by packet.digeo.com (8.9.3+Sun/8.9.3) with SMTP id AAA12900 for ; Mon, 23 Sep 2002 00:43:31 -0700 (PDT) Message-ID: <3D8EC621.9ACEF459@digeo.com> Date: Mon, 23 Sep 2002 00:43:29 -0700 From: Andrew Morton MIME-Version: 1.0 Subject: Re: 2.5.38-mm2 References: <3D8E96AA.C2FA7D8@digeo.com> <20020923071633.GA15479@suse.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Jens Axboe Cc: lkml , "linux-mm@kvack.org" List-ID: Jens Axboe wrote: > > On Sun, Sep 22 2002, Andrew Morton wrote: > > +read-latency.patch > > > > Fix the writer-starves-reader elevator problem. This is basically > > the read_latency2 patch from -ac kernels. > > > > On IDE it provides a 100x improvement in read throughput when there > > is heavy writeback happening. 40x on SCSI. You need to disable > > Ah interesting. I do still think that it is worth to investigate _why_ > both elevator_linus and deadline does not prevent the read starvation. I did. See below. > The read-latency is a hack, not a solution imo. Well it clearly _is_ a solution. To a grave problem. But hopefully not the best solution. Really, this is just me saying "ouch". This is your stuff ;) > > tagged command queueing on scsi - it appears to be quite stupidly > > implemented. > > Ahem I think you are being excessively harsh, or maybe passing judgement > on something you haven't even looked at. Did you consider that you > _drive_ may be the broken component? Excessive turn-around times for > request when using deep tcq is not unusual, by far. It's a Fujitsu SCA-2 thing. Could be that other drive manufacturers have a slight clue, but I doubt it. I bet they just went and designed the queueing for optimum throughput, with the assumption that reads and writes are muchly the same thing. But they're not. They are vastly different things. Your fancy 2GHz processor twiddles thumbs waiting for reads. But not for writes. The "hack" _recognises_ this fact - that reads are very different things from writes. Let's run the numbers. 128 slot write request queue. 512k writes. 30 mbyte/sec bandwidth. That's two seconds worth of writes in the request queue. The reads have basically no chance of getting inserted between those writes, so the first read has a two second latency, and that's before adding in any of the passovers which additional writes will enjoy. It works out that the latency per read is about three seconds. I have all the traces of this. Now think about what userspace wants to do. It reads a block from the directory. Three seconds. Parse the directory, go read an inode block. Three seconds. Go read the file. Three seconds if it's less than 56k. Six seconds otherwise. That's nine seconds since we read the directory block. I'm running with mem=192m. So by now, the directory block has been reclaimed. Move onto the next file. So there is no bug or coding error present in the elevator. Everything is working as it is designed to. But a streaming write slows read performance by a factor of 4000. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/