From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <20020405182738.19092.qmail@london.rubylane.com> From: jim@rubylane.com Subject: Re: 2.2.20 suspends everything then recovers during heavy I/O Date: Fri, 5 Apr 2002 10:27:38 -0800 (PST) In-Reply-To: <3CAD3632.E14560B@zip.com.au> from "Andrew Morton" at Apr 04, 2002 09:29:22 PM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Andrew Morton Cc: linux-mm@kvack.org List-ID: > Jim Wilcoxson wrote: > > > > I'm setting up a new system with 2.2.20, Ingo's raid patches, plus > > Hedrick's IDE patches. > > > > When doing heavy I/O, like copying partitions between drives using tar in a > > pipeline, I've noticed that things will just stop for long periods of time, > > presumably while buffers are written out to the destination disk. The > > destination drive light is on and the system is not exactly hung, because I > > can switch consoles and stuff, but a running vmstat totally suspends for > > 10-15 seconds. > > > > Any tips or patches that will avoid this? If our server hangs for 15 > > seconds, we're going to have tons of web requests piled up for it when it > > decides to wakeup... > > > > Which filesystem are you using? ext2 > First thing to do is to ensure that your disks are achieving > the expected bandwidth. Measure them with `hdparm -t'. > If the throughput is poor, and they're IDE, check the > chipset tuning options in your kernel config and/or > tune the disks with hdparm. # hdparm -tT /dev/hdg /dev/hdg: Timing buffer-cache reads: 128 MB in 0.65 seconds =196.92 MB/sec Timing buffered disk reads: 64 MB in 1.78 seconds = 35.96 MB/sec Is this fast? I dunno - seems fast. The Promise cards are in a 66MHz bus slot, so I thought about using the idebus= thing to tell it that, but I'm gun shy. Probably not worth it for real-world accesses. All the drives are in UDMA5 mode: # hdparm -i /dev/hdg /dev/hdg: Model=Maxtor 5T060H6, FwRev=TAH71DP0, SerialNo=T6HMF4EC Config={ Fixed } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57 BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Drive Supports : Reserved : ATA-1 ATA-2 ATA-3 ATA-4 ATA-5 ATA-6 Kernel Drive Geometry LogicalCHS=119150/16/63 PhysicalCHS=119150/16/63 > If all that fails, you can probably smooth things > out by tuning the writeback parameters in /proc/sys/vm/bdflush > (if that's there in 2.2. It's certainly somewhere :)) > Set the `interval' value smaller than the default five > seconds, set `nfract' higher. Set `age_buffer' lower.. Thanks, I'll try these tips. IMO, one of Linux's weaknesses is that it is not easy to run I/O bound jobs without killing the performance of everything on the machine because of buffer cacheing. I know lots of people are working on solving this and that 2.4 is much better in this regard. It just takes time for a production site to have the warm fuzzies about changing their OS. > And finally: don't go copying entire partitions around > on a live web server :) What would be really great is some way to indicate, maybe with an O_SEQ flag or something, that an application is going to sequentially access a file, so cacheing it is a no-win proposition. Production servers do have situations where lots of data has to be copied or accessed, for example, to do a backup, but doing a backup shouldn't mean that all of the important stuff gets continuously thrown out of memory while the backup is running. Saving metadata during a backup is useful. Saving file data isn't. It's seems hard to do this without an application hint because I may scan a database sequentially but I'd still want those buffers to stay resident. Linux's I/O strategy (2.2.20), IMO, is kinda flawed because a very high priority process (kswapd) is used to cleanup the mess that other I/O-bound processes leave behind. To me, it would be better to penalize the applications that are causing the phsical I/O and slow them down rather than letting them have free reign when there is buffer space available, they instantly fill it, and then invoke this high-priority process in quasi-emergency mode to flush the buffers. The other thing I suggested to Alan Cox is a new ulimit that limits how many file buffers a process can acquire. If the buffer is referenced by another process other than the one that caused it to be created, then maybe it isn't counted in the limit (sharing). This way, without changing any applications, I can set the ulimit before a backup procedure w/o having to change any applications. Another suggestion is to limit disk and network I/O bandwidth by process using ulimits. If I have a 1GB link between machines, I don't necessarily want to kill two computers to transfer a large file across the link. Maybe I don't care how long it takes. I know some applications are adding support for throttling, and there are various other ways to do it - shaper, QOS, throttled pipes, etc. - but a general, easy-to-use mechanism would be very helpful to production sites. We don't always have a lot of time to learn the ins and outs of setting up complex (to us) things like QOS. Hell, I couldn't even wade through all the kernel build options for QOS. :) It's a great feature for sites using Linux as routers, but too complex for general purpose use, IMO. I've been reading some about the new O(1) CPU scheduler and it sounds interesting. Scheduling CPUs is only part of the problem. In an I/O bound situation, there is plenty of CPU to go around. The problem becomes fair, smooth access to the drives for all processes that need that resource, also recognizing that different processes have different completion constraints. Right now I have to copy a 30GB partition to another drive in order to do an upgrade for RAID. I don't care if it takes 3 days cause I still have to rsync it afterwards, but I do have to run it on a live server. I had to write a pipe throttling thingy to run tar data through so it didn't kill our server. Okay, end of my rant. I have my raid running now, my IDE problems have subsided, and I'm a happy Linux camper again. THanks again for the tips. Jim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/