2.5.59-mm5

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* 2.5.59-mm5
@ 2003-01-24  3:50 Andrew Morton
  2003-01-24 11:03 ` 2.5.59-mm5 Alex Bligh - linux-kernel
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Andrew Morton @ 2003-01-24  3:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/

.  -mm3 and -mm4 were not announced - they were sync-up patches as we
  worked on the I/O scheduler.

.  -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler.
  Here's the scoop:

  The problem being addressed here is (mainly) kernel behaviour when there
  is a stream of writeout happening, and someone submits a read.

  In 2.4.x, the disk queues contain up to 30 megabytes of writes (say, one
  seconds's worth).  When a read is submitted the 2.4 I/O scheduler will try
  to insert that at the right place between the writes.  Usually, there is no
  right place and the read is appended to the queue.  That is: it will be
  serviced in one second.

  But the problem with reads is that they are dependent - neither the
  application nor the kernel can submit read #N until read #N-1 has
  completed.  So something as simple as

	cat /usr/src/linux/kernel/*.c > /dev/null

  requires several hundred dependent reads.  And in the presence of a
  streaming write, each and every one of those reads gets stuck at the end of
  the queue, and takes a second to propagate to the head.  The `cat' takes
  hundreds of seconds.

  The celebrated read-latency2 patch recognises the fact that appending a
  read to a tail of writes is dumb, and puts the read near the head of the
  queue of writes.  It provides an improvement of up to 30x.  The deadline
  I/O scheduler in 2.5 does the same thing: if reads are queued up, promote
  them past writes, even if those writes have been waiting longer.

  So far so good, but these fixes are still dumb.  Because we're solving
  the dependent read problem by creating a seek storm.  Every time someone
  submits a read, we stop writing, seek over and service the read, and then
  *immediately* seek back and start servicing writes again.

  But in the common case, the application which submitted a read is about
  to go and submit another one, closeby on-disk to the first.  So whoops, we
  have to seek back to service that one as well.

  So what anticipatory scheduling does is very simple: if an application
  has performed a read, do *nothing at all* for a few milliseconds.  Just
  return to userspace (or to the filesystem) in the expectation that the
  application or filesystem will quickly submit another read which is
  closeby.

  If the application _does_ submit the read then fine - we service that
  quickly.  If it does not submit a read then we lose.  Time out and go back
  to doing writes.

  The end result is a large reduction in seeking - decreased read latency,
  increased read bandwidth and increased write bandwidth.

  The code as-is has rough spots and still needs quite some work.  But it
  appears to be stable.  The test which I have concentrated on is "how long
  does my laptop take to compile util-linux when there is a continuous write
  happening".  On ext2, mounted noatime:

	2.4.20:                 538 seconds
	2.5.59:                 400 seconds
	2.5.59-mm5:             70 seconds
	No streaming write:     48 seconds

  A couple of VFS changes were needed as well.

  More details on anticipatory scheduling may be found at

	http://www.cs.rice.edu/~ssiyer/r/antsched/

Changes since 2.5.59-mm2:

+preempt-locking.patch

 Speed up the smp preempt locking.

+ext2-allocation-failure-fix.patch

 ext2 ENOSPC crash fix

+ext2_new_block-fixes.patch

 ext2 cleanups

+hangcheck-timer.patch

 A form of software watchdog

+slab-irq-fix.patch

 Fix a BUG() in slab when memory exhaustion happens at a bad time.

+sendfile-security-hooks.patch

 Reinstate lost security hooks around sendfile()

+buffer-io-accounting.patch

 Fix IO-wait acounting

+aic79xx-linux-2.5.59-20030122.patch

 aic7xxx driver update

+topology-remove-underbars.patch

 cleanup

+mandlock-oops-fix.patch

 file locking fix

+reiserfs_file_write.patch

 reworked reiserfs write code.

-exit_mmap-fix2.patch

 Dropped

+generic_file_readonly_mmap-fix.patch

 Fix MAP_PRIVATE mmaps for filesystems which don't support ->writepage()

+seq_file-page-defn.patch

 Compile fix

+exit_mmap-fix-ppc64.patch
+exit_mmap-ia64-fix.patch

 Fix the exit_mmap() problem in arch code.

+show_task-fix.patch

 Fix oops in show_task()

+scsi-iothread.patch

 software suspend fix

+numaq-ioapic-fix2.patch

 NUMAQ stuff

+misc.patch

 Random fixes

+writeback-sync-cleanup.patch

 remove some junk from fs-writeback.c

+dont-wait-on-inode.patch

 Fix large delays in the writeback path

+unlink-latency-fix.patch

 Fix large delays in unlink()

+anticipatory_io_scheduling-2_5_59-mm3.patch

 Anticipatory scheduling implementation

All 65 patches:

kgdb.patch

devfs-fix.patch

deadline-np-42.patch
  (undescribed patch)

deadline-np-43.patch
  (undescribed patch)

setuid-exec-no-lock_kernel.patch
  remove lock_kernel() from exec of setuid apps

buffer-debug.patch
  buffer.c debugging

warn-null-wakeup.patch

reiserfs-readpages.patch
  reiserfs v3 readpages support

fadvise.patch
  implement posix_fadvise64()

ext3-scheduling-storm.patch
  ext3: fix scheduling storm and lockups

auto-unplug.patch
  self-unplugging request queues

less-unplugging.patch
  Remove most of the blk_run_queues() calls

lockless-current_kernel_time.patch
  Lockless current_kernel_timer()

scheduler-tunables.patch
  scheduler tunables

htlb-2.patch
  hugetlb: fix MAP_FIXED handling

kirq.patch

kirq-up-fix.patch
  Subject: Re: 2.5.59-mm1

ext3-truncate-ordered-pages.patch
  ext3: explicitly free truncated pages

prune-icache-stats.patch
  add stats for page reclaim via inode freeing

vma-file-merge.patch

mmap-whitespace.patch

read_cache_pages-cleanup.patch
  cleanup in read_cache_pages()

remove-GFP_HIGHIO.patch
  remove __GFP_HIGHIO

quota-lockfix.patch
  quota locking fix

quota-offsem.patch
  quota semaphore fix

oprofile-p4.patch

oprofile_cpu-as-string.patch
  oprofile cpu-as-string

preempt-locking.patch
  Subject: spinlock efficiency problem [was 2.5.57 IO slowdown with CONFIG_PREEMPT enabled)

wli-11_pgd_ctor.patch
  (undescribed patch)

wli-11_pgd_ctor-update.patch
  pgd_ctor update

stack-overflow-fix.patch
  stack overflow checking fix

ext2-allocation-failure-fix.patch
  Subject: [PATCH] ext2 allocation failures

ext2_new_block-fixes.patch
  ext2_new_block cleanups and fixes

hangcheck-timer.patch
  hangcheck-timer

slab-irq-fix.patch
  slab IRQ fix

Richard_Henderson_for_President.patch
  Subject: [PATCH] Richard Henderson for President!

parenthesise-pgd_index.patch
  Subject: i386 pgd_index() doesn't parenthesize its arg

sendfile-security-hooks.patch
  Subject: [RFC][PATCH] Restore LSM hook calls to sendfile

macro-double-eval-fix.patch
  Subject: Re: i386 pgd_index() doesn't parenthesize its arg

mmzone-parens.patch
  asm-i386/mmzone.h macro paren/eval fixes

blkdev-fixes.patch
  blkdev.h fixes

remove-will_become_orphaned_pgrp.patch
  remove will_become_orphaned_pgrp()

buffer-io-accounting.patch
  correct wait accounting in wait_on_buffer()

aic79xx-linux-2.5.59-20030122.patch
  aic7xxx update

MAX_IO_APICS-ifdef.patch
  MAX_IO_APICS #ifdef'd wrongly

dac960-error-retry.patch
  Subject: [PATCH] linux2.5.56 patch to DAC960 driver for error retry

topology-remove-underbars.patch
  Remove __ from topology macros

mandlock-oops-fix.patch
  ftruncate/truncate oopses with mandatory locking

put_user-warning-fix.patch
  Subject: Re: Linux 2.5.59

reiserfs_file_write.patch
  Subject: reiserfs file_write patch

vmlinux-fix.patch
  vmlinux fix

smalldevfs.patch
  smalldevfs

sound-firmware-load-fix.patch
  soundcore.c referenced non-existent errno variable

generic_file_readonly_mmap-fix.patch
  Fix generic_file_readonly_mmap()

seq_file-page-defn.patch
  Include <asm/page.h> in fs/seq_file.c, as it uses PAGE_SIZE

exit_mmap-fix-ppc64.patch

exit_mmap-ia64-fix.patch
  Fix ia64's 64bit->32bit app switching

show_task-fix.patch
  Subject: [PATCH] 2.5.59: show_task() oops

scsi-iothread.patch
  scsi_eh_* needs to run even during suspend

numaq-ioapic-fix2.patch
  NUMAQ io_apic programming fix

misc.patch
  misc fixes

writeback-sync-cleanup.patch

dont-wait-on-inode.patch

unlink-latency-fix.patch

anticipatory_io_scheduling-2_5_59-mm3.patch
  Subject: [PATCH] 2.5.59-mm3 antic io sched

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24  3:50 2.5.59-mm5 Andrew Morton
@ 2003-01-24 11:03 ` Alex Bligh - linux-kernel
  2003-01-24 11:16   ` 2.5.59-mm5 Andrew Morton
  2003-01-24 11:23   ` 2.5.59-mm5 Jens Axboe
  2003-01-24 13:59 ` 2.5.59-mm5 got stuck during boot Helge Hafting
  2003-01-25  8:33 ` 2.5.59-mm5 Andres Salomon
  2 siblings, 2 replies; 32+ messages in thread
From: Alex Bligh - linux-kernel @ 2003-01-24 11:03 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm; +Cc: Alex Bligh - linux-kernel

--On 23 January 2003 19:50 -0800 Andrew Morton <akpm@digeo.com> wrote:

>   So what anticipatory scheduling does is very simple: if an application
>   has performed a read, do *nothing at all* for a few milliseconds.  Just
>   return to userspace (or to the filesystem) in the expectation that the
>   application or filesystem will quickly submit another read which is
>   closeby.

I'm sure this is a really dumb question, as I've never played
with this subsystem, in which case I apologize in advance.

Why not follow (by default) the old system where you put the reads
effectively at the back of the queue. Then rather than doing nothing
for a few milliseconds, you carry on with doing the writes. However,
promote the reads to the front of the queue when you have a "good
lump" of them. If you get further reads while you are processing
a lump of them, put them behind the lump. Switch back to the putting
reads at the end when we have done "a few lumps worth" of
reads, or exhausted the reads at the start of the queue (or
perhaps are short of memory).

IE (with a "lump" = 20) and "a few" = 3.

W0 W1 W2 ... W50 W51

[Read arrives, we process some writes]

W5 ... W50 W51 R0

[More reads arrive, more writes processed]

W10 ... W50 W51 R0 R1 R2 .. R7

[Haven't got a big enough lump, but a
write arrives]

W12 W13... W50 W51 W52 R0 R1 R2 .. R7

[More reads arrive, more writes processed]

W14 W15 ... W50 W51 W52 R0 R1 R2 .. R7 R8 R9.. R19

[Another read arrives, after 4 more writes have
been processed, and we move the lump to the
front]

R0 R1 R2 .. R7 R8 R9.. R19 R20 W18 W19 ... W50 W51 W52

[Some reads are processed, and some more arrive, which
we insert into our lump at the front]

R0 R1 R2 .. R7 R8 R9.. R19 R20 R21 R22 W18 W19 ... W50 W51 W52

Then either if the reads are processed at the front of
the queue faster than they arrive, and the "lump" disappears,
or if we've processed 3 x 20 = 60 reads, we revert to
sticking reads back at the end.

All this does is lump between 20 and 60 reads together.

The advantage being that you don't "do nothing" for a few
milliseconds, and can attract larger lumps, than by
waiting without incurring additional latency.

Now of course you have the ordering problem (in that I've
assumed you can insert things into the queue at will),
but you have that anyway.

--
Alex Bligh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 11:03 ` 2.5.59-mm5 Alex Bligh - linux-kernel
@ 2003-01-24 11:16   ` Andrew Morton
  2003-01-24 11:23     ` 2.5.59-mm5 Alex Tomas
  2003-01-24 12:14     ` 2.5.59-mm5 Nikita Danilov
  2003-01-24 11:23   ` 2.5.59-mm5 Jens Axboe
  1 sibling, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2003-01-24 11:16 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel; +Cc: linux-kernel, linux-mm

Alex Bligh - linux-kernel <linux-kernel@alex.org.uk> wrote:
>
> 
> 
> --On 23 January 2003 19:50 -0800 Andrew Morton <akpm@digeo.com> wrote:
> 
> >   So what anticipatory scheduling does is very simple: if an application
> >   has performed a read, do *nothing at all* for a few milliseconds.  Just
> >   return to userspace (or to the filesystem) in the expectation that the
> >   application or filesystem will quickly submit another read which is
> >   closeby.
> 
> I'm sure this is a really dumb question, as I've never played
> with this subsystem, in which case I apologize in advance.
> 
> Why not follow (by default) the old system where you put the reads
> effectively at the back of the queue. Then rather than doing nothing
> for a few milliseconds, you carry on with doing the writes. However,
> promote the reads to the front of the queue when you have a "good
> lump" of them.

That is the problem.  Reads do not come in "lumps".  They are dependent. 
Consider the case of reading a file:

1: Read the directory.

   This is a single read, and we cannot do anything until it has
   completed.

2: The directory told us where the inode is.  Go read the inode.

   This is a single read, and we cannot do anything until it has
   completed.

3: Go read the first 12 blocks of the file and the first indirect.

   This is a single read, and we cannot do anything until it has
   completed.

The above process can take up to three trips through the request queue.

In this very common scenario, the only way we'll ever get "lumps" of reads is
if some other processes come in and happen to want to read nearby sectors. 
In the best case, the size of the lump is proportional to the number of
processes which are concurrently trying to read something.  This just doesn't
happen enough to be significant or interesting.

But writes are completely different.  There is no dependency between them and
at any point in time we know where on-disk a lot of writes will be placed. 
We don't know that for reads, which is why we need to twiddle thumbs until the
application or filesystem makes up its mind.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 11:03 ` 2.5.59-mm5 Alex Bligh - linux-kernel
  2003-01-24 11:16   ` 2.5.59-mm5 Andrew Morton
@ 2003-01-24 11:23   ` Jens Axboe
  1 sibling, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2003-01-24 11:23 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Andrew Morton, linux-kernel, linux-mm

On Fri, Jan 24 2003, Alex Bligh - linux-kernel wrote:
> 
> --On 23 January 2003 19:50 -0800 Andrew Morton <akpm@digeo.com> wrote:
> 
> >  So what anticipatory scheduling does is very simple: if an application
> >  has performed a read, do *nothing at all* for a few milliseconds.  Just
> >  return to userspace (or to the filesystem) in the expectation that the
> >  application or filesystem will quickly submit another read which is
> >  closeby.
> 
> I'm sure this is a really dumb question, as I've never played
> with this subsystem, in which case I apologize in advance.
> 
> Why not follow (by default) the old system where you put the reads
> effectively at the back of the queue. Then rather than doing nothing
> for a few milliseconds, you carry on with doing the writes. However,
> promote the reads to the front of the queue when you have a "good
> lump" of them. If you get further reads while you are processing
> a lump of them, put them behind the lump. Switch back to the putting
> reads at the end when we have done "a few lumps worth" of
> reads, or exhausted the reads at the start of the queue (or
> perhaps are short of memory).

The whole point of anticipatory disk scheduling is that the one process
that submits a read is not going to do anything before that reads
completes. However, maybe it will issue a _new_ read right after the
first one completes. The anticipation being that the same process will
submit a close read immediately.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 11:16   ` 2.5.59-mm5 Andrew Morton
@ 2003-01-24 11:23     ` Alex Tomas
  2003-01-24 11:50       ` 2.5.59-mm5 Andrew Morton
  2003-01-24 12:14     ` 2.5.59-mm5 Nikita Danilov
  1 sibling, 1 reply; 32+ messages in thread
From: Alex Tomas @ 2003-01-24 11:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Bligh - linux-kernel, linux-kernel, linux-mm

>>>>> Andrew Morton (AM) writes:

 AM> But writes are completely different.  There is no dependency
 AM> between them and at any point in time we know where on-disk a lot
 AM> of writes will be placed.  We don't know that for reads, which is
 AM> why we need to twiddle thumbs until the application or filesystem
 AM> makes up its mind.

it's significant that application doesn't want to wait read completion
long and doesn't wait for write completion in most cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 11:23     ` 2.5.59-mm5 Alex Tomas
@ 2003-01-24 11:50       ` Andrew Morton
  2003-01-24 12:05         ` 2.5.59-mm5 Alex Tomas
  2003-01-24 15:56         ` 2.5.59-mm5 Oliver Xymoron
  0 siblings, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2003-01-24 11:50 UTC (permalink / raw)
  To: Alex Tomas; +Cc: linux-kernel, linux-kernel, linux-mm

Alex Tomas <bzzz@tmi.comex.ru> wrote:
>
> >>>>> Andrew Morton (AM) writes:
> 
>  AM> But writes are completely different.  There is no dependency
>  AM> between them and at any point in time we know where on-disk a lot
>  AM> of writes will be placed.  We don't know that for reads, which is
>  AM> why we need to twiddle thumbs until the application or filesystem
>  AM> makes up its mind.
> 
> 
> it's significant that application doesn't want to wait read completion
> long and doesn't wait for write completion in most cases.

That's correct.  Reads are usually synchronous and writes are rarely
synchronous.

The most common place where the kernel forces a user process to wait on
completion of a write is actually in unlink (truncate, really).  Because
truncate must wait for in-progress I/O to complete before allowing the
filesystem to free (and potentially reuse) the affected blocks.

If there's a lot of writeout happening then truncate can take _ages_.  Hence
this patch:

Truncates can take a very long time.  Especially if there is a lot of
writeout happening, because truncate must wait on in-progress I/O.

And sys_unlink() is performing that truncate while holding the parent
directory's i_sem.  This basically shuts down new accesses to the entire
directory until the synchronous I/O completes.

In the testing I've been doing, that directory is /tmp, and this hurts.

So change sys_unlink() to perform the actual truncate outside i_sem.

When there is a continuous streaming write to the same disk, this patch
reduces the time for `make -j4 bzImage' from 370 seconds to 220.

 namei.c |   12 ++++++++++++
 1 files changed, 12 insertions(+)

diff -puN fs/namei.c~unlink-latency-fix fs/namei.c
--- 25/fs/namei.c~unlink-latency-fix	2003-01-24 02:41:04.000000000 -0800
+++ 25-akpm/fs/namei.c	2003-01-24 02:47:36.000000000 -0800
@@ -1659,12 +1659,19 @@ int vfs_unlink(struct inode *dir, struct
 	return error;
 }

+/*
+ * Make sure that the actual truncation of the file will occur outside its
+ * diretory's i_sem.  truncate can take a long time if there is a lot of
+ * writeout happening, and we don't want to prevent access to the directory
+ * while waiting on the I/O.
+ */
 asmlinkage long sys_unlink(const char * pathname)
 {
 	int error = 0;
 	char * name;
 	struct dentry *dentry;
 	struct nameidata nd;
+	struct inode *inode = NULL;

 	name = getname(pathname);
 	if(IS_ERR(name))
@@ -1683,6 +1690,9 @@ asmlinkage long sys_unlink(const char * 
 		/* Why not before? Because we want correct error value */
 		if (nd.last.name[nd.last.len])
 			goto slashes;
+		inode = dentry->d_inode;
+		if (inode)
+			inode = igrab(inode);
 		error = vfs_unlink(nd.dentry->d_inode, dentry);
 	exit2:
 		dput(dentry);
@@ -1693,6 +1703,8 @@ exit1:
 exit:
 	putname(name);

+	if (inode)
+		iput(inode);	/* truncate the inode here */
 	return error;

 slashes:

_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 11:50       ` 2.5.59-mm5 Andrew Morton
@ 2003-01-24 12:05         ` Alex Tomas
  2003-01-24 19:12           ` 2.5.59-mm5 Andrew Morton
  2003-01-24 15:56         ` 2.5.59-mm5 Oliver Xymoron
  1 sibling, 1 reply; 32+ messages in thread
From: Alex Tomas @ 2003-01-24 12:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Tomas, linux-kernel, linux-kernel, linux-mm

>>>>> Andrew Morton (AM) writes:

 AM> That's correct.  Reads are usually synchronous and writes are
 AM> rarely synchronous.

 AM> The most common place where the kernel forces a user process to
 AM> wait on completion of a write is actually in unlink (truncate,
 AM> really).  Because truncate must wait for in-progress I/O to
 AM> complete before allowing the filesystem to free (and potentially
 AM> reuse) the affected blocks.

looks like I miss something here.

why do wait for write completion in truncate? 

getblk (blockmap);
getblk (bitmap);
set 0 in blockmap->b_data[N];
mark_buffer_dirty (blockmap);
clear_bit (N, &bitmap);
mark_buffer_dirty (bitmap);

isn't that enough?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 11:16   ` 2.5.59-mm5 Andrew Morton
  2003-01-24 11:23     ` 2.5.59-mm5 Alex Tomas
@ 2003-01-24 12:14     ` Nikita Danilov
  2003-01-24 16:00       ` 2.5.59-mm5 Nick Piggin
  1 sibling, 1 reply; 32+ messages in thread
From: Nikita Danilov @ 2003-01-24 12:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Bligh - linux-kernel, linux-kernel, linux-mm

Andrew Morton writes:

[...]

 > 
 > In this very common scenario, the only way we'll ever get "lumps" of reads is
 > if some other processes come in and happen to want to read nearby sectors. 

Or if you have read-ahead for meta-data, which is quite useful. Isn't
read ahead targeting the same problem as this anticipatory scheduling?

 > In the best case, the size of the lump is proportional to the number of
 > processes which are concurrently trying to read something.  This just doesn't
 > happen enough to be significant or interesting.
 > 
 > But writes are completely different.  There is no dependency between them and
 > at any point in time we know where on-disk a lot of writes will be placed. 
 > We don't know that for reads, which is why we need to twiddle thumbs until the
 > application or filesystem makes up its mind.
 > 

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5 got stuck during boot
  2003-01-24  3:50 2.5.59-mm5 Andrew Morton
  2003-01-24 11:03 ` 2.5.59-mm5 Alex Bligh - linux-kernel
@ 2003-01-24 13:59 ` Helge Hafting
  2003-01-24 17:44   ` Ed Tomlinson
  2003-01-25  8:33 ` 2.5.59-mm5 Andres Salomon
  2 siblings, 1 reply; 32+ messages in thread
From: Helge Hafting @ 2003-01-24 13:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Andrew Morton wrote:

> .  -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler.

Interesting, but it didn't boot completely.
It came all the way to mount root from /dev/md0  (dirty raid1)
freed 316k of kernel memory, and then nothing happened.
numloc and capslock worked, and so did sysrq.
It was as if the kernel "forgot" to run init.
Nothing happened, but it wasn't hanging either.

sysrq "show pc" told me something about default idle.
I noticed that the root raid-1 came up dirty. (2.5.X
seems unable to shut down a raid-1 device "clean" if
it  happens to be the root fs.  So there's _always_
a bootup resync that starts as soon as the raid
is autodetected. (Before mounting root)

This is a UP P4, preempt, no module support,
compiled with gcc 2.95.4 from debian.

Stock 2.5.59 works, the only config change is to enable
that new CONFIG_HANGCHECK_TIMER.  

Helge Hafting
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 11:50       ` 2.5.59-mm5 Andrew Morton
  2003-01-24 12:05         ` 2.5.59-mm5 Alex Tomas
@ 2003-01-24 15:56         ` Oliver Xymoron
  2003-01-24 16:04           ` 2.5.59-mm5 Nick Piggin
  1 sibling, 1 reply; 32+ messages in thread
From: Oliver Xymoron @ 2003-01-24 15:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Tomas, linux-kernel, linux-kernel, linux-mm

On Fri, Jan 24, 2003 at 03:50:17AM -0800, Andrew Morton wrote:
> Alex Tomas <bzzz@tmi.comex.ru> wrote:
> >
> > >>>>> Andrew Morton (AM) writes:
> > 
> >  AM> But writes are completely different.  There is no dependency
> >  AM> between them and at any point in time we know where on-disk a lot
> >  AM> of writes will be placed.  We don't know that for reads, which is
> >  AM> why we need to twiddle thumbs until the application or filesystem
> >  AM> makes up its mind.
> > 
> > 
> > it's significant that application doesn't want to wait read completion
> > long and doesn't wait for write completion in most cases.
> 
> That's correct.  Reads are usually synchronous and writes are rarely
> synchronous.
> 
> The most common place where the kernel forces a user process to wait on
> completion of a write is actually in unlink (truncate, really).  Because
> truncate must wait for in-progress I/O to complete before allowing the
> filesystem to free (and potentially reuse) the affected blocks.
> 
> If there's a lot of writeout happening then truncate can take _ages_.  Hence
> this patch:

An alternate approach might be to change the way the scheduler splits
things. That is, rather than marking I/O read vs write and scheduling
based on that, add a flag bit to mark them all sync vs async since
that's the distinction we actually care about. The normal paths can
all do read+sync and write+async, but you can now do things like
marking your truncate writes sync and readahead async.

And dependent/nondependent or stalling/nonstalling might be a clearer
terminology.

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 12:14     ` 2.5.59-mm5 Nikita Danilov
@ 2003-01-24 16:00       ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2003-01-24 16:00 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Andrew Morton, Alex Bligh - linux-kernel, linux-kernel, linux-mm

Nikita Danilov wrote:

>Andrew Morton writes:
>
>[...]
>
> > 
> > In this very common scenario, the only way we'll ever get "lumps" of reads is
> > if some other processes come in and happen to want to read nearby sectors. 
>
>Or if you have read-ahead for meta-data, which is quite useful. Isn't
>read ahead targeting the same problem as this anticipatory scheduling?
>
Finesse vs brute force. A bit of readahead is good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 15:56         ` 2.5.59-mm5 Oliver Xymoron
@ 2003-01-24 16:04           ` Nick Piggin
  2003-01-24 17:09             ` 2.5.59-mm5 Giuliano Pochini
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2003-01-24 16:04 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Andrew Morton, Alex Tomas, linux-kernel, linux-kernel, linux-mm

Oliver Xymoron wrote:

>On Fri, Jan 24, 2003 at 03:50:17AM -0800, Andrew Morton wrote:
>
>>Alex Tomas <bzzz@tmi.comex.ru> wrote:
>>
>>>>>>>>Andrew Morton (AM) writes:
>>>>>>>>
>>> AM> But writes are completely different.  There is no dependency
>>> AM> between them and at any point in time we know where on-disk a lot
>>> AM> of writes will be placed.  We don't know that for reads, which is
>>> AM> why we need to twiddle thumbs until the application or filesystem
>>> AM> makes up its mind.
>>>
>>>
>>>it's significant that application doesn't want to wait read completion
>>>long and doesn't wait for write completion in most cases.
>>>
>>That's correct.  Reads are usually synchronous and writes are rarely
>>synchronous.
>>
>>The most common place where the kernel forces a user process to wait on
>>completion of a write is actually in unlink (truncate, really).  Because
>>truncate must wait for in-progress I/O to complete before allowing the
>>filesystem to free (and potentially reuse) the affected blocks.
>>
>>If there's a lot of writeout happening then truncate can take _ages_.  Hence
>>this patch:
>>
>
>An alternate approach might be to change the way the scheduler splits
>things. That is, rather than marking I/O read vs write and scheduling
>based on that, add a flag bit to mark them all sync vs async since
>that's the distinction we actually care about. The normal paths can
>all do read+sync and write+async, but you can now do things like
>marking your truncate writes sync and readahead async.
>
>And dependent/nondependent or stalling/nonstalling might be a clearer
>terminology.
>
That will be worth investigating to see if the complexity is worth it.
I think from a disk point of view, we still want to split batches between
reads and writes. Could be wrong.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 16:04           ` 2.5.59-mm5 Nick Piggin
@ 2003-01-24 17:09             ` Giuliano Pochini
  2003-01-24 17:22               ` 2.5.59-mm5 Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Giuliano Pochini @ 2003-01-24 17:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, linux-kernel, linux-kernel, Alex Tomas, Andrew Morton,
	Oliver Xymoron

>>An alternate approach might be to change the way the scheduler splits
>>things. That is, rather than marking I/O read vs write and scheduling
>>based on that, add a flag bit to mark them all sync vs async since
>>that's the distinction we actually care about. The normal paths can
>>all do read+sync and write+async, but you can now do things like
>>marking your truncate writes sync and readahead async.

> That will be worth investigating to see if the complexity is worth it.
> I think from a disk point of view, we still want to split batches between
> reads and writes. Could be wrong.

Yes, sync vs async is a better way to classify io requests than
read vs write and it's more correct from OS point of view. IMHO
it's not more complex then now. Just replace r/w with sy/as and
it will work.


Bye.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 17:09             ` 2.5.59-mm5 Giuliano Pochini
@ 2003-01-24 17:22               ` Nick Piggin
  2003-01-24 19:34                 ` 2.5.59-mm5 Valdis.Kletnieks
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2003-01-24 17:22 UTC (permalink / raw)
  To: Giuliano Pochini
  Cc: linux-mm, linux-kernel, linux-kernel, Alex Tomas, Andrew Morton,
	Oliver Xymoron

Giuliano Pochini wrote:

>>>An alternate approach might be to change the way the scheduler splits
>>>things. That is, rather than marking I/O read vs write and scheduling
>>>based on that, add a flag bit to mark them all sync vs async since
>>>that's the distinction we actually care about. The normal paths can
>>>all do read+sync and write+async, but you can now do things like
>>>marking your truncate writes sync and readahead async.
>>>
>
>>That will be worth investigating to see if the complexity is worth it.
>>I think from a disk point of view, we still want to split batches between
>>reads and writes. Could be wrong.
>>
>
>Yes, sync vs async is a better way to classify io requests than
>read vs write and it's more correct from OS point of view. IMHO
>it's not more complex then now. Just replace r/w with sy/as and
>it will work.
>
We probably wouldn't want to go that far as you obviously can
only merge reads with reads and writes with writes, a flag would
be fine. We have to get the basics working first though ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5 got stuck during boot
  2003-01-24 13:59 ` 2.5.59-mm5 got stuck during boot Helge Hafting
@ 2003-01-24 17:44   ` Ed Tomlinson
  2003-01-24 17:56     ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Ed Tomlinson @ 2003-01-24 17:44 UTC (permalink / raw)
  To: Andrew Morton, Nick Piggin; +Cc: linux-mm

On January 24, 2003 08:59 am, Helge Hafting wrote:
> Andrew Morton wrote:
> > .  -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler.
>
> Interesting, but it didn't boot completely.
> It came all the way to mount root from /dev/md0  (dirty raid1)
> freed 316k of kernel memory, and then nothing happened.
> numloc and capslock worked, and so did sysrq.
> It was as if the kernel "forgot" to run init.
> Nothing happened, but it wasn't hanging either.
>
> sysrq "show pc" told me something about default idle.
> I noticed that the root raid-1 came up dirty. (2.5.X
> seems unable to shut down a raid-1 device "clean" if
> it  happens to be the root fs.  So there's _always_
> a bootup resync that starts as soon as the raid
> is autodetected. (Before mounting root)
>
>
> This is a UP P4, preempt, no module support,
> compiled with gcc 2.95.4 from debian.
>
> Stock 2.5.59 works, the only config change is to enable
> that new CONFIG_HANGCHECK_TIMER.

Same story here - almost.  No raid, using debian and the same
compiler along with multiple disks and fs(es).

Following are the messages and a sysrq+T:

Hope this helps,
Ed Tomlinson

---------
Linux version 2.5.59-mm5 (ed@oscar) (gcc version 2.95.4 20011002 (Debian prerelease)) #1 Fri Jan 24 12:09:29 EST 2003
Video mode to be used for restore is f00
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000001fff0000 (usable)
 BIOS-e820: 000000001fff0000 - 000000001fff3000 (ACPI NVS)
 BIOS-e820: 000000001fff3000 - 0000000020000000 (ACPI data)
 BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
511MB LOWMEM available.
On node 0 totalpages: 131056
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 126960 pages, LIFO batch:16
  HighMem zone: 0 pages, LIFO batch:1
Building zonelist for node : 0
Kernel command line: auto BOOT_IMAGE=Linux ro root=2103 console=tty0 console=ttyS0,38400 vga=ask idebus=33 profile=1
ide_setup: idebus=33
kernel profiling enabled
Initializing CPU#0
PID hash table entries: 2048 (order 11: 16384 bytes)
Detected 400.850 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 790.52 BogoMIPS
Memory: 513308k/524224k available (1336k kernel code, 10184k reserved, 713k data, 80k init, 0k highmem)
Dentry cache hash table entries: 65536 (order: 7, 524288 bytes)
Inode-cache hash table entries: 32768 (order: 6, 262144 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
-> /dev
-> /dev/console
-> /root
Enabling new style K6 write allocation for 511 Mb
CPU: L1 I Cache: 32K (32 bytes/line), D cache 32K (32 bytes/line)
CPU: L2 Cache: 256K (32 bytes/line)
CPU: AMD-K6(tm) 3D+ Processor stepping 01
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
mtrr: v2.0 (20020519)
PCI: PCI BIOS revision 2.10 entry at 0xfb520, last bus=1
PCI: Using configuration type 1
BIO: pool of 256 setup, 15Kb (60 bytes/bio)
biovec pool[0]:   1 bvecs: 256 entries (12 bytes)
biovec pool[1]:   4 bvecs: 256 entries (48 bytes)
biovec pool[2]:  16 bvecs: 256 entries (192 bytes)
biovec pool[3]:  64 bvecs: 256 entries (768 bytes)
biovec pool[4]: 128 bvecs: 256 entries (1536 bytes)
biovec pool[5]: 256 bvecs: 256 entries (3072 bytes)
Linux Plug and Play Support v0.94 (c) Adam Belay
pnp: Enabling Plug and Play Card Services.
PnPBIOS: Found PnP BIOS installation structure at 0xc00fc160
PnPBIOS: PnP BIOS version 1.0, entry 0xf0000:0xc188, dseg 0xf0000
PnPBIOS: 14 nodes reported by PnP BIOS; 14 recorded by driver
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
block request queues:
 128 requests per read queue
 128 requests per write queue
 8 requests per batch
 enter congestion at 15
 exit congestion at 17
drivers/usb/core/usb.c: registered new driver usbfs
drivers/usb/core/usb.c: registered new driver hub
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
PCI: Using IRQ router VIA [1106/0586] at 00:07.0
aio_setup: sizeof(struct page) = 40
Journalled Block Device driver loaded
Initializing Cryptographic API
Activating ISA DMA hang workarounds.
Serial: 8250/16550 driver $Revision: 1.90 $ IRQn sharing disablttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
ttyS2 at I/O 0x3e8 (irq = 4) is a 16550A
pty: 256 Unix98 ptys configured
Linux agpgart interface v0.100 (c) Dave Jones
agpgart: Detected VIA MVP3 chipset
agpgart: Maximum main memory to use for agp memory: 439M
agpgart: AGP aperture is 64M @ 0xe0000000
[drm] Initialized mga 3.1.0 20021029 on minor 0
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes
VP_IDE: IDE controller at PCI slot 00:07.1
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
VP_IDE: VIA vt82c586b (rev 47) IDE UDMA33 controller on pci00:07.1
    ide0: BM-DMA at 0xa000-0xa007, BIOS settings: hda:DMA, hdb:DMA
    ide1: eBM-DMA at 0xa00-0xa00f, BIOS settings: hdc:DMA, hdd:DMA
hda: QUANTUM FIREBALLP KA13.6, ATA DISK drive
hda: DMA disabled
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: AOPEN 16XDVD-ROM/AMH 20020328, ATAPI CD/DVD-ROM drive
hdd: HP COLORADO 20GB, ATAPI TAPE drive
hdc: DMA disabled
hdd: DMA disabled
ide1 at 0x170-0x177,0x376 on irq 15
PDC20267: IDE controller at PCI slot 00:09.0
PCI: Found IRQ 12 for device 00:09.0
PDC20267: chipset revision 2
PDC20267: not 100% native mode: will probe irqs later
PDC20267: ROM enabled at 0xeb000000
PDC20267: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
    ide2: BM-DMA at 0xbc00-0xbc07, BIOS settings: hde:DMA, hdf:pio
    ide3: BM-DMA at 0xbc08-0xbc0f, BIOS settings: hdg:DMA, hdh:DMA
hde: QUANTUM FIREBALLP AS40.0, ATA DISK drive
ide2 at 0xac00-0xac07,0xb002 on irq 12
hdg: QUANTUM FIREBALLP AS40.0, ATA DISK drive
ide3 at 0xb400-0xb407,0xb802 on irq 12
hda: host protected area => 1
hda: 27067824 sectors (13859 MB) w/371KiB Cache, CHS=26853/16/63, UDMA(33)
 hda: hda1 hda2 hda3 hda4 < hda5 >
hde: host protected area => 1
hde: 78177792 sectors (40027 MB) w/1902KiB Cache, CHS=77557/16/63, UDMA(100)
 hde: hde1 hde2 hde3 hde4 < hde5 >
hdg: host protected area => 1
hdg: 78177792 sectors (40027 MB) w/1902KiB Cache, CHS=77557/16/63, UDMA(100)
 hdg: hdg1 hdg2 hdg3 hdg4 < hdg5 >
drivers/usb/host/uhci-hcd.c: USB Universal Host Controller Interface driver v2.0
uhci-hcd 00:07.2: VIA Technologies, In USB
uhci-hcd 00:07.2: irq 10, io base 0000a400
Please use the 'usbfs' filetype instead, the 'usbdevfs' name is deprecated.
uhci-hcd 00:07.2: new USB bus registered, assigned bus number 1
hub 1-0:0: USB hub found
hub 1-0:0: 2 ports detected
mice: PS/2 mouse device common for all mice
input: AT Set 2 keyboard on isa0060/serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 4096 buckets, 32Kbytes
TCP: Hash tables configured (established 32768 bind 32768)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
found reiserfs format "3.6" with standard journal
hub 1-0:0: debounce: port 1: delay 100ms stable 4 status 0x101
hub 1-0:0: new USB device on port 1, assigned address 2
hub 1-1:0: USB hub found
Reiserfs journal params: device ide2(33,3), size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
reiserfs: checking transaction log (ide2(33,3)) for (ide2(33,3))
hub 1-1:0: 4 ports detected
Using r5 hash to sort names
VFS: Mounted root (reiserfs filesystem) readonly.
Freeing unused kernel memory: 80k freed
hub 1-0:0: debounce: port 2: delay 100ms stable 4 status 0x301
hub 1-0:0: new USB device on port 2, assigned address 3
SysRq : Show State

                         free                        sibling
  task             PC    stack   pid father child younger older
init          D 00000086 12112     1      0     2               (NOTLB)
Call Trace:
 [<c0113f5a>] io_schedule+0xe/0x18
 [<c0127654>] __lock_page+0x90/0xac
 [<c0114694>] autoremove_wake_function+0x0/0x38
 [<c0114694>] autoremove_wake_function+0x0/0x38
 [<c01284cb>] filemap_nopage+0x16b/0x2ac
 [<c01322d4>] do_no_page+0x78/0x2b4
 [<c013257d>e] handle_mm_fau+0x6d/0x10c
 [<c0111cb7>] do_page_fault+0x137/0x414
 [<c0111b80>] do_page_fault+0x0/0x414
 [<c013e9aa>] __fput+0xe6/0x108
 [<c0133f01>] unmap_vma+0x69/0x70
 [<c0133f1c>] unmap_vma_list+0x14/0x20
 [<c013423b>] do_munmap+0x127/0x134
 [<c013428c>] sys_munmap+0x44/0x60
 [<c0108cbd>] error_code+0x2d/0x40

ksoftirqd/0   S 00000046 4294963856     2      1             3       (L-TLB)
Call Trace:
 [<c01196e9>] ksoftirqd+0x59/0xc8
 [<c0119711>] ksoftirqd+0x81/0xc8
 [<c0119690>] ksoftirqd+0x0/0xc8
 [<c0106e45>] kernel_thread_helper+0x5/0xc

events/0      D 00000046 4294953780     3      1    12       4     2 (L-TLB)
Call Trace:
 [<c0113463>] wait_for_completion+0x1b/0xe0
 [<c01134e5>] wait_for_completion+0x9d/0xe0
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c0115cba>] do_fork+0x10e/0x130
 [<c0106ec5>] kernel_thread+0x79/0x94
 [<c0121758>] ____call_usermodehelper+0x0/0x3c
 [<c0106e40>] kernel_thread_helper+0x0/0xc
 [<c01217a9>] __call_usermodehelper+0x15/0x28
 [<c0121758>] ____call_usermodehelper+0x0/0x3c
 [<c0121cf2>] worker_thread+0x1fa/0x2dc
 [<c0121af8>] worker_thread+0x0/0x2dc
 [<c0121794>] __call_usermodehelper+0x0/0x28
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c0106e45>] kernel_thread_helper+0x5/0xc

khubd         D 00000046 4292756256     4      1             5     3 (L-TLB)
Call Trace:
 [<c0113463>] wait_for_completion+0x1b/0xe0
 [<c01134e5>] wait_for_completion+0x9d/0xe0
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c0121903>] call_usermodehelper+0x147/0x15c
 [<c01ec6d0>] usb_hotplug+0x0/0x1d8
 [<c0121794>] __call_usermodehelper+0x0/0x28
 [<c0121794>] __call_usermodehelper+0x0/0x28
 [<c01b0fc9>] do_hotplug+0x1e9/0x21c
 [<c01b102c>] dev_hotplug+0x30/0x3c
 [<c01ec6d0>] usb_hotplug+0x0/0x1d8
 [<c01af34e>] device_add+0x112/0x148
 [<c01ed112>] usb_new_device+0x366/0x4c4
 [<c0116a26>] printk+0x11e/0x140
 [<c01eec0f>] usb_hub_port_connect_change+0x24f/0x2e4
 [<c01eeddb>] usb_hub_events+0x137/0x2c4
 [<c01eef98>] usb_hub_thread+0x30/0xd8
 [<c01eef68>] usb_hub_thread+0x0/0xd8
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c0106e45>] kernel_thread_helper+0x5/0xc

pdflush       S 00000046 4292616332     5      1             6     4 (L-TLB)
Call Trace:
 [<c012ba65>] __pdflush+0xf5/0x1f8
 [<c012bb68>] pdflush+0x0/0x14
 [<c012bb73>] pdflush+0xb/0x14
 [<c0106e45>] kernel_thread_helper+0x5/0xc

pdflush       S 00000046 14412     6      1             7     5 (L-TLB)
Call Trace:
 [<c012ba65>] __pdflush+0xf5/0x1f8
 [<c012bb68>] pdflush+0x0/0x14
 [<c012bb73>] pdflush+0xb/0x14
 [<c0106e45>] kernel_thread_helper+0x5/0xc

kswapd0       S 00000046 4294958936     7      1             8     6 (L-TLB)
Call Trace:
 [<c012fb7a>] kswapd+0xea/0x10c
 [<c012fa90>] kswapd+0x0/0x10c
 [<c0109c3b>] math_state_restore+0x27/0x38
 [<c0108d15>] device_not_available+0x25/0x2a
 [<c010e170>] save_init_fpu+0x1c/0x38
 [<c01132b0>] preempt_schedule+0x28/0x40
 [<c0112b7c>] schedule_tail+0x1c/0x4c
 [<c0108915>] ret_from_fork+0x5/0x20
 [<c012fa90>] kswapd+0x0/0x10c
 [<c0114694>] autoremove_wake_function+0x0/0x38
 [<c0114694>] autoremove_wake_function+0x0/0x38
 [<c0106e45>] kernel_thre<ad_helper+0x5/0
aio/0         S 00000046 429488[6880     8              9     7 (L-TLB)
Call Trace:
 [<c0121c49>] worker_thread+0x151/0x2dc
 [<c0121af8>] worker_thread+0x0/0x2dc
 [<c0108915>] ret_from_fork+0x5/0x20
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c0106e45>] kernel_thread_helper+0x5/0xc

kpnpbiosd     T 00000046 4294880228     9      1            10     8 (L-TLB)
Call Trace:
 [<c011820c>] do_exit+0x3c4/0x3d4
 [<c0118232>] complete_and_exit+0x16/0x18
 [<c01a769d>] pnp_dock_thread+0x99/0xf4
 [<c01a7604>] pnp_dock_thread+0x0/0xf4
 [<c0106e45>] kernel_thread_helper+0x5/0xc

kseriod       S 00000046 4294112016    10      1            11     9 (L-TLB)
Call Trace:
 [<c01ff629>] serio_thread+0x9d/0x124
 [<c01ff58c>] serio_thread+0x0/0x124
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c0106e45>] kernel_thread_helper+0x5/0xc

reiserfs/0    S 00000046  8096    11      1                  10 (L-TLB)
Call Trace:
 [<c0121c49>] worker_thread+0x151/0x2dc
 [<c0121af8>] worker_thread+0x0/0x2dc
 [<c0108915>] ret_from_fork+0x5/0x20
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c01132c8>] default_wake_function+0x0/0x2c
 [<c0106e45>] kernel_thread_helper+0x5/0xc

events/0      D 00000046 4294304092    12      3                     (L-TLB)
Call Trace:
 [<c0113f5a>] io_schedule+0xe/0x18
 [<c013ec50>] __wait_on_buffer+0x78/0x94
 [<c0114694>] autoremove_wake_function+0x0/0x38
 [<c0114694>] autoremove_wake_function+0x0/0x38
 [<c013fbfc>] __bread_slow+0x6c/0x94
 [<c013fe4c>] __bread+0x28/0x30
 [<c018d5c9>] search_by_key+0x65/0xd64
 [<c01792a4>] search_by_entry_key+0x20/0x1b4
 [<c01797e9>] reiserfs_find_entry+0x7d/0x134
 [<c0179919>] reiserfs_lookup+0x79/0x168
 [<c012d14e>] kmem_cache_alloc+0x22/0x5c
 [<c01515ef>] d_alloc+0x1b/0x18c
 [<c0148b5f>] real_lookup+0x5f/0xcc
 [<c0148dfe>] do_lookup+0xb2/0x1fc
 [<c01494c7>] link_path_walk+0x57f/0x8c4
 [<c0149af4>] path_lookup+0x128/0x12c
 [<c014640b>] open_exec+0x1b/0xb8
 [<c01471ca>] do_execve+0x1e/0x204
 [<c012d14e>] kmem_cache_alloc+0x22/0x5c
 [<c014887e>] getname+0x5e/0x9c
 [<c0107584>] sys_execve+0x2c/0x64
 [<c0108a57>] syscall_call+0x7/0xb
 [<c01214e3>] exec_usermodehelper+0x333/0x360
 [<c0121785>] ____call_usermodehelper+0x2d/0x3c
 [<c0121758>] ____call_usermodehelper+0x0/0x3c
 [<c0106e45>] kernel_thread_helper+0x5/0xc

SysRq : Emergency Sync
Syncing device ide2(33,3) ... OK
Done.
SysRq : Emergency Remount R/O
Remounting device ide2(33,3) ... R/O
Done.
SysRq : Resetting

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5 got stuck during boot
  2003-01-24 17:44   ` Ed Tomlinson
@ 2003-01-24 17:56     ` Nick Piggin
  2003-01-24 19:18       ` Ed Tomlinson
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2003-01-24 17:56 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: Andrew Morton, linux-mm

Ed Tomlinson wrote:

>On January 24, 2003 08:59 am, Helge Hafting wrote:
>
>>Andrew Morton wrote:
>>
>>>.  -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler.
>>>
>>Interesting, but it didn't boot completely.
>>It came all the way to mount root from /dev/md0  (dirty raid1)
>>freed 316k of kernel memory, and then nothing happened.
>>numloc and capslock worked, and so did sysrq.
>>It was as if the kernel "forgot" to run init.
>>Nothing happened, but it wasn't hanging either.
>>
>>sysrq "show pc" told me something about default idle.
>>I noticed that the root raid-1 came up dirty. (2.5.X
>>seems unable to shut down a raid-1 device "clean" if
>>it  happens to be the root fs.  So there's _always_
>>a bootup resync that starts as soon as the raid
>>is autodetected. (Before mounting root)
>>
>>
>>This is a UP P4, preempt, no module support,
>>compiled with gcc 2.95.4 from debian.
>>
>>Stock 2.5.59 works, the only config change is to enable
>>that new CONFIG_HANGCHECK_TIMER.
>>
>
>Same story here - almost.  No raid, using debian and the same
>compiler along with multiple disks and fs(es).
>
>Following are the messages and a sysrq+T:
>
>Hope this helps,
>
Yes thanks for the nice report.

>
>                         free                        sibling
>  task             PC    stack   pid father child younger older
>init          D 00000086 12112     1      0     2               (NOTLB)
>Call Trace:
> [<c0113f5a>] io_schedule+0xe/0x18
> [<c0127654>] __lock_page+0x90/0xac
> [<c0114694>] autoremove_wake_function+0x0/0x38
> [<c0114694>] autoremove_wake_function+0x0/0x38
> [<c01284cb>] filemap_nopage+0x16b/0x2ac
> [<c01322d4>] do_no_page+0x78/0x2b4
> [<c013257d>e] handle_mm_fau+0x6d/0x10c
> [<c0111cb7>] do_page_fault+0x137/0x414
> [<c0111b80>] do_page_fault+0x0/0x414
> [<c013e9aa>] __fput+0xe6/0x108
> [<c0133f01>] unmap_vma+0x69/0x70
> [<c0133f1c>] unmap_vma_list+0x14/0x20
> [<c013423b>] do_munmap+0x127/0x134
> [<c013428c>] sys_munmap+0x44/0x60
> [<c0108cbd>] error_code+0x2d/0x40
>
Processes get sleep waiting for a page and never wake up.
It doesn't seem to be an anticipatory scheduling problem but
if you have time, try changing drivers/block/deadline-iosched.c

static int antic_expire = HZ / 25;
to
static int antic_expire = 0;

And see if you can reproduce.

Nick




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 12:05         ` 2.5.59-mm5 Alex Tomas
@ 2003-01-24 19:12           ` Andrew Morton
  2003-01-24 19:58             ` 2.5.59-mm5 Alex Tomas
  2003-01-25 17:32             ` 2.5.59-mm5 Ed Tomlinson
  0 siblings, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2003-01-24 19:12 UTC (permalink / raw)
  To: Alex Tomas; +Cc: linux-kernel, linux-kernel, linux-mm

Alex Tomas <bzzz@tmi.comex.ru> wrote:
>
> >>>>> Andrew Morton (AM) writes:
> 
>  AM> That's correct.  Reads are usually synchronous and writes are
>  AM> rarely synchronous.
> 
>  AM> The most common place where the kernel forces a user process to
>  AM> wait on completion of a write is actually in unlink (truncate,
>  AM> really).  Because truncate must wait for in-progress I/O to
>  AM> complete before allowing the filesystem to free (and potentially
>  AM> reuse) the affected blocks.
> 
> looks like I miss something here.
> 
> why do wait for write completion in truncate? 

We cannot free disk blocks until I/O against them has completed.  Otherwise
the block could be reused for something else, then the old IO will scribble
on the new data.

What we _can_ do is to defer the waiting - only wait on the I/O when someone
reuses the disk blocks.  So there are actually unused blocks with I/O in
flight against them.

We do that for metadata (the wait happens in unmap_underlying_metadata()) but
for file data blocks there is no mechanism in place to look them up.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5 got stuck during boot
  2003-01-24 17:56     ` Nick Piggin
@ 2003-01-24 19:18       ` Ed Tomlinson
  0 siblings, 0 replies; 32+ messages in thread
From: Ed Tomlinson @ 2003-01-24 19:18 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-mm

On January 24, 2003 12:56 pm, Nick Piggin wrote:
> Processes get sleep waiting for a page and never wake up.
> It doesn't seem to be an anticipatory scheduling problem but
> if you have time, try changing drivers/block/deadline-iosched.c
>
> static int antic_expire = HZ / 25;
> to
> static int antic_expire = 0;
>
> And see if you can reproduce.

It boots with this change.

Ed 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 17:22               ` 2.5.59-mm5 Nick Piggin
@ 2003-01-24 19:34                 ` Valdis.Kletnieks
  2003-01-24 20:04                   ` 2.5.59-mm5 Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Valdis.Kletnieks @ 2003-01-24 19:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Giuliano Pochini, linux-mm, linux-kernel, linux-kernel,
	Alex Tomas, Andrew Morton, Oliver Xymoron

[-- Attachment #1: Type: text/plain, Size: 541 bytes --]

On Sat, 25 Jan 2003 04:22:39 +1100, Nick Piggin said:
> We probably wouldn't want to go that far as you obviously can
> only merge reads with reads and writes with writes, a flag would
> be fine. We have to get the basics working first though ;)

"obviously can only"?  Admittedly, merging reads and writes is a lot
trickier, and probably "too hairy to bother", but I'm not aware of a
fundamental "cant" that applies across IDE/SCSI/USB/1394/fiberchannel/etc.
-- 
				Valdis Kletnieks
				Computer Systems Senior Engineer
				Virginia Tech


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 19:12           ` 2.5.59-mm5 Andrew Morton
@ 2003-01-24 19:58             ` Alex Tomas
  2003-01-25 17:32             ` 2.5.59-mm5 Ed Tomlinson
  1 sibling, 0 replies; 32+ messages in thread
From: Alex Tomas @ 2003-01-24 19:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Tomas, linux-kernel, linux-kernel, linux-mm

>>>>> Andrew Morton (AM) writes:

 AM> We cannot free disk blocks until I/O against them has completed.
 AM> Otherwise the block could be reused for something else, then the
 AM> old IO will scribble on the new data.

 AM> What we _can_ do is to defer the waiting - only wait on the I/O
 AM> when someone reuses the disk blocks.  So there are actually
 AM> unused blocks with I/O in flight against them.

 AM> We do that for metadata (the wait happens in
 AM> unmap_underlying_metadata()) but for file data blocks there is no
 AM> mechanism in place to look them up

yeah! indeed. my stupid mistake ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 19:34                 ` 2.5.59-mm5 Valdis.Kletnieks
@ 2003-01-24 20:04                   ` Jens Axboe
  2003-01-24 22:02                     ` 2.5.59-mm5 Valdis.Kletnieks
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2003-01-24 20:04 UTC (permalink / raw)
  To: Valdis.Kletnieks, Nick Piggin
  Cc: Giuliano Pochini, linux-mm, linux-kernel, linux-kernel,
	Alex Tomas, Andrew Morton, Oliver Xymoron

On Fri, Jan 24 2003, Valdis.Kletnieks@vt.edu wrote:
> On Sat, 25 Jan 2003 04:22:39 +1100, Nick Piggin said:
> > We probably wouldn't want to go that far as you obviously can
> > only merge reads with reads and writes with writes, a flag would
> > be fine. We have to get the basics working first though ;)
> 
> "obviously can only"?  Admittedly, merging reads and writes is a lot
> trickier, and probably "too hairy to bother", but I'm not aware of a
> fundamental "cant" that applies across IDE/SCSI/USB/1394/fiberchannel/etc.

Nicks comment refers to the block layer situation, we obviously cannot
merge reads and writes there. You would basically have to rewrite the
entire request submission structure and break all drivers. And for zero
benefit. Face it, it would be stupid to even attempt such a manuever.

Since you bring it up, you must know if a device which can take a single
command that says "read blocks a to b, and write blocks x to z"? Even if
such a thing existed, it would be much better implemented by the driver
as pulling more requests of the queue and constructing these weirdo
commands itself. Something as ugly as that would never invade the Linux
block layer, at least not as long as I have any input on the design of
it.

So I quite agree with the "obviously".

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 20:04                   ` 2.5.59-mm5 Jens Axboe
@ 2003-01-24 22:02                     ` Valdis.Kletnieks
  2003-01-25 12:28                       ` 2.5.59-mm5 Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Valdis.Kletnieks @ 2003-01-24 22:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Nick Piggin, Giuliano Pochini, linux-mm, linux-kernel,
	linux-kernel, Alex Tomas, Andrew Morton, Oliver Xymoron

[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]

On Fri, 24 Jan 2003 21:04:34 +0100, Jens Axboe said:

> Nicks comment refers to the block layer situation, we obviously cannot
> merge reads and writes there. You would basically have to rewrite the
> entire request submission structure and break all drivers. And for zero
> benefit. Face it, it would be stupid to even attempt such a manuever.

As I *said* - "hairy beyond benefit", not "cant".

> Since you bring it up, you must know if a device which can take a single
> command that says "read blocks a to b, and write blocks x to z"? Even
> such thing existed,

They do exist.

IBM mainframe disks (the 3330/50/80 series) are able to do much more than that
in one CCW chain  So it was *quite* possible to even express things like "Go to
this cylinder/track, search for each record that has value XYZ in the 'key'
field, and if found, write value ABC in the data field". (In fact, the DASD I/O
opcodes for CCW chains are Turing-complete).

>                      it would be much better implemented by the driver
> as pulling more requests of the queue and constructing these weirdo

The only operating system I'm aware of that actually uses that stuff is MVS.

> So I quite agree with the "obviously".

My complaint was the confusion of "obviously cant" with "we have decided we
don't want to".

/Valdis

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24  3:50 2.5.59-mm5 Andrew Morton
  2003-01-24 11:03 ` 2.5.59-mm5 Alex Bligh - linux-kernel
  2003-01-24 13:59 ` 2.5.59-mm5 got stuck during boot Helge Hafting
@ 2003-01-25  8:33 ` Andres Salomon
  2 siblings, 0 replies; 32+ messages in thread
From: Andres Salomon @ 2003-01-25  8:33 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

My atyfb_base.c compile fix (from 2.5.54) still hasn't found its way into
any of the main kernel trees.  The original patch generates a reject
against 2.5.59-mm5, so here's an updated patch.


On Thu, 23 Jan 2003 19:50:44 -0800, Andrew Morton wrote:

> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/
> 
> .  -mm3 and -mm4 were not announced - they were sync-up patches as we
>   worked on the I/O scheduler.
> 
> .  -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler.
>   Here's the scoop:
> 
[...]
> 
> anticipatory_io_scheduling-2_5_59-mm3.patch
>   Subject: [PATCH] 2.5.59-mm3 antic io sched
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/


--- a/drivers/video/aty/atyfb_base.c    2003-01-25 03:02:35.000000000 -0500
+++ b/drivers/video/aty/atyfb_base.c    2003-01-25 03:21:48.000000000 -0500
@@ -2587,12 +2587,12 @@
	if (info->screen_base)
		iounmap((void *) info->screen_base);
 #ifdef __BIG_ENDIAN
-	if (info->cursor && par->cursor->ram)
+	if (par->cursor && par->cursor->ram)
		iounmap(par->cursor->ram);
 #endif
 #endif
-	if (info->cursor)
-		kfree(info->cursor);
+	if (par->cursor)
+		kfree(par->cursor);
 #ifdef __sparc__
	if (par->mmap_map)
		kfree(par->mmap_map);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 22:02                     ` 2.5.59-mm5 Valdis.Kletnieks
@ 2003-01-25 12:28                       ` Jens Axboe
  0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2003-01-25 12:28 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Nick Piggin, Giuliano Pochini, linux-mm, linux-kernel,
	linux-kernel, Alex Tomas, Andrew Morton, Oliver Xymoron

On Fri, Jan 24 2003, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 24 Jan 2003 21:04:34 +0100, Jens Axboe said:
> 
> > Nicks comment refers to the block layer situation, we obviously cannot
> > merge reads and writes there. You would basically have to rewrite the
> > entire request submission structure and break all drivers. And for zero
> > benefit. Face it, it would be stupid to even attempt such a manuever.
> 
> As I *said* - "hairy beyond benefit", not "cant".

Hairy is ok as long as it provides substantial benefit in some way, and
this does definitely not qualify.

> > Since you bring it up, you must know if a device which can take a single
> > command that says "read blocks a to b, and write blocks x to z"? Even
> > such thing existed,
> 
> They do exist.
> 
> IBM mainframe disks (the 3330/50/80 series) are able to do much more
> than that in one CCW chain  So it was *quite* possible to even express
> things like "Go to this cylinder/track, search for each record that
> has value XYZ in the 'key' field, and if found, write value ABC in the
> data field". (In fact, the DASD I/O
> opcodes for CCW chains are Turing-complete).

Well as interesting as that is, it is still an obscurity that will not
be generally supported. As I said, if you wanted to do such a thing you
can do it in the driver. Complicating the block layer in this way is
totally unacceptable, and is just bound to be an endless source of data
corrupting driver bugs.

> > So I quite agree with the "obviously".
> 
> My complaint was the confusion of "obviously cant" with "we have decided we
> don't want to".

Ok fair enough, make that a strong "obviously wont" instead then.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-24 19:12           ` 2.5.59-mm5 Andrew Morton
  2003-01-24 19:58             ` 2.5.59-mm5 Alex Tomas
@ 2003-01-25 17:32             ` Ed Tomlinson
  2003-01-25 17:41               ` 2.5.59-mm5 Andrew Morton
  1 sibling, 1 reply; 32+ messages in thread
From: Ed Tomlinson @ 2003-01-25 17:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

Hi Andrew,

I am seeing a strange problem with mm5.  This occurs both with and without
the anticipatory scheduler changes.  What happens is I see very high system
times and X responds very very slowly.  I first noticed this when switching
between folders in kmail and have seen it rebuilding db files for squidguard.
Here is what happened during the db rebuild (no anticipatory ioscheduler):

oscar# readprofile -r; vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 6  0    348  15824 115900 183148    0    0   191   134 1064   770 22  6 66  6
 5  4    348  15312 115936 183420    0    0     0  4392 1027   537 28 72  0  0
 5  0    348  14872 115936 183956    0    0     0   422 1079   553 33 68  0  0
 7  0    348  14552 115936 184316    0    0     0     0 1001   536 42 58  0  0
 6  0    348  13912 116012 184900    0    0     0   126 1019   560 32 68  0  0
 5  0    348  13272 116024 185468    0    0     4     0 1002   560 27 73  0  0
 5  4    348  12696 116060 186052    0    0     0    86 1014   519 28 73  0  0
 5  0    348  12368 116060 186356    0    0     0     0 1001   509 24 76  0  0
 5  0    348  11920 116060 186772    0    0     0    34 1003   519 27 74  0  0
 6  0    348  11672 116084 187044    0    0     0    88 1186  1199 29 71  0  0
 8  1    348   8536 116276 188148    0    0   468     0 1118   761 39 61  0  0
 5  5    348   5016 114468 188120    0    0   614   304 1118   811 59 41  0  0
 6  0    348   5144 113336 186548    0    0   648     0 1036   770 54 46  0  0
 7  0    348   5080 113252 185920    0    0   132     0 1013   707 42 58  0  0
 6  0    348   4688 113188 185528    0    0   184   262 1049   784 64 36  0  0
 6  0    348   6032 111292 185160    0    0   406     0 1038   725 39 62  0  0
 6  0    348   5200 111392 185908    0    0   216  1096 1032   733 35 65  0  0
 6  0    348   4312 111392 186744    0    0   166     0 1023   668 39 62  0  0
 6  1    348   5096 111396 187196    0    0    10     0 1002   701 25 76  0  0
 6  1    348   4328 111436 187692    0    0    16  3778 1207   755 24 76  0  0
 6  1    348   6120 110460 186728    0    0    14     0 1201   841 30 70  0  0
 7  2    348   5608 110548 187108    0    0     6    64 1083   753 23 77  0  0
 6  1    348   4960 110548 187600    0    0    14    74 1105   783 24 77  0  0
 6  1    348   4448 110548 187988    0    0     8     0 1122   700 25 75  0  0
 6  1    348   5224 109732 187940    0    0     6   142 1066   813 42 59  0  0
 6  1    348   4648 109740 188380    0    0    10     0 1003   682 25 76  0  0
 8  1    348   4264 109740 188724    0    0     8     0 1110   740 27 73  0  0
 6  1    348   6184 109000 187380    0    0    18   164 1026   727 23 78  0  0
 7  1    348   5800 109000 187684    0    0     8     0 1002   694 25 76  0  0
 6  1    348   5152 109056 188048    0    0    14   126 1022   743 25 75  0  0
 7  1    348   4768 109060 188340    0    0     6     0 1002   699 24 76  0  0
 6  0    348   4384 109060 188612    0    0     6     0 1002   681 26 74  0  0
 8  1    348   5160 109032 187768    0    0     8   118 1018   709 23 78  0  0
 7  1    348   4840 109032 188024    0    0    12     0 1004   655 23 78  0  0
 6  1    348   5800 109076 186864    0    0     2  3246 1244   808 46 54  0  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 7  0    348   6184 109084 185740    0    0   304     0 1027   717 32 68  0  0
 6  1    348   6440 109084 185988    0    0     4     0 1001   676 34 66  0  0
 6  1    348   6112 109168 186304    0    0    12  4414 1242   813 23 77  0  0
 6  1    348   5664 109172 186636    0    0     6     0 1005   727 24 76  0  0
 6  0    348   5224 109216 186924    0    0     6   108 1173   838 24 76  0  0
 6  1    348   4840 109216 187260    0    0    16     0 1099   686 25 76  0  0
 6  1    348   4328 109220 187644    0    0     6     0 1002   637 24 76  0  0
 7  1   1016   6248 108640 185376    0    0    14   108 1021   778 25 76  0  0
 8  1   1016   5800 108644 185748    0    0     6     0 1002   627 21 79  0  0
 6  1   1016   5344 108696 186012    0    0     8   158 1025   764 44 56  0  0
 6  1   1016   4832 108696 186460    0    0    12     0 1003   735 27 73  0  0
 6  1   1016   4384 108700 186888    0    0     4     0 1002   648 26 75  0  0
 6  0   1016   4968 108152 186612    0    0    12   254 1047   764 25 76  0  0
 6  1   1016   4392 108156 187116    0    0    16     0 1002   718 24 77  0  0
 6  0   1016   7080 108080 184172    0    0     6    92 1014   720 30 71  0  0
 6  1   1016   6760 108092 184584    0    0    12     0 1004   695 24 76  0  0
 6  1   1016   6376 108096 184876    0    0     6     0 1002   675 21 79  0  0
 6  1   1016   5536 108204 185256    0    0    90  4642 1250   838 26 75  0  0
 6  1   1016   5088 108212 185628    0    0    10    36 1006   705 24 76  0  0
 8  2   1016   4776 108244 185836    0    0     6  2900 1138   783 57 43  0  0
 6  1   1016   5544 108316 184704    0    0   228  3294 1260   874 37 62  0  1
 6  1   1016   5096 108316 185088    0    0     6     0 1008   658 24 76  0  0
 6  1   1016   4192 108448 185424    0    0    18   276 1047   694 23 77  0  0
 7  0   1016   6432 108080 183236    0    0    68     0 1057   742 26 74  0  0
 6  1   1016   5848 108220 183744    0    0   126   236 1043   732 26 75  0  0
 6  1   1016   5400 108220 184072    0    0     8     0 1056   698 24 76  0  0
 7  0   1016   4824 108220 184448    0    0    16     0 1002   662 24 76  0  0
 7  1   1016   4384 108280 184796    0    0    12   118 1019   721 25 76  0  0
 6  1   1016   5728 108272 183268    0    0     4     0 1056   662 25 75  0  0
 9  2   1016   4448 107924 183288    0    0   164   304 1062   796 28 72  0  0
 6  2   1016   7512 106888 182268    0    0     8    32 1017   866 47 54  0  0
 5  1   1016   5720 106892 183048    0    0    14     0 1045   700 43 57  0  0
 2  1   1016   5776 105212 182628    0    0    24   386 1058   741 45 56  0  0
 2  1   1016   5464 105216 182828    0    0    38     0 1061   753 20 80  0  0
 3  2   1016   6112 105276 181404    0    0   234  1848 1114   774 32 68  0  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 3  1   1016   5416 105280 181852    0    0   150  3654 1292   848 24 76  0  0
 2  1   1016   5040 105284 182112    0    0    36     0 1090   726 23 78  0  0
 2  2   1016   6128 105344 180228    0    0    52  3724 1262   859 21 79  0  0
 2  2   1016   5360 105344 180500    0    0    40  2782 1231   758 18 82  0  0
 4  1   1016   4328 105424 180888    0    0    62  1018 1144   724 21 79  0  0
 3  0   1016   5160 105408 180100    0    0    48     0 1087   849 38 62  0  0
 2  1   1016   4776 105448 180388    0    0    36     0 1234   781 18 82  0  0
 2  1   1016   4272 105596 180676    0    0    30   122 1025   706 17 83  0  0
 0  2   1016   4656 105644 179832    0    0   104  1136 1077   761 24 70  0  6
 0  2   1016   5616 105164 174620    0    0   422  3392 1394   933 43 12  0 44
 0  2   1016   9200 105868 175916    0    0   532  1096 1152   852 50 26  0 23
 3  1   1016   5496 104644 177692    0    0  1410     2 1157   936 37 14  0 50
 0  3   1016   5336 103132 177448    0  334   292  3106 1244   784 74 13  0 14
 2  1   1020  11096 100876 168948    0    0   566  1356 1118   752 82 18  0  0
 1  1   1020  18088 100976 168120    0    0   616     0 1082   789 50  7  0 43
 0  1   1020  10856 101660 169780    0    0   562   666 1150   841 59  8  0 33
 0  1   1020   5040 101692 169428    0    2   568  1724 1112   727 43  6  0 50
 0  1   1020   6024 101080 163120    0    0   588  1368 1180   779 48  9  0 44
 0  1   1020   4360  97712 162408    0    0   568   472 1131   787 42  7  0 51
 2  0   1020   4800  91872 161560    0    0   596     8 1090   784 46  7  0 47
 1  0   1512   4608  87900 160428    0  246   548   686 1129   785 42  8  0 51
 2  1   1512   4736  83968 157512    0    0   640     0 1093   807 45  8  0 48
 1  1   1512   5320  76640 157896    0    0   604     0 1088   780 47  7  0 47
 0  1   1512   5128  71820 157204    0    0   568   444 1127   766 40  7  0 53
 1  1   1528   4808  65792 157160    0    8   600     8 1085   798 48  8  0 45
 2  0   1536   4616  63268 157108    0    4   892   464 1136   810 76  6  0 19
 1  0   2472   4488  62680 158428    0  452   890   744 1075   794 89  5  0  6
 3  0   2916   4416  61812 159912   12  222  1148   222 1056   805 81  5  0 15
 0  1   3048   5056  60108 159328    0   66   990   228 1122   858 46  5  0 50
 0  1   3328   4744  55496 159560    0  140   584   140 1095   863 39  6  0 55
 1  0   3704   4428  52604 158456    0  188   568   572 1126   801 36  6  0 58
 1  0   4800   5396  51944 154448    0  548   556   554 1088   851 43  7  0 51
 1  0   4948   5668  49528 151096   48   74   674    74 1091   793 45  8  0 48
 2  0   5896   5648  49392 146584    0  474   598   794 1132   815 38  6  0 56
 0  1   6748   6032  49364 142004   16  426   592   436 1085   765 47  8  0 46
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0   7720   5376  48920 139236    0  486   554   800 1126   745 40  8  0 53
 2  0   7720   4928  46020 137268    0    0   596     0 1095   774 65  7  0 28
 0  1  12396   5440  45420 135712   16 2338   576  2628 1152   804 45 14  0 42
 0  1  15264   5184  45556 134552    0 1434   456  1806 1130   759 36  7  0 58
 1  0  17432   4864  43640 134168    0 1084   584  1084 1099   739 43  8  0 48
 8  0  22028   4928  42256 133512    0 2298   528  2592 1231   810 39  9  0 51
 0  1  24148   5940  40412 133016    0  982   524   982 1142   771 39  8  0 53
 0  1  25916   4936  37448 133184   16  884   594   884 1100   740 44  9  0 48
 1  1  28856   4892  36868 132172    0 1470   490  1766 1122   729 39  7  0 54
 3  1  30236   4292  33800 130832  144  690   836   690 1116   812 46  9  0 45
 0  0  32176   5408  33792 131384   32  970   690  1696 1180  1220 43  7 14 36
 0  0  32176   5408  33792 131396    0    0     2     0 1001   553  4  1 94  0
 0  0  32176   4896  33796 132032   16    0    46     0 1141   928 29  3 66  1
 1  0  32176   4864  33904 132036    0    0     6    90 1017   532  4  1 93  2

 55091 default_idle                             1377.2750
 62640 __copy_from_user_ll                      1204.6154
 33595 __copy_to_user_ll                        646.0577
   432 system_call                                9.0000
   100 ide_outb                                   8.3333
   488 current_kernel_time                        8.1333
   167 block_commit_write                         5.2188
   119 delay_tsc                                  4.2500
    38 syscall_call                               3.4545
    81 get_offset_tsc                             3.3750
   203 fget                                       3.1719
   349 radix_tree_lookup                          2.8145
  1549 do_anonymous_page                          2.7080
    32 ide_inb                                    2.6667
   548 reiserfs_copy_from_user_to_file_region     2.6346
   156 mark_page_accessed                         2.6000
    46 fput                                       2.3000
   131 unlock_page                                2.1833
   347 reiserfs_submit_file_region_for_write      2.1688
   302 update_atime                               2.0972
    67 init_journal_hash                          2.0938
   422 find_lock_page                             1.9906
   126 reiserfs_can_fit_pages                     1.7500
    21 user_schedule                              1.7500
   100 kmem_cache_free                            1.6667
   193 unix_poll                                  1.5078
    50 task_vsize                                 1.3889
   105 handle_IRQ_event                           1.3816
    60 reiserfs_claim_blocks_to_be_allocated      1.2500
    91 sys_pread64                                1.1974
   171 __block_commit_write                       1.1875
    56 pathrelse                                  1.1667
    90 sys_pwrite64                               1.1250
    13 ide_outl                                   1.0833
   101 atomic_dec_and_lock                        1.0521
   279 SHATransform                               1.0257

This is on a K6-3 400, 512m debian, kernel built with gcc 2.95-4

Ideas?
Ed Tomlinson


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-25 17:32             ` 2.5.59-mm5 Ed Tomlinson
@ 2003-01-25 17:41               ` Andrew Morton
  2003-01-25 20:34                 ` 2.5.59-mm5 Ed Tomlinson
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2003-01-25 17:41 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed Tomlinson <tomlins@cam.org> wrote:
>
> Hi Andrew,
> 
> I am seeing a strange problem with mm5.  This occurs both with and without
> the anticipatory scheduler changes.  What happens is I see very high system
> times and X responds very very slowly.  I first noticed this when switching
> between folders in kmail and have seen it rebuilding db files for squidguard.
> Here is what happened during the db rebuild (no anticipatory ioscheduler):

Could you please try reverting the reiserfs changes?

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/reiserfs-readpages.patch

and

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/reiserfs_file_write.patch


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-25 17:41               ` 2.5.59-mm5 Andrew Morton
@ 2003-01-25 20:34                 ` Ed Tomlinson
  2003-01-25 22:33                   ` 2.5.59-mm5 Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Ed Tomlinson @ 2003-01-25 20:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On January 25, 2003 12:41 pm, Andrew Morton wrote:
> Ed Tomlinson <tomlins@cam.org> wrote:
> > Hi Andrew,
> >
> > I am seeing a strange problem with mm5.  This occurs both with and
> > without the anticipatory scheduler changes.  What happens is I see very
> > high system times and X responds very very slowly.  I first noticed this
> > when switching between folders in kmail and have seen it rebuilding db
> > files for squidguard. Here is what happened during the db rebuild (no
> > anticipatory ioscheduler):
>
> Could you please try reverting the reiserfs changes?
>
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/
>reiserfs-readpages.patch
>
> and
>
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/
>reiserfs_file_write.patch

Reverting reiserfs_file_write.patch seems to cure the interactivity problems.
I still see the high system times but they in themselves are not a problem.
Reverting the second patch does not change the situation.  I am currently
running with reiserfs_file_write.patch removed - so far so good.

Thanks
Ed Tomlinson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-25 20:34                 ` 2.5.59-mm5 Ed Tomlinson
@ 2003-01-25 22:33                   ` Andrew Morton
  2003-01-26  1:43                     ` 2.5.59-mm5 Ed Tomlinson
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2003-01-25 22:33 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm, Oleg Drokin

Ed Tomlinson <tomlins@cam.org> wrote:
>
> On January 25, 2003 12:41 pm, Andrew Morton wrote:
> > Ed Tomlinson <tomlins@cam.org> wrote:
> > > Hi Andrew,
> > >
> > > I am seeing a strange problem with mm5.  This occurs both with and
> > > without the anticipatory scheduler changes.  What happens is I see very
> > > high system times and X responds very very slowly.  I first noticed this
> > > when switching between folders in kmail and have seen it rebuilding db
> > > files for squidguard. Here is what happened during the db rebuild (no
> > > anticipatory ioscheduler):
> >
> > Could you please try reverting the reiserfs changes?
> >
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/
> >reiserfs-readpages.patch
> >
> > and
> >
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/
> >reiserfs_file_write.patch
> 
> Reverting reiserfs_file_write.patch seems to cure the interactivity problems.
> I still see the high system times but they in themselves are not a problem.
> Reverting the second patch does not change the situation.  I am currently
> running with reiserfs_file_write.patch removed - so far so good.
> 

Well, high system time _is_ a problem, isn't it?  Do you always see that?

Or perhaps userspace monitoring tools are confusing I/O wait with CPU
busyness. Does a revert of

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/buffer-io-accounting.patch

make the numbers look different?  If so, then it's a procps bug...

WRT the excessive copy_foo_user() times: I shall forward your initial email
to Oleg, thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-25 22:33                   ` 2.5.59-mm5 Andrew Morton
@ 2003-01-26  1:43                     ` Ed Tomlinson
  2003-01-26  2:17                       ` 2.5.59-mm5 Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Ed Tomlinson @ 2003-01-26  1:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Oleg Drokin

On January 25, 2003 05:33 pm, Andrew Morton wrote:
> Ed Tomlinson <tomlins@cam.org> wrote:
> > On January 25, 2003 12:41 pm, Andrew Morton wrote:
> > > Ed Tomlinson <tomlins@cam.org> wrote:
> > > > Hi Andrew,
> > > >
> > > > I am seeing a strange problem with mm5.  This occurs both with and
> > > > without the anticipatory scheduler changes.  What happens is I see
> > > > very high system times and X responds very very slowly.  I first
> > > > noticed this when switching between folders in kmail and have seen it
> > > > rebuilding db files for squidguard. Here is what happened during the
> > > > db rebuild (no anticipatory ioscheduler):
> > >
> > > Could you please try reverting the reiserfs changes?
> > >
> > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-
> > >out/ reiserfs-readpages.patch
> > >
> > > and
> > >
> > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-
> > >out/ reiserfs_file_write.patch
> >
> > Reverting reiserfs_file_write.patch seems to cure the interactivity
> > problems. I still see the high system times but they in themselves are
> > not a problem. Reverting the second patch does not change the situation. 
> > I am currently running with reiserfs_file_write.patch removed - so far so
> > good.
>
> Well, high system time _is_ a problem, isn't it?  Do you always see that?
>
> Or perhaps userspace monitoring tools are confusing I/O wait with CPU
> busyness. Does a revert of
>
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/
>buffer-io-accounting.patch
>
> make the numbers look different?  If so, then it's a procps bug...
>
> WRT the excessive copy_foo_user() times: I shall forward your initial email
> to Oleg, thanks.

The excessive copy_foo_user times are still there with Oleg (and Chris's) patch
removed.  Here is what I see doing:

"apt-get install --reinstall squidguard chastity-list"

(with file_write from my first message)
 55091 default_idle                             1377.2750
 62640 __copy_from_user_ll                      1204.6154
 33595 __copy_to_user_ll                        646.0577

(without file_write)
 40259 __copy_from_user_ll                      774.2115
 18735 default_idle                             468.3750
 21524 __copy_to_user_ll                        413.9231 
   386 system_call                                8.0417
   428 current_kernel_time                        7.1333
   988 established_get_next                       6.8611
    60 ide_outb                                   5.0000
   509 reiserfs_prepare_write                     4.2417
   100 get_offset_tsc                             4.1667
    38 syscall_call                               3.4545
   159 fget                                       2.4844
   279 radix_tree_lookup                          2.2500
    61 init_journal_hash                          1.9062
    68 task_vsize                                 1.8889
   105 mark_page_accessed                         1.7500
   366 find_lock_page                             1.7264
    48 delay_tsc                                  1.7143
    89 block_prepare_write                        1.7115
   237 update_atime                               1.6458
    32 fput                                       1.6000
    90 unlock_page                                1.5000
   210 inode_update_time                          1.3816
   108 sys_pwrite64                               1.3500
    16 ide_inb                                    1.3333
    78 mark_buffer_dirty                          1.3000
   192 reiserfs_wait_on_write_block               1.2632
    93 handle_IRQ_event                           1.2237
    76 fault_in_pages_readable                    1.1875
     4 reiserfs_check_lock_depth                  1.0000

So removing file_read seems to have reduced the copy_foo_user() issue but
has not removed it.

Using a vmstat hacked to show iowait with the above running...

oscar% vmstat -a 5

   procs             memory (mB)      swap          io     system         cpu
 r  b  w  swpd  free inact   act   si   so    bi    bo   in    cs us sy io id
 3  0  0    42     6    13   434    0    3    36    69 1061    61 25  3  1 71
 5  0  0    42     4    15   434    0    0  1189   893 1184 18253 28 11 10 51
 4  0  0    42     5     8   440    0   66   353   274 1070  7874 74  7 10  9
 6  0  0    42     6     9   438    0    0   468   343 1081  2936 93  7  0  0
 5  0  0    46     4     5   444    0  714  1453   976 1147  8891 87 13  0  0
 4  0  0    51     5     1   447    0 1086   626  1877 1279 23445 57 43  0  0
 4  1  1    52     4     3   446    0  290   615  1206 1219 22018 68 32  0  0
 6  0  0    53     8    10   434    0   82   690  1020 1141 14962 59 41  0  0
10  0  0    53    36    14   403    0    0     2   599 1206  1988 85 15  0  0
 5  0  0    53    27     9   417    0    0    35    94 1072  1269 94  6  0  0
 5  0  0    53    31    11   411    0    0   188   761 1089  2401 88 12  0  0
 8  0  0    53    26    11   416    0    0     1   298 1052  9013 42 28  3 27
 7  0  0    53    25    11   417    0    0     0    22 1021   574 38 62  0  0
10  0  0    53    24    11   418    0    0     0    34 1014   546 53 47  0  0
11  0  0    53    23    11   419    0    0     0  1814 1142   634 43 57  0  0
 9  0  0    53    22    11   421    0    0     2    39 1019   556 40 60  0  0
13  0  0    53    20    10   423    0    0     0    32 1031  1183 51 47  0  2
 9  0  0    53    18    10   425    0    0     0  1946 1083   560 36 64  0  0
 9  0  0    53    17    10   426    0    0     0    28 1016   575 38 62  0  0
10  0  0    53    16    10   427    0    0     0    47 1022   560 52 48  0  0
 9  0  0    53    15    10   428    0    0     0    36 1015   540 28 72  0  0
 9  0  0    53    14    10   429    0    0     0    27 1023   603 48 52  0  0
 8  0  0    53    13    10   430    0    0     0    36 1019   536 48 52  0  0
 9  0  0    53    12    10   431    0    0     0   367 1029   539 36 64  0  0
11  0  0    53    11    10   432    0    0     0  1785 1112   587 32 68  0  0
10  0  0    53    11    10   433    0    0     0    58 1030   610 75 25  0  0
10  0  0    53    10    10   433    0    0     0    38 1037   599 67 33  0  0
12  0  0    53    10    10   434    0    0     0    34 1056   679 81 19  0  0
14  0  0    53    10    10   434   26    0    26    44 1059   647 42 58  0  0
13  0  0    53     9    10   435    0    0     0    45 1050   686 56 44  0  0
10  0  0    53     9    10   435    0    0     0   585 1083   678 59 41  0  0
   procs             memory (mB)      swap          io     system         cpu
 r  b  w  swpd  free inact   act   si   so    bi    bo   in    cs us sy io id
 9  0  1    53     8    10   435    0    0     0  2518 1200   727 48 52  0  0
10  0  0    53     8    10   436    0    0     0    43 1065   660 38 62  0  0
11  0  0    53     7    10   437    0    0     0    39 1044   661 29 71  0  0
 9  0  0    53     6     9   438    0    0     0   196 1063   676 44 56  0  0
 9  0  0    53     5    10   438    0    0     0   732 1169   681 27 73  0  0
 6  4  0    53     4    10   440    0    0     0   633 1121  1987 52 48  0  0
10  0  0    53    10    12   431    0    0     2  3294 1203  8145 54 46  0  0
11  0  0    53    24    17   412    0    0     0   806 1133   686 60 40  0  0

Unless its an accounting error, its not iowait (confirmed on a nonbusy system
too).  There is no change with or with out the io_schedule() changed back to 
schedule().

Ed Tomlinson









--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-26  1:43                     ` 2.5.59-mm5 Ed Tomlinson
@ 2003-01-26  2:17                       ` Andrew Morton
  2003-01-26  3:51                         ` 2.5.59-mm5 Ed Tomlinson
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2003-01-26  2:17 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm, green

Ed Tomlinson <tomlins@cam.org> wrote:
>
> The excessive copy_foo_user times are still there with Oleg (and Chris's) patch
> removed.  Here is what I see doing:
> 
> "apt-get install --reinstall squidguard chastity-list"
> 
> (with file_write from my first message)
>  55091 default_idle                             1377.2750
>  62640 __copy_from_user_ll                      1204.6154
>  33595 __copy_to_user_ll                        646.0577
> 
> (without file_write)
>  40259 __copy_from_user_ll                      774.2115
>  18735 default_idle                             468.3750
>  21524 __copy_to_user_ll                        413.9231 
>    386 system_call                                8.0417
>    428 current_kernel_time                        7.1333

Is this different from 2.5.59 base?

It's beginning to look like copy_foo_user() itself has gone silly.

I don't know what's causing this, Ed.  Could you please dig into it a little
more?  Does it happen with a bare `dd'?  Or is it networking?  etcetera...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-26  2:17                       ` 2.5.59-mm5 Andrew Morton
@ 2003-01-26  3:51                         ` Ed Tomlinson
  2003-01-26  4:04                           ` 2.5.59-mm5 Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Ed Tomlinson @ 2003-01-26  3:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, green

On January 25, 2003 09:17 pm, Andrew Morton wrote:
> Is this different from 2.5.59 base?

Same in 59 and as far back as 51(ish) which is the oldest that I 
have prebuilt here...

> It's beginning to look like copy_foo_user() itself has gone silly.
>
> I don't know what's causing this, Ed.  Could you please dig into it a
> little more?  Does it happen with a bare `dd'?  Or is it networking?
>  etcetera...

What I see is this.

apt installs squidguard

squidguard starts 5 processes 

atp installs chastity-list

and the squidguard processes proceed to take most of the cpu.  Each 
of the squidguard processes takes about 17% of the cpu.  These keep 
running after apt finshes and the system time drops when they end...

I started a strace of one of the offending processes and saw lots like:

pread(6, "\0\0\0\0\1\0\0\0\325\0\0\0\243\0\0\0\267\0\0\0t\1@\16\1"..., 8192, 1744896) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\267\0\0\0\325\0\0\0\320\0\0\0000\2\270"..., 8192, 1499136) = 8192
pread(6, "\0\0\0\0\1\0\0\0\305\0\0\0\330\0\0\0\332\0\0\0d\1\f\16"..., 8192, 1613824) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\273\0\0\0\323\0\0\0\327\0\0\0n\1\210\r"..., 8192, 1531904) = 8192
pread(6, "\0\0\0\0\1\0\0\0\330\0\0\0\10\0\0\0\305\0\0\0j\1`\r\1\5"..., 8192, 1769472) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\342\0\0\0\303\0\0\0\262\0\0\0.\1\314\20"..., 8192, 1851392) = 8192
pread(6, "\0\0\0\0\1\0\0\0\346\0\0\0\310\0\0\0\266\0\0\0X\1$\20\1"..., 8192, 1884160) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\6\0\0\0\317\0\0\0\315\0\0\0\34\2d\4\1"..., 8192, 49152) = 8192
pread(6, "\0\0\0\0\1\0\0\0\5\0\0\0\363\0\0\0\362\0\0\0$\1\224\21"..., 8192, 40960) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\10\0\0\0\341\0\0\0\330\0\0\0\220\1\230"..., 8192, 65536) = 8192
pread(6, "\0\0\0\0\1\0\0\0\331\0\0\0\277\0\0\0\250\0\0\0b\1l\r\1"..., 8192, 1777664) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\350\0\0\0\37\0\0\0\303\0\0\0H\1`\20\1"..., 8192, 1900544) = 8192
pread(6, "\0\0\0\0\1\0\0\0\267\0\0\0\325\0\0\0\320\0\0\0000\2\270"..., 8192, 1499136) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\310\0\0\0\362\0\0\0\346\0\0\0B\1|\20\1"..., 8192, 1638400) = 8192
pread(6, "\0\0\0\0\1\0\0\0\302\0\0\0\326\0\0\0\335\0\0\0l\1\354\16"..., 8192, 1589248) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\313\0\0\0\356\0\0\0\270\0\0\0\26\2\254"..., 8192, 1662976) = 8192
pread(6, "\0\0\0\0\1\0\0\0\307\0\0\0\354\0\0\0\347\0\0\0N\1l\17\1"..., 8192, 1630208) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\314\0\0\0\361\0\0\0\265\0\0\0\24\2d\4"..., 8192, 1671168) = 8192
pread(6, "\0\0\0\0\1\0\0\0\10\0\0\0\341\0\0\0\330\0\0\0\220\1\230"..., 8192, 65536) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\321\0\0\0\266\0\0\0\243\0\0\0p\1\320\r"..., 8192, 1712128) = 8192
pread(6, "\0\0\0\0\1\0\0\0\336\0\0\0 \0\0\0\300\0\0\0>\0010\17\1"..., 8192, 1818624) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\322\0\0\0\272\0\0\0\244\0\0\0\274\1`\t"..., 8192, 1720320) = 8192
pread(6, "\0\0\0\0\1\0\0\0\4\0\0\0\344\0\0\0\361\0\0\0(\1\240\21"..., 8192, 32768) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\262\0\0\0\342\0\0\0\316\0\0\0\350\1\340"..., 8192, 1458176) = 8192
pread(6, "\0\0\0\0\1\0\0\0\324\0\0\0\274\0\0\0!\0\0\0\250\1\220\f"..., 8192, 1736704) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0!\0\0\0\324\0\0\0\356\0\0\0008\1p\21\1"..., 8192, 270336) = 8192
pread(6, "\0\0\0\0\1\0\0\0\310\0\0\0\362\0\0\0\346\0\0\0B\1|\20\1"..., 8192, 1638400) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\271\0\0\0\347\0\0\0\326\0\0\0\202\1\334"..., 8192, 1515520) = 8192
pread(6, "\0\0\0\0\1\0\0\0\266\0\0\0\346\0\0\0\321\0\0\0t\1\354\v"..., 8192, 1490944) = 8192
pwrite(6, "\0\0\0\0\1\0\0\0\3\0\0\0\351\0\0\0\357\0\0\0\"\1\270\20"..., 8192, 24576) = 8192

Does this help?

Ed





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: 2.5.59-mm5
  2003-01-26  3:51                         ` 2.5.59-mm5 Ed Tomlinson
@ 2003-01-26  4:04                           ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2003-01-26  4:04 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm, green

Ed Tomlinson <tomlins@cam.org> wrote:
>
> and the squidguard processes proceed to take most of the cpu.  Each 
> of the squidguard processes takes about 17% of the cpu.  These keep 
> running after apt finshes and the system time drops when they end...
> 
> ...
>
> Does this help?

Not a lot.  Looks like squidguard has gone berzerk reading lots of stuff from
pagecache.  Could be that it has a bug which is triggered by subtly altered
kernel behaviour, or a subtle bug in the kernel broke it.

Do any other applications exhibit the same behaviour?

Can you generate a simple, standalone usage of squidguard which exhibits this
behaviour?  Just starting them up??

You may need to build your own squidguard and attach gdb to one, see what
it's up to.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2003-01-26  4:04 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-01-24  3:50 2.5.59-mm5 Andrew Morton
2003-01-24 11:03 ` 2.5.59-mm5 Alex Bligh - linux-kernel
2003-01-24 11:16   ` 2.5.59-mm5 Andrew Morton
2003-01-24 11:23     ` 2.5.59-mm5 Alex Tomas
2003-01-24 11:50       ` 2.5.59-mm5 Andrew Morton
2003-01-24 12:05         ` 2.5.59-mm5 Alex Tomas
2003-01-24 19:12           ` 2.5.59-mm5 Andrew Morton
2003-01-24 19:58             ` 2.5.59-mm5 Alex Tomas
2003-01-25 17:32             ` 2.5.59-mm5 Ed Tomlinson
2003-01-25 17:41               ` 2.5.59-mm5 Andrew Morton
2003-01-25 20:34                 ` 2.5.59-mm5 Ed Tomlinson
2003-01-25 22:33                   ` 2.5.59-mm5 Andrew Morton
2003-01-26  1:43                     ` 2.5.59-mm5 Ed Tomlinson
2003-01-26  2:17                       ` 2.5.59-mm5 Andrew Morton
2003-01-26  3:51                         ` 2.5.59-mm5 Ed Tomlinson
2003-01-26  4:04                           ` 2.5.59-mm5 Andrew Morton
2003-01-24 15:56         ` 2.5.59-mm5 Oliver Xymoron
2003-01-24 16:04           ` 2.5.59-mm5 Nick Piggin
2003-01-24 17:09             ` 2.5.59-mm5 Giuliano Pochini
2003-01-24 17:22               ` 2.5.59-mm5 Nick Piggin
2003-01-24 19:34                 ` 2.5.59-mm5 Valdis.Kletnieks
2003-01-24 20:04                   ` 2.5.59-mm5 Jens Axboe
2003-01-24 22:02                     ` 2.5.59-mm5 Valdis.Kletnieks
2003-01-25 12:28                       ` 2.5.59-mm5 Jens Axboe
2003-01-24 12:14     ` 2.5.59-mm5 Nikita Danilov
2003-01-24 16:00       ` 2.5.59-mm5 Nick Piggin
2003-01-24 11:23   ` 2.5.59-mm5 Jens Axboe
2003-01-24 13:59 ` 2.5.59-mm5 got stuck during boot Helge Hafting
2003-01-24 17:44   ` Ed Tomlinson
2003-01-24 17:56     ` Nick Piggin
2003-01-24 19:18       ` Ed Tomlinson
2003-01-25  8:33 ` 2.5.59-mm5 Andres Salomon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox