linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* 2.5.40-mm1
@ 2002-10-01  9:32 Andrew Morton
  2002-10-08  6:46 ` 2.5.40-mm1 Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2002-10-01  9:32 UTC (permalink / raw)
  To: lkml, linux-mm

url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.40/2.5.40-mm1/

Mainly a resync.

- A few minor problems in the per-cpu-pages code have been fixed.

- Updated dcache RCU code.

- Significant brain surgery on the SARD patch.

- Decreased the disk scheduling tunable `fifo_batch' from 32 to 16 to
  improve disk read latency.

- Updated ext3 htree patch from Ted.

- Included a patch from Mala Anand which _should_ speed up kernel<->userspace
  memory copies for Intel ia32 hardware.  But I can't measure any difference
  with poorly-aligned pagecache copies.


-scsi_hack.patch
-might_sleep-2.patch
-slab-fix.patch
-hugetlb-doc.patch
-get_user_pages-PG_reserved.patch
-move_one_page_fix.patch
-zab-list_heads.patch
-remove-gfp_nfs.patch
-buddyinfo.patch
-free_area.patch
-per-node-kswapd.patch
-topology-api.patch
-topology_fixes.patch

 Merged

+misc.patch

 Trivia

+ioperm-fix.patch

 Fix the sys_ioperm() might-sleep-while-atomic bug

-sard.patch
+bd-sard.patch

 Somewhat rewritten to not key everything off minors and majors - use
 pointers instead.

+bio-get-nr-vecs.patch

 use bio_get_nr_vecs in fs/mpage.c

+dio-nr-segs.patch

 use bio_get_nr_vecs in fs/direct-io.c

-per-node-zone_normal.patch
+per-node-mem_map.patch

 Renamed

+free_area_init-cleanup.patch

 Clean up some mm init code.

+intel-user-copy.patch

 Supposedly faster copy_*_user.



ext3-dxdir.patch
  ext3 htree

spin-lock-check.patch
  spinlock/rwlock checking infrastructure

rd-cleanup.patch
  Cleanup and fix the ramdisk driver (doesn't work right yet)

misc.patch
  misc

write-deadlock.patch
  Fix the generic_file_write-from-same-mmapped-page deadlock

ioperm-fix.patch
  sys_ioperm() atomicity fix

radix_tree_gang_lookup.patch
  radix tree gang lookup

truncate_inode_pages.patch
  truncate/invalidate_inode_pages rewrite

proc_vmstat.patch
  Move the vm accounting out of /proc/stat

kswapd-reclaim-stats.patch
  Add kswapd_steal to /proc/vmstat

iowait.patch
  I/O wait statistics

bd-sard.patch

dio-bio-add-page.patch
  Use bio_add_page() in direct-io.c

tcp-wakeups.patch
  Use fast wakeups in TCP/IPV4

swapoff-deadlock.patch
  Fix a tmpfs swapoff deadlock

dirty-and-uptodate.patch
  page state cleanup

shmem_rename.patch
  shmem_rename() directory link count fix

dirent-size.patch
  tmpfs: show a non-zero size for directories

tmpfs-trivia.patch
  tmpfs: small fixlets

per-zone-vm.patch
  separate the kswapd and direct reclaim code paths

swsusp-feature.patch
  add shrink_all_memory() for swsusp

bio-get-nr-vecs.patch
  use bio_get_nr_vecs() in fs/mpage.c

dio-nr-segs.patch
  Use bio_get_nr_vecs() in direct-io.c

remove-page-virtual.patch
  remove page->virtual for !WANT_PAGE_VIRTUAL

dirty-memory-clamp.patch
  sterner dirty-memory clamping

mempool-wakeup-fix.patch
  Fix for stuck tasks in mempool_alloc()

remove-write_mapping_buffers.patch
  Remove write_mapping_buffers

buffer_boundary-scheduling.patch
  IO schduling for indirect blocks

ll_rw_block-cleanup.patch
  cleanup ll_rw_block()

lseek-ext2_readdir.patch
  remove lock_kernel() from ext2_readdir()

discontig-no-contig_page_data.patch
  undefine contif_page_data for discontigmem

per-node-mem_map.patch
  ia32 NUMA: per-node ZONE_NORMAL

alloc_pages_node-cleanup.patch
  alloc_pages_node cleanup

free_area_init-cleanup.patch
  free_area_init_node cleanup

batched-slab-asap.patch
  batched slab shrinking

akpm-deadline.patch
  deadline scheduler tweaks

rmqueue_bulk.patch
  bulk page allocator

free_pages_bulk.patch
  Bulk page freeing function

hot_cold_pages.patch
  Hot/Cold pages and zone->lock amortisation
  EDEC
  
  Hot/Cold pages and zone->lock amortisation
  

readahead-cold-pages.patch
  Use cache-cold pages for pagecache reads.

pagevec-hot-cold-hint.patch
  hot/cold hints for truncate and page reclaim

intel-user-copy.patch

read_barrier_depends.patch
  extended barrier primitives

rcu_ltimer.patch
  RCU core

dcache_rcu.patch
  Use RCU for dcache
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.5.40-mm1
  2002-10-01  9:32 2.5.40-mm1 Andrew Morton
@ 2002-10-08  6:46 ` Andrew Morton
  0 siblings, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2002-10-08  6:46 UTC (permalink / raw)
  To: lkml, linux-mm, Mala Anand

Andrew Morton wrote:
> 
> ...
> - Included a patch from Mala Anand which _should_ speed up kernel<->userspace
>   memory copies for Intel ia32 hardware.  But I can't measure any difference
>   with poorly-aligned pagecache copies.
> 

Well Mala, I have to take that back.  I must have forgotten to
turn on my computer or brain or something.   Your patch kicks
butt.

In this test I timed how long it took to read a fully-cached
1 gigabyte file into an 8192-byte userspace buffer.  The alignment
of the user buffer was incremented by one byte between runs.

for i in $(seq 0 32)
do
	time time-read -a $i -b 8192 -h 8192 foo 
done

time-read.c is in http://www.zip.com.au/~akpm/linux/patches/2.5/ext3-tools.tar.gz

The CPU is "Pentium III (Katmai)"

All times are in seconds:


User buffer	2.5.41		2.5.41+		2.5.41+
				patch		patch++

0x804c000	4.373		4.387		6.063
0x804c001	10.024		6.410
0x804c002	10.002		6.411
0x804c003	10.013		6.408
0x804c004	10.105		6.343
0x804c005	10.184		6.394
0x804c006	10.179		6.398
0x804c007	10.185		6.408
0x804c008	9.725		9.724		6.347
0x804c009	9.780		6.436
0x804c00a	9.779		6.421
0x804c00b	9.778		6.433
0x804c00c	9.723		6.402
0x804c00d	9.790		6.382
0x804c00e	9.790		6.381
0x804c00f	9.785		6.380
0x804c010	9.727		9.723		6.277
0x804c011	9.779		6.360
0x804c012	9.783		6.345
0x804c013	9.786		6.341
0x804c014	9.772		6.133
0x804c015	9.919		6.327
0x804c016	9.920		6.319
0x804c017	9.918		6.319
0x804c018	9.846		9.857		6.372
0x804c019	10.060		6.443
0x804c01a	10.049		6.436
0x804c01b	10.041		6.432
0x804c01c	9.931		6.356
0x804c01d	10.013		6.432
0x804c01e	10.020		6.425
0x804c01f	10.016		6.444
0x804c020	4.442		4.423		6.380

So the patch is a 30% win at all alignments except for 32-byte-aligned
destination addresses.

Now, in the patch++ I modified things so we use the copy_user_int()
function for _all_ alignments.  Look at the 0x804c008 alignment.
We sped up the copies by 30% by using copy_user_int() instead of
rep;movsl.

This is important, because glibc malloc() returns addresses which
are N+8 aligned.  I would expect that this alignment is common.

So.  Patch is a huge win as-is.  For the PIII it looks like we need
to enable it at all alignments except mod32.  And we need to test
with aligned dest, unaligned source.

Can you please do some P4 testing?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.5.40-mm1
  2002-10-09 23:20 2.5.40-mm1 Mala Anand
@ 2002-10-09 23:32 ` Andrew Morton
  0 siblings, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2002-10-09 23:32 UTC (permalink / raw)
  To: Mala Anand; +Cc: lkml, linux-mm, Bill Hartner

Mala Anand wrote:
> 
> ...
> P4 Xeon CPU 1.50 GHz 4-way - hyperthreading disabled
> Src is aligned and dst is misaligned as follows:
> 
>  Dst      2.5.40       2.5.40+patch     2.5.40+patch++
> Align    throughout     throughput      throughput
> (bytes)   KB/sec          KB/sec        KB/sec
>   0       1360071         1314783        912359
>   1       323674           340447
>   2       329202           336425
>   4       512955           693170
>   8       523223           615097        506641
>  12       517184           558701        553700
>  16       966598           872080        932736
>  32       846937           838514        845178

Note the tremendous slowdown which the P4 suffers when you're not
cacheline aligned.  Even 32-byte-aligned is down a lot.

 
> I see too much variance in the test results so I ran
> each test 3 times. I tried increasing the iterations
> but it did not reduce the variance.
> 
> Dst is aligned and src is misaligned as follows:
> 
>  Dst      2.5.40       2.5.40+patch
> Align    throughout     throughput
> (bytes)   KB/sec          KB/sec
>   0       1275372       1029815
>   1        529907        511815
>   2        534811        530850
>   4        643196        627013
>   8        568000        626676
>  12        574468        658793
>  16        631707        635979
>  32        741485        592938

This differs a little from my P4 testing - the rep;movsl approach
seemed OK for 8,16,32 alignment.

But still, that's something we can tune later.
 
> 
> However I have seen using floating point registers instead of integer
> registers on Pentium IV improves performance to a greater extent on
> some alignments. I need to do more testing and then I will create a
> patch for pentium IV.

I believe there are "issues" using those registers in-kernel. Related
to the need to save/restore them, or errata; not too sure about that.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.5.40-mm1
@ 2002-10-09 23:20 Mala Anand
  2002-10-09 23:32 ` 2.5.40-mm1 Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Mala Anand @ 2002-10-09 23:20 UTC (permalink / raw)
  To: akpm, lkml, linux-mm; +Cc: Bill Hartner

>Andrew Morton wrote:

>So.  Patch is a huge win as-is.  For the PIII it looks like we need
>to enable it at all alignments except mod32.  And we need to test
>with aligned dest, unaligned source.

Pentium III (coppermine) 997Mhz 2-way
Read from pagecache to user buffer misaligning the source
Size of copy is 262144 and the number of iterations copied for
each test is 16384.
      Patch++ - uses copy_user_int if size > 64
      Patch - uses copy_user_int if size > 64, or src and dst
              are not aligned on an 8 byte boundary

dst aligned on an 4k and src misaligned

          2.5.40       2.5.40+patch     2.5.40+patch++
Align    throughout     throughput      throughput
(bytes)   KB/sec          KB/sec        KB/sec
0         275592          281356        285567
1         124266          197361
2         120157          200270
4         125935          197558
8         157244          156655        162189
16        167296          173202        173702
32        283731          285222        290810

Looks like the patch can be used for all the above tested
alignments on Pentium III.
>Can you please do some P4 testing?

P4 Xeon CPU 1.50 GHz 4-way - hyperthreading disabled
Src is aligned and dst is misaligned as follows:

 Dst      2.5.40       2.5.40+patch     2.5.40+patch++
Align    throughout     throughput      throughput
(bytes)   KB/sec          KB/sec        KB/sec
  0       1360071         1314783        912359
  1       323674           340447
  2       329202           336425
  4       512955           693170
  8       523223           615097        506641
 12       517184           558701        553700
 16       966598           872080        932736
 32       846937           838514        845178

I see too much variance in the test results so I ran
each test 3 times. I tried increasing the iterations
but it did not reduce the variance.

Dst is aligned and src is misaligned as follows:

 Dst      2.5.40       2.5.40+patch
Align    throughout     throughput
(bytes)   KB/sec          KB/sec
  0       1275372       1029815
  1        529907        511815
  2        534811        530850
  4        643196        627013
  8        568000        626676
 12        574468        658793
 16        631707        635979
 32        741485        592938

Since there is 5 - 10% variance in these test's results I am not
sure whether we can use this data to validate. I will try
to run this on another pentium 4 machine.

However I have seen using floating point registers instead of integer
registers on Pentium IV improves performance to a greater extent on
some alignments. I need to do more testing and then I will create a
patch for pentium IV.

Regards,
    Mala


   Mala Anand
   IBM Linux Technology Center - Kernel Performance
   E-mail:manand@us.ibm.com
   http://www-124.ibm.com/developerworks/opensource/linuxperf
   http://www-124.ibm.com/developerworks/projects/linuxperf
   Phone:838-8088; Tie-line:678-8088




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-10-09 23:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-01  9:32 2.5.40-mm1 Andrew Morton
2002-10-08  6:46 ` 2.5.40-mm1 Andrew Morton
2002-10-09 23:20 2.5.40-mm1 Mala Anand
2002-10-09 23:32 ` 2.5.40-mm1 Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox