* [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
@ 2003-07-08 22:45 Ingo Molnar
2003-07-09 1:29 ` William Lee Irwin III
` (5 more replies)
0 siblings, 6 replies; 11+ messages in thread
From: Ingo Molnar @ 2003-07-08 22:45 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm
i'm pleased to announce the first public release of the "4GB/4GB VM split"
patch, for the 2.5.74 Linux kernel:
http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
The 4G/4G split feature is primarily intended for large-RAM x86 systems,
which want to (or have to) get more kernel/user VM, at the expense of
per-syscall TLB-flush overhead.
on x86, the total amount of virtual memory - as we all know - is limited
to 4GB. Of this total 4GB VM, userspace uses 3GB (0x00000000-0xbfffffff),
the kernel uses 1GB (0xc0000000-0xffffffff). This is VM scheme is called
the 3/1 split. This split works perfecly fine up until 1 GB of RAM - and
it works adequately well even after that, due to 'highmem', which moves
various larger caches (and objects) into the high memory area.
But as the amount of RAM increases, the 3/1 split becomes a real
bottleneck. Despite highmem being utilized by a number of large-size
caches, one of the most crutial data structures, the mem_map[], is
allocated out of the 1 GB kernel VM. With 32 GB of RAM the remaining 0.5
GB lowmem area is quite limited and only represents 1.5% of all RAM.
Various common workloads exhaust the lowmem area and create artificial
bottlenecks. With 64 GB RAM, the mem_map[] alone takes up nearly 1 GB of
RAM, making the kernel unable to boot. Relocating the mem_map[] to highmem
is very impractical, due to the deep integration of this central data
structure into the whole kernel - the VM, lowlevel arch code, drivers,
filesystems, etc.
with the 4G/4G patch, the kernel can be compiled in 4G/4G mode, in which
case there's a full, separate 4GB VM for the kernel, and there are
separate full (and per-process) 4GB VMs for user-space.
A typical /proc/PID/maps file of a process running on a 4G/4G kernel shows
a full 4GB address-space:
00e80000-00faf000 r-xp 00000000 03:01 175909 /lib/tls/libc-2.3.2.so
00faf000-00fb2000 rw-p 0012f000 03:01 175909 /lib/tls/libc-2.3.2.so
[...]
feffe000-ff000000 rwxp fffff000 00:00 0
the stack ends at 0xff000000 (4GB minus 16MB). The kernel has a 4GB lowmem
area, of which 3.1 GB is still usable even with 64 GB of RAM:
MemTotal: 66052020 kB
MemFree: 65958260 kB
HighTotal: 62914556 kB
HighFree: 62853140 kB
LowTotal: 3137464 kB
LowFree: 3105120 kB
the amount of lowmem is still more than 3 times the amount of lowmem
available to a 4GB system. It's more than 6 times the amount of lowmem a
32 GB system gets with the 3/1 split.
Performance impact of the 4G/4G feature:
There's a runtime cost with the 4G/4G patch: to implement separate address
spaces for the kernel and userspace VM, the entry/exit code has to switch
between the kernel pagetables and the user pagetables. This causes TLB
flushes, which are quite expensive, not so much in terms of TLB misses
(which are quite fast on Intel CPUs if they come from caches), but in
terms of the direct TLB flushing cost (%cr3 manipulation) done on
system-entry.
RAM limits:
in theory, the 4G/4G patch could provide a mem_map[] for 200 GB (!) of
physical RAM on x86, while still having 1 GB of lowmem left. So it gives
quite some legroom. While the right solution for lots of RAM is to use a
proper 64-bit system, there's alot of existing x86 hardware, and x86
servers will still be sold in the next couple of years, so we ought to
support them maximally.
The patch is orthogonal to wli's pgcl patch - both patches try to achieve
the same, with different methods. I can very well imagine workloads where
we want to have the combination of the two patches.
Implementational details:
the patch implements/touches a number of new lowlevel x86 infrastructures:
- it moves the GDT, IDT, TSS, LDT, vsyscall page and kernel stack up into
a high virtual memory window (trampoline) at the top 16 MB of the
4GB address space. This 16 MB window is the only area that is shared
between user-space and kernel-space pagetables.
- it splits out atomic kmaps from highmem dependencies.
- it makes LDT(s) atomic-kmap-ed.
- (and lots of other smaller details, like increasing the size of the
initial mappings and fixing the PAE code to map the full 4GB of kernel
VM.)
Whenever we do a syscall (or any other trap) from user-mode, the
high-address trampoline code starts to run, with a high-address esp0. This
code switches over to the kernel pagetable, then it switches the 'virtual
kernel stack' to the regular (real) kernel stack. On syscall-exit it does
it the other way around.
there are a few generic kernel changes as well:
- it implements 'indirect uaccess' primitives and implements all the
get_user/put_user/copy_to_user/... functions without relying on direct
access to user-space. This feature uncovered a number of bugs in the
lowlevel x86 code already, there was still code that accessed
user-space memory directly.
- it splits up PAGE_OFFSET into PAGE_OFFSET_USER and PAGE_OFFSET (kernel)
- fixes a couple of assumptions about PAGE_OFFSET being PMD_SIZE aligned.
but the generic-kernel impact of the patch is quite low.
the patch optimizes kernel<->kernel context switches and does not flush
the TLB, also, IRQ entry only cases a TLB flush if a userspace pagetable
is loaded.
the typical cost of 4G/4G on typical x86 servers is +3 usecs of syscall
latency (this is in addition to the ~1 usec null syscall latency).
Depending on the workload this can cause a typical measurable wall-clock
overhead from 0% to 30%, for typical application workloads (DB workload,
networking workload, etc.). Isolated microbenchmarks can show a bigger
slowdown as well - due to the syscall latency increase.
i'd guess that the 4G/4G patch is not worth the overhead for systems with
less than 16 GB of RAM (although exceptions might exist, for particularly
lowmem-intensive/sensitive workloads). 32 GB RAM systems run into lowmem
limitations quite frequently so the 4G/4G patch is quite recommended
there, and for 64 GB and larger systems it's a must i think.
Status, future plans:
The patch is a work-in-progress snapshot - it still has a few TODOs and
FIXMEs, but it compiles & works fine for me. Be careful with it
nevertheless - it's an experimental patch which does very intrusive
changes to the lowlevel x86 code.
There are a couple of performance enhancements ontop of this patch that
i'll integrate into this patch in the next couple of days, but i first
wanted to release the base patch.
In any case, enjoy the patch - and as usual, comments and suggestions are
more than welcome,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
@ 2003-07-09 1:29 ` William Lee Irwin III
2003-07-09 5:13 ` Martin J. Bligh
` (4 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: William Lee Irwin III @ 2003-07-09 1:29 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, linux-mm
On Wed, Jul 09, 2003 at 12:45:52AM +0200, Ingo Molnar wrote:
> The patch is orthogonal to wli's pgcl patch - both patches try to achieve
> the same, with different methods. I can very well imagine workloads where
> we want to have the combination of the two patches.
Well, your patch does have the advantage of not being a "break all
drivers" affair.
Also, even though pgcl scales "perfectly" wrt. highmem (nm the code
being a train wreck), the raw capacity increase is needed. There are
enough other reasons to go through with ABI-preserving page clustering
that they're not really in competition with each other.
Looks good to me. I'll spin it up tonight.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
2003-07-09 1:29 ` William Lee Irwin III
@ 2003-07-09 5:13 ` Martin J. Bligh
2003-07-09 5:19 ` William Lee Irwin III
2003-07-09 6:42 ` Ingo Molnar
2003-07-09 5:16 ` Dave Hansen
` (3 subsequent siblings)
5 siblings, 2 replies; 11+ messages in thread
From: Martin J. Bligh @ 2003-07-09 5:13 UTC (permalink / raw)
To: Ingo Molnar, linux-kernel; +Cc: linux-mm
> i'm pleased to announce the first public release of the "4GB/4GB VM split"
> patch, for the 2.5.74 Linux kernel:
>
> http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
I presume this was for -bk something as it applies clean to -bk6, but not
virgin.
However, it crashes before console_init on NUMA ;-( I'll shove early printk
in there later.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-09 5:13 ` Martin J. Bligh
@ 2003-07-09 5:19 ` William Lee Irwin III
2003-07-09 5:43 ` William Lee Irwin III
2003-07-09 6:42 ` Ingo Molnar
1 sibling, 1 reply; 11+ messages in thread
From: William Lee Irwin III @ 2003-07-09 5:19 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Ingo Molnar, linux-kernel, linux-mm
At some point in the past, mingo wrote:
>> i'm pleased to announce the first public release of the "4GB/4GB VM split"
>> patch, for the 2.5.74 Linux kernel:
>> http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
On Tue, Jul 08, 2003 at 10:13:12PM -0700, Martin J. Bligh wrote:
> I presume this was for -bk something as it applies clean to -bk6, but not
> virgin.
> However, it crashes before console_init on NUMA ;-( I'll shove early printk
> in there later.
Don't worry, I'm debugging it.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-09 5:19 ` William Lee Irwin III
@ 2003-07-09 5:43 ` William Lee Irwin III
0 siblings, 0 replies; 11+ messages in thread
From: William Lee Irwin III @ 2003-07-09 5:43 UTC (permalink / raw)
To: Martin J. Bligh, Ingo Molnar, linux-kernel, linux-mm
On Tue, Jul 08, 2003 at 10:13:12PM -0700, Martin J. Bligh wrote:
>> I presume this was for -bk something as it applies clean to -bk6, but not
>> virgin.
>> However, it crashes before console_init on NUMA ;-( I'll shove early printk
>> in there later.
On Tue, Jul 08, 2003 at 10:19:41PM -0700, William Lee Irwin III wrote:
> Don't worry, I'm debugging it.
Rather predictably, the NUMA KVA remapping shat itself:
Script started on Tue Jul 8 22:28:53 2003
\x0f$ sscreen -x
^[[?1049h^[[r^[[H^[[?7h^[[?1;4;6l^[[4l^[[?1h^[=\x0f^[(B^[[1;27r^[[H^[[HRecovering nvi editor sessions... done.
Setting up X server socket directory /tmp/.X11-unix...done.
INIT: Entering runlevel: 2
Starting system log daemon: syslogd.
Starting kernel log daemon: klogd.
Starting internet superserver: inetd.
Starting printer spooler: lpd.
Starting network benchmark server: netserver.
Not starting NFS kernel daemon: No exports.
Starting OpenBSD Secure Shell server: sshd.
Starting the system activity data collector: sadc.
Starting NFS common utilities: statd lockd.
Starting periodic command scheduler: cron.
Debian GNU/Linux testing/unstable megeira ttyS0
megeira login: root
Password:
Last login: Tue Jul 8 21:56:18 2003 on ttyS0
Linux megeira 2.5.74 #1 SMP Mon Jul 7 22:15:57 PDT 2003 i686 GNU/Linux
megeira:~# mount /mnt/g
megeira:~# !ec
echo 1 > /proc/sys/vm/overcommit_memory ; echo 1 > /proc/sys/vm/swappiness ; echo 360000 > /proc/sys/vm/dirty_expire_centisecs ; echo 360000 > /proc/sys/vm/dirty_writeback_centisecs ; echo 99 > /proc/sys/vm/dirty_background_ratio ; echo 1 > /proc/profile
megeira:~# shutdown -h now
^[[?5hBroadcast message from root (ttyS0) (Tue Jul 8 22:29:37 2003):
The system is going down for system halt NOW!
INIT: INIT: Sending processes the TERM signal
megeira:~# INIT:Stopping periodic command scheduler: cron.
Stopping internet superserver: inetd.
Stopping printer spooler: lpd.
Stopping network benchmark server: netserver.
Stopping OpenBSD Secure Shell server: sshd.
Saving the System Clock time to the Hardware Clock...
Hardware Clock updated to Tue Jul 8 22:30:04 PDT 2003.
Stopping NFS common utilities: lockd statd.
Stopping NFS kernel daemon: mountd nfsd.
Unexporting directories for NFS kernel daemon...done.
Stopping kernel log daemon: klogd.
Stopping system log daemon: syslogd.
Stopping portmap daemon: portmap.
Sending all processes the TERM signal... done.
Sending all processes the KILL signal... done.
Saving random seed... done.
Unmounting remote filesystems... done.
Deconfiguring network interfaces... done.
Deactivating swap... done.
Unmounting local filesystems... mount: proc already mounted
done.
Shutting down devices
Power down.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
^[[H^[[H
GRUB version 0.92 (639K lower / 3668992K upper memory)
+-------------------------------------------------------------------------+
||
||
||
||
||
||
||
||
||
||
||
||
+-------------------------------------------------------------------------+
Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, 'e' to edit the
commands before booting, or 'c' for a command-line.^[[5;78H | Boot Safe Kernel |
Boot check Kernel |
Boot latest kernel |
boot latest kernel from elm3b96 |
2.5.44 |
2.5.44-mm4 |
2.5.44-mm4-erich |
2.5.44-mm4-michael |
2.5.47-stock |
2.5.47-sched |
2.5.50-sched |
2.5.50-stock | v^[[8;3H boot latest kernel from elm3b96 \x0f^[[H^[[H
GRUB version 0.92 (639K lower / 3668992K upper memory)
[ Minimal BASH-like line editing is supported. For the first word, TAB
lists possible command completions. Anywhere else TAB lists the possible
completions of a device/filename. ESC at any time exits. ]
grub> root (hd0,1)
Filesystem type is ext2fs, partition type 0x83
grub> kernel /home/wli/vmlinuz-ingo root=/dev/sda2 console=ttyS0,38400n8 prof<00n8 profi le=1grub> kernel /home/wli/vmlinuz-ingo root=/dev/sda2 console=ttyS0,38400n8 profi>
[Linux-bzImage, setup=0xa00, size=0x1407e4]
grub> boot
Linux version 2.5.74-mm2 (wli@megeira) (gcc version 3.3 (Debian)) #1 SMP Tue Jul 8 22:28:26 PDT 2003
Video mode to be used for restore is ffff
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 0000000000100000 - 00000000e0000000 (usable)
BIOS-e820: 00000000fec00000 - 00000000fec09000 (reserved)
BIOS-e820: 00000000ffe80000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000800000000 (usable)
user-defined physical RAM map:
user: 0000000000000000 - 000000000009fc00 (usable)
user: 0000000000100000 - 00000000e0000000 (usable)
user: 00000000fec00000 - 00000000fec09000 (reserved)
user: 00000000ffe80000 - 0000000100000000 (reserved)
user: 0000000100000000 - 0000000800000000 (usable)
Reserving 23040 pages of KVA for lmem_map of node 1
Shrinking node 1 from 4194304 pages to 4171264 pages
Reserving 23040 pages of KVA for lmem_map of node 2
Shrinking node 2 from 6291456 pages to 6268416 pages
Reserving 23040 pages of KVA for lmem_map of node 3
Shrinking node 3 from 8388608 pages to 8365568 pages
Reserving total of 69120 pages for numa KVA remap
28832MB HIGHMEM available.
3666MB LOWMEM available.
min_low_pfn = 1045, max_low_pfn = 938496, highstart_pfn = 1007616
Low memory ends at vaddr e7200000
node 0 will remap to vaddr f8000000 - f8000000
node 1 will remap to vaddr f2600000 - f8000000
node 2 will remap to vaddr ecc00000 - f2600000
node 3 will remap to vaddr e7200000 - ecc00000
High memory starts at vaddr f8000000
found SMP MP-table at 000f6040
hm, page 000f6000 reserved twice.
hm, page 000f7000 reserved twice.
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Un^[[?1l^[>^[[27;1H
^[[?1049l[detached]
\x0f$
Script done on Tue Jul 8 22:36:58 2003
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-09 5:13 ` Martin J. Bligh
2003-07-09 5:19 ` William Lee Irwin III
@ 2003-07-09 6:42 ` Ingo Molnar
1 sibling, 0 replies; 11+ messages in thread
From: Ingo Molnar @ 2003-07-09 6:42 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: linux-kernel, linux-mm
On Tue, 8 Jul 2003, Martin J. Bligh wrote:
> > i'm pleased to announce the first public release of the "4GB/4GB VM split"
> > patch, for the 2.5.74 Linux kernel:
> >
> > http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
>
> I presume this was for -bk something as it applies clean to -bk6, but
> not virgin.
indeed - it's for BK-curr.
> However, it crashes before console_init on NUMA ;-( I'll shove early
> printk in there later.
wli found the bug meanwhile - i'll do a new patch later today.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
2003-07-09 1:29 ` William Lee Irwin III
2003-07-09 5:13 ` Martin J. Bligh
@ 2003-07-09 5:16 ` Dave Hansen
2003-07-09 7:08 ` Geert Uytterhoeven
` (2 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Dave Hansen @ 2003-07-09 5:16 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linux Kernel Mailing List, linux-mm
Looks very interesting. A few concerns, though, some stylish. Although
I know, if I had done something half as complex, it would look much
worse :) If you're still planning on doing cleanups I can wait, but
otherwise, I can send patches.
Have you looked at the impact on high interrupt load workloads? I saw
you mention the per-syscall TLB overhead, but you only mentioned the
interrupt overhead in passing. Doesn't this make it increasingly
important to coalesce interrupts, especially when you're running with
lots of user time? Any particular workloads have you've tested this
on? I can try to get a couple of large webserver benchmark runs in on
it, if you like.
It's a lot harder now to drop back to 4k stacks, because of the
hard-coded 2 page kmap sequences. But those patches are out-of-tree, so
they're of relatively little consequence.
It might be nice to some more abstraction of the size of the trampoline
window. There's a stuff this:
pgd[PTRS_PER_PGD-2] = swapper_pg_dir[PTRS_PER_PGD-2];
pgd[PTRS_PER_PGD-1] = swapper_pg_dir[PTRS_PER_PGD-1];
Being clever, I think some of these can be the same as the generic code.
The sepmd and banana_split patches in -mjb demonstrate some relatively
nice ways to do this.
There seems to be quite a bit of duplication of code in the new
__kmap_atomic* functions. __kmap_atomic_vaddr() could replace all of
the
duplicated
idx = type + KM_TYPE_NR*smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
lines. Also, it might nice to combine __kmap_atomic{,_noflush}()
Are you hoping to get this integrated for 2.6, or will it be more of an
add-on for 2.6 distro releases?
--
Dave Hansen
haveblue@us.ibm.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
` (2 preceding siblings ...)
2003-07-09 5:16 ` Dave Hansen
@ 2003-07-09 7:08 ` Geert Uytterhoeven
2003-07-10 1:36 ` Martin J. Bligh
2003-07-13 22:05 ` Petr Vandrovec
5 siblings, 0 replies; 11+ messages in thread
From: Geert Uytterhoeven @ 2003-07-09 7:08 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linux Kernel Development, linux-mm
On Wed, 9 Jul 2003, Ingo Molnar wrote:
> i'm pleased to announce the first public release of the "4GB/4GB VM split"
> patch, for the 2.5.74 Linux kernel:
>
> http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
>
> The 4G/4G split feature is primarily intended for large-RAM x86 systems,
> which want to (or have to) get more kernel/user VM, at the expense of
> per-syscall TLB-flush overhead.
Great! Another enterprise feature stolen from SCO? :-)
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
` (3 preceding siblings ...)
2003-07-09 7:08 ` Geert Uytterhoeven
@ 2003-07-10 1:36 ` Martin J. Bligh
2003-07-10 13:36 ` Martin J. Bligh
2003-07-13 22:05 ` Petr Vandrovec
5 siblings, 1 reply; 11+ messages in thread
From: Martin J. Bligh @ 2003-07-10 1:36 UTC (permalink / raw)
To: Ingo Molnar, linux-kernel; +Cc: linux-mm
> i'm pleased to announce the first public release of the "4GB/4GB VM split"
> patch, for the 2.5.74 Linux kernel:
>
> http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
>
> The 4G/4G split feature is primarily intended for large-RAM x86 systems,
> which want to (or have to) get more kernel/user VM, at the expense of
> per-syscall TLB-flush overhead.
wli pointed out that the only problem with the NUMA boxen was that you
left out "remap_numa_kva();" from pagetable_init - sticking it back at the
end works fine.
Preliminary benchmark results:
2.5.74-bk6-44 is with the patch applied
2.5.74-bk6-44-on is with the patch applied and config option turned on.
Kernbench: (make -j vmlinux, maximal tasks)
Elapsed System User CPU
2.5.74 46.11 115.86 571.77 1491.50
2.5.74-bk6-44 45.92 115.71 570.35 1494.75
2.5.74-bk6-44-on 48.11 134.51 583.88 1491.75
SDET 128 (see disclaimer)
Throughput Std. Dev
2.5.74 100.0% 0.1%
2.5.74-bk6-44 100.3% 0.7%
2.5.74-bk6-44-on 92.1% 0.2%
Which isn't too bad at all, considering ... highpte does this to it:
Kernbench: (make -j vmlinux, maximal tasks)
Elapsed System User CPU
2.5.73-mm3 45.38 114.91 565.81 1497.75
2.5.73-mm3-highpte 46.54 130.41 566.84 1498.00
SDET 16 (see disclaimer)
Throughput Std. Dev
2.5.73-mm3 100.0% 0.3%
2.5.73-mm3-highpte 94.8% 0.6%
(I don't have highpte results for higher SDET right now - I'll run
'em later).
diffprofile for kernbench (- is better with 4/4 on, + worse)
15066 9.2% total
10883 0.0% rw_vm
3686 170.3% do_page_fault
1652 3.4% default_idle
1380 0.0% str_vm
1256 0.0% follow_page
1012 7.2% do_anonymous_page
669 119.7% kmap_atomic
611 78.1% handle_mm_fault
563 0.0% get_user_pages
418 41.4% clear_page_tables
338 66.3% page_address
304 16.9% buffered_rmqueue
263 222.9% kunmap_atomic
161 2.0% __d_lookup
152 21.8% sys_brk
151 26.3% find_vma
138 24.3% pgd_alloc
135 9.3% schedule
133 0.6% page_remove_rmap
128 0.0% put_user_size
123 3.4% find_get_page
121 8.4% free_hot_cold_page
106 3.3% zap_pte_range
99 11.1% filemap_nopage
97 1.5% page_add_rmap
84 7.5% file_move
79 6.6% release_pages
65 0.0% get_user_size
59 15.7% file_kill
52 0.0% find_extend_vma
...
-50 -47.2% kmap_high
-63 -10.8% fd_install
-76 -100.0% bad_get_user
-86 -11.6% pte_alloc_one
-109 -100.0% direct_strncpy_from_user
-151 -100.0% __copy_user_intel
-878 -100.0% direct_strnlen_user
-3505 -100.0% __copy_from_user_ll
-5368 -100.0% __copy_to_user_ll
and for SDET:
63719 8.1% total
39097 9.8% default_idle
12494 0.0% rw_vm
4820 192.6% do_page_fault
3587 36.4% clear_page_tables
3341 0.0% follow_page
1744 0.0% str_vm
1297 138.4% kmap_atomic
1026 43.8% pgd_alloc
1010 0.0% get_user_pages
932 27.6% do_anonymous_page
877 100.2% handle_mm_fault
828 14.2% path_lookup
605 42.9% page_address
552 13.3% do_wp_page
496 216.6% kunmap_atomic
455 4.1% __d_lookup
441 2.5% zap_pte_range
415 12.8% do_no_page
408 36.7% __block_prepare_write
349 2.5% copy_page_range
331 12.3% filemap_nopage
308 0.0% put_user_size
305 43.9% find_vma
266 35.7% update_atime
212 2.3% find_get_page
209 8.4% proc_pid_stat
196 9.1% schedule
188 7.7% buffered_rmqueue
186 5.2% pte_alloc_one
166 13.7% __find_get_block
162 15.1% __mark_inode_dirty
159 9.1% current_kernel_time
155 18.1% grab_block
149 1.5% release_pages
124 2.6% follow_mount
118 7.6% ext2_new_inode
117 5.6% path_release
113 28.2% __free_pages
113 0.0% get_user_size
107 12.1% dnotify_parent
105 20.8% __alloc_pages
102 18.4% generic_file_aio_write_nolock
102 4.7% file_move
...
-101 -6.5% __set_page_dirty_buffers
-102 -30.7% kunmap_high
-104 -13.4% .text.lock.base
-108 -3.9% copy_process
-114 -13.4% unmap_vmas
-121 -5.0% link_path_walk
-127 -10.5% __read_lock_failed
-128 -24.3% set_page_address
-180 -100.0% bad_get_user
-237 -11.6% .text.lock.namei
-243 -100.0% direct_strncpy_from_user
-262 -0.3% page_remove_rmap
-310 -5.6% kmem_cache_free
-332 -4.4% atomic_dec_and_lock
-365 -35.3% kmap_high
-458 -15.7% .text.lock.dcache
-583 -22.8% .text.lock.filemap
-609 -13.4% .text.lock.dec_and_lock
-649 -54.9% .text.lock.highmem
-848 -100.0% direct_strnlen_user
-877 -100.0% __copy_user_intel
-958 -100.0% __copy_from_user_ll
-1098 -2.7% page_add_rmap
-6746 -100.0% __copy_to_user_ll
I'll play around some more with it later. Presumably things like
disk / network intensive workloads that generate a lot of interrupts
will be bad ... but NAPI would help?
What I *really* like is that without the config option on, there's
no degredation ;-)
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-10 1:36 ` Martin J. Bligh
@ 2003-07-10 13:36 ` Martin J. Bligh
0 siblings, 0 replies; 11+ messages in thread
From: Martin J. Bligh @ 2003-07-10 13:36 UTC (permalink / raw)
To: Ingo Molnar, linux-kernel; +Cc: linux-mm
Results now with highpte
2.5.74-bk6-44 is with the patch applied
2.5.74-bk6-44-on is with the patch applied and 4/4 config option.
2.5.74-bk6-44-hi is with the patch applied and with highpte instead.
Overhead of 4/4 isn't much higher, and is much more generally useful.
Kernbench: (make -j vmlinux, maximal tasks)
Elapsed System User CPU
2.5.74 46.11 115.86 571.77 1491.50
2.5.74-bk6-44 45.92 115.71 570.35 1494.75
2.5.74-bk6-44-on 48.11 134.51 583.88 1491.75
2.5.74-bk6-44-hi 47.06 131.13 570.79 1491.50
SDET 128 (see disclaimer)
Throughput Std. Dev
2.5.74 100.0% 0.1%
2.5.74-bk6-44 100.3% 0.7%
2.5.74-bk6-44-on 92.1% 0.2%
2.5.74-bk6-44-hi 94.5% 0.1%
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
` (4 preceding siblings ...)
2003-07-10 1:36 ` Martin J. Bligh
@ 2003-07-13 22:05 ` Petr Vandrovec
5 siblings, 0 replies; 11+ messages in thread
From: Petr Vandrovec @ 2003-07-13 22:05 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, linux-mm
On Wed, Jul 09, 2003 at 12:45:52AM +0200, Ingo Molnar wrote:
>
> i'm pleased to announce the first public release of the "4GB/4GB VM split"
> patch, for the 2.5.74 Linux kernel:
>
> http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
FYI, VMware's vmmon/vmnet I maintain for 2.5.x kernels at
http://platan.vc.cvut.cz/ftp/pub/vmware (currently
.../vmware-any-any-update37.tar.gz) were updated to work correctly
with 4G/4G kernel configuration.
Best regards,
Petr Vandrovec
vandrove@vc.cvut.cz
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2003-07-13 22:05 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
2003-07-09 1:29 ` William Lee Irwin III
2003-07-09 5:13 ` Martin J. Bligh
2003-07-09 5:19 ` William Lee Irwin III
2003-07-09 5:43 ` William Lee Irwin III
2003-07-09 6:42 ` Ingo Molnar
2003-07-09 5:16 ` Dave Hansen
2003-07-09 7:08 ` Geert Uytterhoeven
2003-07-10 1:36 ` Martin J. Bligh
2003-07-10 13:36 ` Martin J. Bligh
2003-07-13 22:05 ` Petr Vandrovec
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox