From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from max.fys.ruu.nl (max.fys.ruu.nl [131.211.32.73]) by kvack.org (8.8.7/8.8.7) with ESMTP id MAA05645 for ; Mon, 8 Dec 1997 12:10:01 -0500 Received: from mirkwood.dummy.home (root@anx1p7.fys.ruu.nl [131.211.33.96]) by max.fys.ruu.nl (8.8.7/8.8.7/hjm) with ESMTP id SAA18254 for ; Mon, 8 Dec 1997 18:04:15 +0100 (MET) Message-Id: From: jr@petz.han.de (Joerg Rade) Subject: Re: TTY changes to 2.1.65 Date: Thu, 4 Dec 1997 21:06:59 +0100 (MET) In-Reply-To: from "Rik van Riel" at Nov 27, 97 01:56:49 pm Content-Type: text ReSent-To: linux-mm ReSent-Message-ID: Sender: owner-linux-mm@kvack.org To: Rik van Riel List-ID: Hi Rik, > Send Linux memory-management wishes to me: I'm currently looking > for something to hack... How about something like a garbage collection for interpreted languages, i.e. GNU-Smalltalk or Java? As Paul Wilson in his article below pointed out, linux would be a suitable platform. grtnx -j|g -- Joerg Rade | How could I know what I say | jr@petz.han.de Birkenstr. 32 | before I hear what I think? | +49 511 9887497 D-30171 Hannover +-----------------------------+ S: Schlaegerstr. ----8<---- From: wilson@cs.utexas.edu (Paul Wilson) Newsgroups: comp.arch,comp.lang.misc,comp.lang.smalltalk Subject: hardware (and OS) support for memory management (was Re: The Architect's Trap) Date: 21 Jul 1997 17:58:01 -0500 Organization: CS Dept, University of Texas at Austin Lines: 150 Message-ID: <5r0php$97b$1@roar.cs.utexas.edu> References: <5m4oqi$eer@darkstar.ucsc.edu<33C98CAD.41C6@iil.intel.com<5qodsl$gt1@bcarh8ab.bnr.ca<33CFD939.446B9B3D@eng.adaptec.com> NNTP-Posting-Host: roar.cs.utexas.edu Ident-User: wilson In article <33CFD939.446B9B3D@eng.adaptec.com>, Greg Gritton x2386 Hi, > >It is improtant to know the whole story. >The SOAR chip, described in the "What Price Smalltalk" >by Ungar and Patterson (IEEE Computer, Januar 1987, pages 67-74) >contains a number of innovative features designed to run >Smalltalk faster compared to a simpler RISC processor >of the time, such as MIPS. After investigating the speed increases >derived from various features, they found that many of the >features weren't worthwhile. However, they found other >features to be very whorthwhile. The valuable features included >register windows and trapping arithmetic instructions. This isn't the whole story. As I recall, Ungar concluded that 3 features were worthwhile (register windows, trapping arithmetic instructions, and hardware support for a generational GC write barrier)---but since then, the value of those features has been called into question, too. Register windows are regarded by many as unnecessary, and were originally motivated by statistics from code generated by compilers that are now obsolete. Tagged arithmetic instructions can be useful, but alternatives exist, including having word loads trap when the address is not word-aligned. (You can fiddle the tag representations to ensure that the tag is stripped out by an immediate instruction-stream constant, and an unaligned load exception is generated if in fact it wasn't the right tag.) At least one commercial Lisp systems for the SPARC didn't use the special SPARC tags because the implementors wanted three-bit tags, rather than the hardware-supported two. The GC write barrier is now usually implemented in software, which often has better performance than Ungar's original write barrier did with hardware support. (Ungar's minimal hardware support was fast in the common case stressed by the Smalltalk macro benchmarks, but is slow in cases that can be common in some other programs. Ungar et al. adopted a card-marking write barrier that's implemented purely in software, and is very fast.) Interestingly, Ungar and his students have done an excellent job of developing software-only techniques for compiling OOP languages, notably the Self compilers (first with Craig Chambers and later with Urs Hoelzle) and Chambers' Vortex compiler for the Cecil language (and other languages). When it comes to GC support, please don't go designing special purpose hardware without reading my GC survey (the long version available from our web site, ). Different kinds of GC's benefit from different kinds of write barriers, and it's unclear what's worth putting in hardware, if anything. My own vote for "most worthwhile hardware for memory management" is sub-page protections in the TLB, e.g., 1-KB independently-protectable units withing 4KB or 8KB pages, like the ARM 6xx series. And VERY FAST traps. And please don't make the pages bigger. (And get the OS kernel people to support querying which pages are dirty, and VERY FAST USER-LEVEL TRAPS.) These features are desirable for a lot of things, including: 1. persistence (pointer swizzling at page fault time) 2. checkpointing dirty pages (for fault-tolerance, process migration persistence, and time-travel debugging), 3. distributed virtual memory, 4. GC read and write barriers, 5. redzoning to detect array overruns in debugging mallocs and buffer overruns in networking software, 6. compressed caching virtual memory, 7. adaptive clustering of VM pages, 8. overlapping protection domains (e.g., in single-address-space OS's, or runtime systems with protection and sharing, or for fast path communications software that avoids unnecessary copies), 9. pagewise incremental translation of code from bytecodes (or existing ISA's) to VLIW or whatever, 10. incremental changing of data formats for sharing data across multicomputers (switching endianness, pointer size, etc.) or networks, 11. memory tracing and profiling, 12. incremental materialization of database query results in a representation normal programs can handle 13. copy-on-write, lazy messaging, etc. 14. remote paging against distributed RAM caches and I'm sure lots of other stuff. (Papers on some of these topics are available from our web site too.) I view the TLB as a very general-purpose piece of parallel hardware. You can really make it do lots of tricks for you to implement fancy memory abstractions cheaply. There are lots of things for which a TLB can give you zero-cost checking in the common case, and trap to software in the uncommon cases. If you're going to do anything more radical than a nicer TLB (and unaligned load trapping), you might want to think really hard about how to expose some of the memory-managment hardware in ways that can serve multiple purposes. For example, people at Wisconsin (I believe) messed with the ECC bits on some machine or other (CM-5?) to cause fine-grained traps on word accesses. The Symbolics machines checked extra tag bits when doing cache lookups, to implement some forms of tagged memory with zero time cost and fairly small hardware cost. I'd be very interested in hardware that let you program a little PLA that did the checking of the cache tag RAM, so that you could use it for ECC if you wanted, or fine-grained coherency, or any of a zillion other things. The basic idea would be to use a little more hardware to get a lot more flexibility in what your TLB, cache dictionaries etc. are already doing. I'd think it'd be worth putting a programmable state machine in a PLA, with excruciatingly fast traps to something like PAL code to sort out the uncommon cases. (I seem to recall seeing something like this idea for cache consistency protocols. It has advantages both in flexibility and in the ability to improve the coherence logic after the hardware is built.) I'm not a CPU design expert, though, so I'm not clear on how hairy this would be. For it to be generally useful, it would also be important to get the OS guys on board. Some OS's (e.g., Solaris) have laughably slow VM trap handling as it is, even when the hardware is fine. (How can you spend so many thousands of cycles in the kernel to figure out that the user has a handler for an access-protection trap? Hyeesh.) This is definitely due in part to the lack of benchmarks that stress these aspects of a system. Some OS's (e.g., Linux) are several times faster than some others (e.g., Solaris), and some kernels (e.g., L4) are much faster still. This doesn't show up in SPECmarks. I think it's time that everybody realized that the TLB they already have can support much more sophisticated software, and do it efficiently---and then started banging on their OS vendors to realize that TLB's aren't just for disk paging anymore. Concurrently, TLB's should get better to support finer-grained uses than current TLB's can. -- | Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (wilson@cs.utexas.edu) | Papers on memory allocators, garbage collection, memory hierarchies, | persistence and Scheme interpreters and compilers available via ftp from | ftp.cs.utexas.edu, in pub/garbage (or http://www.cs.utexas.edu/users/wilson/)