From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from max.fys.ruu.nl (max.fys.ruu.nl [131.211.32.73])
	by kvack.org (8.8.7/8.8.7) with ESMTP id MAA05645
	for <linux-mm@kvack.org>; Mon, 8 Dec 1997 12:10:01 -0500
Received: from mirkwood.dummy.home (root@anx1p7.fys.ruu.nl [131.211.33.96])
	by max.fys.ruu.nl (8.8.7/8.8.7/hjm) with ESMTP id SAA18254
	for <linux-mm@kvack.org>; Mon, 8 Dec 1997 18:04:15 +0100 (MET)
Message-Id: <m0xdhY3-000PukC@petz>
From: jr@petz.han.de (Joerg Rade)
Subject: Re: TTY changes to 2.1.65
Date: Thu, 4 Dec 1997 21:06:59 +0100 (MET)
In-Reply-To: <mng==Pine.LNX.3.91.971127141337.259F-100000@mirkwood.dummy.home> from "Rik van Riel" at Nov 27, 97 01:56:49 pm
Content-Type: text
ReSent-To: linux-mm <linux-mm@kvack.org>
ReSent-Message-ID: <Pine.LNX.3.91.971208125927.553E@mirkwood.dummy.home>
Sender: owner-linux-mm@kvack.org
To: Rik van Riel <H.H.vanRiel@fys.ruu.nl>
List-ID: <linux-mm.kvack.org>

Hi Rik,

> Send Linux memory-management wishes to me: I'm currently looking
> for something to hack...

How about something like a garbage collection for interpreted
languages, i.e. GNU-Smalltalk or Java?  As Paul Wilson in his article
below pointed out, linux would be a suitable platform.

grtnx -j|g
--
    Joerg Rade   | How could I know what I say | jr@petz.han.de 
   Birkenstr. 32 | before I hear what I think? | +49 511 9887497
D-30171 Hannover +-----------------------------+ S: Schlaegerstr. 
----8<----
From: wilson@cs.utexas.edu (Paul Wilson)
Newsgroups: comp.arch,comp.lang.misc,comp.lang.smalltalk
Subject: hardware (and OS) support for memory management (was Re: The Architect's Trap)
Date: 21 Jul 1997 17:58:01 -0500
Organization: CS Dept, University of Texas at Austin
Lines: 150
Message-ID: <5r0php$97b$1@roar.cs.utexas.edu>
References: <5m4oqi$eer@darkstar.ucsc.edu<33C98CAD.41C6@iil.intel.com<5qodsl$gt1@bcarh8ab.bnr.ca<33CFD939.446B9B3D@eng.adaptec.com>
NNTP-Posting-Host: roar.cs.utexas.edu
Ident-User: wilson

In article <33CFD939.446B9B3D@eng.adaptec.com>,
Greg Gritton x2386  <gregory@eng.adaptec.comwrote:
>Hi,
>
>It is improtant to know the whole story.
>The SOAR chip, described in the "What Price Smalltalk"
>by Ungar and Patterson (IEEE Computer, Januar 1987, pages 67-74)
>contains a number of innovative features designed to run
>Smalltalk faster compared to a simpler RISC processor
>of the time, such as MIPS.  After investigating the speed increases
>derived from various features, they found that many of the
>features weren't worthwhile.  However, they found other
>features to be very whorthwhile.  The valuable features included 
>register windows  and trapping arithmetic instructions.

This isn't the whole story.  As I recall, Ungar concluded that
3 features were worthwhile (register windows, trapping arithmetic
instructions, and hardware support for a generational GC write
barrier)---but since then, the value of those features has
been called into question, too.

Register windows are regarded by many as unnecessary, and were
originally motivated by statistics from code generated by compilers
that are now obsolete.  

Tagged arithmetic instructions can be useful, but alternatives
exist, including having word loads trap when the address is
not word-aligned.  (You can fiddle the tag representations
to ensure that the tag is stripped out by an immediate
instruction-stream constant, and an unaligned load exception
is generated if in fact it wasn't the right tag.)  At least
one commercial Lisp systems for the SPARC didn't use the
special SPARC tags because the implementors wanted three-bit
tags, rather than the hardware-supported two.

The GC write barrier is now usually implemented in software,
which often has better performance than Ungar's original
write barrier did with hardware support.  (Ungar's minimal
hardware support was fast in the common case stressed by
the Smalltalk macro benchmarks, but is slow in cases that
can be common in some other programs.  Ungar et al. adopted
a card-marking write barrier that's implemented purely
in software, and is very fast.)

Interestingly, Ungar and his students have done an excellent
job of developing software-only techniques for compiling
OOP languages, notably the Self compilers (first with Craig
Chambers and later with Urs Hoelzle) and Chambers' Vortex
compiler for the Cecil language (and other languages).

When it comes to GC support, please don't go designing special
purpose hardware without reading my GC survey (the long version
available from our web site, <http://www.cs.utexas.edu/users/oops>).
Different kinds of GC's benefit from different kinds of write
barriers, and it's unclear what's worth putting in hardware,
if anything.

My own vote for "most worthwhile hardware for memory management"
is sub-page protections in the TLB, e.g., 1-KB independently-protectable
units withing 4KB or 8KB pages, like the ARM 6xx series.  And VERY FAST
traps.  And please don't make the pages bigger.  (And get the OS kernel
people to support querying which pages are dirty, and VERY FAST USER-LEVEL
TRAPS.)

These features are desirable for a lot of things, including:

  1. persistence (pointer swizzling at page fault time)
  2. checkpointing dirty pages (for fault-tolerance, process migration 
     persistence, and time-travel debugging),
  3. distributed virtual memory, 
  4. GC read and write barriers, 
  5. redzoning to detect array overruns in debugging mallocs and buffer
     overruns in networking software,
  6. compressed caching virtual memory,
  7. adaptive clustering of VM pages,
  8. overlapping protection domains (e.g., in single-address-space OS's,
     or runtime systems with protection and sharing, or for fast
     path communications software that avoids unnecessary copies),
  9. pagewise incremental translation of code from bytecodes (or existing
     ISA's) to VLIW or whatever,
  10. incremental changing of data formats for sharing data across
      multicomputers (switching endianness, pointer size, etc.) or
      networks,
  11. memory tracing and profiling,
  12. incremental materialization of database query results in
      a representation normal programs can handle
  13. copy-on-write, lazy messaging, etc.
  14. remote paging against distributed RAM caches

  and I'm sure lots of other stuff.

(Papers on some of these topics are available from our web site too.)

I view the TLB as a very general-purpose piece of parallel hardware.
You can really make it do lots of tricks for you to implement
fancy memory abstractions cheaply.  There are lots of things for
which a TLB can give you zero-cost checking in the common
case, and trap to software in the uncommon cases.

If you're going to do anything more radical than a nicer TLB
(and unaligned load trapping), you might want to think really
hard about how to expose some of the memory-managment hardware
in ways that can serve multiple purposes.  For example, people
at Wisconsin (I believe) messed with the ECC bits on some
machine or other (CM-5?) to cause fine-grained traps on
word accesses.  The Symbolics machines checked extra tag
bits when doing cache lookups, to implement some forms of tagged
memory with zero time cost and fairly small hardware cost.

I'd be very interested in hardware that let you program a
little PLA that did the checking of the cache tag RAM, so
that you could use it for ECC if you wanted, or fine-grained
coherency, or any of a zillion other things.  The basic idea
would be to use a little more hardware to get a lot more
flexibility in what your TLB, cache dictionaries etc. are already 
doing.

I'd think it'd be worth putting a programmable state machine
in a PLA, with excruciatingly fast traps to something like PAL code to
sort out the uncommon cases.  (I seem to recall seeing something
like this idea for cache consistency protocols.  It has advantages
both in flexibility and in the ability to improve the coherence logic
after the hardware is built.)

I'm not a CPU design expert, though, so I'm not clear on
how hairy this would be.

For it to be generally useful, it would also be important to get the OS
guys on board.  Some OS's (e.g., Solaris) have laughably slow VM trap
handling as it is, even when the hardware is fine.  (How can you spend
so many thousands of cycles in the kernel to figure out that the user
has a handler for an access-protection trap?  Hyeesh.)

This is definitely due in part to the lack of benchmarks that stress
these aspects of a system.  Some OS's (e.g., Linux) are several times
faster than some others (e.g., Solaris), and some kernels (e.g., L4)
are much faster still.  This doesn't show up in SPECmarks.

I think it's time that everybody realized that the TLB they already
have can support much more sophisticated software, and do it
efficiently---and then started banging on their OS vendors to realize
that TLB's aren't just for disk paging anymore.  Concurrently, TLB's
should get better to support finer-grained uses than current TLB's 
can.

-- 
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (wilson@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and  Scheme interpreters and compilers available via ftp from 
| ftp.cs.utexas.edu, in pub/garbage (or http://www.cs.utexas.edu/users/wilson/)