Re: RFC: design for new VM

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: RFC: design for new VM
       [not found] <87256934.0072FA16.00@d53mta04h.boulder.ibm.com>
@ 2000-08-08  0:36 ` Gerrit.Huizenga
  0 siblings, 0 replies; 46+ messages in thread
From: Gerrit.Huizenga @ 2000-08-08  0:36 UTC (permalink / raw)
  To: chucklever; +Cc: linux-mm, linux-kernel, Linus Torvalds

Hi Chuck,

> 1.  kswapd runs in the background and wakes up every so often to handle
> the corner cases that smooth bursty memory request workloads.  it executes
> the same code that is invoked from the kernel's memory allocator to
> reclaim pages.

 yep...  We do the same, although primarily through RSS management and our
 pageout deamon (separate from swapout).

 One possible difference - dirty pages are schedule for asynchronous
 flush to disk and then moved to the end of the free list after IO
 is complete.  If the process faults on that page, either before it is
 paged out or aftewrwards, it can be "reclaimed" either from the dirty
 list or the free list , without re-reading from disk.  The pageout daemon
 runs with the dirty list reaches a tuneable size, and the pageout deamon
 shrinks the list to a tuneable size, moving all written pages to the
 free list.

 In many ways, similar to what Rik is proposing, although I don't see any
 "fast reclaim" capability.  Also, the method by which pages are aged
 is quite different (global phys memory scan vs. processes maintaining
 their own LRU set).  Having a list of prime candidates to flush makes
 the kswapd/pageout overhead lower than using a global clock hand, but
 the global clock hand *may* more perform better global optimisation
 of page aging.

> 2.  i agree with you that when the system exhausts memory, it hits a hard
> knee; it would be better to soften this.  however, the VM system is
> designed to optimize the case where the system has enough memory.  in
> other words, it is designed to avoid unnecessary work when there is no
> need to reclaim memory.  this design was optimized for a desktop workload,
> like the scheduler or ext2 "async" mode.  if i can paraphrase other
> comments i've heard on these lists, it epitomizes a basic design
> philosophy: "to optimize the common case gains the most performance
> advantage."

 This works fine until I have a stable load on my system and then
 start {Netscape, StarOffice, VMware, etc.} which then causes IO for
 demand paging of the executable, as well as paging/swapping activity
 to make room for the piggish footprints of these bigger applications.

 This is where it might help to pre-write dirty pages when the system
 is more idle, without fully returning those pages to the free list.

> can a soft-knee swapping algorithm be demonstrated that doesn't impact the
> performance of applications running on a system that hasn't exhausted its
> memory?
> 
>      - Chuck Lever

 Our VM doesn't exhibit a strong knee, but its method of avoiding that
 is again the flexing RSS management.  Inactive processes tend to shrink
 to their working footprint, larger processes tend to grow to expand
 their footprint but still self-manage within the limits of available
 memory.  I think it is possible to soften the knee on a per-workload
 basis, and that's probably a spot for some tuneables.  E.g. when to
 flush dirty old pages, how many to flush, and I think Rik has already
 talked about having those tunables.

 Despite the fact that our systems have been primarily deployed for
 a single workload type (databases), we still have found that (the
 right!) VM tuneables can have an enormous impact on performance. I
 think the same will be much more true of an OS like Linux which tries
 to be many things to all people.

gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

[parent not found: <87256934.0078DADB.00@d53mta03h.boulder.ibm.com>]

* Re: RFC: design for new VM
       [not found] <87256934.0078DADB.00@d53mta03h.boulder.ibm.com>
@ 2000-08-08  0:48 ` Gerrit.Huizenga
  2000-08-08 15:21   ` Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Gerrit.Huizenga @ 2000-08-08  0:48 UTC (permalink / raw)
  To: Rik van Riel; +Cc: chucklever, linux-mm, linux-kernel, Linus Torvalds

> On Mon, 7 Aug 2000, Rik van Riel wrote:
> The idea is that the memory_pressure variable indicates how
> much page stealing is going on (on average) so every time
> kswapd wakes up it knows how much pages to steal. That way
> it should (if we're "lucky") free enough pages to get us
> along until the next time kswapd wakes up.

 Seems like you could signal kswapd when either the page fault
 rate increases or the rate of (memory allocations / memory
 frees) hits a tuneable? ratio (I hate relying on luck, simply
 because so much luck is bad ;-)

> About NUMA scalability: we'll have different memory pools
> per NUMA node. So if you have a 32-node, 64GB NUMA machine,
> it'll partly function like 32 independant 2GB machines.

 One lesson we learned early on is that anything you can
 possibly do on a per-CPU basis helps both SMP and NUMA
 activity.  This includes memory management, scheduling,
 TCP performance counters, any kind of system counters, etc.
 Once you have the basic SMP hierarchy in place, adding a NUMA
 hierarchy (or more than one for architectures that need it)
 is much easier.

 Also, is there a kswapd per pool?  Or does one kswapd oversee
 all of the pools (in the NUMA world, that is)?

gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-08  0:48 ` Gerrit.Huizenga
@ 2000-08-08 15:21   ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2000-08-08 15:21 UTC (permalink / raw)
  To: Gerrit.Huizenga; +Cc: chucklever, linux-mm, linux-kernel, Linus Torvalds

On Mon, 7 Aug 2000 Gerrit.Huizenga@us.ibm.com wrote:
> > On Mon, 7 Aug 2000, Rik van Riel wrote:
> > The idea is that the memory_pressure variable indicates how
> > much page stealing is going on (on average) so every time
> > kswapd wakes up it knows how much pages to steal. That way
> > it should (if we're "lucky") free enough pages to get us
> > along until the next time kswapd wakes up.
>  
>  Seems like you could signal kswapd when either the page fault
>  rate increases or the rate of (memory allocations / memory
>  frees) hits a tuneable? ratio

We will. Each page steal and each allocation will increase
the memory_pressure variable, and because of that, also the
inactive_target.

Whenever either 
- one zone gets low on free memory *OR* 
- all zones get more or less low on free+inactive_clean pages *OR*
- we get low on inactive pages (inactive_shortage > inactive_target/2),
THEN kswapd gets woken up immediately.

We do this both from the page allocation code and from
__find_page_nolock (which gets hit every time we reclaim
an inactive page back for its original purpose).

> > About NUMA scalability: we'll have different memory pools
> > per NUMA node. So if you have a 32-node, 64GB NUMA machine,
> > it'll partly function like 32 independant 2GB machines.
>  
>  One lesson we learned early on is that anything you can
>  possibly do on a per-CPU basis helps both SMP and NUMA
>  activity.  This includes memory management, scheduling,
>  TCP performance counters, any kind of system counters, etc.
>  Once you have the basic SMP hierarchy in place, adding a NUMA
>  hierarchy (or more than one for architectures that need it)
>  is much easier.
> 
>  Also, is there a kswapd per pool?  Or does one kswapd oversee
>  all of the pools (in the NUMA world, that is)?

Currently we have none of this, but once 2.5 is forked
off, I'll submit a patch which shuffles all variables
into per-node (per pgdat) structures.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

[parent not found: <8725692F.0079E22B.00@d53mta03h.boulder.ibm.com>]

* Re: RFC: design for new VM
       [not found] <8725692F.0079E22B.00@d53mta03h.boulder.ibm.com>
@ 2000-08-07 17:40 ` Gerrit.Huizenga
  2000-08-07 18:37   ` Matthew Wilcox
                     ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Gerrit.Huizenga @ 2000-08-07 17:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, Linus Torvalds

Hi Rik,

I have a few comments on your RFC for VM.  Some are simply
observational, some are based on our experience locally with the
development, deployment and maintenance of a VM subsystem here at IBM
NUMA-Q (formerly Sequent Computer Systems, Inc.).  As you may remember,
our VM subsystem was initially designed in ~1982-1984 to operate on 30
processor SMP machines, and in roughly 1993-1995 it was updated to
support NUMA systems up to 64 processors.  Our machines started with ~1
GB of physical memory, and today support up to 64 GB of physical memory
on a 32-64 processor machine.  These machines run a single operating
system (DYNIX/ptx) which is derived originally from BSD 4.2, although
the VM subsystem has been completely rewritten over the years.

Along the way, we learned many things about memory latency, large
memory support, SMP & NUMA issues, some of which may be useful to
you in your current design effort.

First, and perhaps foremost, I believe your design deals almost
exclusively with page aging & page replacement algorithms, rather
than being a complete VM redesign, although feel free to correct
me if I have misconstrued that.  For instance, I don't believe you
are planning to redo the 3 or 4 tier page table layering as part
of your effort, nor are you changing memory allocation routines in
any kernel-visible way.  I also don't see any modifications to kernel
pools, general memory management of free pages (e.g. AVL trees vs. 
linked lists), any changes to the PAE mechanism currently in use,
no reference to alternate page sizes (e.g. Intel PSE), buffer/page
cache organization, etc.  I also see nothing in the design which
reduces the needs for global TLB flushes across this system, which
is one area where I believe Linux is starting to suffer as CPU counts
increase.  I believe a full VM redesign would tend to address all of
these issues, even if it did so in a completely modular fashion.

I also note that you intend to draw heavily from the FreeBSD
implementation.  Two areas in which to be very careful here have
already been mentioned, but they are worth restating:  FreeBSD
has little to no SMP experience (e.g. kernel big lock) and little
to no large memory experience.  I believe Linux is actually slighly
more advanced in both of these areas, and a good redesign should
preserve and/or improve on those capabilities.

I believe that your current proposed aging mechanism, while perhaps
a positive refinement of what currently exists, still suffers from
a fundamental problem in that you are globally managing page aging.
In both large memory systems and in SMP systems, scaleability is
greatly enhanced if major capabilities like page aging can in some
way be localized.  One mechanism might be to use something like
per-CPU zones from which private pages are typically allocated from
and freed to.  This, in conjunction with good scheduler affinity,
maximizes the benefits of any CPU L1/L2 cache.  Another mechanism,
and the one that we chose in our operating system, was to use a modified
process resident set sizes as the machanism for page management.  The
basic modifications are to make the RSS tuneable system wide as well
as per process.  The RSS size "flexes" based on available memory and
a processes page fault frequency (PFF).  Frequent page faults force the
RSS to increase, infrequent page faults cause a processes resident size
to shrink.  When memory pressure mounts, the running process manages
itself a little more agressively; processes which have "flexed"
their resident set size beyond their system or per process recommended
maxima are among the first to lose pages.  And, when pressure can not
be addressed to RSS management, swapping starts.

Another fundamental flaw I see with both the current page aging mechanism
and the proposed mechanism is that workloads which exhaust memory pay
no penalty at all until memory is full.  Then there is a sharp spike
in the amount of (slow) IO as pages are flushed, processes are swapped,
etc.  There is no apparent smoothing of spikes, such as increasing the
rate of IO as the rate of memory pressure increases.  With the exception
of laptops, most machines can sustain a small amount of background
asynchronous IO without affecting performance (laptops may want IO
batched to maximize battery life).  I would propose that as memory
pressure increases, paging/swapping IO should increase somewhat
proportionally.  This provides some smoothing for the bursty nature of
most single user or small ISP workloads.  I believe databases style
loads on larger machines would also benefit.

Your current design does not address SMP locking at all.  I would
suggest that a single VM lock would provide reasonable scaleability
up to about 16 processors, depending on page size, memory size, processor
speed, and the ratio of processor speed to memory bandwidth.  One
method for stretching that lock is to use zoned, per-processor (or
per-node) data for local page allocations whenever possible.  Then
local allocations can use minimal locking (need only to protect from
memory allocations in interrupt code).  Further, the layout of memory
in a bitmaped, power of 2 sized "buddy system" can speed allocations,
reducing the amount of time during which a critical lock needs to be
held.  AVL trees will perform similarly well, with the exception that
a resource bitmap tends to be easier on TLB entries and processor
cache.  A bitmaped allocator may also be useful in more efficiently
allocating pages of variable sizes on a CPU which supports variable
sized pages in hardware.

Also, I note that your filesys->flush() mechanism utilizes a call
per page.  This is an interesting capability, although I'd question
the processor efficiency of a page granularity here.  On large memory
systems, with large processes starting (e.g. Netscape, StarOffice, or
possible a database client), it seems like a callback to a filesystem
which said something like flush("I must have at least 10 pages from
you", "and I'd really like 100 pages") might be a better way to
use this advisory capability.  You've already pointed out that you
may request that a specific page might be requested but other pages
may be freed; this may be a more explicit way to code the policy
you really want.

It would also be interesting to review the data structure you intend
to use in terms of cache line layout, as well as look at the algorithms
which use those structures with an eye towards minimizing page & cache
hits for both SMP *and* single processor efficiency.

Hope this is of some help,

Gerrit Huizenga
IBM NUMA-Q (nee' Sequent)
Gerrit.Huizenga@us.ibm.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-07 17:40 ` Gerrit.Huizenga
@ 2000-08-07 18:37   ` Matthew Wilcox
  2000-08-07 20:55   ` Chuck Lever
  2000-08-08  3:26   ` David Gould
  2 siblings, 0 replies; 46+ messages in thread
From: Matthew Wilcox @ 2000-08-07 18:37 UTC (permalink / raw)
  To: Gerrit.Huizenga; +Cc: Rik van Riel, linux-mm, linux-kernel, Linus Torvalds

On Mon, Aug 07, 2000 at 10:40:52AM -0700, Gerrit.Huizenga@us.ibm.com wrote:
> Also, I note that your filesys->flush() mechanism utilizes a call
> per page.  This is an interesting capability, although I'd question
> the processor efficiency of a page granularity here.  On large memory
> systems, with large processes starting (e.g. Netscape, StarOffice, or
> possible a database client), it seems like a callback to a filesystem
> which said something like flush("I must have at least 10 pages from
> you", "and I'd really like 100 pages") might be a better way to
> use this advisory capability.  You've already pointed out that you
> may request that a specific page might be requested but other pages
> may be freed; this may be a more explicit way to code the policy
> you really want.

i had a little argument with Rik about this.  his PoV is that the
filesystem should know nothing about which pages are aged and are ready
to be sent to disc.  so what he wants is the filesystem to be able to say
`no, you can't flush that page'.

-- 
Revolutions do not require corporate support.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-07 17:40 ` Gerrit.Huizenga
  2000-08-07 18:37   ` Matthew Wilcox
@ 2000-08-07 20:55   ` Chuck Lever
  2000-08-07 21:59     ` Rik van Riel
  2000-08-08  3:26   ` David Gould
  2 siblings, 1 reply; 46+ messages in thread
From: Chuck Lever @ 2000-08-07 20:55 UTC (permalink / raw)
  To: Gerrit.Huizenga; +Cc: linux-mm, linux-kernel, Linus Torvalds

hi gerrit-

good to see you on the list.

On Mon, 7 Aug 2000 Gerrit.Huizenga@us.ibm.com wrote:
> Another fundamental flaw I see with both the current page aging mechanism
> and the proposed mechanism is that workloads which exhaust memory pay
> no penalty at all until memory is full.  Then there is a sharp spike
> in the amount of (slow) IO as pages are flushed, processes are swapped,
> etc.  There is no apparent smoothing of spikes, such as increasing the
> rate of IO as the rate of memory pressure increases.  With the exception
> of laptops, most machines can sustain a small amount of background
> asynchronous IO without affecting performance (laptops may want IO
> batched to maximize battery life).  I would propose that as memory
> pressure increases, paging/swapping IO should increase somewhat
> proportionally.  This provides some smoothing for the bursty nature of
> most single user or small ISP workloads.  I believe databases style
> loads on larger machines would also benefit.

2 comments here.

1.  kswapd runs in the background and wakes up every so often to handle
the corner cases that smooth bursty memory request workloads.  it executes
the same code that is invoked from the kernel's memory allocator to
reclaim pages.

2.  i agree with you that when the system exhausts memory, it hits a hard
knee; it would be better to soften this.  however, the VM system is
designed to optimize the case where the system has enough memory.  in
other words, it is designed to avoid unnecessary work when there is no
need to reclaim memory.  this design was optimized for a desktop workload,
like the scheduler or ext2 "async" mode.  if i can paraphrase other
comments i've heard on these lists, it epitomizes a basic design
philosophy: "to optimize the common case gains the most performance
advantage."

can a soft-knee swapping algorithm be demonstrated that doesn't impact the
performance of applications running on a system that hasn't exhausted its
memory?

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@bigfoot.com>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-07 20:55   ` Chuck Lever
@ 2000-08-07 21:59     ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2000-08-07 21:59 UTC (permalink / raw)
  To: chucklever; +Cc: Gerrit.Huizenga, linux-mm, linux-kernel, Linus Torvalds

On Mon, 7 Aug 2000, Chuck Lever wrote:
> On Mon, 7 Aug 2000 Gerrit.Huizenga@us.ibm.com wrote:
> > Another fundamental flaw I see with both the current page aging mechanism
> > and the proposed mechanism is that workloads which exhaust memory pay
> > no penalty at all until memory is full.  Then there is a sharp spike
> > in the amount of (slow) IO as pages are flushed, processes are swapped,
> > etc.  There is no apparent smoothing of spikes, such as increasing the
> > rate of IO as the rate of memory pressure increases.  With the exception
> > of laptops, most machines can sustain a small amount of background
> > asynchronous IO without affecting performance (laptops may want IO
> > batched to maximize battery life).  I would propose that as memory
> > pressure increases, paging/swapping IO should increase somewhat
> > proportionally.  This provides some smoothing for the bursty nature of
> > most single user or small ISP workloads.  I believe databases style
> > loads on larger machines would also benefit.
> 
> 2 comments here.
> 
> 1.  kswapd runs in the background and wakes up every so often to handle
> the corner cases that smooth bursty memory request workloads.  it executes
> the same code that is invoked from the kernel's memory allocator to
> reclaim pages.

*nod*

The idea is that the memory_pressure variable indicates how
much page stealing is going on (on average) so every time
kswapd wakes up it knows how much pages to steal. That way
it should (if we're "lucky") free enough pages to get us
along until the next time kswapd wakes up.

> 2.  i agree with you that when the system exhausts memory, it
> hits a hard knee; it would be better to soften this.

The memory_pressure variable is there to ease this. If the load
is more or less bursty, but constant on a somewhat longer timescale
(say one minute), then we'll average the inactive_target to
somewhere between one and two seconds worth of page steals.

> can a soft-knee swapping algorithm be demonstrated that doesn't
> impact the performance of applications running on a system that
> hasn't exhausted its memory?

The algorithm we're using (dynamic inactive target w/
agressively trying to meet that target) will eat disk
bandwidth in the case of one application filling memory
really fast but not swapping, but since the data is
kept in memory, it shouldn't be a very big performance
penalty in most cases.

About NUMA scalability: we'll have different memory pools
per NUMA node. So if you have a 32-node, 64GB NUMA machine,
it'll partly function like 32 independant 2GB machines.

We'll have to find a solution for the pagecache_lock (how do
we make this more scalable?), but the pagecache_lru_lock, the
memory queues/lists and kswapd will be per _node_.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-07 17:40 ` Gerrit.Huizenga
  2000-08-07 18:37   ` Matthew Wilcox
  2000-08-07 20:55   ` Chuck Lever
@ 2000-08-08  3:26   ` David Gould
  2000-08-08  5:54     ` Kanoj Sarcar
  2 siblings, 1 reply; 46+ messages in thread
From: David Gould @ 2000-08-08  3:26 UTC (permalink / raw)
  To: Gerrit.Huizenga; +Cc: Rik van Riel, linux-mm, linux-kernel, Linus Torvalds

On Mon, Aug 07, 2000 at 10:40:52AM -0700, Gerrit.Huizenga@us.ibm.com wrote:
... 
>                                                 ...  Another mechanism,
> and the one that we chose in our operating system, was to use a modified
> process resident set sizes as the machanism for page management.  The
> basic modifications are to make the RSS tuneable system wide as well
> as per process.  The RSS size "flexes" based on available memory and
> a processes page fault frequency (PFF).  Frequent page faults force the
> RSS to increase, infrequent page faults cause a processes resident size
> to shrink.  When memory pressure mounts, the running process manages
> itself a little more agressively; processes which have "flexed"
> their resident set size beyond their system or per process recommended
> maxima are among the first to lose pages.  And, when pressure can not
> be addressed to RSS management, swapping starts.

Hmmm, the vm discussion and the lack of good documentation on vm systems
has sent me back to reread my old "VMS Internals and Data Structures" book,
at least for historical perspective. The above description of per process
RSS size adjustment controlled by page fault rate sounds quite similar to the
scheme in VMS.

Basically in VMS, processes page against themselves, not against the system
as a whole. A process grows or shrinks based on its recent pagefault
rate which is configurable with upper and lower targets. This happens
more or less continously. In addition the system has global goals for free
and dirty pages and in response to memory pressure will start cleaning pages,
(via a page writer task), or if need be, stealing pages from processes or
even swapping whole processes (via swapper task).

I am probably making a hash of decribing this, and of course VMS is not the
last word by any means, but the system was very tunable, and had specific
explicit mechanisms to attain many of the goals of vm system. As such it
is an instructive example if only to point out the problems to be solved,
and at least one way to solve them. If you wish a real description there
is always the "big black book" by Kennah and Bates (IIRC), which has
about 150 pages just on the vm. For a short summary, I found a couple
of web pages about the t:

http://cctr.umkc.edu/vms/72final/6491/6491pro_002.html#memory_chap
http://cctr.umkc.edu/vms/72final/6491/6491pro_003.html

I hope someone finds this useful...

-dg

-- 
David Gould                                                 dg@suse.com
SuSE, Inc.,  580 2cd St. #210,  Oakland, CA 94607          510.628.3380
"I sense a disturbance in the source"  -- Alan Cox
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-08  3:26   ` David Gould
@ 2000-08-08  5:54     ` Kanoj Sarcar
  2000-08-08  7:15       ` David Gould
  0 siblings, 1 reply; 46+ messages in thread
From: Kanoj Sarcar @ 2000-08-08  5:54 UTC (permalink / raw)
  To: David Gould
  Cc: Gerrit.Huizenga, Rik van Riel, linux-mm, linux-kernel, Linus Torvalds

> 
> Hmmm, the vm discussion and the lack of good documentation on vm systems
> has sent me back to reread my old "VMS Internals and Data Structures" book,

I have been stressing the importance of documenting what people do
under Documentation/vm/*. Thinking I would provide an example, I 
created two new files there, at least one of which was quickly outdated
by related changes ...

It would probably help documentation if Linus asked for that along
with patches which considerably change current algorithms. Trust me,
I have had to go back and look at documentations three weeks after
I submitted a patch ... thats all it takes to forget why something
was done one way, rather than another ...

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-08  5:54     ` Kanoj Sarcar
@ 2000-08-08  7:15       ` David Gould
  0 siblings, 0 replies; 46+ messages in thread
From: David Gould @ 2000-08-08  7:15 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: linux-mm

On Mon, Aug 07, 2000 at 10:54:43PM -0700, Kanoj Sarcar wrote:
> > 
> > Hmmm, the vm discussion and the lack of good documentation on vm systems
> > has sent me back to reread my old "VMS Internals and Data Structures" book,
> 
> I have been stressing the importance of documenting what people do
> under Documentation/vm/*. Thinking I would provide an example, I 
> created two new files there, at least one of which was quickly outdated
> by related changes ...
> 
> It would probably help documentation if Linus asked for that along
> with patches which considerably change current algorithms. Trust me,
> I have had to go back and look at documentations three weeks after
> I submitted a patch ... thats all it takes to forget why something
> was done one way, rather than another ...
> 
> Kanoj

Yes, this would be good. Of course, getting documentation to track programs
is sortof an old and apparently insoluble problem. I like the Extreme
Programming approach a bit, because XP makes it clear that there is _no_
documentation other than the code. Worst case, we are where we are now, best
case, the code is more expressive of intent...

But, I think the lack of documentation meant, was the lack of available
literature on on how this stuff is spozed to work.

-dg

-- 
David Gould                                                 dg@suse.com
SuSE, Inc.,  580 2cd St. #210,  Oakland, CA 94607          510.628.3380
"I sense a disturbance in the source"  -- Alan Cox
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
@ 2000-08-04 13:52 Mark_H_Johnson
  0 siblings, 0 replies; 46+ messages in thread
From: Mark_H_Johnson @ 2000-08-04 13:52 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, Linus Torvalds

I've read through this [and about 25 follow up messages] and have the
following [LONG] set of questions, comments, and concerns:

1. Can someone clearly express what we are trying to fix?
  Is it the "process killing" behavior, the "kswapd runs at 100%" behavior,
or what. The two that I mentioned have been side effects of not having free
pages available [though in some cases, there IS plenty of backing store in
memory mapped files or in the swap partitions]. I cannot map what I read
from Rik's message [nor the follow up] to fixing any specific problems. The
statements made under "goals" fits the closest to what I am asking for, but
something similar to that should be the goal of the existing VM system, not
just the new one.

2. How do we know we have succeeded in fixing these problems?
  Will we "declare success" if and only if 95% of all memory accesses refer
to pages that are in the resident set of the active process is AND if
system overhead is <5% for a set of test cases? Can you characterize the
current performance of 2.2.16, 2.4-testX, and FreeBSD in those terms?

3. By setting a clear goal such as identified the hit rate & overhead
listed above, you can clearly tie the design to making those goals. I've
read the previous messages on physical page scanning vs. per process scans
- it is asserted that physical scans are faster. Good. But if a per process
scan improves the hit rate more than the overhead penalty, it can be better
to do this on a per process basis. Please show us how this new design will
work to meet such a goal.

4. As a system administrator, I know the kind of applications we will be
running. Other SA's know their load too. Give us a FEW adjustments to the
new VM system to tune it. As a developer of large real time applications,
we have two basic loads that are quite different:
  a. Software developers doing coding, unit test, and some system testing
on workstations - X server and display, non real time, may be running heavy
swapping loads to run a load far bigger than the machine has memory for.
  b. Delivered loads that have most of the physical memory locked, want -
no demand low latency (<1msec) since my fastest task runs at 80hz
(12.5msec), with high CPU loading (50-80% for hours), high network traffic,
and little or no I/O to the disk while real time is active.
I seriously doubt you can satisfy varied loads without providing some means
to adjust (i) resident set sizes, (ii) size of free & dirty lists, (iii)
limits on CPU time spent in VM reclamation, (iv) aging parameters, (v)
scanning rates, and so on. Yes - I can rebuild the kernel to do this, but
an interface through /proc or other configuration mechanism would be
better.

5. I have a few "smart applications" that know what their future memory
references will be. To use an example, the "out the window" visual display
for a flight simulator is based on the current terrain around the airplane.
You can predict the next regions needed based on the current air speed,
orientation, and current terrain profile. Can we allow for per process
paging algorithms in the new VM design [to identify pages to take into or
out of the current resident set]? This has been implemented in operating
systems before - I first saw this in the late 70's. For OS's that do not
provide such a mechanism, we end up doing complicated non-blocking I/O to
disk files. This could be implemented as:
 a. address in the per process structure to indicate a paging handler is
available
 b. system call to query & set that address, as well as a system call to
preload pages [essential for real time performance]
 c. handler is called when its time to trim or adjust the resident set
 d. handler is called with a map of current memory & request to replace "X"
pages.
 e. result from handler is list of pages to remove and list of pages to add
to resident set [with net "X" pages removed or replaced.
 f. kernel routines make the adjustments, schedule preload, etc.
I do not expect such a capability in 2.4 [even if a new VM is rolled out in
2.4]

6. I do not see any mention of how we handle "read once" data [example is
grep -ir xxx /], "SMP safety", or "locked memory". Perhaps a few "use
cases" to define the situations that the VM system is expected to handle
are needed. Then the design can relate to those & explain how it will work.
Here are a few examples:
 a. heavy paging to a memory mapped file [mmap02?]
 b. web serving of static [or dynamic] content [1000's of hits per second]
 c. running Netscape on a small (32M) system
 d. large system w/ or w/o NUMA
 e. static load with large regions of locked memory [my real time example
above]
 f. kernel build
 g. same operations in UP, and SMP
 h. deleting a large memory mapped file while it is being flushed to disk
[can you abort the flush operation & delete the file immediately?]
 i. forcing synchronization of in memory data to disk
 j. the floppy disk example [do I/O while drive is running to save energy
or overall time]
>From this list, we should be able to specify what "good enough" is (paging
rates, overhead) for each situation.

7. Take into consideration relative performance of CPU cache, CPU speed,
memory access times, disk transfer times, in the algorithms. This relates
directly to a performance goal such as the one I suggested in #2. I can see
conditions where I have a relatively fast CPU, fast memory, but a NFS
mounted disk . The floppy case mentioned is similar. In that case - it
should be better to keep a steady flow of dirty pages going to that disk.
Other systems will have different situations. Determining this in run time
would be great. User settable parameters through /proc would be OK.

Please take these kind of issues into consideration in the new design.
Thanks.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>

                    Rik van Riel                                                                                    
                    <riel@conecti        To:     linux-mm@kvack.org                                                 
                    va.com.br>           cc:     linux-kernel@vger.rutgers.edu, Linus Torvalds                      
                                         <torvalds@transmeta.com>, (bcc: Mark H Johnson/RTS/Raytheon/US)            
                    08/02/00             Subject:     RFC: design for new VM                                        
                    05:08 PM                                                                                        

Hi,

here is a (rough) draft of the design for the new VM, as
discussed at UKUUG and OLS. The design is heavily based
on the FreeBSD VM subsystem - a proven design - with some
tweaks where we think things can be improved. Some of the
ideas in this design are not fully developed, but none of
those "new" ideas are essential to the basic design.

The design is based around the following ideas:
- center-balanced page aging, using
    - multiple lists to balance the aging
    - a dynamic inactive target to adjust
      the balance to memory pressure
- physical page based aging, to avoid the "artifacts"
  of virtual page scanning
- separated page aging and dirty page flushing
    - kupdate flushing "old" data
    - kflushd syncing out dirty inactive pages
    - as long as there are enough (dirty) inactive pages,
      never mess up aging by searching for clean active
      pages ... even if we have to wait for disk IO to
      finish
- very light background aging under all circumstances, to
  avoid half-hour old referenced bits hanging around

                     Center-balanced page aging:

- goals
    - always know which pages to replace next
    - don't spend too much overhead aging pages
    - do the right thing when the working set is
      big but swapping is very very light (or none)
    - always keep the working set in memory in
      favour of use-once cache

- page aging almost like in 2.0, only on a physical page basis
    - page->age starts at PAGE_AGE_START for new pages
    - if (referenced(page)) page->age += PAGE_AGE_ADV;
    - else page->age is made smaller (linear or exponential?)
    - if page->age == 0, move the page to the inactive list
    - NEW IDEA: age pages with a lower page age

- data structures (page lists)
    - active list
        - per node/pgdat
        - contains pages with page->age > 0
        - pages may be mapped into processes
        - scanned and aged whenever we are short
          on free + inactive pages
        - maybe multiple lists for different ages,
          to be better resistant against streaming IO
          (and for lower overhead)
    - inactive_dirty list
        - per zone
        - contains dirty, old pages (page->age == 0)
        - pages are not mapped in any process
    - inactive_clean list
        - per zone
        - contains clean, old pages
        - can be reused by __alloc_pages, like free pages
        - pages are not mapped in any process
    - free list
        - per zone
        - contains pages with no useful data
        - we want to keep a few (dozen) of these around for
          recursive allocations

- other data structures
    - int memory_pressure
        - on page allocation or reclaim, memory_pressure++
        - on page freeing, memory_pressure--  (keep it >= 0, though)
        - decayed on a regular basis (eg. every second x -= x>>6)
        - used to determine inactive_target
    - inactive_target == one (two?) second(s) worth of memory_pressure,
      which is the amount of page reclaims we'll do in one second
        - free + inactive_clean >= zone->pages_high
        - free + inactive_clean + inactive_dirty >= zone->pages_high \
                + one_second_of_memory_pressure * (zone_size / memory_size)
    - inactive_target will be limited to some sane maximum
      (like, num_physpages / 4)

The idea is that when we have enough old (inactive + free)
pages, we will NEVER move pages from the active list to the
inactive lists. We do that because we'd rather wait for some
IO completion than evict the wrong page.

Kflushd / bdflush will have the honourable task of syncing
the pages in the inactive_dirty list to disk before they
become an issue. We'll run balance_dirty over the set of
free + inactive_clean + inactive_dirty AND we'll try to
keep free+inactive_clean > pages_high .. failing either of
these conditions will cause bdflush to kick into action and
sync some pages to disk.

If memory_pressure is high and we're doing a lot of dirty
disk writes, the bdflush percentage will kick in and we'll
be doing extra-agressive cleaning. In that case bdflush
will automatically become more agressive the more page
replacement is going on, which is a good thing.

                     Physical page based page aging

In the new VM we'll need to do physical page based page aging
for a number of reasons. Ben LaHaise said he already has code
to do this and it's "dead easy", so I take it this part of the
code won't be much of a problem.

The reasons we need to do aging on a physical page are:
    - avoid the virtual address based aging "artifacts"
    - more efficient, since we'll only scan what we need
      to scan  (especially when we'll test the idea of
      aging pages with a low age more often than pages
      we know to be in the working set)
    - more direct feedback loop, so less chance of
      screwing up the page aging balance

                     IO clustering

IO clustering is not done by the VM code, but nicely abstracted
away into a page->mapping->flush(page) callback. This means that:
- each filesystem (and swap) can implement their own, isolated
  IO clustering scheme
- (in 2.5) we'll no longer have the buffer head list, but a list
  of pages to be written back to disk, this means doing stuff like
  delayed allocation (allocate on flush) or kiobuf based extents
  is fairly trivial to do

                     Misc

Page aging and flushing are completely separated in this
scheme. We'll never end up aging and freeing a "wrong" clean
page because we're waiting for IO completion of old and
to-be-freed pages.

Write throttling comes quite naturally in this scheme. If we
have too many dirty inactive pages we'll write throttle. We
don't have to take dirty active pages into account since those
are no candidate for freeing anyway. Under light write loads
we will never write throttle (good) and under heavy write
loads the inactive_target will be bigger and write throttling
is more likely to kick in.

Some background page aging will always be done by the system.
We need to do this to clear away referenced bits every once in
a while. If we don't do this we can end up in the situation where,
once memory pressure kicks in, pages which haven't been referenced
in half an hour still have their referenced bit set and we have no
way of distinguishing between newly referenced pages and ancient
pages we really want to free.   (I believe this is one of the causes
of the "freeze" we can sometimes see in current kernels)

Over the next weeks (months?) I'll be working on implementing the
new VM subsystem for Linux, together with various other people
(Andrea Arcangeli??, Ben LaHaise, Juan Quintela, Stephen Tweedie).
I hope to have it ready in time for 2.5.0, but if the code turns
out to be significantly more stable under load than the current
2.4 code I won't hesitate to submit it for 2.4.bignum...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
         -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/                      http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RFC: design for new VM
@ 2000-08-02 22:08 Rik van Riel
  2000-08-03  7:19 ` Chris Wedgwood
                   ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Rik van Riel @ 2000-08-02 22:08 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Linus Torvalds

Hi,

here is a (rough) draft of the design for the new VM, as
discussed at UKUUG and OLS. The design is heavily based
on the FreeBSD VM subsystem - a proven design - with some
tweaks where we think things can be improved. Some of the
ideas in this design are not fully developed, but none of
those "new" ideas are essential to the basic design.

The design is based around the following ideas:
- center-balanced page aging, using
    - multiple lists to balance the aging
    - a dynamic inactive target to adjust
      the balance to memory pressure
- physical page based aging, to avoid the "artifacts"
  of virtual page scanning
- separated page aging and dirty page flushing
    - kupdate flushing "old" data
    - kflushd syncing out dirty inactive pages
    - as long as there are enough (dirty) inactive pages,
      never mess up aging by searching for clean active
      pages ... even if we have to wait for disk IO to
      finish
- very light background aging under all circumstances, to
  avoid half-hour old referenced bits hanging around

		Center-balanced page aging:

- goals
    - always know which pages to replace next
    - don't spend too much overhead aging pages
    - do the right thing when the working set is
      big but swapping is very very light (or none)
    - always keep the working set in memory in
      favour of use-once cache

- page aging almost like in 2.0, only on a physical page basis
    - page->age starts at PAGE_AGE_START for new pages
    - if (referenced(page)) page->age += PAGE_AGE_ADV;
    - else page->age is made smaller (linear or exponential?)
    - if page->age == 0, move the page to the inactive list
    - NEW IDEA: age pages with a lower page age

- data structures (page lists)
    - active list
        - per node/pgdat
        - contains pages with page->age > 0
        - pages may be mapped into processes
        - scanned and aged whenever we are short
          on free + inactive pages
        - maybe multiple lists for different ages,
          to be better resistant against streaming IO
          (and for lower overhead)
    - inactive_dirty list
        - per zone
        - contains dirty, old pages (page->age == 0)
        - pages are not mapped in any process
    - inactive_clean list
        - per zone
        - contains clean, old pages
        - can be reused by __alloc_pages, like free pages
        - pages are not mapped in any process
    - free list
        - per zone
        - contains pages with no useful data
        - we want to keep a few (dozen) of these around for
          recursive allocations

- other data structures
    - int memory_pressure
        - on page allocation or reclaim, memory_pressure++
        - on page freeing, memory_pressure--  (keep it >= 0, though)
        - decayed on a regular basis (eg. every second x -= x>>6)
        - used to determine inactive_target
    - inactive_target == one (two?) second(s) worth of memory_pressure,
      which is the amount of page reclaims we'll do in one second
        - free + inactive_clean >= zone->pages_high
        - free + inactive_clean + inactive_dirty >= zone->pages_high \
                + one_second_of_memory_pressure * (zone_size / memory_size)
    - inactive_target will be limited to some sane maximum
      (like, num_physpages / 4)

The idea is that when we have enough old (inactive + free)
pages, we will NEVER move pages from the active list to the
inactive lists. We do that because we'd rather wait for some
IO completion than evict the wrong page.

Kflushd / bdflush will have the honourable task of syncing
the pages in the inactive_dirty list to disk before they
become an issue. We'll run balance_dirty over the set of
free + inactive_clean + inactive_dirty AND we'll try to
keep free+inactive_clean > pages_high .. failing either of
these conditions will cause bdflush to kick into action and
sync some pages to disk.

If memory_pressure is high and we're doing a lot of dirty
disk writes, the bdflush percentage will kick in and we'll
be doing extra-agressive cleaning. In that case bdflush
will automatically become more agressive the more page
replacement is going on, which is a good thing.

		Physical page based page aging

In the new VM we'll need to do physical page based page aging
for a number of reasons. Ben LaHaise said he already has code
to do this and it's "dead easy", so I take it this part of the
code won't be much of a problem.

The reasons we need to do aging on a physical page are:
    - avoid the virtual address based aging "artifacts"
    - more efficient, since we'll only scan what we need
      to scan  (especially when we'll test the idea of
      aging pages with a low age more often than pages
      we know to be in the working set)
    - more direct feedback loop, so less chance of
      screwing up the page aging balance

		IO clustering

IO clustering is not done by the VM code, but nicely abstracted
away into a page->mapping->flush(page) callback. This means that:
- each filesystem (and swap) can implement their own, isolated
  IO clustering scheme
- (in 2.5) we'll no longer have the buffer head list, but a list
  of pages to be written back to disk, this means doing stuff like
  delayed allocation (allocate on flush) or kiobuf based extents
  is fairly trivial to do

		Misc

Page aging and flushing are completely separated in this
scheme. We'll never end up aging and freeing a "wrong" clean
page because we're waiting for IO completion of old and
to-be-freed pages.

Write throttling comes quite naturally in this scheme. If we
have too many dirty inactive pages we'll write throttle. We
don't have to take dirty active pages into account since those
are no candidate for freeing anyway. Under light write loads
we will never write throttle (good) and under heavy write
loads the inactive_target will be bigger and write throttling
is more likely to kick in.

Some background page aging will always be done by the system.
We need to do this to clear away referenced bits every once in
a while. If we don't do this we can end up in the situation where,
once memory pressure kicks in, pages which haven't been referenced
in half an hour still have their referenced bit set and we have no
way of distinguishing between newly referenced pages and ancient
pages we really want to free.   (I believe this is one of the causes
of the "freeze" we can sometimes see in current kernels)

Over the next weeks (months?) I'll be working on implementing the
new VM subsystem for Linux, together with various other people
(Andrea Arcangeli??, Ben LaHaise, Juan Quintela, Stephen Tweedie).
I hope to have it ready in time for 2.5.0, but if the code turns
out to be significantly more stable under load than the current
2.4 code I won't hesitate to submit it for 2.4.bignum...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
         -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-02 22:08 Rik van Riel
@ 2000-08-03  7:19 ` Chris Wedgwood
  2000-08-03 16:01   ` Rik van Riel
  2000-08-03 18:27   ` lamont
  2000-08-03 18:05 ` Linus Torvalds
  2000-08-03 19:26 ` Roger Larsson
  2 siblings, 2 replies; 46+ messages in thread
From: Chris Wedgwood @ 2000-08-03  7:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, Linus Torvalds

On Wed, Aug 02, 2000 at 07:08:52PM -0300, Rik van Riel wrote:

    here is a (rough) draft of the design for the new VM, as
    discussed at UKUUG and OLS. The design is heavily based
    on the FreeBSD VM subsystem - a proven design - with some
    tweaks where we think things can be improved. 

Can the differences between your system and what FreeBSD has be
isolated or contained -- I ask this because the FreeBSD VM works
_very_ well compared to recent linux kernels; if/when the new system
is implement it would nice to know if performance differences are
tuning related or because of 'tweaks'.



  --cw
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03  7:19 ` Chris Wedgwood
@ 2000-08-03 16:01   ` Rik van Riel
  2000-08-04 15:41     ` Matthew Dillon
  2000-08-05 22:48     ` Theodore Y. Ts'o
  2000-08-03 18:27   ` lamont
  1 sibling, 2 replies; 46+ messages in thread
From: Rik van Riel @ 2000-08-03 16:01 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: linux-mm, linux-kernel, Linus Torvalds, Matthew Dillon

On Thu, 3 Aug 2000, Chris Wedgwood wrote:
> On Wed, Aug 02, 2000 at 07:08:52PM -0300, Rik van Riel wrote:
> 
>     here is a (rough) draft of the design for the new VM, as
>     discussed at UKUUG and OLS. The design is heavily based
>     on the FreeBSD VM subsystem - a proven design - with some
>     tweaks where we think things can be improved. 
> 
> Can the differences between your system and what FreeBSD has be
> isolated or contained

You're right, the differences between FreeBSD VM and the new
Linux VM should be clearly indicated.

> I ask this because the FreeBSD VM works _very_ well compared to
> recent linux kernels; if/when the new system is implement it
> would nice to know if performance differences are tuning related
> or because of 'tweaks'.

Indeed. The amount of documentation (books? nah..) on VM
is so sparse that it would be good to have both systems
properly documented. That would fill a void in CS theory
and documentation that was painfully there while I was
trying to find useful information to help with the design
of the new Linux VM...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 16:01   ` Rik van Riel
@ 2000-08-04 15:41     ` Matthew Dillon
  2000-08-04 17:49       ` Linus Torvalds
  2000-08-05 22:48     ` Theodore Y. Ts'o
  1 sibling, 1 reply; 46+ messages in thread
From: Matthew Dillon @ 2000-08-04 15:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Chris Wedgwood, linux-mm, linux-kernel, Linus Torvalds

    Three or four times in the last year I've gotten emails from 
    people looking for 'VM documentation' or 'books they could read'.
    I couldn't find a blessed thing!  Oh, sure, there are papers strewn
    about, but most are very focused on single aspects of a VM design.
    I have yet to find anything that covers the whole thing.  I've written
    up an occassional 'summary piece' for FreeBSD, e.g. the Jan 2000 Daemon
    News article, but that really isn't adequate.

    The new Linux VM design looks exciting!  I will be paying close 
    attention to your progress with an eye towards reworking some of
    FreeBSD's code.  Except for one or two eyesores (1) the FreeBSD code is
    algorithmically sound, but pieces of the implementation are rather
    messy from years of patching.  When I first started working on it
    the existing crew had a big bent towards patching rather then
    rewriting and I had to really push to get some of my rewrites
    through.  The patching had reached the limits of the original 
    code-base's flexibility.

    note(1) - the one that came up just last week was the O(N) nature
    of the FreeBSD VM maps (linux uses an AVL tree here).  These work
    fine for 95% of the apps out there but turn into a sludgepile for
    things like malloc debuggers and distributed shared memory systems
    which want to mprotect() on a page-by-page basis.   The second eyesore
    is the lack of physically shared page table segments for 'standard'
    processes.  At the moment, it's an all (rfork/RFMEM/clone) or nothing
    (fork) deal.  Physical segment sharing outside of clone is something
    Linux could use to, I don't think it does it either.  It's not easy to
    do right.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-04 15:41     ` Matthew Dillon
@ 2000-08-04 17:49       ` Linus Torvalds
  2000-08-04 23:51         ` Matthew Dillon
  0 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2000-08-04 17:49 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: Rik van Riel, Chris Wedgwood, linux-mm, linux-kernel

On Fri, 4 Aug 2000, Matthew Dillon wrote:
> 
>   							  The second eyesore
>     is the lack of physically shared page table segments for 'standard'
>     processes.  At the moment, it's an all (rfork/RFMEM/clone) or nothing
>     (fork) deal.  Physical segment sharing outside of clone is something
>     Linux could use to, I don't think it does it either.  It's not easy to
>     do right.

It's probably impossible to do right. Basically, if you do it, you do it
wrong.

As far as I can tell, you basically screw yourself on the TLB and locking
if you ever try to implement this. And frankly I don't see how you could
avoid getting screwed.

There are architecture-specific special cases, of course. On ia64, the
page table is not really one page table, it's a number of pretty much
independent page tables, and it would be possible to extend the notion of
fork vs clone to be a per-page-table thing (ie the single-bit thing would
become a multi-bit thing, and the single "struct mm_struct" would become
an array of independent mm's).

You could do similar tricks on x86 by virtually splitting up the page
directory into independent (fixed-size) pieces - this is similar to what
the PAE stuff does in hardware, after all. So you could have (for example)
each process be quartered up into four address spaces with the top two
address bits being the address space sub-ID.

Quite frankly, it tends to be a nightmare to do that. It's also
unportable: it works on architectures that either support it natively
(like the ia64 that has the split page tables because of how it covers
large VM areas) or by "faking" the split on regular page tables. But it
does _not_ work very well at all on CPU's where the native page table is
actually a hash (old sparc, ppc, and the "other mode" in IA64). Unless the
hash happens to have some of the high bits map into a VM ID (which is
common, but not really something you can depend on).

And even when it "works" by emulation, you can't share the TLB contents
anyway. Again, it can be possible on a per-architecture basis (if the
different regions can have different ASI's - ia64 again does this, and I
think it originally comes from the 64-bit PA-RISC VM stuff). But it's one
of those bad ideas that if people start depending on it, it simply won't
work that well on some architectures. And one of the beauties of UNIX is
that it truly is fairly architecture-neutral.

And that's just the page table handling. The SMP locking for all this
looks even worse - you can't share a per-mm lock like with the clone()
thing, so you have to create some other locking mechanism. 

I'd be interested to hear if you have some great idea (ie "oh, if you look
at it _this_ way all your concerns go away"), but I suspect you have only
looked at it from 10,000 feet and thought "that would be a cool thing".
And I suspect it ends up being anything _but_ cool once actually
implemented.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-04 17:49       ` Linus Torvalds
@ 2000-08-04 23:51         ` Matthew Dillon
  2000-08-05  0:03           ` Linus Torvalds
  0 siblings, 1 reply; 46+ messages in thread
From: Matthew Dillon @ 2000-08-04 23:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Chris Wedgwood, linux-mm, linux-kernel

:>     (fork) deal.  Physical segment sharing outside of clone is something
:>     Linux could use to, I don't think it does it either.  It's not easy to
:>     do right.
:
:It's probably impossible to do right. Basically, if you do it, you do it
:wrong.
:
:As far as I can tell, you basically screw yourself on the TLB and locking
:if you ever try to implement this. And frankly I don't see how you could
:avoid getting screwed.
:
:There are architecture-specific special cases, of course. On ia64, the
:..

    I spent a weekend a few months ago trying to implement page table 
    sharing in FreeBSD -- and gave up, but it left me with the feeling
    that it should be possible to do without polluting the general VM
    architecture.

    For IA32, what it comes down to is that the page table generated by
    any segment-aligned mmap() (segment == 4MB) made by two processes 
    should be shareable, simply be sharing the page directory entry (and thus
    the physical page representing 4MB worth of mappings).  This would be
    restricted to MAP_SHARED mappings with the same protections, but the two
    processes would not have to map the segments at the same VM address, they
    need only be segment-aligned.

    This would be a transparent optimization wholely invisible to the process,
    something that would be optionally implemented in the machine-dependant
    part of the VM code (with general support in the machine-independant
    part for the concept).  If the process did anything to create a mapping
    mismatch, such as call mprotect(), the shared page table would be split.

    The problem being solved for FreeBSD is actually quite serious -- due to
    FreeBSD's tracking of individual page table entries, being able to share
    a page table would radically reduce the amount of tracking information
    required for any large shared areas (shared libraries, large shared file
    mappings, large sysv shared memory mappings).  For linux the problem is
    relatively minor - linux would save considerable page table memory.
    Linux is still reasonably scaleable without the optimization while 
    FreeBSD currently falls on its face for truely huge shared mappings
    (e.g. 300 processes all mapping a shared 1GB memory area, aka Oracle 8i).
    (Linux falls on its face for other reasons, mainly the fact that it
    maps all of physical memory into KVM in order to manage it).

    I think the loss of MP locking for this situation is outweighed by the
    benefit of a huge reduction in page faults -- rather then see 300 
    processes each take a page fault on the same page, only the first process
    would and the pte would already be in place when the others got to it.
    When it comes right down to it, page faults on shared data sets are not
    really an issue for MP scaleability.

    In anycase, this is a 'dream' for me for FreeBSD right now.  It's a very 
    difficult problem to solve.

						-Matt


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-04 23:51         ` Matthew Dillon
@ 2000-08-05  0:03           ` Linus Torvalds
  2000-08-05  1:52             ` Matthew Dillon
  0 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2000-08-05  0:03 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: Rik van Riel, Chris Wedgwood, linux-mm, linux-kernel

On Fri, 4 Aug 2000, Matthew Dillon wrote:
> :
> :There are architecture-specific special cases, of course. On ia64, the
> :..
> 
>     I spent a weekend a few months ago trying to implement page table 
>     sharing in FreeBSD -- and gave up, but it left me with the feeling
>     that it should be possible to do without polluting the general VM
>     architecture.
> 
>     For IA32, what it comes down to is that the page table generated by
>     any segment-aligned mmap() (segment == 4MB) made by two processes 
>     should be shareable, simply be sharing the page directory entry (and thus
>     the physical page representing 4MB worth of mappings).  This would be
>     restricted to MAP_SHARED mappings with the same protections, but the two
>     processes would not have to map the segments at the same VM address, they
>     need only be segment-aligned.

I agree that from a page table standpoint you should be correct. 

I don't think that the other issues are as easily resolved, though.
Especially with address space ID's on other architectures it can get
_really_ interesting to do TLB invalidates correctly to other CPU's etc
(you need to keep track of who shares parts of your page tables etc).

>     This would be a transparent optimization wholely invisible to the process,
>     something that would be optionally implemented in the machine-dependant
>     part of the VM code (with general support in the machine-independant
>     part for the concept).  If the process did anything to create a mapping
>     mismatch, such as call mprotect(), the shared page table would be split.

Right. But what about the TLB?

It's not a problem on the x86, because the x86 doesn't have ASN's anyway.
But fo rit to be a valid notion, I feel that it should be able to be
portable too.

You have to have some page table locking mechanism for SMP eventually: I
think you miss some of the problems because the current FreeBSD SMP stuff
is mostly still "big kernel lock" (outdated info?), and you'll end up
kicking yourself in a big way when you have the 300 processes sharing the
same lock for that region..

(Not that I think you'd necessarily have much contention on the lock - the
problem tends to be more in the logistics of keeping track of the locks of
partial VM regions etc).

>     (Linux falls on its face for other reasons, mainly the fact that it
>     maps all of physical memory into KVM in order to manage it).

Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;)

>     I think the loss of MP locking for this situation is outweighed by the
>     benefit of a huge reduction in page faults -- rather then see 300 
>     processes each take a page fault on the same page, only the first process
>     would and the pte would already be in place when the others got to it.
>     When it comes right down to it, page faults on shared data sets are not
>     really an issue for MP scaleability.

I think you'll find that there are all these small details that just
cannot be solved cleanly. Do you want to be stuck with a x86-only
solution?

That said, I cannot honestly say that I have tried very hard to come up
with solutions. I just have this feeling that it's a dark ugly hole that I
wouldn't want to go down..

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-05  0:03           ` Linus Torvalds
@ 2000-08-05  1:52             ` Matthew Dillon
  2000-08-05  1:09               ` Matthew Wilcox
                                 ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Matthew Dillon @ 2000-08-05  1:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Chris Wedgwood, linux-mm, linux-kernel

:I agree that from a page table standpoint you should be correct. 
:
:I don't think that the other issues are as easily resolved, though.
:Especially with address space ID's on other architectures it can get
:_really_ interesting to do TLB invalidates correctly to other CPU's etc
:(you need to keep track of who shares parts of your page tables etc).
:
:...
:>     mismatch, such as call mprotect(), the shared page table would be split.
:
:Right. But what about the TLB?

    I'm not advocating trying to share TLB entries, that would be 
    a disaster.  I'm contemplating just the physical page table structure.
    e.g. if you mmap() a 1GB file shared (or private read-only) into 300
    independant processes, it should be possible to share all the meta-data
    required to support that mapping except for the TLB entries themselves.
    ASNs shouldn't make a difference... presumably the tags on the TLB
    entries are added on after the metadata lookup.  I'm also not advocating
    attempting to share intermediate 'partial' in-memory TLB caches (hash
    tables or other structures).  Those are typically fixed in size,
    per-cpu, and would not be impacted by scale.

:You have to have some page table locking mechanism for SMP eventually: I
:think you miss some of the problems because the current FreeBSD SMP stuff
:is mostly still "big kernel lock" (outdated info?), and you'll end up
:kicking yourself in a big way when you have the 300 processes sharing the
:same lock for that region..

    If it were a long-held lock I'd worry, but if it's a lock on a pte
    I don't think it can hurt.  After all, even with separate page tables
    if 300 processes fault on the same backing file offset you are going
    to hit a bottleneck with MP locking anyway, just at a deeper level
    (the filesystem rather then the VM system).  The BSDI folks did a lot
    of testing with their fine-grained MP implementation and found that
    putting a global lock around the entire VM system had absolutely no 
    impact on MP performance.

:>     (Linux falls on its face for other reasons, mainly the fact that it
:>     maps all of physical memory into KVM in order to manage it).
:
:Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;)

    Oh, that's cool!  I don't think anyone in FreeBSDland has bothered with
    large-memory (> 4GB) memory configurations, there doesn't seem to be 
    much demand for such a thing on IA32.

:>     I think the loss of MP locking for this situation is outweighed by the
:>     benefit of a huge reduction in page faults -- rather then see 300 
:>     processes each take a page fault on the same page, only the first process
:>     would and the pte would already be in place when the others got to it.
:>     When it comes right down to it, page faults on shared data sets are not
:>     really an issue for MP scaleability.
:
:I think you'll find that there are all these small details that just
:cannot be solved cleanly. Do you want to be stuck with a x86-only
:solution?
:
:That said, I cannot honestly say that I have tried very hard to come up
:with solutions. I just have this feeling that it's a dark ugly hole that I
:wouldn't want to go down..
:
:			Linus

    Well, I don't think this is x86-specific.  Or, that is, I don't think it
    would pollute the machine-independant code.  FreeBSD has virtually no
    notion of 'page tables' outside the i386-specific VM files... it doesn't
    use page tables (or two-level page-like tables... is Linux still using
    those?) to store meta information at all in the higher levels of the
    kernel.  It uses architecture-independant VM objects and vm_map_entry
    structures for that.  Physical page tables on FreeBSD are 
    throw-away-at-any-time entities.  The actual implementation of the
    'page table' in the IA32 sense occurs entirely in the machine-dependant
    subdirectory for IA32.  

    A page-table sharing mechanism would have to implement the knowledge --
    the 'potential' for sharing at a higher level (the vm_map_entry 
    structure), but it would be up to the machine-dependant VM code to
    implement any actual sharing given that knowledge.  So while the specific
    implementation for IA32 is definitely machine-specific, it would have
    no effect on other OS ports (of course, we have only one other
    working port at the moment, to the alpha, but you get the idea).

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-05  1:52             ` Matthew Dillon
@ 2000-08-05  1:09               ` Matthew Wilcox
  2000-08-05  2:05               ` Linus Torvalds
  2000-08-05  2:17               ` Alexander Viro
  2 siblings, 0 replies; 46+ messages in thread
From: Matthew Wilcox @ 2000-08-05  1:09 UTC (permalink / raw)
  To: Matthew Dillon
  Cc: Linus Torvalds, Rik van Riel, Chris Wedgwood, linux-mm, linux-kernel

On Fri, Aug 04, 2000 at 06:52:16PM -0700, Matthew Dillon wrote:
> :Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;)
> 
>     Oh, that's cool!  I don't think anyone in FreeBSDland has bothered with
>     large-memory (> 4GB) memory configurations, there doesn't seem to be 
>     much demand for such a thing on IA32.

you need to talk to Oracle or SAP :-)

-- 
Revolutions do not require corporate support.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-05  1:52             ` Matthew Dillon
  2000-08-05  1:09               ` Matthew Wilcox
@ 2000-08-05  2:05               ` Linus Torvalds
  2000-08-05  2:17               ` Alexander Viro
  2 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2000-08-05  2:05 UTC (permalink / raw)
  To: Matthew Dillon; +Cc: Rik van Riel, Chris Wedgwood, linux-mm, linux-kernel


On Fri, 4 Aug 2000, Matthew Dillon wrote:
> :
> :Right. But what about the TLB?
> 
>     I'm not advocating trying to share TLB entries, that would be 
>     a disaster.

You migth have to, if the machine has a virtually mapped cache.. 

Ugh. That gets too ugly to even contemplate, actually. Just forget the
idea.

>     If it were a long-held lock I'd worry, but if it's a lock on a pte
>     I don't think it can hurt.  After all, even with separate page tables
>     if 300 processes fault on the same backing file offset you are going
>     to hit a bottleneck with MP locking anyway, just at a deeper level
>     (the filesystem rather then the VM system).  The BSDI folks did a lot
>     of testing with their fine-grained MP implementation and found that
>     putting a global lock around the entire VM system had absolutely no 
>     impact on MP performance.

Hmm.. That may be load-dependent, but I know it wasn't true for Linux. The
kernel lock for things like brk() were some of the worst offenders, and
people worked hard on making mmap() and friends not need the BKL exactly
because it showed up very clearly in the lock profiles.

> :>     (Linux falls on its face for other reasons, mainly the fact that it
> :>     maps all of physical memory into KVM in order to manage it).
> :
> :Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;)
> 
>     Oh, that's cool!  I don't think anyone in FreeBSDland has bothered with
>     large-memory (> 4GB) memory configurations, there doesn't seem to be 
>     much demand for such a thing on IA32.

Not normally no. Linux didn't start seeing the requirement until last year
or so, when running big databases and big benchmarks just required it
because the working set was so big. "dbench" with a lot of clients etc.

Now, whether such a working set is realistic or not is another issue, of
course. 64GB isn't as much memory as it used to be, though, and we
couldn't have beated the mindcraft NT numbers without large memory
support.

>     Well, I don't think this is x86-specific.  Or, that is, I don't think it
>     would pollute the machine-independant code.  FreeBSD has virtually no
>     notion of 'page tables' outside the i386-specific VM files... it doesn't
>     use page tables (or two-level page-like tables... is Linux still using
>     those?) to store meta information at all in the higher levels of the
>     kernel.  It uses architecture-independant VM objects and vm_map_entry
>     structures for that.  Physical page tables on FreeBSD are 
>     throw-away-at-any-time entities.  The actual implementation of the
>     'page table' in the IA32 sense occurs entirely in the machine-dependant
>     subdirectory for IA32.  

It's not the page tables themselves I worry about, but all the meta-data
synchronization requirements. But hey. Go wild, prove me wrong.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-05  1:52             ` Matthew Dillon
  2000-08-05  1:09               ` Matthew Wilcox
  2000-08-05  2:05               ` Linus Torvalds
@ 2000-08-05  2:17               ` Alexander Viro
  2000-08-07 17:55                 ` Matthew Dillon
  2 siblings, 1 reply; 46+ messages in thread
From: Alexander Viro @ 2000-08-05  2:17 UTC (permalink / raw)
  To: Matthew Dillon
  Cc: Linus Torvalds, Rik van Riel, Chris Wedgwood, linux-mm, linux-kernel


On Fri, 4 Aug 2000, Matthew Dillon wrote:

> :You have to have some page table locking mechanism for SMP eventually: I
> :think you miss some of the problems because the current FreeBSD SMP stuff
> :is mostly still "big kernel lock" (outdated info?), and you'll end up
> :kicking yourself in a big way when you have the 300 processes sharing the
> :same lock for that region..
> 
>     If it were a long-held lock I'd worry, but if it's a lock on a pte
>     I don't think it can hurt.  After all, even with separate page tables
>     if 300 processes fault on the same backing file offset you are going
>     to hit a bottleneck with MP locking anyway, just at a deeper level
>     (the filesystem rather then the VM system).

Erm... I'm not sure about that - for one thing, you are not caching
results of bmap(). We do. And our VFS is BKL-free, so contention really
hits only on the VOP_BALLOC() level (that can be fixed too, but that's
another story).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-05  2:17               ` Alexander Viro
@ 2000-08-07 17:55                 ` Matthew Dillon
  0 siblings, 0 replies; 46+ messages in thread
From: Matthew Dillon @ 2000-08-07 17:55 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Rik van Riel, Chris Wedgwood, linux-mm, linux-kernel

:>     if 300 processes fault on the same backing file offset you are going
:>     to hit a bottleneck with MP locking anyway, just at a deeper level
:>     (the filesystem rather then the VM system).
:
:Erm... I'm not sure about that - for one thing, you are not caching
:results of bmap(). We do. And our VFS is BKL-free, so contention really
:hits only on the VOP_BALLOC() level (that can be fixed too, but that's
:another story).

    Well... actually, a side effect of the FreeBSD buffer cache is to
    cache BMAP translations.

    What we do do, which kinda kills the cacheability aspects of balloc
    in some cases, is VOP_REALLOC() -- that is, the FFS filesystem will
    reallocate blocks to implement on-the-fly defragmentation.  This 
    typically occurs on writes.  It works *very* well.  This is a feature
    that actually used to be in FFS a few years ago but had to be turned off
    due to bugs.  The bugs were fixed about a year ago and realloc was turned
    on by default in 4.x.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 16:01   ` Rik van Riel
  2000-08-04 15:41     ` Matthew Dillon
@ 2000-08-05 22:48     ` Theodore Y. Ts'o
  1 sibling, 0 replies; 46+ messages in thread
From: Theodore Y. Ts'o @ 2000-08-05 22:48 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Chris Wedgwood, linux-mm, linux-kernel, Matthew Dillon

   You're right, the differences between FreeBSD VM and the new
   Linux VM should be clearly indicated.

   > I ask this because the FreeBSD VM works _very_ well compared to
   > recent linux kernels; if/when the new system is implement it
   > would nice to know if performance differences are tuning related
   > or because of 'tweaks'.

   Indeed. The amount of documentation (books? nah..) on VM
   is so sparse that it would be good to have both systems
   properly documented. That would fill a void in CS theory
   and documentation that was painfully there while I was
   trying to find useful information to help with the design
   of the new Linux VM...

... and you know, once written, it would make a *wonderful* paper to
present at Freenix or for ALS.... (speaking as someone who has been on
program committees for both conferences :-)

						- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03  7:19 ` Chris Wedgwood
  2000-08-03 16:01   ` Rik van Riel
@ 2000-08-03 18:27   ` lamont
  2000-08-03 18:34     ` Linus Torvalds
  1 sibling, 1 reply; 46+ messages in thread
From: lamont @ 2000-08-03 18:27 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Rik van Riel, linux-mm, linux-kernel, Linus Torvalds

CONFIG_VM_FREEBSD_ME_HARDER would be a nice kernel option to have, if
possible.  Then drop it iff the tweaks are proven over time to work
better.

On Thu, 3 Aug 2000, Chris Wedgwood wrote:
> On Wed, Aug 02, 2000 at 07:08:52PM -0300, Rik van Riel wrote:
> 
>     here is a (rough) draft of the design for the new VM, as
>     discussed at UKUUG and OLS. The design is heavily based
>     on the FreeBSD VM subsystem - a proven design - with some
>     tweaks where we think things can be improved. 
> 
> Can the differences between your system and what FreeBSD has be
> isolated or contained -- I ask this because the FreeBSD VM works
> _very_ well compared to recent linux kernels; if/when the new system
> is implement it would nice to know if performance differences are
> tuning related or because of 'tweaks'.
> 
> 
> 
>   --cw
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:27   ` lamont
@ 2000-08-03 18:34     ` Linus Torvalds
  2000-08-03 19:11       ` Chris Wedgwood
  2000-08-03 19:32       ` Rik van Riel
  0 siblings, 2 replies; 46+ messages in thread
From: Linus Torvalds @ 2000-08-03 18:34 UTC (permalink / raw)
  To: lamont; +Cc: Chris Wedgwood, Rik van Riel, linux-mm, linux-kernel


On Thu, 3 Aug 2000 lamont@icopyright.com wrote:
> 
> CONFIG_VM_FREEBSD_ME_HARDER would be a nice kernel option to have, if
> possible.  Then drop it iff the tweaks are proven over time to work
> better.

On eproblem is/may be the basic setup. Does FreeBSD have the notion of
things like high memory etc? Different memory pools for NUMA? Things like
that..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:34     ` Linus Torvalds
@ 2000-08-03 19:11       ` Chris Wedgwood
  2000-08-03 21:04         ` Benjamin C.R. LaHaise
  2000-08-03 19:32       ` Rik van Riel
  1 sibling, 1 reply; 46+ messages in thread
From: Chris Wedgwood @ 2000-08-03 19:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: lamont, Rik van Riel, linux-mm, linux-kernel

No, I don't think it does -- so for people running <= 1 1GB of ram
perhasp there should be a compile time option to not have all this
additional stuff linux will require?


  --cw
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 19:11       ` Chris Wedgwood
@ 2000-08-03 21:04         ` Benjamin C.R. LaHaise
  0 siblings, 0 replies; 46+ messages in thread
From: Benjamin C.R. LaHaise @ 2000-08-03 21:04 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Linus Torvalds, lamont, Rik van Riel, linux-mm, linux-kernel

Please don't make this kind of head-stuck-in-sand argument before you've
had a chance to test the code.  If anything, choosing the correct page for
replacement is *more* important on a 4MB 386 where disks are typically
1/20th the speed of a desktop.

		-ben

On Fri, 4 Aug 2000, Chris Wedgwood wrote:

> No, I don't think it does -- so for people running <= 1 1GB of ram
> perhasp there should be a compile time option to not have all this
> additional stuff linux will require?
> 
> 
>   --cw
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:34     ` Linus Torvalds
  2000-08-03 19:11       ` Chris Wedgwood
@ 2000-08-03 19:32       ` Rik van Riel
  1 sibling, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2000-08-03 19:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: lamont, Chris Wedgwood, linux-mm, linux-kernel

On Thu, 3 Aug 2000, Linus Torvalds wrote:
> On Thu, 3 Aug 2000 lamont@icopyright.com wrote:
> > 
> > CONFIG_VM_FREEBSD_ME_HARDER would be a nice kernel option to have, if
> > possible.  Then drop it iff the tweaks are proven over time to work
> > better.
> 
> On eproblem is/may be the basic setup. Does FreeBSD have the
> notion of things like high memory etc? Different memory pools
> for NUMA? Things like that..

That's basically a minor issue. The FreeBSD page replacement
code (or rather the slightly modified one) can just be glued
on top of that.

If the code isn't modular enough to do that it wouldn't be
maintainable anyway.

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-02 22:08 Rik van Riel
  2000-08-03  7:19 ` Chris Wedgwood
@ 2000-08-03 18:05 ` Linus Torvalds
  2000-08-03 18:50   ` Rik van Riel
                     ` (4 more replies)
  2000-08-03 19:26 ` Roger Larsson
  2 siblings, 5 replies; 46+ messages in thread
From: Linus Torvalds @ 2000-08-03 18:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

On Wed, 2 Aug 2000, Rik van Riel wrote:
>
> [Linus: I'd really like to hear some comments from you on this idea]

I am completely and utterly baffled on why you think that the multi-list
approach would help balancing.

Every single indication we have ever had is that balancing gets _harder_
when you have multiple sources of pages, not easier.

As far as I can tell, the only advantage of multiple lists compared to the
current one is to avoid overhead in walking extra pages, no?

And yet you claim that you see no way to fix the current VM behaviour.

This is illogical, and sounds like complete crap to me.

Why don't you just do it with the current scheme (the only thing needed to
be added to the current scheme being the aging, which we've had before),
and prove that the _balancing_ works. If you can prove that the balancing
works but that we spend unnecessary time in scanning the pages, then
you've proven that the basic VM stuff is right, and then the multiple
queues becomes a performance optimization.

Yet you seem to sell the "multiple queues" idea as some fundamental
change. I don't see that. Please explain what makes your ideas so
radically different?

> The design is based around the following ideas:
> - center-balanced page aging, using
>     - multiple lists to balance the aging
>     - a dynamic inactive target to adjust
>       the balance to memory pressure
> - physical page based aging, to avoid the "artifacts"
>   of virtual page scanning
> - separated page aging and dirty page flushing
>     - kupdate flushing "old" data
>     - kflushd syncing out dirty inactive pages
>     - as long as there are enough (dirty) inactive pages,
>       never mess up aging by searching for clean active
>       pages ... even if we have to wait for disk IO to
>       finish
> - very light background aging under all circumstances, to
>   avoid half-hour old referenced bits hanging around

As far as I can tell, the above is _exactly_ equivalent to having one
single list, and multiple "scan-points" on that list. 

A "scan-point" is actually very easy to implement: anybody at all who
needs to scan the list can just include his own "anchor-page": a "struct
page_struct" that is purely local to that particular scanner, and that
nobody else will touch because it has an artificially elevated usage count
(and because there is actually no real page associated with that virtual
"struct page" the page count will obviosly never decrease ;).

Then, each scanner just advances its own anchor-page around the list, and
does whatever it is that the scanner is designed to do on the page it
advances over. So "bdflush" would do

	..
	lock_list();
	struct page *page = advance(&bdflush_entry);
	if (page->buffer) {
		get_page(page);
		unlock_list();
		flush_page(page);
		continue;
	}
	unlock_list();
	..

while the page ager would do

	lock_list();
	struct page *page = advance(&bdflush_entry);
	page->age = page->age >> 1;
	if (PageReferenced(page))
		page->age += PAGE_AGE_REF;
	unlock_list();

etc.. Basically, you can have any number of virtual "clocks" on a single
list.

No radical changes necessary. This is something we can easily add to
2.4.x.

The reason I'm unconvinced about multiple lists is basically:

 - they are inflexible. Each list has a meaning, and a page cannot easily
   be on more than one list. It's really hard to implement overlapping
   meanings: you get exponential expanision of combinations, and everybody
   has to be aware of them.

   For example, imagine that the definition of "dirty" might be different
   for different filesystems.  Imagine that you have a filesystem with its
   own specific "walk the pages to flush out stuff", with special logic
   that is unique to that filesystem ("you cannot write out this page
   until you've done 'Y' or whatever). This is hard to do with your
   approach. It is trivial to do with the single-list approach above.

   More realistic (?) example: starting write-back of pages is very
   different from waiting on locked pages. We may want to have a "dirty
   but not yet started" list, and a "write-out started but not completed"
   locked list. Right now we use the same "clock" for them (the head of
   the LRU queue with some ugly heuristic to decide whether we want to
   wait on anything).

   But we potentially really want to have separate logic for this: we want
   to have a background "start writeout" that goes on all the time, and
   then we want to have a separate "start waiting" clock that uses
   different principles on which point in the list to _wait_ on stuff.

   This is what we used to have in the old buffer.c code (the 2.0 code
   that Alan likes). And it was _horrible_ to have separate lists, because
   in fact pages can be both dirty and locked and they really should have
   been on both lists etc..

 - in contrast, scan-points (withour LRU, but instead working on the basis
   of the age of the page - which is logically equivalent) offer the
   potential for specialized scanners. You could have "statistics
   gathering robots" that you add dynamically. Or you could have
   per-device flush deamons.

   For example, imagine a common problem with floppies: we have a timeout
   for the floppy motor because it's costly to start them up again. And
   they are removable. A perfect floppy driver would notice when it is
   idle, and instead of turning off the motor it might decide to scan for
   dirty pages for the floppy on the (correct) assumption that it would be
   nice to have them all written back instead of turning off the motor and
   making the floppy look idle.

   With a per-device "dirty list" (which you can test out with a page
   scanner implementation to see if it ends up reall yimproving floppy
   behaviour) you could essentially have a guarantee: whenever the floppy
   motor is turned off, the filesystem on that floppy is synced.
   Test implementation: floppy deamon that walks the list and turns off
   the engine only after having walked it without having seen any dirty
   blocks.

   In the end, maybe you realize that you _really_ don't want a dirty list
   at all. You want _multiple_ dirty lists, one per device.

   And that's really my point. I think you're too eager to rewrite things,
   and not interested enough in verifying that it's the right thing. Which
   I think you can do with the current one-list thing easily enough.

 - In the end, even if you don't need the extra flexibility of multiple
   clocks, splitting them up into separate lists doesn't change behaviour,
   it's "only" a CPU time optimization.

   Which may well be worth it, don't get me wrong. But I don't see why you
   tout this as being something radically needed in order to get better VM
   behaviour. Sure, multiple lists avoids the unnecessary walking over
   pages that we don't care about for some particular clock. And they may
   well end up being worth it for that reason. But it's not a very good
   way of doing prototyping of the actual _behaviour_ of the lists.

To make a long story short, I'd rather see a proof-of-concept thing. And I
distrust your notion that "we can't do it with the current setup, we'll
have to implement something radically different". 

Bascially, IF you think that your newly designed VM should work, then you
should be able to prototype and prove it easily enough with the current
one. 

I'm personally of the opinion that people see that page aging etc is hard,
so they try to explain the current failures by claiming that it needs a
completely different approach. And in the end, I don't see what's so
radically different about it - it's just a re-organization. And as far as
I can see it is pretty much logically equivalent to just minor tweaks of
the current one.

(The _big_ change is actually the addition of a proper "age" field. THAT
is conceptually a very different approach to the matter. I agree 100% with
that, and the reason I don't get all that excited about it is just that we
_have_ done page aging before, and we dropped it for probably bad reasons,
and adding it back should not be that big of a deal. Probabl yless than 50
lines of diff).

Read Dilbert about the effectiveness of (and reasons for)  re-
organizations.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:05 ` Linus Torvalds
@ 2000-08-03 18:50   ` Rik van Riel
  2000-08-03 20:22     ` Linus Torvalds
  2000-08-03 19:00   ` Richard B. Johnson
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2000-08-03 18:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, linux-kernel

On Thu, 3 Aug 2000, Linus Torvalds wrote:
> On Wed, 2 Aug 2000, Rik van Riel wrote:
> >
> > [Linus: I'd really like to hear some comments from you on this idea]
> 
> I am completely and utterly baffled on why you think that the
> multi-list approach would help balancing.
> 
> Every single indication we have ever had is that balancing gets
> _harder_ when you have multiple sources of pages, not easier.

The lists are not at all dependant on where the pages come
from. The lists are dependant on the *page age*. This almost
sounds like you didn't read my mail... ;(

> As far as I can tell, the only advantage of multiple lists
> compared to the current one is to avoid overhead in walking
> extra pages, no?

NO. We need different queues so waiting for pages to be flushed
to disk doesn't screw up page aging of the other pages (the ones
we absolutely do not want to evict from memory yet).

That the inactive list is split into two lists has nothing to
do with page aging or balancing. We just do that to make it
easier to kick bdflush and to have the information available
we need for eg. write throttling.

> Why don't you just do it with the current scheme (the only thing
> needed to be added to the current scheme being the aging, which
> we've had before), and prove that the _balancing_ works.

In the current scheme we don't have enough information available
to do proper balancing.

> Yet you seem to sell the "multiple queues" idea as some fundamental
> change. I don't see that. Please explain what makes your ideas so
> radically different?

Having multiple queues instantly gives us the information we need
to do balancing. Having just one queue inevitably means we end up
doing page aging while waiting for already old pages to be flushed
to disk and we'll end up evicting the *wrong* pages from memory.

> As far as I can tell, the above is _exactly_ equivalent to
> having one single list, and multiple "scan-points" on that list.

More or less, yes. Except that the scan points still don't give us
the information we need to decide if we need to age more not-old
pages or if we simply have a large amount of dirty old pages and
we need to wait for them to be synced to disk.

> bdflush
> 
> 	..
> 	lock_list();
> 	struct page *page = advance(&bdflush_entry);
> 	if (page->buffer) {
> 		get_page(page);
> 		unlock_list();
> 		flush_page(page);
> 		continue;
> 	}
> 	unlock_list();
> 	..

This is absolute CRAP. Have you read the discussions about the
page->mapping->flush(page) callback?

In 2.5 we'll be dealing with journaling filesystems, filesystems
with delayed allocation (flush on allocate) and various other
things you do not want the VM subsystem to know about.

We want to have 2 lists of dirty pages (that the VM subsystem
knows about) in the system:
- inactive_dirty
- active_writeback  (works like the current bufferhead list)

Kupdate will _ask the filesystem_ (or swap subsystem) if a
certain page could be flushed to disk. If the subsystem called
has opportunities to do IO clustering, it can do so. If the page
is a pinned page of a journaling filesystem and cannot be flushed
yet, the filesystem will not flush it (but flush something else
instead, because it knows there is memory pressure).

> The reason I'm unconvinced about multiple lists is basically:
> 
>  - they are inflexible. Each list has a meaning, and a page cannot easily
>    be on more than one list.

Until you figure out a way for pages to have multiple page ages
at the same time, I don't see how this is relevant.

>    For example, imagine that the definition of "dirty" might be different
>    for different filesystems.  Imagine that you have a filesystem with its
>    own specific "walk the pages to flush out stuff", with special logic
>    that is unique to that filesystem ("you cannot write out this page
>    until you've done 'Y' or whatever). This is hard to do with your
>    approach. It is trivial to do with the single-list approach above.

That has absolutely nothing to do with it. The VM subsystem cares
about _page replacement_. Flushing pages is done by kindly asking
the filesystem if it could flush something (preferably this page).

Littering the VM subystem with filesystem knowledge and having page
replacement fucked up by that is simply not the way to go. At least,
not if you want to have code that can actually be maintained by
anybody. Especially when the dirty bit means something different to
different filesystems ...

>    More realistic (?) example: starting write-back of pages is very
>    different from waiting on locked pages. We may want to have a "dirty
>    but not yet started" list, and a "write-out started but not completed"
>    locked list. Right now we use the same "clock" for them (the head of
>    the LRU queue with some ugly heuristic to decide whether we want to
>    wait on anything).
> 
>    But we potentially really want to have separate logic for this: we want

Gosh, so now you are proposing the multi-queue idea you flamed
into the ground one page up?

>  - in contrast, scan-points (withour LRU, but instead working on the basis
>    of the age of the page - which is logically equivalent) offer the
>    potential for specialized scanners. You could have "statistics
>    gathering robots" that you add dynamically. Or you could have
>    per-device flush deamons.

We could still have those with the multiqueue code. Just have the
per-filesystem flush daemon walk the inactive_dirty and
active_writeback list.

Per-device flush daemons are, unfortunately(?), impossible when
you're dealing with allocate-on-flush filesystems.

> Bascially, IF you think that your newly designed VM should work,
> then you should be able to prototype and prove it easily enough
> with the current one.

The current one doesn't give us the information we need to
balance the different activities (keeping page aging at the
right pace, flushing out old dirty pages, write throttling)
with each other.

If there was any hope that the current VM would be a good
enough basis to work from I would have done that. In fact,
I tried this for the last 6 months and horribly failed.

Other people have also tried (and failed). I'd be surprised
if you could do better, but it sure would be a pleasant
surprise...

> (The _big_ change is actually the addition of a proper "age"
> field. THAT is conceptually a very different approach to the
> matter. I agree 100% with that,

While page aging is a fairly major part, it is certainly NOT
the big issue here...

The big issues are:
- separate page aging and page flushing, so lingering dirty
  pages don't fuck up page aging
- organise the VM in such a way that we actually have the
  information available we need for balancing the different
  VM activities
- abstract away dirty page flushing in such a way that we
  give filesystems (and swap) the opportunity for their own
  optimisations

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:50   ` Rik van Riel
@ 2000-08-03 20:22     ` Linus Torvalds
  2000-08-03 22:05       ` Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2000-08-03 20:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

On Thu, 3 Aug 2000, Rik van Riel wrote:
> 
> The lists are not at all dependant on where the pages come
> from. The lists are dependant on the *page age*. This almost
> sounds like you didn't read my mail... ;(

I did read the email. And I understand that. And that's exactly why I
think a single-list is equivalent (because your lists basically act simply
as "caches" of the page age).

> NO. We need different queues so waiting for pages to be flushed
> to disk doesn't screw up page aging of the other pages (the ones
> we absolutely do not want to evict from memory yet).

Ehh.. Did you read _my_ mail?

Go back. Read it. Realize that your "multiple queues" is nothing more than
"cached information". They do not change _behaviour_ at all. They only
change the amount of CPU-time you need to parse it.

Your arguments do not seem to address this issue at all.

In my mailbox I have an email from you as of yesterday (or the day before)
which says:
 - I will not try to balance the current MM because it is not doable

And I don't see that your suggestion is fundamentally adding anything but
a CPU timesaver.

Basically, answer me this _simple_ question: what _behavioural_
differences do you claim multiple queues have? Ignore CPU usage for now. 

I'm claiming they are just a cache.

And you claim that the current MM cannot be balanced, but your new one
can.

Please reconcile these two things for me.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 20:22     ` Linus Torvalds
@ 2000-08-03 22:05       ` Rik van Riel
  2000-08-03 22:19         ` Linus Torvalds
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2000-08-03 22:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, linux-kernel

On Thu, 3 Aug 2000, Linus Torvalds wrote:
> On Thu, 3 Aug 2000, Rik van Riel wrote:
> > 
> > The lists are not at all dependant on where the pages come
> > from. The lists are dependant on the *page age*. This almost
> > sounds like you didn't read my mail... ;(
> 
> I did read the email. And I understand that. And that's exactly
> why I think a single-list is equivalent (because your lists
> basically act simply as "caches" of the page age).

If you add "with statistics about how many pages of age 0 there
are" this is indeed the case.

> > NO. We need different queues so waiting for pages to be flushed
> > to disk doesn't screw up page aging of the other pages (the ones
> > we absolutely do not want to evict from memory yet).
> 
> Go back. Read it. Realize that your "multiple queues" is nothing
> more than "cached information". They do not change _behaviour_
> at all. They only change the amount of CPU-time you need to
> parse it.

If the information is cached somewhere else, then this is indeed
the case. My point is that we need to know how many pages with
page->age==0 we have, so we can know if we need to scan memory
and age more pages or if we should simply wait a bit until the
currently old pages are flushed to disk and ready to be reused.

> Basically, answer me this _simple_ question: what _behavioural_
> differences do you claim multiple queues have? Ignore CPU usage
> for now.
> 
> I'm claiming they are just a cache.
> 
> And you claim that the current MM cannot be balanced, but your
> new one can.

I agree that we could cache the information about how many pages
of different ages and different dirty state we have in memory in
a different way.

We could have one single queue, as you wrote, and a number of
counters. Basically we'd need a counter for the number of old
(age==0) clean pages and one for the old dirty pages.

Then we'd have multiple functions. Kflushd and kupdate would
flush out the old dirty pages, __alloc_pages would walk the
list to reclaim the old clean pages and we'd have a separate
page aging function that only walks the list when we're short
on free + inactive_dirty + inactive_clean pages.

That would give us the same behaviour as the plan I wrote.

What I fail to see is why this would be preferable to a code
base where all the different pages are neatly separated and
we don't have N+1 functions that are all scanning the same
list, special-casing out each other's pages and searching 
the list for their own special pages...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 22:05       ` Rik van Riel
@ 2000-08-03 22:19         ` Linus Torvalds
  0 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2000-08-03 22:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

[ Ok, we agree on the basics ]

On Thu, 3 Aug 2000, Rik van Riel wrote:
> 
> What I fail to see is why this would be preferable to a code
> base where all the different pages are neatly separated and
> we don't have N+1 functions that are all scanning the same
> list, special-casing out each other's pages and searching 
> the list for their own special pages...

I disagree just with the "all improved, radically new, 50% more for the
same price" ad-campaign I've seen.

I don't like the fact that you said that you don't want to worry about
2.4.x because you don't think it can be fixed it as it stands. I think
that's a cop-out and dishonest. I think I've explained why.

I could fully imagine doing even multi-lists in 2.4.x. I think performance
bugs are secondary to stability bugs, but hey, if the patch is clean and
straightforward and fixes a performance bug, I would not hesitate to apply
it. It may be that going to multi-lists actually is easier just because of
some thins being more explicit. Fine.

But stop the ad-campaign. We get too many biased ads for presidents-to-be
already, no need to take that approach to technical issues. We need to fix
the VM balancing, we don't need to sell it to people with buzz-words.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:05 ` Linus Torvalds
  2000-08-03 18:50   ` Rik van Riel
@ 2000-08-03 19:00   ` Richard B. Johnson
  2000-08-03 19:29     ` Rik van Riel
  2000-08-03 20:23     ` Linus Torvalds
  2000-08-03 19:37   ` Ingo Oeser
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 46+ messages in thread
From: Richard B. Johnson @ 2000-08-03 19:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-mm, linux-kernel

On Thu, 3 Aug 2000, Linus Torvalds wrote:
> 
> Read Dilbert about the effectiveness of (and reasons for)  re-
> organizations.
> 
> 		Linus

Reasons for:
.... cats in a litter box. They instinctively shuffle things around
to conceal what they have done..."

Cheers,
Dick Johnson

Penguin : Linux version 2.2.15 on an i686 machine (797.90 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 19:00   ` Richard B. Johnson
@ 2000-08-03 19:29     ` Rik van Riel
  2000-08-03 20:23     ` Linus Torvalds
  1 sibling, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2000-08-03 19:29 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Linus Torvalds, linux-mm, linux-kernel

On Thu, 3 Aug 2000, Richard B. Johnson wrote:
> On Thu, 3 Aug 2000, Linus Torvalds wrote:
> > 
> > Read Dilbert about the effectiveness of (and reasons for)  re-
> > organizations.
> 
> Reasons for:
> .... cats in a litter box. They instinctively shuffle things
> around to conceal what they have done..."

<flamebait>
Ermmm, in this case it was _Linus_ who replied after
reading a small part of the email and then cunningly
hid the rest ;)
</flamebait>

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 19:00   ` Richard B. Johnson
  2000-08-03 19:29     ` Rik van Riel
@ 2000-08-03 20:23     ` Linus Torvalds
  1 sibling, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2000-08-03 20:23 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Rik van Riel, linux-mm, linux-kernel


On Thu, 3 Aug 2000, Richard B. Johnson wrote:
> On Thu, 3 Aug 2000, Linus Torvalds wrote:
> > 
> > Read Dilbert about the effectiveness of (and reasons for)  re-
> > organizations.
> > 
> > 		Linus
> 
> Reasons for:
> .... cats in a litter box. They instinctively shuffle things around
> to conceal what they have done..."

Right. 

This is my argument. I see much noise, I do not see what is so
fundamentally different.

And because I don't see what's so fundamentally different, I don't see a
reason to really believe that it's a magic bullet like Rik claims.

Rik, do you see my argument now?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:05 ` Linus Torvalds
  2000-08-03 18:50   ` Rik van Riel
  2000-08-03 19:00   ` Richard B. Johnson
@ 2000-08-03 19:37   ` Ingo Oeser
  2000-08-03 20:40     ` Linus Torvalds
  2000-08-04  2:33   ` David Gould
  2000-08-16 15:10   ` Stephen C. Tweedie
  4 siblings, 1 reply; 46+ messages in thread
From: Ingo Oeser @ 2000-08-03 19:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-mm, linux-kernel

On Thu, Aug 03, 2000 at 11:05:47AM -0700, Linus Torvalds wrote:
> As far as I can tell, the only advantage of multiple lists compared to the
> current one is to avoid overhead in walking extra pages, no?

[...]

> As far as I can tell, the above is _exactly_ equivalent to having one
> single list, and multiple "scan-points" on that list. 

[...]

3 keywords:

   -  reordering of the list breaks _all_ scanpoints
   -  wraparound inside the scanner breaks ordering or it should
      store it's starting point globally
   -  state transistions _require_ reordering, which will affect
      all scanners

conclusions:

   -  scanners can only run exclusive (spinlock()ed) one at a
      point, if they can ever reorder the list, until the reach
      their temporally success or wrap point

   -  scanners, that don't reorder the list have to be run under
      the guarantee, that the list will _never_ change until they
      reach their wrap point or succeed for now

Isn't this really bad for performance? It would imply a lot of
waiting, but I haven't measured this ;-)

With the multiple list approach we can skip pages easily and
avoid contention and stuck scanners (waiting for the list_lock to
become free). 

Even your headache with the "purpose" of the lists might get
adressed, if you consider adding a queue in between for the
special state you need (like "dirty_but_not_really_list" ;-)).

The only wish _I_ have is having portal functions for _all_ state
transitions, which can be used as entry point for future
extensions which should continue adding portal functions for
their own transistions.

Practical example: *Nobody* was able to tell me, where we stop
   accessing a swapped out page (so it can be encrypted) and
   where we start accessing a swapped in page (so it has to be
   decrypted). 

   Would be no problem (nor a question ;-)) with portal functions
   for this important state transition.

PS: Maybe I didn't get your point with the "scan-points"
   approach.

Regards

Ingo Oeser
-- 
Feel the power of the penguin - run linux@your.pc
<esc>:x
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 19:37   ` Ingo Oeser
@ 2000-08-03 20:40     ` Linus Torvalds
  2000-08-03 21:56       ` Ingo Oeser
  0 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2000-08-03 20:40 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: Rik van Riel, linux-mm, linux-kernel

On Thu, 3 Aug 2000, Ingo Oeser wrote:

> On Thu, Aug 03, 2000 at 11:05:47AM -0700, Linus Torvalds wrote:
> > As far as I can tell, the only advantage of multiple lists compared to the
> > current one is to avoid overhead in walking extra pages, no?
> 
> [...]
> 
> > As far as I can tell, the above is _exactly_ equivalent to having one
> > single list, and multiple "scan-points" on that list. 
> 
> [...]
> 
> 3 keywords:
> 
>    -  reordering of the list breaks _all_ scanpoints

No.

Think about it.

The separate lists are _completely_ independent. That's why they are
separate, after all. So quite provably they do not interact with each
other, no?

So tell me, why would an algorithm that works on a single list, and on
that single lists re-orders only those entries that would be on the
private lists act any differently?

>    -  wraparound inside the scanner breaks ordering or it should
>       store it's starting point globally

No. Read the email again. It uses markers: essentially virtual entries in
the list that simply get ignored by the other markers.

>    -  state transistions _require_ reordering, which will affect
>       all scanners

NO.

All your arguments are wrong.

Think about it _another_ way instead:
 - the "multiple lists" case is provably a sub-case of the "one list,
   scanners only care about their type of entries".
 - the "one list" _allows_ for (but does not require) "mixing metaphors",
   ie a scanner _can_ see and _can_ modify an entry that wouldn't be on
   "it's list".

>    -  scanners can only run exclusive (spinlock()ed) one at a
>       point, if they can ever reorder the list, until the reach
>       their temporally success or wrap point

No. I guess you didn't understand what the "virtual page" anchor was all
about. It's adding an entry to the list that nobody uses (it could be
marked by an explicit flag in page->flags, if you will - it can be easier
thinking about it that way, although it is not required if there are other
heuristics that just make the marker something that other scanners don't
touch. 

It's akin to the head of the list - except a page list doesn't actually
need to have a head at all - _any_ of these virtual pages act as anchors
for the list.

In it's purest case you can think of the list as multiple independent 
lists. But you can also allow the entries to interact if you wish. 

And that's my beef with this: I can see a direct mapping from the multiple
list case to the single list case. Which means that the multiple list case
simply _cannot_ do something that the single-list case couldn't do.

(The reverse is also true: the single list can have the list entries
interact. That's logically equivalent to the case of the multi-list
implementation moving an entry from one list to another)

So a single list is basically equivalent to multi-list, as long as the
decisions to move and re-order entries are equivalent.

> Isn't this really bad for performance? It would imply a lot of
> waiting, but I haven't measured this ;-)

Not waiting. The multi-lists have the advantage of caching the state of a
page, and I see why we may want to go to multi-lists. I do not see why Rik
claims that multi-lists introduce anything _new_. That's my beef. 

> With the multiple list approach we can skip pages easily and
> avoid contention and stuck scanners (waiting for the list_lock to
> become free). 

The multi-list scanners will probably have multiple spinlocks, and that's
nice. But they will also have to move entries from one list to another,
which can be deadlock country etc (think of one CPU that wants to move
from the free list to the in-use list and another CPU that does the
reverse). 

But again, I claim that multi-lists are a CPU optimization, not a
"behaviour" optimization. Yet everybody seems to claim that multi-lists
will help balance the VM better - implying that they have fundamentally
different _behaviour_. Which is not true, as far as I can tell.

Let me re-iterate: I'm not arguing against multi-lists. I'm arguing about
people being apparently dishonest and saying that the multi-lists are
somehow able to do things that the current VM wouldn't be able to do.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 20:40     ` Linus Torvalds
@ 2000-08-03 21:56       ` Ingo Oeser
  2000-08-03 22:12         ` Linus Torvalds
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Oeser @ 2000-08-03 21:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-mm, linux-kernel

On Thu, Aug 03, 2000 at 01:40:59PM -0700, Linus Torvalds wrote:
> >    -  state transistions _require_ reordering, which will affect
> >       all scanners
> 
> NO.
> 
> All your arguments are wrong.

Hmm, so I think I use wrong assumptions then...

I assumed all lists we talk about are circular and double chained
(either your single list or Riks state lists).

I also assumed your markers are nothing but a normal element of
the list, that is just skipped, but don't cause a wraparound of
each of the scanners.

What happens, if one scanner decides to remove an element and
insert it elsewhere (to achieve it's special ordering)?

Or are all elements only touched but the ordering is only changed
by removing in the middle and appending only to either head or
tail of this list?

> Think about it _another_ way instead:
>  - the "multiple lists" case is provably a sub-case of the "one list,
>    scanners only care about their type of entries".

Got this concept (I think).

>  - the "one list" _allows_ for (but does not require) "mixing metaphors",
>    ie a scanner _can_ see and _can_ modify an entry that wouldn't be on
>    "it's list".

That's what I would like to avoid. I don't like to idea of
multiple "states" per page. I would like to scan all pages, that
are *guaranteed* to have a special state and catch their
transistions. I prefer clean automata design for this.

To get back to my encrypted swap example:

   -  I only have to catch the transistions to "inactive_dirty" for
      encryption (if the page is considered for real swap) and
      mark it "PG_Encrypted".

   -  I only have to catch the transition to "active" and only
      have to check for "PG_Encrypted", decrypt and clear this
      flag.

   -  Or I use a new list "encrypted" and do a transistion from
      "encryped" to "active" and "inactive_dirty" to "encrypted"
      including right points in the VM, which would be more like
      adding a layer instead of creating a kludge.

I still couldn't figure out, how to do it for our kernels
floating around, since I don't get a clean state transition
diagram :-(

> In it's purest case you can think of the list as multiple independent 
> lists. But you can also allow the entries to interact if you wish. 

> And that's my beef with this: I can see a direct mapping from the multiple
> list case to the single list case. Which means that the multiple list case
> simply _cannot_ do something that the single-list case couldn't do.

Agree. There ist just a bit more atomicy between the scanners,
thats all I think. And of course states are exlusive instead of
possibly inclusive.

> (The reverse is also true: the single list can have the list entries
> interact. That's logically equivalent to the case of the multi-list
> implementation moving an entry from one list to another)
> 
> So a single list is basically equivalent to multi-list, as long as the
> decisions to move and re-order entries are equivalent.

Agreed.

> Let me re-iterate: I'm not arguing against multi-lists. I'm arguing about
> people being apparently dishonest and saying that the multi-lists are
> somehow able to do things that the current VM wouldn't be able to do.

Got that.

Its the features, that multiple lists *lack* , what makes them
attractive to _my_ eyes. You are the one, that has the last
word, I just want to make sure, you've seen all the implications
and I'm only stupid to assume, you didn't do that ;-)

Regards

Ingo Oeser
-- 
Feel the power of the penguin - run linux@your.pc
<esc>:x
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 21:56       ` Ingo Oeser
@ 2000-08-03 22:12         ` Linus Torvalds
  0 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2000-08-03 22:12 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: Rik van Riel, linux-mm, linux-kernel

On Thu, 3 Aug 2000, Ingo Oeser wrote:
> 
> I also assumed your markers are nothing but a normal element of
> the list, that is just skipped, but don't cause a wraparound of
> each of the scanners.

Right.

Think of them as invisible.

> What happens, if one scanner decides to remove an element and
> insert it elsewhere (to achieve it's special ordering)?

Nothing, as far as the other scanners are aware, as they won't even look
at that element anyway (assuming they work the same way as a multi-list
scanner would work).

See?

One list is equivalent to multiple lists, assuming the scanners honour the
same logic as a multi-list scanner would (ie ignore entries that they
aren't designed for).

> > Think about it _another_ way instead:
> >  - the "multiple lists" case is provably a sub-case of the "one list,
> >    scanners only care about their type of entries".
> 
> Got this concept (I think).
> 
> >  - the "one list" _allows_ for (but does not require) "mixing metaphors",
> >    ie a scanner _can_ see and _can_ modify an entry that wouldn't be on
> >    "it's list".
> 
> That's what I would like to avoid. I don't like to idea of
> multiple "states" per page. I would like to scan all pages, that
> are *guaranteed* to have a special state and catch their
> transistions. I prefer clean automata design for this.

I would tend to agree with you. It's much easier to think about the
problems when you don't start "mixing" behaviour. 

And getting a more explicit state transition may well be a good thing.

However, considering that right now we do not have that explicit code, I'd
hate to add it and require it to be 100% correct for 2.4.x. See?

And I dislike the mental dishonesty of claiming that multiple lists are
somehow different.

> > And that's my beef with this: I can see a direct mapping from the multiple
> > list case to the single list case. Which means that the multiple list case
> > simply _cannot_ do something that the single-list case couldn't do.
>  
> Agree. There ist just a bit more atomicy between the scanners,
> thats all I think. And of course states are exlusive instead of
> possibly inclusive.

I do like the notion of having stricter rules, and that is a huge bonus
for multi-lists.

But one downside of multi-lists is that we've had problems with them in
the past. fs/buffer.c used to use them even more than it does now, and it
was a breeding ground of bugs. fs/buffer.c got cleaned up, and the current
multi-list stuff is not at all that horrible any more, so multi-lists
aren't necessarily evil.

> > Let me re-iterate: I'm not arguing against multi-lists. I'm arguing about
> > people being apparently dishonest and saying that the multi-lists are
> > somehow able to do things that the current VM wouldn't be able to do.
> 
> Got that.
> 
> Its the features, that multiple lists *lack* , what makes them
> attractive to _my_ eyes.

Oh, I can agree with that. Discipline can be good for you.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:05 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2000-08-03 19:37   ` Ingo Oeser
@ 2000-08-04  2:33   ` David Gould
  2000-08-16 15:10   ` Stephen C. Tweedie
  4 siblings, 0 replies; 46+ messages in thread
From: David Gould @ 2000-08-04  2:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-mm, linux-kernel

On Thu, Aug 03, 2000 at 11:05:47AM -0700, Linus Torvalds wrote:
... 
> As far as I can tell, the above is _exactly_ equivalent to having one
> single list, and multiple "scan-points" on that list. 
> 
> A "scan-point" is actually very easy to implement: anybody at all who
> needs to scan the list can just include his own "anchor-page": a "struct
> page_struct" that is purely local to that particular scanner, and that
> nobody else will touch because it has an artificially elevated usage count
> (and because there is actually no real page associated with that virtual
> "struct page" the page count will obviosly never decrease ;).

I have seen this done in other contexts, where there was a single more or
less LRU list, with different regions, mainly, a "wash" region which
cleaned dirty pages. Regions were just pointers into the list.

Bad ascii art concept drawing:
   'U' is used, 'D' is dirty',
   'W' is washing, 'C' is clean
 
  [new]->U-U-U-U-D-U-D-U-U-D-U-U-D-D-U-*-W-W-W-W-W-W-W-W-*-C-C-C-C-C->[old]
                                       ^                ^
                 ... active ...        | ... wash ....  |  ... free ...
                                                        |
                                       <- size of wash->\washptr

Basically when a page aged into the "wash" section of the list, it would be
cleaned and moved on to the clean section. This was done either on demand
by tasks trying to find free pages, or by a pagecleaner task. Tunables were
the size of the wash, the size goals for the free section, i/o rate, how
on demand tasks would scan after finding a page, how agressive the pagecleaner
etc.

It seemed to work ok, and the code was not too horrible.

> etc.. Basically, you can have any number of virtual "clocks" on a single
> list.

Yes.
 
>    For example, imagine a common problem with floppies: we have a timeout
>    for the floppy motor because it's costly to start them up again. And
>    they are removable. A perfect floppy driver would notice when it is
>    idle, and instead of turning off the motor it might decide to scan for
>    dirty pages for the floppy on the (correct) assumption that it would be
>    nice to have them all written back instead of turning off the motor and
>    making the floppy look idle.

This would be a big win for laptops. Instead of turning off flushing, or
flushing too often, just piggy back all the flushing onto time when the
drive was already spun up anyway. Happy result, less power and noise, and
more safety.


-dg

-- 
David Gould                                                 dg@suse.com
SuSE, Inc.,  580 2cd St. #210,  Oakland, CA 94607          510.628.3380
"I sense a disturbance in the source"  -- Alan Cox
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 18:05 ` Linus Torvalds
                     ` (3 preceding siblings ...)
  2000-08-04  2:33   ` David Gould
@ 2000-08-16 15:10   ` Stephen C. Tweedie
  4 siblings, 0 replies; 46+ messages in thread
From: Stephen C. Tweedie @ 2000-08-16 15:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-mm, linux-kernel, Stephen Tweedie

Hi,

I'm coming to this late -- I've been getting drenched in the Scottish
rain up in Orkney and Skye for the past couple of weeks.

On Thu, Aug 03, 2000 at 11:05:47AM -0700, Linus Torvalds wrote:

> Why don't you just do it with the current scheme (the only thing needed to
> be added to the current scheme being the aging, which we've had before),
> and prove that the _balancing_ works. If you can prove that the balancing
> works but that we spend unnecessary time in scanning the pages, then
> you've proven that the basic VM stuff is right, and then the multiple
> queues becomes a performance optimization.
> 
> Yet you seem to sell the "multiple queues" idea as some fundamental
> change. I don't see that. Please explain what makes your ideas so
> radically different?

> As far as I can tell, the above is _exactly_ equivalent to having one
> single list, and multiple "scan-points" on that list. 

I've been talking with Rik about some of the other requirements that
filesystems have of the VM.  What came out of it was a strong
impression that the VM is currently confusing too many different
tasks.

We have the following tasks:

 * Aging of pages (maintaining the information about which pages are
   good to reclaim)

 * Reclaiming of pages (doesn't necessarily have to be done until 
   the free list gets low, even if we are still doing aging)

 * Write-back of dirty pages when under memory pressure (including
   swap write)

 * Write-behind of dirty buffers on timeout

 * Flow-control in the VM --- preventing aggressive processes from 
   consuming all free pages to the detriment of the rest of the
   system, or from filling all of memory with dirty, non-reclaimable
   pages

Rik's design specified that page aging --- the location of pages
suitable for freeing --- was to be done on a physical basis, using
something similar to 2.0's page walk (or FreeBSD's physical clock).
That scan doesn't have to walk lists: its main interaction with the
lists would be to populate the list of pages suitable for reclaim.

Once page aging is cleaned up in that way, we don't have to worry
overly about the scan order for pages on the page lists.  It's much
like the buffer cache --- we can use the buffer cache locked and dirty
lists to simplify the tracking of dirty data in the buffer cache
without having to worry about how those lists interact with the buffer
reclaim scan, since the reclaim scan is done using a completely
different scanning mechanism (the page list).

The presence of multiple queues isn't the Radically Different feature
of Rik's outline.  The fact that aging is independent of those queues
is.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-02 22:08 Rik van Riel
  2000-08-03  7:19 ` Chris Wedgwood
  2000-08-03 18:05 ` Linus Torvalds
@ 2000-08-03 19:26 ` Roger Larsson
  2000-08-03 21:50   ` Rik van Riel
  2 siblings, 1 reply; 46+ messages in thread
From: Roger Larsson @ 2000-08-03 19:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, Scott F. Kaplan

Hi,

My comments (IMHO).

Rik van Riel wrote:
> 
> Hi,
> 
> here is a (rough) draft of the design for the new VM, as
> discussed at UKUUG and OLS. The design is heavily based
> on the FreeBSD VM subsystem - a proven design - with some
> tweaks where we think things can be improved. Some of the
> ideas in this design are not fully developed, but none of
> those "new" ideas are essential to the basic design.
> 
> The design is based around the following ideas:
> - center-balanced page aging, using
>     - multiple lists to balance the aging
>     - a dynamic inactive target to adjust
>       the balance to memory pressure
> - physical page based aging, to avoid the "artifacts"
>   of virtual page scanning
> - separated page aging and dirty page flushing
>     - kupdate flushing "old" data
>     - kflushd syncing out dirty inactive pages
>     - as long as there are enough (dirty) inactive pages,
>       never mess up aging by searching for clean active
>       pages ... even if we have to wait for disk IO to
>       finish
> - very light background aging under all circumstances, to
>   avoid half-hour old referenced bits hanging around
> 
>                 Center-balanced page aging:
> 
> - goals
>     - always know which pages to replace next
>     - don't spend too much overhead aging pages
>     - do the right thing when the working set is
>       big but swapping is very very light (or none)
>     - always keep the working set in memory in
>       favour of use-once cache
> 
> - page aging almost like in 2.0, only on a physical page basis
>     - page->age starts at PAGE_AGE_START for new pages
>     - if (referenced(page)) page->age += PAGE_AGE_ADV;
>     - else page->age is made smaller (linear or exponential?)
>     - if page->age == 0, move the page to the inactive list
>     - NEW IDEA: age pages with a lower page age
> 
> - data structures (page lists)
>     - active list
>         - per node/pgdat
>         - contains pages with page->age > 0
>         - pages may be mapped into processes
>         - scanned and aged whenever we are short
>           on free + inactive pages
>         - maybe multiple lists for different ages,
>           to be better resistant against streaming IO
>           (and for lower overhead)

Does this really need to be a list? Since most pages should
be on this list can't it be virtual - pages on no other list
are on active list. All pages are scanned all the time...


>     - inactive_dirty list
>         - per zone
>         - contains dirty, old pages (page->age == 0)
>         - pages are not mapped in any process
>     - inactive_clean list
>         - per zone
>         - contains clean, old pages
>         - can be reused by __alloc_pages, like free pages
>         - pages are not mapped in any process

What will happen to pages on these lists if pages gets referenced?
* Move them back to the active list? Then it is hard to know how
  many free able pages there really are...

>     - free list
>         - per zone
>         - contains pages with no useful data
>         - we want to keep a few (dozen) of these around for
>           recursive allocations
> 
> - other data structures
>     - int memory_pressure
>         - on page allocation or reclaim, memory_pressure++
>         - on page freeing, memory_pressure--  (keep it >= 0, though)
>         - decayed on a regular basis (eg. every second x -= x>>6)
>         - used to determine inactive_target
>     - inactive_target == one (two?) second(s) worth of memory_pressure,
>       which is the amount of page reclaims we'll do in one second
>         - free + inactive_clean >= zone->pages_high
>         - free + inactive_clean + inactive_dirty >= zone->pages_high \
>                 + one_second_of_memory_pressure * (zone_size / memory_size)

One of the most interesting aspects (IMHO) of Scott F. Kaplands
"Compressed Cache
and Virtual Memory Simulation" was the use of VM time instead of wall
time.
One second could be too long of a reaction time - relative to X
allocations/sec etc.

>     - inactive_target will be limited to some sane maximum
>       (like, num_physpages / 4)

Question: Why is this needed?
Answer: Due to high memory_pressure can only exist momentarily. And can
pollute our
statistics.


> The idea is that when we have enough old (inactive + free)
> pages, we will NEVER move pages from the active list to the
> inactive lists. We do that because we'd rather wait for some
> IO completion than evict the wrong page.
> 

So, will the scanning stop then??? And referenced builds up.
Or will there be pages with age == 0 on the active list?
(This is one of the reasons VM time is nice as time base for ageing -
 little happens time goes slower)
This contradicts "very light background ageing" earlier.

> Kflushd / bdflush will have the honourable task of syncing
> the pages in the inactive_dirty list to disk before they
> become an issue. We'll run balance_dirty over the set of
> free + inactive_clean + inactive_dirty AND we'll try to
> keep free+inactive_clean > pages_high .. failing either of
> these conditions will cause bdflush to kick into action and
> sync some pages to disk.
> 
> If memory_pressure is high and we're doing a lot of dirty
> disk writes, the bdflush percentage will kick in and we'll
> be doing extra-agressive cleaning. In that case bdflush
> will automatically become more agressive the more page
> replacement is going on, which is a good thing.

I think that one of the omissions in Kaplands report is the
time it takes to clean dirty pages. (Or have I missed
something... Need to select the pages earlier)

> 
>                 Physical page based page aging
> 
> In the new VM we'll need to do physical page based page aging
> for a number of reasons. Ben LaHaise said he already has code
> to do this and it's "dead easy", so I take it this part of the
> code won't be much of a problem.
> 
> The reasons we need to do aging on a physical page are:
>     - avoid the virtual address based aging "artifacts"
>     - more efficient, since we'll only scan what we need
>       to scan  (especially when we'll test the idea of
>       aging pages with a low age more often than pages
>       we know to be in the working set)
>     - more direct feedback loop, so less chance of
>       screwing up the page aging balance

Nod.

> 
>                 IO clustering
> 
> IO clustering is not done by the VM code, but nicely abstracted
> away into a page->mapping->flush(page) callback. This means that:
> - each filesystem (and swap) can implement their own, isolated
>   IO clustering scheme
> - (in 2.5) we'll no longer have the buffer head list, but a list
>   of pages to be written back to disk, this means doing stuff like
>   delayed allocation (allocate on flush) or kiobuf based extents
>   is fairly trivial to do
>

Nod.
 
>                 Misc
> 
> Page aging and flushing are completely separated in this
> scheme. We'll never end up aging and freeing a "wrong" clean
> page because we're waiting for IO completion of old and
> to-be-freed pages.
>

Is page ageing modification of LRU enough?
In many cases it will probably behave worse than plain LRU
(slower phase adaptions).
The access pattern diagrams in Kaplans report are very
enlightening...


/RogerL

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 19:26 ` Roger Larsson
@ 2000-08-03 21:50   ` Rik van Riel
  2000-08-03 22:28     ` Roger Larsson
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2000-08-03 21:50 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-mm, Scott F. Kaplan

On Thu, 3 Aug 2000, Roger Larsson wrote:

> > - data structures (page lists)
> >     - active list
> >         - per node/pgdat
> >         - contains pages with page->age > 0
> >         - pages may be mapped into processes
> >         - scanned and aged whenever we are short
> >           on free + inactive pages
> >         - maybe multiple lists for different ages,
> >           to be better resistant against streaming IO
> >           (and for lower overhead)
> 
> Does this really need to be a list? Since most pages should
> be on this list can't it be virtual - pages on no other list
> are on active list. All pages are scanned all the time...

It doesn't have to be a list per se, but since we have the
list head in the page struct anyway we might as well make
it one.

> >     - inactive_dirty list
> >         - per zone
> >         - contains dirty, old pages (page->age == 0)
> >         - pages are not mapped in any process
> >     - inactive_clean list
> >         - per zone
> >         - contains clean, old pages
> >         - can be reused by __alloc_pages, like free pages
> >         - pages are not mapped in any process
> 
> What will happen to pages on these lists if pages gets referenced?
> * Move them back to the active list? Then it is hard to know how 
>   many free able pages there really are...

Indeed, we will move such a page back to the active list.
"Luckily" the inactive pages are not mapped, so we have to
locate them through find_page_nolock() and friends, which
allows us to move the page back to the active list, adjust
statistics and maybe even wake up kswapd as needed.

> > - other data structures
> >     - int memory_pressure
> >         - on page allocation or reclaim, memory_pressure++
> >         - on page freeing, memory_pressure--  (keep it >= 0, though)
> >         - decayed on a regular basis (eg. every second x -= x>>6)
> >         - used to determine inactive_target
> >     - inactive_target == one (two?) second(s) worth of memory_pressure,
> >       which is the amount of page reclaims we'll do in one second
> >         - free + inactive_clean >= zone->pages_high
> >         - free + inactive_clean + inactive_dirty >= zone->pages_high \
> >                 + one_second_of_memory_pressure * (zone_size / memory_size)
> 
> One of the most interesting aspects (IMHO) of Scott F. Kaplands
> "Compressed Cache and Virtual Memory Simulation" was the use of
> VM time instead of wall time. One second could be too long of a
> reaction time - relative to X allocations/sec etc.

It's just the inactive target. Trying to keep one second of
unmapped pages with page->age==0 around is mainly done to:
- make sure we can flush all of them on time
- put an "appropriate" amount of pressure on the
  pages in the active list, so page aging is smoothed
  out a little bit

> >     - inactive_target will be limited to some sane maximum
> >       (like, num_physpages / 4)
> 
> Question: Why is this needed?
> Answer: Due to high memory_pressure can only exist momentarily.
> And can pollute our statistics.

Indeed. Imagine Netscape starting on a 32MB machine. 10MB
allocated within the second, but there's no way we want the
inactive list to grow to that size...

> > The idea is that when we have enough old (inactive + free)
> > pages, we will NEVER move pages from the active list to the
> > inactive lists. We do that because we'd rather wait for some
> > IO completion than evict the wrong page.
> 
> So, will the scanning stop then??? And referenced builds up.
> Or will there be pages with age == 0 on the active list?

Active scanning goes on only when we have a shortage of
inactive pages. Also, when aren't scanning, the page
age of no page will magically change to 0 ;)

> This contradicts "very light background ageing" earlier.

Nope. If the system does no scanning of pages for some
time (say 1 minute), we will simply scan some fraction
of the inactive list. That way we can guarantee that
we'll not have OLD referenced bits lingering around and
messing up page aging when we start running out of memory.

> > If memory_pressure is high and we're doing a lot of dirty
> > disk writes, the bdflush percentage will kick in and we'll
> > be doing extra-agressive cleaning. In that case bdflush
> > will automatically become more agressive the more page
> > replacement is going on, which is a good thing.
> 
> I think that one of the omissions in Kaplands report is the
> time it takes to clean dirty pages. (Or have I missed
> something... Need to select the pages earlier)

Page replacement (select which page to replace) should always
be independant from page flushing. You can make pretty decent
decisions on which page(s) to free and the last thing you want
is having them messed up by page flushing.

> >                 Misc
> > 
> > Page aging and flushing are completely separated in this
> > scheme. We'll never end up aging and freeing a "wrong" clean
> > page because we're waiting for IO completion of old and
> > to-be-freed pages.
> 
> Is page ageing modification of LRU enough?

It seems to work fine for FreeBSD. Also, we can always change
the "aging" of the active pages with something else. The
system is modular enough that we can do that.

> In many cases it will probably behave worse than plain LRU
> (slower phase adaptions).

We can change that by using exponential decay for the page
age, or by using some different aging technique...

> The access pattern diagrams in Kaplans report are very
> enlightening...

They are very interesting indeed, but I miss one very
common workload in their report. A lot of systems do
(multimedia) streaming IO these days, where a lot of
data passes through the cache quickly, but all of the
data is only touched once (or maybe twice).

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: RFC: design for new VM
  2000-08-03 21:50   ` Rik van Riel
@ 2000-08-03 22:28     ` Roger Larsson
  0 siblings, 0 replies; 46+ messages in thread
From: Roger Larsson @ 2000-08-03 22:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, Scott F. Kaplan

Rik van Riel wrote:
> 
> On Thu, 3 Aug 2000, Roger Larsson wrote:
> 
> > > - data structures (page lists)
> > >     - active list
> > >         - per node/pgdat
> > >         - contains pages with page->age > 0
> > >         - pages may be mapped into processes
> > >         - scanned and aged whenever we are short
> > >           on free + inactive pages
> > >         - maybe multiple lists for different ages,
> > >           to be better resistant against streaming IO
> > >           (and for lower overhead)
> >
> > Does this really need to be a list? Since most pages should
> > be on this list can't it be virtual - pages on no other list
> > are on active list. All pages are scanned all the time...
> 
> It doesn't have to be a list per se, but since we have the
> list head in the page struct anyway we might as well make
> it one.
> 

If we do not want to increase the size of the page struct.
A union could be added instead - age info _or_ actual list.
(not on any list unless age is zero)

> > >     - inactive_dirty list
> > >         - per zone
> > >         - contains dirty, old pages (page->age == 0)
> > >         - pages are not mapped in any process
> > >     - inactive_clean list
> > >         - per zone
> > >         - contains clean, old pages
> > >         - can be reused by __alloc_pages, like free pages
> > >         - pages are not mapped in any process
> >
> > What will happen to pages on these lists if pages gets referenced?
> > * Move them back to the active list? Then it is hard to know how
> >   many free able pages there really are...
> 
> Indeed, we will move such a page back to the active list.
> "Luckily" the inactive pages are not mapped, so we have to
> locate them through find_page_nolock() and friends, which
> allows us to move the page back to the active list, adjust
> statistics and maybe even wake up kswapd as needed.
> 
> > > - other data structures
> > >     - int memory_pressure
> > >         - on page allocation or reclaim, memory_pressure++
> > >         - on page freeing, memory_pressure--  (keep it >= 0, though)
> > >         - decayed on a regular basis (eg. every second x -= x>>6)
> > >         - used to determine inactive_target
> > >     - inactive_target == one (two?) second(s) worth of memory_pressure,
> > >       which is the amount of page reclaims we'll do in one second
> > >         - free + inactive_clean >= zone->pages_high
> > >         - free + inactive_clean + inactive_dirty >= zone->pages_high \
> > >                 + one_second_of_memory_pressure * (zone_size / memory_size)
> >
> > One of the most interesting aspects (IMHO) of Scott F. Kaplands
> > "Compressed Cache and Virtual Memory Simulation" was the use of
> > VM time instead of wall time. One second could be too long of a
> > reaction time - relative to X allocations/sec etc.
> 
> It's just the inactive target. Trying to keep one second of
> unmapped pages with page->age==0 around is mainly done to:
> - make sure we can flush all of them on time
> - put an "appropriate" amount of pressure on the
>   pages in the active list, so page aging is smoothed
>   out a little bit
> 

Yes, but why one second? Why not 1/10 second, one jiffie, ...


> > > If memory_pressure is high and we're doing a lot of dirty
> > > disk writes, the bdflush percentage will kick in and we'll
> > > be doing extra-agressive cleaning. In that case bdflush
> > > will automatically become more agressive the more page
> > > replacement is going on, which is a good thing.
> >
> > I think that one of the omissions in Kaplands report is the
> > time it takes to clean dirty pages. (Or have I missed
> > something... Need to select the pages earlier)
> 
> Page replacement (select which page to replace) should always
> be independant from page flushing. You can make pretty decent
> decisions on which page(s) to free and the last thing you want
> is having them messed up by page flushing.
> 
> > >                 Misc
> > >
> > > Page aging and flushing are completely separated in this
> > > scheme. We'll never end up aging and freeing a "wrong" clean
> > > page because we're waiting for IO completion of old and
> > > to-be-freed pages.
> >
> > Is page ageing modification of LRU enough?
> 
> It seems to work fine for FreeBSD. Also, we can always change
> the "aging" of the active pages with something else. The
> system is modular enough that we can do that.
> 
> > In many cases it will probably behave worse than plain LRU
> > (slower phase adaptions).
> 
> We can change that by using exponential decay for the page
> age, or by using some different aging technique...
>

Another research subject...
 
> > The access pattern diagrams in Kaplans report are very
> > enlightening...
> 
> They are very interesting indeed, but I miss one very
> common workload in their report. A lot of systems do
> (multimedia) streaming IO these days, where a lot of
> data passes through the cache quickly, but all of the
> data is only touched once (or maybe twice).

And at the same time browsing www with netscape...
 
/RogerL

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2000-08-16 15:10 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87256934.0072FA16.00@d53mta04h.boulder.ibm.com>
2000-08-08  0:36 ` RFC: design for new VM Gerrit.Huizenga
     [not found] <87256934.0078DADB.00@d53mta03h.boulder.ibm.com>
2000-08-08  0:48 ` Gerrit.Huizenga
2000-08-08 15:21   ` Rik van Riel
     [not found] <8725692F.0079E22B.00@d53mta03h.boulder.ibm.com>
2000-08-07 17:40 ` Gerrit.Huizenga
2000-08-07 18:37   ` Matthew Wilcox
2000-08-07 20:55   ` Chuck Lever
2000-08-07 21:59     ` Rik van Riel
2000-08-08  3:26   ` David Gould
2000-08-08  5:54     ` Kanoj Sarcar
2000-08-08  7:15       ` David Gould
2000-08-04 13:52 Mark_H_Johnson
  -- strict thread matches above, loose matches on Subject: below --
2000-08-02 22:08 Rik van Riel
2000-08-03  7:19 ` Chris Wedgwood
2000-08-03 16:01   ` Rik van Riel
2000-08-04 15:41     ` Matthew Dillon
2000-08-04 17:49       ` Linus Torvalds
2000-08-04 23:51         ` Matthew Dillon
2000-08-05  0:03           ` Linus Torvalds
2000-08-05  1:52             ` Matthew Dillon
2000-08-05  1:09               ` Matthew Wilcox
2000-08-05  2:05               ` Linus Torvalds
2000-08-05  2:17               ` Alexander Viro
2000-08-07 17:55                 ` Matthew Dillon
2000-08-05 22:48     ` Theodore Y. Ts'o
2000-08-03 18:27   ` lamont
2000-08-03 18:34     ` Linus Torvalds
2000-08-03 19:11       ` Chris Wedgwood
2000-08-03 21:04         ` Benjamin C.R. LaHaise
2000-08-03 19:32       ` Rik van Riel
2000-08-03 18:05 ` Linus Torvalds
2000-08-03 18:50   ` Rik van Riel
2000-08-03 20:22     ` Linus Torvalds
2000-08-03 22:05       ` Rik van Riel
2000-08-03 22:19         ` Linus Torvalds
2000-08-03 19:00   ` Richard B. Johnson
2000-08-03 19:29     ` Rik van Riel
2000-08-03 20:23     ` Linus Torvalds
2000-08-03 19:37   ` Ingo Oeser
2000-08-03 20:40     ` Linus Torvalds
2000-08-03 21:56       ` Ingo Oeser
2000-08-03 22:12         ` Linus Torvalds
2000-08-04  2:33   ` David Gould
2000-08-16 15:10   ` Stephen C. Tweedie
2000-08-03 19:26 ` Roger Larsson
2000-08-03 21:50   ` Rik van Riel
2000-08-03 22:28     ` Roger Larsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox