linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* on load control / process swapping
@ 2001-05-07 21:16 Rik van Riel
  2001-05-07 22:50 ` Matt Dillon
  2001-05-08 12:25 ` Scott F. Kaplan
  0 siblings, 2 replies; 38+ messages in thread
From: Rik van Riel @ 2001-05-07 21:16 UTC (permalink / raw)
  To: arch; +Cc: linux-mm, Matt Dillon, sfkaplan

Hi,

after staring at the code for a long long time, I finally
figured out exactly why FreeBSD's load control code (the
process swapping in vm_glue.c) can never work in many
scenarios.

In short, the process suspension / wake up code only does
load control in the sense that system load is reduced, but
absolutely no effort is made to ensure that individual
programs can run without thrashing. This, of course, kind of
defeats the purpose of doing load control in the first place.


To see this situation in some more detail, lets first look
at how the current process suspension code has evolved over
time.  Early paging Unixes, including earlier BSDs, had a
rate-limited clock algorithm for the pageout code, where
the VM subsystem would only scan (and page) memory out at
a rate of fastscan pages per second.

Whenever the paging system wasn't able to keep up, free
memory would get below a certain threshold and memory load
control (in the form of process suspension) kicked in.  As
soon as free memory (averaged over a few seconds) got over
this threshold, processes get swapped in again.  Because of
the exact "speed limit" for the paging code, this would give
a slow rotation of memory-resident progesses at a paging rate
well below the thashing threshold.


More modern Unixes, like FreeBSD, NetBSD or Linux, however,
don't have the artificial speed limit on pageout.  This means
the pageout code can go on freeing memory until well beyond
the trashing point of the system.  It also means that the
amount of free memory is no longer any indication of whether
the system is thrashing or not.

Add to that the fact that the classical load control in BSD
resumes a suspended task whenever the system is above the
(now not very meaningful) free memory threshold, regardless
of whether the resident tasks have had the opportunity to
make any progress ... which of course only encourages more
thrashing instead of letting the system work itself out of
the overload situation.


Any solution will have to address the following points:

1) allow the resident processes to stay resident long
   enough to make progess
2) make sure the resident processes aren't thrashing,
   that is, don't let new processes back in memory if
   none of the currently resident processes is "ready"
   to be suspended
3) have a mechanism to detect thrashing in a VM
   subsystem which isn't rate-limited  (hard?)

and, for extra brownie points:
4) fairness, small processes can be paged in and out
   faster, so we can suspend&resume them faster; this
   has the side effect of leaving the proverbial root
   shell more usable
5) make sure already resident processes cannot create
   a situation that'll keep the swapped out tasks out
   of memory forever ... but don't kill performance either,
   since bad performance means we cannot get out of the
   bad situation we're in


Points 1), 2) and 4) are relatively easy to address by simply
keeping resident tasks unswappable for a long enough time that
they've been able to do real work in an environment where
3) indicates we're not thrashing.


3) is the hard part. We know we're not thrashing when we don't
have ongoing page faults all the time, but (say) only 50% of the
time. However, I still have no idea to determine when we _are_
thrashing, since a system which always has 10 ongoing page faults
may still be functioning without thrashing...  This is the part
where I cannot hand a ready solution but where we have to figure
out a solution together.

(and it's also the reason I cannot "send a patch" ... I know the
current scheme cannot possibly work all the time, I understand why,
but I just don't have a solution to the problem ... yet)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 38+ messages in thread
* RE: on load control / process swapping
@ 2001-05-16 15:17 Charles Randall
  0 siblings, 0 replies; 38+ messages in thread
From: Charles Randall @ 2001-05-16 15:17 UTC (permalink / raw)
  To: 'Matt Dillon', Roger Larsson
  Cc: Rik van Riel, arch, linux-mm, sfkaplan

On a related note, we have a process (currently on Solaris, but possibly
moving to FreeBSD) that reads a 26 GB file just once for a database load. On
Solaris, we use the directio() function call to tell the filesystem to
bypass the buffer cache for this file descriptor.

>From the Solaris directio() man page,

     DIRECTIO_ON
             The system behaves as though the application is  not
             going  to reuse the file data in the near future. In
             other words, the file data  is  not  cached  in  the
             system's memory pages.

We found that without this, Solaris was aggressively trying to cache the
huge input file at the expense of database load performance (but we knew
that we'd never access it again). For some applications this is a huge win
(random I/O on a file much larger than memory seems to be another case).

Would there be an advantage to having a similar feature in FreeBSD (if not
already present)?

-Charles

-----Original Message-----
From: Matt Dillon [mailto:dillon@earth.backplane.com]
Sent: Tuesday, May 15, 2001 6:17 PM
To: Roger Larsson
Cc: Rik van Riel; arch@FreeBSD.ORG; linux-mm@kvack.org;
sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping



:Are the heuristics persistent? 
:Or will the first use after  boot use the rough prediction? 
:For how long time will the heuristic stick? Suppose it is suddenly used in
:a slightly different way. Like two sequential readers instead of one...
:
:/RogerL
:Roger Larsson
:Skelleftea
:Sweden

    It's based on the VM page cache, so its adaptive over time.  I wouldn't
    call it persistent, it is nothing more then a simple heuristic that
    'normally' throws a page away but 'sometimes' caches it.  In otherwords,
    you lose some performance on the frontend in order to gain some later
    on.  If you loop through a file enough times, most of the file
    winds up getting cached.  It's still experimental so it is only
    lightly tied into the system.  It seems to work, though, so at some
    point in the future I'll probably try to put some significant prediction
    in.  But as I said, it's a very difficult thing to predict.  You can't
    just put your foot down and say 'I'll cache X amount of file Y'.  That
    doesn't work at all.

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 38+ messages in thread
* Re: RE: on load control / process swapping
  2001-05-16 15:17 Charles Randall
@ 2001-05-16 17:14 Matt Dillon
  2001-05-16 17:41 ` Rik van Riel
  -1 siblings, 1 reply; 38+ messages in thread
From: Matt Dillon @ 2001-05-16 17:14 UTC (permalink / raw)
  To: Charles Randall; +Cc: Roger Larsson, Rik van Riel, arch, linux-mm, sfkaplan

    We've talked about implementing O_DIRECT.  I think it's a good idea.

    In regards to the particular case of scanning a huge multi-gigabyte
    file, FreeBSD has a sequential detection heuristic which does a
    pretty good job preventing cache blow-aways by depressing the priority
    of the data as it is read or written.  FreeBSD will still try to cache
    a good chunk, but it won't sacrifice all available memory.  If you
    access the data via the VM system, through mmap, you get even more 
    control through the madvise() syscall.

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 38+ messages in thread
[parent not found: <OF5A705983.9566DA96-ON86256A50.00630512@hou.us.ray.com>]

end of thread, other threads:[~2001-05-19  2:56 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-05-07 21:16 on load control / process swapping Rik van Riel
2001-05-07 22:50 ` Matt Dillon
2001-05-07 23:35   ` Rik van Riel
2001-05-08  0:56     ` Matt Dillon
2001-05-12 14:23       ` Rik van Riel
2001-05-12 17:21         ` Matt Dillon
2001-05-12 21:17           ` Rik van Riel
2001-05-12 23:58         ` Matt Dillon
2001-05-13 17:22           ` Rik van Riel
2001-05-15  6:38             ` Terry Lambert
2001-05-15 13:39               ` Cy Schubert - ITSD Open Systems Group
2001-05-15 15:31               ` Rik van Riel
2001-05-15 17:24               ` Matt Dillon
2001-05-15 23:55                 ` Roger Larsson
2001-05-16  0:16                   ` Matt Dillon
2001-05-16  4:22                     ` Kernel Debugger Amarnath Jolad
2001-05-16  7:58                       ` Kris Kennaway
2001-05-16 11:42                       ` Martin Frey
2001-05-16 12:04                         ` R.Oehler
2001-05-16  8:23                 ` on load control / process swapping Terry Lambert
2001-05-16 17:26                   ` Matt Dillon
2001-05-08 20:52   ` Kirk McKusick
2001-05-09  0:18     ` Matt Dillon
2001-05-09  2:07       ` Peter Jeremy
2001-05-09 19:41         ` Matt Dillon
2001-05-12 14:28       ` Rik van Riel
2001-05-08 12:25 ` Scott F. Kaplan
2001-05-16 15:17 Charles Randall
2001-05-16 17:14 Matt Dillon
2001-05-16 17:41 ` Rik van Riel
2001-05-16 17:54   ` Matt Dillon
2001-05-18  5:58     ` Terry Lambert
2001-05-18  6:20       ` Matt Dillon
2001-05-18 10:00         ` Andrew Reilly
2001-05-18 13:49         ` Jonathan Morton
2001-05-19  2:18           ` Rik van Riel
2001-05-19  2:56             ` Jonathan Morton
2001-05-16 17:57   ` Alfred Perlstein
2001-05-16 18:01     ` Matt Dillon
2001-05-16 18:10       ` Alfred Perlstein
     [not found] <OF5A705983.9566DA96-ON86256A50.00630512@hou.us.ray.com>
2001-05-18 20:13 ` Jonathan Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox