* Re: Large-footprint processes in a batch-processing-like scenario
@ 2003-04-22 17:24 Ping Huang
2003-04-22 18:01 ` Benjamin LaHaise
0 siblings, 1 reply; 5+ messages in thread
From: Ping Huang @ 2003-04-22 17:24 UTC (permalink / raw)
To: linux-mm
I received only one reply from <wli@holomorphy.com>, who CC'ed this
email list, so there is no need to provide a "summary" as promised.
I would still interested in any ideas that people might have for
tuning the throughput for my work load, short of doing a general
implementation of load control for the Linux kernel from scratch.
--
Ping Huang <pshuang@alum.mit.edu>; info: http://web.mit.edu/pshuang/.plan
Disclaimer: unless explicitly otherwise stated, my
statements represent my personal viewpoints only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Large-footprint processes in a batch-processing-like scenario
2003-04-22 17:24 Large-footprint processes in a batch-processing-like scenario Ping Huang
@ 2003-04-22 18:01 ` Benjamin LaHaise
2003-04-23 3:15 ` William Lee Irwin III
0 siblings, 1 reply; 5+ messages in thread
From: Benjamin LaHaise @ 2003-04-22 18:01 UTC (permalink / raw)
To: Ping Huang; +Cc: linux-mm
On Tue, Apr 22, 2003 at 01:24:12PM -0400, Ping Huang wrote:
> I received only one reply from <wli@holomorphy.com>, who CC'ed this
> email list, so there is no need to provide a "summary" as promised.
>
> I would still interested in any ideas that people might have for
> tuning the throughput for my work load, short of doing a general
> implementation of load control for the Linux kernel from scratch.
In the systems I've used and heard about, people tend to limit the load at
another level where more intelligent scheduling decisions can be made. In
other cases people have run multiple jobs on clusters that swap in order to
get better throughput on the large matrix operations which already exceed
the size of memory.
All told, the best implementation is probably one that is in user space and
simply does a kill -STOP and -CONT on jobs which are competing. Any
additional policy could then be added to the configuration by the admin at
run time, unlike a kernel implementation.
-ben
--
Junk email? <a href="mailto:aart@kvack.org">aart@kvack.org</a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Large-footprint processes in a batch-processing-like scenario
2003-04-22 18:01 ` Benjamin LaHaise
@ 2003-04-23 3:15 ` William Lee Irwin III
0 siblings, 0 replies; 5+ messages in thread
From: William Lee Irwin III @ 2003-04-23 3:15 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: Ping Huang, linux-mm
On Tue, Apr 22, 2003 at 01:24:12PM -0400, Ping Huang wrote:
>> I received only one reply from <wli@holomorphy.com>, who CC'ed this
>> email list, so there is no need to provide a "summary" as promised.
>> I would still interested in any ideas that people might have for
>> tuning the throughput for my work load, short of doing a general
>> implementation of load control for the Linux kernel from scratch.
On Tue, Apr 22, 2003 at 02:01:46PM -0400, Benjamin LaHaise wrote:
> In the systems I've used and heard about, people tend to limit the load at
> another level where more intelligent scheduling decisions can be made. In
> other cases people have run multiple jobs on clusters that swap in order to
> get better throughput on the large matrix operations which already exceed
> the size of memory.
> All told, the best implementation is probably one that is in user space and
> simply does a kill -STOP and -CONT on jobs which are competing. Any
> additional policy could then be added to the configuration by the admin at
> run time, unlike a kernel implementation.
There were some issues mentioned that had to do with swap fragmentation
and poor page replacement behavior in the presence of random access
patterns, and it sounds like he's already doing kill -STOP and -CONT
from this:
On Fri, Apr 18, 2003 at 07:05:46PM -0400, Ping Huang wrote:
> - In practice, if I start all 5 application instances on a single 3GB
> PC, and signal instances 2-5 to go to sleep, and let instance 1 run
> for an hour, then signal instance 1 to go to sleep and signal
> instance 2 to wake up, the Linux kernel will page in instance 2's
> 2GB working set, but rather slowly. The application's memory access
> patterns are close enough to being random that Linux is essentially
> paging in its working set randomly, and this is resulting in very
> slow page-in rates compared to the 25MB/sec. bandwidth rate.
> Instead of being bandwidth limited, the observed paging behavior in
> this case seems disk seek limited. Increasing the value of
I don't know what other people's requirements are. What was asked was
this:
On Fri, Apr 18, 2003 at 07:05:46PM -0400, Ping Huang wrote:
> I'm trying to figure out if there is an efficient way to coerce the
> Linux kernel to effectively swap (not demand-page) between multiple
> processes which will not all fit together into physical memory. I'd
and I gave what I thought would be enough information to do it with. It
really sounded like he was pointing directly at load control from that.
It also sounds like he's in a forced overcommitment scenario from other
parts of the post.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Large-footprint processes in a batch-processing-like scenario
2003-04-18 23:05 Ping Huang
@ 2003-04-18 23:50 ` William Lee Irwin III
0 siblings, 0 replies; 5+ messages in thread
From: William Lee Irwin III @ 2003-04-18 23:50 UTC (permalink / raw)
To: Ping Huang; +Cc: linux-mm
On Fri, Apr 18, 2003 at 07:05:46PM -0400, Ping Huang wrote:
> I'm trying to figure out if there is an efficient way to coerce the
> Linux kernel to effectively swap (not demand-page) between multiple
> processes which will not all fit together into physical memory. I'd
> be interested in peoples' comments about how they would expect the
> Linux VM subsystem to behave for the workload described below, what
> kernels might do better vs. others, and how I might tune for system
> throughput for this kind of application load.
This is generally known as load control. Linux has not yet implemented
this. Most of the other comments are over-specific. Essentially you
need the policy to effectively RR the large app instances with some
notion of how many are simultaneously runnable.
Carr (1981) describes load control policies tailored to the traditional
algorithms like clock scanning, WS, etc. They aren't directly applicable
to Linux but should give some notion of what code doing it is trying to
achieve. It's long out of print, so you may very well have to hunt for
it at a library (ILL?). The more traditional UNIX implementations (e.g.
FBSD) all do this and are probably good for the mechanical details.
It really boils down to a scheduling problem, so various queueing
tidbits can be applied with some changes to how they're phrased.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Large-footprint processes in a batch-processing-like scenario
@ 2003-04-18 23:05 Ping Huang
2003-04-18 23:50 ` William Lee Irwin III
0 siblings, 1 reply; 5+ messages in thread
From: Ping Huang @ 2003-04-18 23:05 UTC (permalink / raw)
To: linux-mm
I'm trying to figure out if there is an efficient way to coerce the
Linux kernel to effectively swap (not demand-page) between multiple
processes which will not all fit together into physical memory. I'd
be interested in peoples' comments about how they would expect the
Linux VM subsystem to behave for the workload described below, what
kernels might do better vs. others, and how I might tune for system
throughput for this kind of application load.
As a first cut, please reply back to me, I'll collate all replies, and
summarize back to the email list, where people may then discuss
everybody else's contributions and comments.
If this email list is inappropriate for such discussion, I apologize
and look forward to suggestions for better-suited discussion forums.
I skimmed through the thread listings for the the past several months'
worth of messages in the archives before sending this email.
- Hardware: dual Athlon PC's with 3GB physical memory & 15GB of swap
configured (striped across multiple disks).
- Software: Linux 2.4.18 SMP kernels (otherwise running RedHat 7.2).
- I have 5 separate instances of the same Java application (using the
Sun 1.4.1 JVM), each of which needs lots of memory, so the JVM is
started with options to allow 1.8GB of Java heap. Each application
has about half a dozen Java threads (translating into Linux
processes), only one of which is really doing significant amounts of
work. Although the application code is the same, each of the 5
instances is working on a different partition of the same overall
problem. Very succintly, each application instance polls a central
Oracle database for its "events" and then processes the events in
chronological order. The applications have a very high
initialization startup cost; it takes roughly 30 minutes for an
application instance to start up. There's a high shutdown cost as
well, also about 30 minutes. (The high memory consumption and the
30 minutes to startup and shutdown is because each application
instance maintains an immense amount of state.) But it's easy for
me to tell the application instance to stop what it's doing and
sleep, and then later, tell it to start working again.
- Unfortunately, 5 such application instances which are so large
certainly cannot fit into the 3GB of physical memory I have
available in their entirety and at the same time. Each instance's
memory access patterns is such that the working set of pages for the
instance includes pretty much the entire 1.8GB Java heap, especially
when full Java heap garbage collection occurs. So I can effectively
only actively run one instance at a time on a 3GB PC; trying to
actively run two instances simply results in massive thrashing.
- For throughput efficiency reasons I cannot simply start up
application instance 1, let it do some work (e.g., for an hour),
then shut it down (have it exit completely), then start up instance
2, let it do some work, etc. With a startup cost of 30 minutes and
shutdown cost of 30 minutes, this would result in only 50% of
elapsed clock time being spent doing productive work. If I could
afford a different (probably 64-bit) hardware platform which
physically accomodated enough RAM, I could run all 5 instances at
once, have everything fit in physical memory, and the world would be
hunky-dory. But I cannot afford such a platform; and for deployment
practicality reasons, I can't just use 5 separate PCs each with 3GB
of memory, running only one application instance on each PC.
- The behavior I would like but I don't think I can get (though I'd
love to be wrong) is pure process *swapping* as opposed to demand
paging. If the multiple Linux processes associated with each
instance have a cumulative virtual memory footprint of 2.0GB (since
the 1.8GB Java heap and much of the other memory allocated are
shared between the different Java threads within an application
instance, but each thread has some Java-thread/Linux-process private
pages), then if I have disks capable of sustained 25MB/sec. large
read-write I/O, then in theory, the OS could swap out all the
processes associated with application instance 1 in about 80 seconds
(25MB/sec. * 82 seconds > 2048MB). The OS could then swap in all
the processes associated with instance 2 also in about 80 seconds.
So if I let each application instance work for about an hour, the
overhead of swapping processes entirely to switch between
application instances would be about 5% of clock time wasted (160
seconds wasted every 3600 seconds). That's pretty reasonable.
- In practice, if I start all 5 application instances on a single 3GB
PC, and signal instances 2-5 to go to sleep, and let instance 1 run
for an hour, then signal instance 1 to go to sleep and signal
instance 2 to wake up, the Linux kernel will page in instance 2's
2GB working set, but rather slowly. The application's memory access
patterns are close enough to being random that Linux is essentially
paging in its working set randomly, and this is resulting in very
slow page-in rates compared to the 25MB/sec. bandwidth rate.
Instead of being bandwidth limited, the observed paging behavior in
this case seems disk seek limited. Increasing the value of
/proc/sys/vm/page-cluster doesn't seem to help. The application
instance may spend about half an hour (out of its 1 hour work time
"quantum") during which the CPUs are often nearly 100% idle while
the disks are working madly.
- Using the Linux ptrace() system call to let me touch another
process's virtual memory address space through /proc/$PID/mem, I'm
able to get Linux to page in a process with a more predictable
memory access pattern (linear rather than pseudo-random). This
seems to help page-in rates significantly. To switch from
application instance 1 to instance 2, I now tell instance 1 to go to
sleep, then run my process memory toucher program to touch one of
the processes for instance 2, and then tell instance 2 to wake up.
The overhead of switching is cut down to about 3 minutes, but over
time, it slowly takes longer and longer (I have a run where
switching now taking about 5 minutes, although I'm not sure if this
will continue to grow without bound or not). My guess is that Linux
initially does a good job of grouping pages which are adjacent in a
process's virtual memory address space such that they are adjacent
in swap space as well, which allows pre-fetch (based on the value
2^"page-cluster"?) to reduce the number of I/O operations and the
number of disk seeks necessary when I touch the process's virtual
address space linearly. But over time, fragmentation occurs and
pages adjacent in a process's virtual memory address space become
separated from each other in swap space.
Thoughts?
--
Ping Huang <pshuang@alum.mit.edu>; info: http://web.mit.edu/pshuang/.plan
Disclaimer: unless explicitly otherwise stated, my
statements represent my personal viewpoints only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2003-04-23 3:15 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-22 17:24 Large-footprint processes in a batch-processing-like scenario Ping Huang
2003-04-22 18:01 ` Benjamin LaHaise
2003-04-23 3:15 ` William Lee Irwin III
-- strict thread matches above, loose matches on Subject: below --
2003-04-18 23:05 Ping Huang
2003-04-18 23:50 ` William Lee Irwin III
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox