* Re: Thread implementations...
@ 1998-06-30 19:30 Larry McVoy
1998-07-01 8:50 ` Stephen C. Tweedie
0 siblings, 1 reply; 26+ messages in thread
From: Larry McVoy @ 1998-06-30 19:30 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Eric W. Biederman, Christoph Rohland, linux-kernel, linux-mm
: Not for very large files: the forget-behind is absolutely critical in
: that case.
SunOS' local file system, UFS, implements the following alg for forget
behind on all types of accesses (SunOS has a unified page cache, all
accesses are mmap based, read/write are implemented by the kernel doing
an mmap and then a bcopy):
if ((free_memory < we_will_start_paging_soon) &&
(offset is clust_size multiple) &&
(offset > small_file) &&
(access is sequential)) {
free_behind(vp, offset - clust_size, clust_size);
}
in the ufs_getpage() code.
I'll admit this was a hack, but it had some nice attributes that you might
want to consider:
1) it was nice that I/O took care of itself. The pageout daemon is
pretty costly (Stephen, we talked about this at Linux Expo - this
is why I want a pageout daemon that works on files, not on pages).
2) Small files aren't worth the trouble and aren't the cause of the
trouble.
3) Random access frequently wants caching and randoms are expensive
to bring in.
4) I/O is freed in large chunks, not a page at a time. It's about
as costly to bring in one page as bring in 64-256K these days.
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: Thread implementations... 1998-06-30 19:30 Thread implementations Larry McVoy @ 1998-07-01 8:50 ` Stephen C. Tweedie 1998-07-03 15:21 ` Rik van Riel 0 siblings, 1 reply; 26+ messages in thread From: Stephen C. Tweedie @ 1998-07-01 8:50 UTC (permalink / raw) To: Larry McVoy Cc: Stephen C. Tweedie, Eric W. Biederman, Christoph Rohland, linux-kernel, linux-mm Hi, On Tue, 30 Jun 1998 12:30:45 -0700, lm@bitmover.com (Larry McVoy) said: > if ((free_memory < we_will_start_paging_soon) && > (offset is clust_size multiple) && > (offset > small_file) && > (access is sequential)) { > free_behind(vp, offset - clust_size, clust_size); > } Looks entirely reasonable. I've been thinking of something very similar but just a little more complex, so that we can also cleanly handle the case of sequential mmap()ed reads, both of mapped files and potentially of anonymous datasets. The difference there is that if we are dealing with tiled data, then we may need to allow a larger window between the current pagein cursor and the forget-behind cursor. Again, if we just unmap the pages and place them on a high-priority reuse queue, then getting the guess wrong just results in a minor fault unless we do actually reuse the memory before accessing the data again. > 1) it was nice that I/O took care of itself. The pageout daemon is > pretty costly (Stephen, we talked about this at Linux Expo - this > is why I want a pageout daemon that works on files, not on pages). Yes, and Ingo and I have been talking about ways of doing it. > 2) Small files aren't worth the trouble and aren't the cause of the > trouble. Small files benefit from a similar scheme. For small sequentially-accessed files, as they age, we want to remove the entire file from cache at once. Repopulating a sequential file's fragmented cache is expensive anyway, so it may in fact be _cheaper_ to do this than to just throw out one page at a time. As long as we have the concept of a virtual extent, where we define that extent as the natural readahead pattern for the workload, then we want to uncache the same units we readahead. That's normally sequential clusters, but if we have things like Ingo's random swap stats-based prediction logic, then we can use exactly the same extent concept there too. --Stephen ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-07-01 8:50 ` Stephen C. Tweedie @ 1998-07-03 15:21 ` Rik van Riel 1998-07-03 20:05 ` Stephen C. Tweedie 0 siblings, 1 reply; 26+ messages in thread From: Rik van Riel @ 1998-07-03 15:21 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Linux MM On Wed, 1 Jul 1998, Stephen C. Tweedie wrote: > sequential clusters, but if we have things like Ingo's random swap > stats-based prediction logic, then we can use exactly the same extent > concept there too. Hmm, it appears this was the legendary swap readahead code I was looking for a while ago :) But, ehhh, just what _is_ this random swap stats-based prediction algorithm, and how far from implementation is it? (and if it isn't implemented yet, what should I do to make it implemented; swapin readahead is very wanted on my memory-starved box...) Rik. +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-07-03 15:21 ` Rik van Riel @ 1998-07-03 20:05 ` Stephen C. Tweedie 1998-07-03 20:36 ` Rik van Riel 0 siblings, 1 reply; 26+ messages in thread From: Stephen C. Tweedie @ 1998-07-03 20:05 UTC (permalink / raw) To: Rik van Riel; +Cc: Stephen C. Tweedie, Linux MM Hi, On Fri, 3 Jul 1998 17:21:51 +0200 (CEST), Rik van Riel <H.H.vanRiel@phys.uu.nl> said: > On Wed, 1 Jul 1998, Stephen C. Tweedie wrote: >> sequential clusters, but if we have things like Ingo's random swap >> stats-based prediction logic, then we can use exactly the same extent >> concept there too. > Hmm, it appears this was the legendary swap readahead code I > was looking for a while ago :) > But, ehhh, just what _is_ this random swap stats-based prediction > algorithm, It's a per-swap-page readahead predictor which observes the access patterns for vmas. > and how far from implementation is it? It is implemented. It is not in the main kernels, nor does it take advantage of the potential for swap readahead in the 2.1.86+ kernels. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-07-03 20:05 ` Stephen C. Tweedie @ 1998-07-03 20:36 ` Rik van Riel 1998-07-04 16:37 ` Stephen C. Tweedie 0 siblings, 1 reply; 26+ messages in thread From: Rik van Riel @ 1998-07-03 20:36 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Linux MM On Fri, 3 Jul 1998, Stephen C. Tweedie wrote: > On Fri, 3 Jul 1998 17:21:51 +0200 (CEST), Rik van Riel > <H.H.vanRiel@phys.uu.nl> said: > > > But, ehhh, just what _is_ this random swap stats-based prediction > > algorithm, > It's a per-swap-page readahead predictor which observes the access > patterns for vmas. > > > and how far from implementation is it? > It is implemented. It is not in the main kernels, nor does it take > advantage of the potential for swap readahead in the 2.1.86+ kernels. Then where is it? It would be great to test and it would make an excellent link with description for the Linux MM homepage... Besides, I'm currently somewhat memory starved and I would really like to test and possibly improve or integrate this piece of code with the main kernel. I know it's too late for inclusion now, but I'm willing to keep the patch up-to-date with the kernel up to the date of inclusion. Rik. +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-07-03 20:36 ` Rik van Riel @ 1998-07-04 16:37 ` Stephen C. Tweedie 0 siblings, 0 replies; 26+ messages in thread From: Stephen C. Tweedie @ 1998-07-04 16:37 UTC (permalink / raw) To: Rik van Riel; +Cc: Stephen C. Tweedie, Linux MM Hi, On Fri, 3 Jul 1998 22:36:06 +0200 (CEST), Rik van Riel <H.H.vanRiel@phys.uu.nl> said: > Then where is it? It would be great to test and it > would make an excellent link with description for > the Linux MM homepage... You'd have to ask Ingo for it. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <199806240915.TAA09504@vindaloo.atnf.CSIRO.AU>]
[parent not found: <Pine.LNX.3.96dg4.980624025515.26983E-100000@twinlark.arctic.org>]
[parent not found: <199806241213.WAA10661@vindaloo.atnf.CSIRO.AU>]
* Re: Thread implementations... [not found] ` <199806241213.WAA10661@vindaloo.atnf.CSIRO.AU> @ 1998-06-24 22:00 ` Eric W. Biederman 1998-06-24 23:41 ` Richard Gooch 1998-06-25 4:12 ` Dean Gaudet 0 siblings, 2 replies; 26+ messages in thread From: Eric W. Biederman @ 1998-06-24 22:00 UTC (permalink / raw) To: Richard Gooch; +Cc: Dean Gaudet, linux-kernel, linux-mm >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: RG> If we get madvise(2) right, we don't need sendfile(2), correct? It looks like it from here. As far as madvise goes, I think we need to implement madvise(2) as: enum madvise_strategy { MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL, MADV_WILLNEED, MADV_DONTNEED, } struct madvise_struct { caddr_t addr; size_t size; size_t strategy; }; int sys_madvise(struct madvise_struct *, int count); With madvise(3) following the traditional format with only one advisement can be done easily. The reason I suggest multiple arguments is that for apps that have random but predictable access patterns will want to use MADV_WILLNEED & MADV_DONTNEED to an optimum swapping algorigthm. And for that you will probably need multiple address ranges. The clustering comunity has a similiar syscall implemented for programs whose working set size exceeds avaiable memory. Except it has strategy hardwired to MADV_WILLNEED. However someone needs to look at actuall programs to see which form is more practical to implement, in the kernel. Of course all I know about madvise I just read in the kernel source so I may be totally off... Eric ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-24 22:00 ` Eric W. Biederman @ 1998-06-24 23:41 ` Richard Gooch 1998-06-25 4:45 ` Eric W. Biederman 1998-06-25 4:12 ` Dean Gaudet 1 sibling, 1 reply; 26+ messages in thread From: Richard Gooch @ 1998-06-24 23:41 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel, linux-mm Eric W. Biederman writes: > >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: > > RG> If we get madvise(2) right, we don't need sendfile(2), correct? > > It looks like it from here. As far as madvise goes, I think we need > to implement madvise(2) as: > > enum madvise_strategy { > MADV_NORMAL, > MADV_RANDOM, > MADV_SEQUENTIAL, > MADV_WILLNEED, > MADV_DONTNEED, > } > struct madvise_struct { > caddr_t addr; > size_t size; > size_t strategy; > }; > int sys_madvise(struct madvise_struct *, int count); > > With madvise(3) following the traditional format with only one ^ Don't you mean 2? > advisement can be done easily. The reason I suggest multiple > arguments is that for apps that have random but predictable access > patterns will want to use MADV_WILLNEED & MADV_DONTNEED to an optimum > swapping algorigthm. I'm not aware of madvise() being a POSIX standard. I've appended the man page from alpha_OSF1, which looks reasonable. It would be nice to be compatible with something. Regards, Richard.... =============================================================================== madvise(2) madvise(2) NAME m\bma\bad\bdv\bvi\bis\bse\be - Advise the system of the expected paging behavior of a process SYNOPSIS #\b#i\bin\bnc\bcl\blu\bud\bde\be <\b<s\bsy\bys\bs/\b/t\bty\byp\bpe\bes\bs.\b.h\bh>\b> #\b#i\bin\bnc\bcl\blu\bud\bde\be <\b<s\bsy\bys\bs/\b/m\bmm\bma\ban\bn.\b.h\bh>\b> i\bin\bnt\bt m\bma\bad\bdv\bvi\bis\bse\be (\b( c\bca\bad\bdd\bdr\br_\b_t\bt _\ba_\bd_\bd_\br,\b, s\bsi\biz\bze\be_\b_t\bt _\bl_\be_\bn,\b, i\bin\bnt\bt _\bb_\be_\bh_\ba_\bv )\b);\b; PARAMETERS _\ba_\bd_\bd_\br Specifies the address of the region to which the advice refers. _\bl_\be_\bn Specifies the length in bytes of the region specified by the _\ba_\bd_\bd_\br parameter. _\bb_\be_\bh_\ba_\bv Specifies the behavior of the region. The following values for the _\bb_\be_\bh_\ba_\bv parameter are defined in the s\bsy\bys\bs/\b/m\bmm\bma\ban\bn.\b.h\bh header file: M\bMA\bAD\bDV\bV_\b_N\bNO\bOR\bRM\bMA\bAL\bL No further special treatment M\bMA\bAD\bDV\bV_\b_R\bRA\bAN\bND\bDO\bOM\bM Expect random page references M\bMA\bAD\bDV\bV_\b_S\bSE\bEQ\bQU\bUE\bEN\bNT\bTI\bIA\bAL\bL Expect sequential references M\bMA\bAD\bDV\bV_\b_W\bWI\bIL\bLL\bLN\bNE\bEE\bED\bD Will need these pages M\bMA\bAD\bDV\bV_\b_D\bDO\bON\bNT\bTN\bNE\bEE\bED\bD Do not need these pages The system will free any resident pages that are allo- cated to the region. All modifications will be lost and any swapped out pages will be discarded. Subse- quent access to the region will result in a zero-fill- on-demand fault as though it is being accessed for the first time. Reserved swap space is not affected by this call. M\bMA\bAD\bDV\bV_\b_S\bSP\bPA\bAC\bCE\bEA\bAV\bVA\bAI\bIL\bL Ensure that resources are reserved DESCRIPTION The m\bma\bad\bdv\bvi\bis\bse\be(\b()\b) function permits a process to advise the system about its expected future behavior in referencing a mapped file or shared memory region. NOTES Only a few values of the b\bbe\beh\bha\bav\bv parameter values are operational on Digital UNIX systems. Non-operational values cause the system to always return success (zero). RETURN VALUES Upon successful completion, the m\bma\bad\bdv\bvi\bis\bse\be(\b()\b) function returns zero. Other- wise, -1 is returned and e\ber\brr\brn\bno\bo is set to indicate the error. ERRORS If the m\bma\bad\bdv\bvi\bis\bse\be(\b()\b) function fails, e\ber\brr\brn\bno\bo may be set to one of the following values: [\b[E\bEI\bIN\bNV\bVA\bAL\bL]\b] The _\bb_\be_\bh_\ba_\bv parameter is invalid. [\b[E\bEN\bNO\bOS\bSP\bPC\bC]\b] The _\bb_\be_\bh_\ba_\bv parameter specifies MADV_SPACEAVAIL and resources can not be reserved. RELATED INFORMATION Functions: m\bmm\bma\bap\bp(2) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-24 23:41 ` Richard Gooch @ 1998-06-25 4:45 ` Eric W. Biederman 1998-06-25 17:14 ` Todd Larason 1998-06-26 7:53 ` Christoph Rohland 0 siblings, 2 replies; 26+ messages in thread From: Eric W. Biederman @ 1998-06-25 4:45 UTC (permalink / raw) To: Richard Gooch; +Cc: linux-kernel, linux-mm >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: RG> Eric W. Biederman writes: >> >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: >> With madvise(3) following the traditional format with only one RG> ^ RG> Don't you mean 2? My suggestion: madvise(2)(struct madvise_struct *, int number_of_structs); madvise(3)(caddr_t addr, size_t len, size_t strategy); madvise(3) being in libc... >> advisement can be done easily. The reason I suggest multiple >> arguments is that for apps that have random but predictable access >> patterns will want to use MADV_WILLNEED & MADV_DONTNEED to an optimum >> swapping algorigthm. RG> I'm not aware of madvise() being a POSIX standard. I've appended the RG> man page from alpha_OSF1, which looks reasonable. It would be nice to RG> be compatible with something. According to the kernel source it is available on: the alpha, mips, and sparc. And the mips code thinks there is a posix version somewhere. Does someone have the Sun/sparc man page? Besides what is in the kernel source I mean. > MADV_WILLNEED This needs to start an asynchronouse pagein if necessary. > MADV_DONTNEED > Do not need these pages > The system will free any resident pages that are allo- > cated to the region. All modifications will be lost > and any swapped out pages will be discarded. Subse- > quent access to the region will result in a zero-fill- > on-demand fault as though it is being accessed for the > first time. Reserved swap space is not affected by > this call. This one is broken, for 3 reasons. 1) madvise should only give advise. 2) This can be done with mmap(start, len, PROT..., MAP_ANON, -1, 0) 3) There is a more reasonable interpretation from IRIX: MADV_DONTNEED informs the system that the address range from addr to addr + len will likely not be referenced in the near future. The memory to which the indicated addresses are mapped will be the first to be reclaimed when memory is needed by the system. Which means that with a smart programmer you can implement the optimal swapping algorithm for your process with MADV_DONTNEED and MADV_WILLNEED and be relatively portable. Of course MADV_SEQUENTIAL should handle the case of sending a file out a socket, for a userspace sendfile. > MADV_SPACEAVAIL > Ensure that resources are reserved This one also does more than advise and for that reason I don't like it. Anyhow this looks like something to keep in mind for 2.3. Currently I have too many projects in the air to do more than think the interface through. The mapping type could easily be stored in the vma as a hind though. Perhaps it could be ready for 2.2 but I couldn't do it. Eric ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 4:45 ` Eric W. Biederman @ 1998-06-25 17:14 ` Todd Larason 1998-06-26 7:53 ` Christoph Rohland 1 sibling, 0 replies; 26+ messages in thread From: Todd Larason @ 1998-06-25 17:14 UTC (permalink / raw) To: linux-kernel, linux-mm On Wed, Jun 24, 1998 at 11:45:52PM -0500, Eric W. Biederman wrote: > >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: > > RG> Eric W. Biederman writes: > >> >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: > > Does someone have the Sun/sparc man page? C Library Functions madvise(3) NAME madvise - provide advice to VM system SYNOPSIS #include <sys/types.h> #include <sys/mman.h> int madvise(caddr_t _\ba_\bd_\bd_\br, size_t _\bl_\be_\bn, int _\ba_\bd_\bv_\bi_\bc_\be); DESCRIPTION madvise() advises the kernel that a region of user mapped memory in the range [_\ba_\bd_\bd_\br, _\ba_\bd_\bd_\br + _\bl_\be_\bn) will be accessed fol- lowing a type of pattern. The kernel uses this information to optimize the procedure for manipulating and maintaining the resources associated with the specified mapping range. Values for _\ba_\bd_\bv_\bi_\bc_\be are defined in <sys/mman.h> as: #define MADV_NORMAL 0x0 /* No further special treatment */ #define MADV_RANDOM 0x1 /* Expect random page references */ #define MADV_SEQUENTIAL 0x2 /* Expect sequential page references */ #define MADV_WILLNEED 0x3 /* Will need these pages */ #define MADV_DONTNEED 0x4 /* Don't need these pages */ MADV_NORMAL The default system characteristic where accessing memory within the address range causes the system to read data from the mapped file. The kernel reads all data from files into pages which are retained for a period of time as a "cache." System pages can be a scarce resource, so the kernel steals pages from other mappings when needed. This is a likely occurrence, but adversely affects system performance only if a large amount of memory is accessed. MADV_RANDOM Tells the kernel to read in a minimum amount of data from a mapped file on any single particular access. If MADV_NORMAL is in effect when an address of a mapped file is accessed, the system tries to read in as much data from the file as reasonable, in anticipation of other accesses within a certain locality. MADV_SEQUENTIAL Tells the system that addresses in this range are likely to be accessed only once, so the system will free the resources mapping the address range as quickly as possible. This is used in the cat(1) and cp(1) utilities. MADV_WILLNEED Tells the system that a certain address range is SunOS 5.6 Last change: 29 Dec 1996 1 C Library Functions madvise(3) definitely needed so the kernel will start reading the specified range into memory. This can benefit programs wanting to minimize the time needed to access memory the first time, as the kernel would need to read in from the file. MADV_DONTNEED Tells the kernel that the specified address range is no longer needed, so the system starts to free the resources associated with the address range. madvise() should be used by programs with specific knowledge of their access patterns over a memory object, such as a mapped file, to increase system performance. RETURN VALUES madvise() returns: 0 on success. -1 on failure and sets errno to indicate the error. ERRORS EINVAL _\ba_\bd_\bd_\br is not a multiple of the page size as returned by sysconf(3C). The length of the specified address range is less than or equal to 0, or the advice was invalid. EIO An I/O error occurred while reading from or writing to the file system. ENOMEM Addresses in the range [_\ba_\bd_\bd_\br, _\ba_\bd_\bd_\br + _\bl_\be_\bn) are outside the valid range for the address space of a process, or specify one or more pages that are not mapped. ESTALE Stale nfs file handle. ATTRIBUTES See attributes(5) for descriptions of the following attri- butes: __________________________________ | ATTRIBUTE TYPE| ATTRIBUTE VALUE| |_\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b|\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b__\b|\b_ | MT-Level | MT-Safe | |________________\b|__________________\b| SEE ALSO cat(1), cp(1), mmap(2), sysconf(3C), attributes(5) SunOS 5.6 Last change: 29 Dec 1996 2 No mention of conforming to any standard her. HP-UX 10.20's manpage claims conformance with AES and SVID3. It defines a MADV_SPACEAVAIL behavior too, but notes that it isn't implemented. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 4:45 ` Eric W. Biederman 1998-06-25 17:14 ` Todd Larason @ 1998-06-26 7:53 ` Christoph Rohland 1998-06-26 14:16 ` Eric W. Biederman 1 sibling, 1 reply; 26+ messages in thread From: Christoph Rohland @ 1998-06-26 7:53 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel, linux-mm ebiederm+eric@npwt.net (Eric W. Biederman) writes: > > MADV_DONTNEED > > Do not need these pages > > > The system will free any resident pages that are allo- > > cated to the region. All modifications will be lost > > and any swapped out pages will be discarded. Subse- > > quent access to the region will result in a zero-fill- > > on-demand fault as though it is being accessed for the > > first time. Reserved swap space is not affected by > > this call. > > This one is broken, for 3 reasons. > 1) madvise should only give advise. > 2) This can be done with mmap(start, len, PROT..., MAP_ANON, -1, 0) > 3) There is a more reasonable interpretation from IRIX: > > MADV_DONTNEED informs the system that the address range from addr to > addr + len will likely not be referenced in the near > future. The memory to which the indicated addresses are > mapped will be the first to be reclaimed when memory is > needed by the system. I do not agree: 1) why should madvise only advise. O.K. it is a naming thing, but I think you can find more terms which went far from the original meaning. 2) Would not work on shared pages. 3) Why is IRIX more reasonable than any other implementation? The functionality described in the OSF manpage greatly help transactional programs, which use loads of memory for single transactions. I do not know if it should be done with madvise, but there is at least one OS which thinks it is the right place and I would look for this functionality exactly there. Cheers Christoph -- #include <stddisclaimer.h> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-26 7:53 ` Christoph Rohland @ 1998-06-26 14:16 ` Eric W. Biederman 1998-06-29 10:19 ` Stephen C. Tweedie 0 siblings, 1 reply; 26+ messages in thread From: Eric W. Biederman @ 1998-06-26 14:16 UTC (permalink / raw) To: Christoph Rohland; +Cc: Eric W. Biederman, linux-kernel, linux-mm >>>>> "CR" == Christoph Rohland <hans-christoph.rohland@sap-ag.de> writes: CR> I do not agree: CR> 1) why should madvise only advise. O.K. it is a naming thing, but I CR> think you can find more terms which went far from the original CR> meaning. Because if it only advises, you can ignore it and return success. If it does more than advise you have to do much more error checking and error handling. If it turns out we want to give lots of advise in one syscall, instead of just one piece of advise, this could be important. CR> 2) Would not work on shared pages. Not perfectly. That does appear to be the achillies heel currently of madvise. Multiple users of the same memory. CR> 3) Why is IRIX more reasonable than any other implementation? Well IRIX also sync with the sun man page and my intuition. I am thinking in terms of swapping hints, and specific functionality doesn't fit into that category. CR> The functionality described in the OSF manpage greatly help CR> transactional programs, which use loads of memory for single CR> transactions. I do not know if it should be done with madvise, but CR> there is at least one OS which thinks it is the right place and I CR> would look for this functionality exactly there. I hadn't considered the transaction case. In fact I haven't considered most cases. That's partly why I'm still talking. But still there are other more portable methods to achieve a memory reset, as I mentioned earlier. And there isn't another even semi portable method to achieve swapping hints. Eric ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-26 14:16 ` Eric W. Biederman @ 1998-06-29 10:19 ` Stephen C. Tweedie 1998-06-30 6:19 ` Eric W. Biederman 0 siblings, 1 reply; 26+ messages in thread From: Stephen C. Tweedie @ 1998-06-29 10:19 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Christoph Rohland, linux-kernel, linux-mm Hi, On 26 Jun 1998 09:16:14 -0500, ebiederm+eric@npwt.net (Eric W. Biederman) said: >>>>>> "CR" == Christoph Rohland <hans-christoph.rohland@sap-ag.de> writes: CR> 1) why should madvise only advise. > Because if it only advises, you can ignore it and return success. > If it does more than advise you have to do much more error checking > and error handling. Not necessarily; even if we do take immediate action on the advise, within the madvise system call, we don't have to do any extra layers of error handling. It's more a case of "Please try to do this now / OK, I tried." CR> 2) Would not work on shared pages. > Not perfectly. That does appear to be the achillies heel currently of > madvise. Multiple users of the same memory. Again, madvise is the application telling us that it KNOWS what the access pattern is. If the app is wrong, and the page is shared, big deal; throw away the advise, it was duff. :) --Stephen ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-29 10:19 ` Stephen C. Tweedie @ 1998-06-30 6:19 ` Eric W. Biederman 1998-06-30 13:10 ` Stephen C. Tweedie 0 siblings, 1 reply; 26+ messages in thread From: Eric W. Biederman @ 1998-06-30 6:19 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Christoph Rohland, linux-kernel, linux-mm >>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes: ST> Hi, ST> On 26 Jun 1998 09:16:14 -0500, ebiederm+eric@npwt.net (Eric ST> W. Biederman) said: >>>>>>> "CR" == Christoph Rohland <hans-christoph.rohland@sap-ag.de> writes: CR> 1) why should madvise only advise. >> Because if it only advises, you can ignore it and return success. >> If it does more than advise you have to do much more error checking >> and error handling. ST> Not necessarily; even if we do take immediate action on the advise, ST> within the madvise system call, we don't have to do any extra layers of ST> error handling. It's more a case of "Please try to do this now / OK, I ST> tried." The semantics for some of one or two of the implimentation specific madvise options were more much more like mlock... And for that you need extra error checking to confirm that success occured. The try this now I see as a totally appropriate implementation. CR> 2) Would not work on shared pages. >> Not perfectly. That does appear to be the achillies heel currently of >> madvise. Multiple users of the same memory. ST> Again, madvise is the application telling us that it KNOWS what the ST> access pattern is. If the app is wrong, and the page is shared, big ST> deal; throw away the advise, it was duff. :) Again the case was: I have a multithreaded web server serving up files. The web server mmaps each file, and calls madvise(file_start, file_len, MADV_SEQUENTIAL). The trick is that it may be serving the say file to two different clients simultaneously. MADV_SEQUENTIAL implies readahead, and forget behind, but for a simple process. The forget behind is tricky and difficult to get right, but if we concentrate on aggressive readahead (in this we will probably be o.k.) And some readahead we already have implemented filemap_nopage. Getting it general for the whole mm layer could be fun but it is certainly doable. Though at the moment putting hint information in the vm_area_struct, and keeping the implemetation in the nopage functions sounds like the way to go. Eric ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-30 6:19 ` Eric W. Biederman @ 1998-06-30 13:10 ` Stephen C. Tweedie 1998-06-30 19:35 ` Dean Gaudet 0 siblings, 1 reply; 26+ messages in thread From: Stephen C. Tweedie @ 1998-06-30 13:10 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen C. Tweedie, Christoph Rohland, linux-kernel, linux-mm Hi, On 30 Jun 1998 01:19:18 -0500, ebiederm+eric@npwt.net (Eric W. Biederman) said: > Again the case was: I have a multithreaded web server serving up > files. The web server mmaps each file, and calls madvise(file_start, > file_len, MADV_SEQUENTIAL). The trick is that it may be serving the > say file to two different clients simultaneously. The actual sharing is not a problem; the cache is already safe against that even when doing readahead. > MADV_SEQUENTIAL implies readahead, and forget behind, but for a simple > process. Yep, the forget behind is the important stuff to get right, but all we need to do there is to unmap the pages from the process's address space: we don't need to actually flush the page cache. As long as the page cache can find these pages quickly if it needs to reuse the memory for something else, then there's no reason to actually forget the data there and then. > The forget behind is tricky and difficult to get right, but if we > concentrate on aggressive readahead (in this we will probably be > o.k.) Not for very large files: the forget-behind is absolutely critical in that case. --Stephen ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-30 13:10 ` Stephen C. Tweedie @ 1998-06-30 19:35 ` Dean Gaudet 1998-07-01 9:09 ` Stephen C. Tweedie 0 siblings, 1 reply; 26+ messages in thread From: Dean Gaudet @ 1998-06-30 19:35 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Eric W. Biederman, Christoph Rohland, linux-kernel, linux-mm On Tue, 30 Jun 1998, Stephen C. Tweedie wrote: > Not for very large files: the forget-behind is absolutely critical in > that case. I dunno why you're thinking of unmapping pages though... isn't an mmap cache the best way to amortize the extra cost of mmap()ing? In that case you don't want the forget-behind pages to be unmapped. But you do want them to be dropped from memory when appropriate. Another thought re: sendfile. The network layer could hint to sendfile as to the speed of the socket it's delivering to. With that hint and some suitable queueing theory someone should be able to get a nifty little algorithm that will "synchronize" sockets as much as possible without noticeable delays to the user. By "synchronize" I mean getting them going from the same, or nearby pages. That way on larger than memory data sets the kernel can sacrifice some latency on a few connections in order to improve the total throughput. I won't pretend to have a good heuristic for it ;) applications: multimedia servers -- audio/video streaming. These boxes can be limited by disk bandwidth because their data sets are typically much larger than RAM. Dean ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-30 19:35 ` Dean Gaudet @ 1998-07-01 9:09 ` Stephen C. Tweedie 0 siblings, 0 replies; 26+ messages in thread From: Stephen C. Tweedie @ 1998-07-01 9:09 UTC (permalink / raw) To: Dean Gaudet Cc: Stephen C. Tweedie, Eric W. Biederman, Christoph Rohland, linux-kernel, linux-mm Hi, On Tue, 30 Jun 1998 12:35:35 -0700 (PDT), Dean Gaudet <dgaudet-list-linux-kernel@arctic.org> said: > On Tue, 30 Jun 1998, Stephen C. Tweedie wrote: >> Not for very large files: the forget-behind is absolutely critical in >> that case. > I dunno why you're thinking of unmapping pages though... But you do > want them to be dropped from memory when appropriate. We want to *physically* unmap them from the page tables. You can't evict the pages from cache if they are still physically mapped! --Stephen ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-24 22:00 ` Eric W. Biederman 1998-06-24 23:41 ` Richard Gooch @ 1998-06-25 4:12 ` Dean Gaudet 1998-06-25 3:53 ` Richard Gooch 1998-06-25 4:56 ` Eric W. Biederman 1 sibling, 2 replies; 26+ messages in thread From: Dean Gaudet @ 1998-06-25 4:12 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Richard Gooch, linux-kernel, linux-mm On 24 Jun 1998, Eric W. Biederman wrote: > >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: > > RG> If we get madvise(2) right, we don't need sendfile(2), correct? > > It looks like it from here. As far as madvise goes, I think we need > to implement madvise(2) as: ... note that mmap() requires a bunch of kernel structures set up to map things into the program's memory space... when in reality the program doesn't care at all about the bytes. (And then there's process address space limitations...) sendfile() and such don't have these problems, and it may be far more simple to implement sendfile() than it would be to put all the hints and such into the mm layer to get mmap() performance up to the same level. Dean ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 4:12 ` Dean Gaudet @ 1998-06-25 3:53 ` Richard Gooch 1998-06-25 11:32 ` Stephen C. Tweedie 1998-06-25 4:56 ` Eric W. Biederman 1 sibling, 1 reply; 26+ messages in thread From: Richard Gooch @ 1998-06-25 3:53 UTC (permalink / raw) To: Dean Gaudet; +Cc: Eric W. Biederman, linux-kernel, linux-mm Dean Gaudet writes: > > > On 24 Jun 1998, Eric W. Biederman wrote: > > > >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: > > > > RG> If we get madvise(2) right, we don't need sendfile(2), correct? > > > > It looks like it from here. As far as madvise goes, I think we need > > to implement madvise(2) as: > > ... note that mmap() requires a bunch of kernel structures set up to map > things into the program's memory space... when in reality the program > doesn't care at all about the bytes. (And then there's process address > space limitations...) sendfile() and such don't have these problems, and > it may be far more simple to implement sendfile() than it would be to put > all the hints and such into the mm layer to get mmap() performance up to > the same level. This may be true, but my point is that we *need* a decent madvise(2) implementation. It will be use to a greater range of applications than sendfile(2). Regards, Richard.... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 3:53 ` Richard Gooch @ 1998-06-25 11:32 ` Stephen C. Tweedie 1998-06-25 21:24 ` Chris Wedgwood 1998-06-25 22:16 ` Richard Gooch 0 siblings, 2 replies; 26+ messages in thread From: Stephen C. Tweedie @ 1998-06-25 11:32 UTC (permalink / raw) To: Richard Gooch; +Cc: Dean Gaudet, Eric W. Biederman, linux-kernel, linux-mm Hi, On Thu, 25 Jun 1998 13:53:36 +1000, Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> said: > This may be true, but my point is that we *need* a decent madvise(2) > implementation. It will be use to a greater range of applications than > sendfile(2). Not necessarily; we may be able to detect a lot of the relevant access patterns ourselves. Ingo has had a swap prediction algorithm for a while, and we talked at Usenix about a number of other things we can do to tune vm performance automatically. 2.3 ought to be a great deal better. madvise() may still have merit, but we really ought to be aiming at making the vm system as self-tuning as possible. --Stephen ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 11:32 ` Stephen C. Tweedie @ 1998-06-25 21:24 ` Chris Wedgwood 1998-06-25 22:16 ` Richard Gooch 1 sibling, 0 replies; 26+ messages in thread From: Chris Wedgwood @ 1998-06-25 21:24 UTC (permalink / raw) To: Stephen C. Tweedie, Richard Gooch Cc: Dean Gaudet, Eric W. Biederman, linux-kernel, linux-mm > Not necessarily; we may be able to detect a lot of the relevant access > patterns ourselves. Ingo has had a swap prediction algorithm for a > while, and we talked at Usenix about a number of other things we can do > to tune vm performance automatically. 2.3 ought to be a great deal > better. madvise() may still have merit, but we really ought to be > aiming at making the vm system as self-tuning as possible. madvise(2) will _always_ have some uses. Large database applications and stuff can know in advance how to tune mmap regions and stuff. The kernel will always be second guessing here, and making sub optimal decisions, whereas the application can and probably does know better. The same argument also applies to raw devices (but lets not start that thread again). -Chris ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 11:32 ` Stephen C. Tweedie 1998-06-25 21:24 ` Chris Wedgwood @ 1998-06-25 22:16 ` Richard Gooch 1 sibling, 0 replies; 26+ messages in thread From: Richard Gooch @ 1998-06-25 22:16 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-kernel, linux-mm Stephen C. Tweedie writes: > Hi, > > On Thu, 25 Jun 1998 13:53:36 +1000, Richard Gooch > <Richard.Gooch@atnf.CSIRO.AU> said: > > > This may be true, but my point is that we *need* a decent madvise(2) > > implementation. It will be use to a greater range of applications than > > sendfile(2). > > Not necessarily; we may be able to detect a lot of the relevant access > patterns ourselves. Ingo has had a swap prediction algorithm for a > while, and we talked at Usenix about a number of other things we can do > to tune vm performance automatically. 2.3 ought to be a great deal > better. madvise() may still have merit, but we really ought to be > aiming at making the vm system as self-tuning as possible. Including when I access my tiled data? Regards, Richard.... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 4:12 ` Dean Gaudet 1998-06-25 3:53 ` Richard Gooch @ 1998-06-25 4:56 ` Eric W. Biederman 1998-06-25 11:35 ` Stephen C. Tweedie 1 sibling, 1 reply; 26+ messages in thread From: Eric W. Biederman @ 1998-06-25 4:56 UTC (permalink / raw) To: Dean Gaudet; +Cc: Eric W. Biederman, Richard Gooch, linux-kernel, linux-mm >>>>> "DG" == Dean Gaudet <dgaudet-list-linux-kernel@arctic.org> writes: DG> On 24 Jun 1998, Eric W. Biederman wrote: >> >>>>> "RG" == Richard Gooch <Richard.Gooch@atnf.CSIRO.AU> writes: >> RG> If we get madvise(2) right, we don't need sendfile(2), correct? >> >> It looks like it from here. As far as madvise goes, I think we need >> to implement madvise(2) as: DG> ... note that mmap() requires a bunch of kernel structures set up to map DG> things into the program's memory space... when in reality the program DG> doesn't care at all about the bytes. (And then there's process address DG> space limitations...) sendfile() and such don't have these problems, and DG> it may be far more simple to implement sendfile() than it would be to put DG> all the hints and such into the mm layer to get mmap() performance up to DG> the same level. mmap, madvise(SEQUENTIAL),write is easy to implement. The mmap layer already does readahead, all we do is tell it not to be so conservative. Meanwhile to write sendfile, you need to do all of the same work (except the page tables) without an interface to do it with. madvise looks simpler from here. Eric ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 4:56 ` Eric W. Biederman @ 1998-06-25 11:35 ` Stephen C. Tweedie 1998-06-25 20:31 ` Dean Gaudet 1998-06-30 6:40 ` Eric W. Biederman 0 siblings, 2 replies; 26+ messages in thread From: Stephen C. Tweedie @ 1998-06-25 11:35 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Dean Gaudet, Richard Gooch, linux-kernel, linux-mm Hi, On 24 Jun 1998 23:56:28 -0500, ebiederm+eric@npwt.net (Eric W. Biederman) said: > mmap, madvise(SEQUENTIAL),write > is easy to implement. The mmap layer already does readahead, all we > do is tell it not to be so conservative. Swap readhead is also now possible. However, madvise(SEQUENTIAL) needs to do much more than this; it needs to aggressively track what region of the vma is being actively used, and to unmap those areas no longer in use. (They can remain in cache until the memory is needed for something else, of course.) The madvise is only going to be important if the whole file / vma does not fit into memory, so having advice that a piece of memory not recently accessed is unlikely to be accessed again until the next sequential pass is going to be very valuable. It will prevent us from having to swap out more useful stuff. --Stephen ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 11:35 ` Stephen C. Tweedie @ 1998-06-25 20:31 ` Dean Gaudet 1998-06-30 6:40 ` Eric W. Biederman 1 sibling, 0 replies; 26+ messages in thread From: Dean Gaudet @ 1998-06-25 20:31 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Eric W. Biederman, Richard Gooch, linux-kernel, linux-mm On Thu, 25 Jun 1998, Stephen C. Tweedie wrote: > Hi, > > On 24 Jun 1998 23:56:28 -0500, ebiederm+eric@npwt.net (Eric > W. Biederman) said: > > > mmap, madvise(SEQUENTIAL),write > > is easy to implement. The mmap layer already does readahead, all we > > do is tell it not to be so conservative. > > Swap readhead is also now possible. However, madvise(SEQUENTIAL) needs > to do much more than this; it needs to aggressively track what region of > the vma is being actively used, and to unmap those areas no longer in > use. Remember it's *regions* not just a region. An http/ftp server sends the same file over and over and over. There are many cursors moving sequentially within the same file. A threaded http/ftp server will have a single mmap, and multiple users of that mmap. Dean ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Thread implementations... 1998-06-25 11:35 ` Stephen C. Tweedie 1998-06-25 20:31 ` Dean Gaudet @ 1998-06-30 6:40 ` Eric W. Biederman 1 sibling, 0 replies; 26+ messages in thread From: Eric W. Biederman @ 1998-06-30 6:40 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Dean Gaudet, Richard Gooch, linux-kernel, linux-mm >>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes: ST> Hi, ST> On 24 Jun 1998 23:56:28 -0500, ebiederm+eric@npwt.net (Eric ST> W. Biederman) said: >> mmap, madvise(SEQUENTIAL),write >> is easy to implement. The mmap layer already does readahead, all we >> do is tell it not to be so conservative. ST> Swap readhead is also now possible. However, madvise(SEQUENTIAL) needs ST> to do much more than this; In the long term I agree. We can get a close approximation to the proper behavior by simply doing aggressive readahead. This is doable now, and should work in the presence of multiple readers. ST> it needs to aggressively track what region of ST> the vma is being actively used, and to unmap those areas no longer in ST> use. (They can remain in cache until the memory is needed for something ST> else, of course.) The madvise is only going to be important if the ST> whole file / vma does not fit into memory, Actally it will be important if the whole working set of data, (which in a web server would be _all_ of it's files is too large to fit into memory). Each file /vma may fit in fine. ST> so having advice that a piece ST> of memory not recently accessed is unlikely to be accessed again until ST> the next sequential pass is going to be very valuable. It will prevent ST> us from having to swap out more useful stuff. Agreed. Eric ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~1998-07-04 16:38 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-06-30 19:30 Thread implementations Larry McVoy
1998-07-01 8:50 ` Stephen C. Tweedie
1998-07-03 15:21 ` Rik van Riel
1998-07-03 20:05 ` Stephen C. Tweedie
1998-07-03 20:36 ` Rik van Riel
1998-07-04 16:37 ` Stephen C. Tweedie
[not found] <199806240915.TAA09504@vindaloo.atnf.CSIRO.AU>
[not found] ` <Pine.LNX.3.96dg4.980624025515.26983E-100000@twinlark.arctic.org>
[not found] ` <199806241213.WAA10661@vindaloo.atnf.CSIRO.AU>
1998-06-24 22:00 ` Eric W. Biederman
1998-06-24 23:41 ` Richard Gooch
1998-06-25 4:45 ` Eric W. Biederman
1998-06-25 17:14 ` Todd Larason
1998-06-26 7:53 ` Christoph Rohland
1998-06-26 14:16 ` Eric W. Biederman
1998-06-29 10:19 ` Stephen C. Tweedie
1998-06-30 6:19 ` Eric W. Biederman
1998-06-30 13:10 ` Stephen C. Tweedie
1998-06-30 19:35 ` Dean Gaudet
1998-07-01 9:09 ` Stephen C. Tweedie
1998-06-25 4:12 ` Dean Gaudet
1998-06-25 3:53 ` Richard Gooch
1998-06-25 11:32 ` Stephen C. Tweedie
1998-06-25 21:24 ` Chris Wedgwood
1998-06-25 22:16 ` Richard Gooch
1998-06-25 4:56 ` Eric W. Biederman
1998-06-25 11:35 ` Stephen C. Tweedie
1998-06-25 20:31 ` Dean Gaudet
1998-06-30 6:40 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox