* PTE chaining, kswapd and swapin readahead @ 1998-06-16 22:10 Rik van Riel 1998-06-17 9:24 ` Eric W. Biederman 0 siblings, 1 reply; 5+ messages in thread From: Rik van Riel @ 1998-06-16 22:10 UTC (permalink / raw) To: Linux MM Hi, In the PTE chaining discussion/patches a while ago, I saw that kswapd was changed in a way that it scanned memory in physical order instead of walking the pagetables. This has the advantage of deallocating memory in physically adjecant chunks, which will be nice while we still have the primitive buddy allocator we're using now. However, it will be a major performance bottleneck when we get around to implementing the zone allocator and swapin readahead. This is because we don't need physical deallocation with the zone allocatore and because swapin readahead is just an awful lot faster when the pages are contiguous in swap. I write this to let the PTE people (Stephen and Ben) know that they probably shouldn't remove the pagetable walking routines from kswapd... Rik. +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: PTE chaining, kswapd and swapin readahead 1998-06-16 22:10 PTE chaining, kswapd and swapin readahead Rik van Riel @ 1998-06-17 9:24 ` Eric W. Biederman 1998-06-17 16:03 ` Rik van Riel 0 siblings, 1 reply; 5+ messages in thread From: Eric W. Biederman @ 1998-06-17 9:24 UTC (permalink / raw) To: Rik van Riel; +Cc: Linux MM >>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes: RR> Hi, RR> In the PTE chaining discussion/patches a while ago, I saw RR> that kswapd was changed in a way that it scanned memory RR> in physical order instead of walking the pagetables. RR> This has the advantage of deallocating memory in physically RR> adjecant chunks, which will be nice while we still have the RR> primitive buddy allocator we're using now. Also it has the advantage that shared pages are only scanned once, and empty address space needn't be scanned. RR> However, it will be a major performance bottleneck when we RR> get around to implementing the zone allocator and swapin RR> readahead. This is because we don't need physical deallocation RR> with the zone allocatore and because swapin readahead is just RR> an awful lot faster when the pages are contiguous in swap. Just what is your zone allocator? I have a few ideas based on the name but my ideas don't seem to jive with your descriptions. This part about not needing physically contigous memory is really puzzling. RR> I write this to let the PTE people (Stephen and Ben) know RR> that they probably shouldn't remove the pagetable walking RR> routines from kswapd... If we get around to using a true LRU algorithm we aren't too likely too to swap out address space adjacent pages... Though I can see the advantage for pages of the same age. Also for swapin readahead the only effective strategy I know is to implement a kernel system call, that says I'm going to be accessing this chunck of my address space soon. The clustering people have already implemented a system call of this nature for their own use. It would probably be a good idea to do something similiar... Eric ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: PTE chaining, kswapd and swapin readahead 1998-06-17 9:24 ` Eric W. Biederman @ 1998-06-17 16:03 ` Rik van Riel 1998-06-18 4:06 ` Eric W. Biederman 0 siblings, 1 reply; 5+ messages in thread From: Rik van Riel @ 1998-06-17 16:03 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Linux MM On 17 Jun 1998, Eric W. Biederman wrote: > >>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes: > > RR> This has the advantage of deallocating memory in physically > RR> adjecant chunks, which will be nice while we still have the > RR> primitive buddy allocator we're using now. > > Also it has the advantage that shared pages are only scanned once, and > empty address space needn't be scanned. OK, this is a _very_ big advantage which I overlooked... > Just what is your zone allocator? I have a few ideas based on the > name but my ideas don't seem to jive with your descriptions. > This part about not needing physically contigous memory is really > puzzling. Well, the idea is to divide memory into different areas (of 32 to 256 pages in size, depending on the amount of main memory) for different uses. There are 3 different uses: - process pages, buffers and page cache - pagetables and small (order 0, 1 and maybe 2) SLAB areas - large SLAB allocations (order 2, 3, 4 and 5) On large memory machines (>=128M) we might even split the SLAB areas into 3 types... Allocation is always done in the fullest area. We keep track of this by hashing the area's in a doubly linked list, using perhaps 8 different degrees of 'fullness'. When an area get's fuller than the queue is meant to be, it 'promotes' one level up and is added to the _tail_ of the queue above. When an area get's emptier than it's queue is supposed to be, it get's added to the _head_ of the queue below. This way, the emptier areas get emptier and the fullest area's get fuller. This way we can force-free an area (with PTE chaining) when we're short of memory. Inside the user area's, we can simply use a linked list to mark free pages. Alternatively, we can keep the administration in a separate area of memory. This has the advantage that we don't have to reread a page when it's needed shortly after we swapped it out. Then we can simply use a bitmap and a slightly optimized function. For the SLAB area's, where we use different sizes of allocation, we could use a simple buddy allocator. Because the SLAB data is usually either long-lived or _very_ short-lived and because we use only a few different sizes in one area, the buddy allocator could actually work here. Maybe we want the SLAB allocator to give us a hint on whether it needs the memory for a long or a short period and using separate area's... There's no code yet, because I'm having a lot of trouble switching to glibc _and_ keeping PPP working :( Maybe later this month. > RR> I write this to let the PTE people (Stephen and Ben) know > RR> that they probably shouldn't remove the pagetable walking > RR> routines from kswapd... > > If we get around to using a true LRU algorithm we aren't too likely > too to swap out address space adjacent pages... Though I can see the > advantage for pages of the same age. True LRU swapping might actually be a disadvantage. The way we do things now (walking process address space) can result in a much larger I/O bandwidth to/from the swapping device. > Also for swapin readahead the only effective strategy I know is to > implement a kernel system call, that says I'm going to be accessing There are more possibilities. One of them is to use the same readahead tactic that is being used for mmap() readahead. To do this, we'll either have to rewrite the mmap() stuff, or we can piggyback the mmap() code by writing a vnode system for normal memory area's. The vnode system is probably easier, since that would also allow for an easy implementation of shared memory and easier tracking of memory movement (since we never loose track of pages). Also, this vnode system will make it possible to turn off memory overcommitment (something that a lot of people have requested) and do some other nice tricks... Rik. +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: PTE chaining, kswapd and swapin readahead 1998-06-17 16:03 ` Rik van Riel @ 1998-06-18 4:06 ` Eric W. Biederman 1998-06-18 7:25 ` Rik van Riel 0 siblings, 1 reply; 5+ messages in thread From: Eric W. Biederman @ 1998-06-18 4:06 UTC (permalink / raw) To: Rik van Riel; +Cc: Linux MM >>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes: RR> On 17 Jun 1998, Eric W. Biederman wrote: >> If we get around to using a true LRU algorithm we aren't too likely >> too to swap out address space adjacent pages... Though I can see the >> advantage for pages of the same age. RR> True LRU swapping might actually be a disadvantage. The way RR> we do things now (walking process address space) can result RR> in a much larger I/O bandwidth to/from the swapping device. The truly optimal algorithm is I to read in the page/pages we are going to use next, and remove the page we won't use for the longest period of time. It's the same as LRU but in reverse time order. And both of these algorithms have the important property that they avoid Belady's anomoly. That is with more memory they won't cause more page faults and more I/O. The goal should be to reduce disk I/O as disk bandwidth is way below memory bandwidth. Using ``unused'' disk bandwidth in prepaging may also be a help. Note: much of the write I/O performance we achieve is because get_swap_page() is very efficient at returning adjacent swap pages. I don't see where location is memory makes a difference. As to which pages should be put close to each other for read performance, that is a different question. Files tend to be read/written sequentially, so it is a safe bet to anticipate this one usage pattern and if it is going on capitalize on it, if not don't do any read ahead. We could probably add a few more likely cases to the vm system. The only simple special cases I can think to add are reverse sequential access, and stack access where pages 1 2 3 4 are accesed and then 4 3 2 1 are accessed in reverse order. But for the general case it is quite difficult to predict, and a wrong prediction could make things worse. >> Also for swapin readahead the only effective strategy I know is to >> implement a kernel system call, that says I'm going to be accessing The point I was hoping to make is that for programs that find themselves swapping frequently a non blocking read (for mmapped areas) can be quite effective. Because in certain circumstances a program can in fact predict which pages it will be using next. And this will allow a very close approximation of the true optimal paging algorithm, and is fairly simple to implement. For a really close approximation we might also want to have a system call that says which pages we aren't likely to be using soon. Perhaps: mtouch(void *addr, int len, int when); where ``when'' can be SOON or LATER ... Or after looking at: http://cesdis.gsfc.nasa.gov/beowulf/software/pager.html struct mtouch_tuple { caddr_t addr; size_t extent; int when; } prepage_list[NR_ELEMENTS]; mtouch(&prepage_list); Where prepage_list is terminated with an special element. I don't like suggested terminator of a NULL address of 0 because there are some programs that actually use that... The when paremeter is my idea... RR> There are more possibilities. One of them is to use the RR> same readahead tactic that is being used for mmap() RR> readahead. Actually that sounds like a decent idea. But I doubt it will help much. I will start on the vnodes fairly soon, after I get a kernel pgflush deamon working. Eric ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: PTE chaining, kswapd and swapin readahead 1998-06-18 4:06 ` Eric W. Biederman @ 1998-06-18 7:25 ` Rik van Riel 0 siblings, 0 replies; 5+ messages in thread From: Rik van Riel @ 1998-06-18 7:25 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Linux MM On 17 Jun 1998, Eric W. Biederman wrote: > >>>>> "RR" == Rik van Riel <H.H.vanRiel@phys.uu.nl> writes: > > RR> True LRU swapping might actually be a disadvantage. The way > RR> we do things now (walking process address space) can result > RR> in a much larger I/O bandwidth to/from the swapping device. > > The goal should be to reduce disk I/O as disk bandwidth is way below > memory bandwidth. Using ``unused'' disk bandwidth in prepaging may > also be a help. For normal disks, most I/O is so completely dominated by _seeks_, that transfer time can be almost completely ignored. This is why we should focus on reducing the number of disk I/Os, not the number of blocks transferred. > Note: much of the write I/O performance we achieve is because > get_swap_page() is very efficient at returning adjacent swap pages. > I don't see where location is memory makes a difference. It makes a difference because: - programs usually use larger area's in one piece - swapout clustering saves a lot of I/O (like you just said) - when pages from a process are adjacant on disk, we can save the same amount of I/O on _swapin_ too. - this will result in a _much_ improved bandwidth > We could probably add a few more likely cases to the vm system. The > only simple special cases I can think to add are reverse sequential > access, and stack access where pages 1 2 3 4 are accesed and then 4 3 > 2 1 are accessed in reverse order. Maybe that's why stacks grow down :-) Looking at the addresses of a shrinking stack, you'll notice that linear forward readahead still is the best algorithm. > >> Also for swapin readahead the only effective strategy I know is to > >> implement a kernel system call, that says I'm going to be accessing > > The point I was hoping to make is that for programs that find > themselves swapping frequently a non blocking read (for mmapped areas) > can be quite effective. Agreed. And combined with your other (snipped) call, it might give a _huge_ advantage for large simulations and other processes which implement it. > RR> There are more possibilities. One of them is to use the > RR> same readahead tactic that is being used for mmap() > RR> readahead. > > Actually that sounds like a decent idea. But I doubt it will help > much. I will start on the vnodes fairly soon, after I get a kernel > pgflush deamon working. It _does_ help very much. A lot of the perceived slowness of Linux is the 'task switching' in X. By this I mean new people selecting another window as their foreground window, causing X and the programs to read in huge amounts of graphics data, while simultaneously swapping out other data. By implementing the same readahead tactic as we use for mmap()ed files, we could cut the number of I/Os by more than one 3rd and probably even more. Rik. +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~1998-06-18 8:31 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 1998-06-16 22:10 PTE chaining, kswapd and swapin readahead Rik van Riel 1998-06-17 9:24 ` Eric W. Biederman 1998-06-17 16:03 ` Rik van Riel 1998-06-18 4:06 ` Eric W. Biederman 1998-06-18 7:25 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox