On Wed, Apr 17, 2013 at 10:56 AM, John Stultz wrote: > LSF-MM Volatile Ranges Discussion Plans > ==============================**========= > > Just wanted to send this out to hopefully prime the discussion at > lsf-mm tomorrow (should the schedule hold). Much of it is background > material we won't have time to cover. > > First of all, this is my (John's) perspective here, Minchan may > disagree with me on specifics here, but I think it covers the desired > behavior fairly well, and I've tried to call out the places where > we currently don't yet agree. > > > Volatile Ranges: > ---------------- > > Idea is from Android's ashmem feature (originally by Robert Love), > which allows for unpinned ranges. > > I've been told other OSes support similar functionality > (VM_FLAGS_PURGABLE and MEM_RESET/MEM_RESET_UNDO). > > Been slow going last 6-mo on my part, due to lots of adorable > SIGBABY interruptions & other work. > > > Concept in general: > ------------------- > > Applications marks memory as volatile, allowing kernel to purge > that memory if and when its needed. Applications can mark memory as > non-volatile, and kernel will return a value to notify them if memory > was purged while it was volatile. > > > Use cases: > ---------- > > Allows for eviction of userspace cache by the kernel, which is nice > as applications don't have to tinker with optimizing cache sizes, > as the kernel which has the global view will optimize it for them. > > Marking obscured bitmaps of rendered image data volatile. Ie: Keep > compressed jpeg around, but mark volatile off-screen rendered bitmaps. > > Marking non-visible web-browser tabs as volatile. > > Lazy freeing of heap in malloc/free implementations. > > > Parallel ways of thinking about it: > ------------------------------**----- > > Also similar to MADV_DONTNEED, but eviction is needs based, not > instantaneous. Also applications can cancel eviction if it hasn't > happened (by setting non-volatile). So sort of delayed and cancel-able > MADV_DONTNEED. > > Can consider it like swapping some pages to /dev/null ? > > Rik's MADV_FREE was vary similar, but with implicit NON_VOLATILE > marking on page-dirtying. > > > Two basic usage-modes: > ---------------------- > > 1) Application explicitly unmarks memory as volatile whenever it > uses it, never touching memory marked volatile. > > If memory is purged, applications is notified when it marks the > area as non-volatile. > > 2) Applications may access memory marked volatile, but should it > access memory that was purged, it will receive SIGBUS > > On SIGBUS, application has to mark needed range as non-volatile, > regenerate or re-fetch the data, and then can continue. > > This is a little more optimistic, but applications need to be > able to handle getting a SIGBUS and fixing things up. > > This second optimistic method is desired by Mozilla folks. > > > Important Goals: > ---------------- > > Applications using this likely to mark and unmark ranges > frequently (ideally only marking the data they immediately need as > nonvolatile). This makes it necessary for these operations to be cheap, > since applications won't volunteer their currently unused memory to > the kernel if it adds dramatic overhead. Although this concerned is > lessened with the optimistic/SIGBUS usage-mode. > > Overall, we try to push costs from the mark/unmark paths to the page > eviction side. > > > > Two basic types of volatile memory: > ------------------------------**----- > > 1) File based memory > > 2) Anonymous memory > > > Volatile ranges on file memory: > ------------------------------**- > > This allows for using volatile ranges on shared memory between > processes. > > Very similar to ashmem's unpinned pages. > > One example: Two processes can create a large circular buffer, where > any unused memory in that buffer is volatile. Producer marks memory > as non-volatile, writes to it. The consumer would read the data, > then mark it volatile. > > An important distinction here is that the volatility is shared, > in the same way the file's data is shared. Its a property of the > file's pages, not a property of the process that marked the range as > volatile. Thus one application can mark file data as volatile, and > the pages could be purged from all applications mapping that data. > And a different application could mark it as non-volatile, and that > would keep it from being purged from all applications. > > For this reason, the volatility is likely best to be stored on > address_space (or otherwise connected to the address_space/inode). > > Another important semantic: Volatility is cleared when all fd's to > a file are closed. > > There's no really good way for volatility to persist when no one > is using a file. > > It could cause confusion if an application died leaving some > file data volatile, and then had that data disappear as it was > starting up again. > > No volatility across reboots! > > > [TBD]: For the most-part, volatile ranges really only makes sense to > me on tmpfs files. Mostly due to semantics of purging data on files > is similar to hole punching, and I suspect having the resulting hole > punched pushed out to disk would cause additional io and load. Partial > range purging could have strange effects on resulting file. > > [TBD]: Minchan disagrees and thinks fadvise(DONTNEED) has problems, > as it causes immediate writeout when there's plenty of free memory > (possibly unnecessary). Although we may defer so long that the hole > is never punched, which may be problematic. > > > > Volatile ranges on anonymous/process memory: > ------------------------------**-------------- > > For anonymous memory, its mostly un-shared between processes (except > copy-on-write pages). > > The only way to address anonymous memory is really relative to the > process address space (its anonymous: there's no named handle to it). > > Same semantics as described above. Mark region of process memory > volatile, or non-volatile. > > Volatility is a per-proecess (well mm_struct) state. > > Kernel will only purge a memory page, if *all* the processes that > map that page in consider the page volatile. > > Important semantics: Preserve volatility over a fork, but clear child > volatility on exec. > > So if a process marks a range as volatile then forks. Both > the child and parent should see the same range as volatile. > On memory pressure, kernel could purge those pages, since all of > the processes that map that page consider it volatile. > > If the child writes to the pages, the COW links are broken, but > both ranges ares still volatile, and can be purged until they > are marked non-volatile or cleared. > > Then like mappings and the rest of memory, volatile ranges are > cleared on exec. > > > Implementation history: > ----------------------- > > File-focused (John): Interval tree connected to address_space w/ global > LRU of unpurged volatile ranges. Used shrinker to trigger purging > off the lru. Numa folks complained that shrinker is numa-unaware and > would cause purging on nodes not under pressure. > > File-focused (John): Checking volatility at page eviction time. Caused > problems on swap-free systems, since tmpfs pages are anonymous and > aren't aged/shrunk off lrus. In order to handle that we moved the > pages to a volatile lru list, but that causes volatile/non-volatile > operations to be very expensive O(n) for number of pages in the range. > > Anon-focused (Minchan): Store volatility in VMA. Worked well for > anonymous ranges, but was problematic to extend to file ranges as > we need volatility state to be connected with the file, not the > process. Iterating across and splitting VMAs was somewhat costly. > > Anon-focused (Minchan): Store anonymous volatility in interval tree > off of the mm_struct. Use global LRU of volatile ranges to use when > purging ranges via a shrinker. Also hooks into normal eviction to > make sure evicted pages are purged instead of swapped out. Very fast, > due to quick manipulations to a single interval tree. File pages in > ranges are ignored. > > Both (John): Same as above, but mostly extended so interval tree > of ranges can be hung off of the mm_struct OR an address_space. > Currently functionality is partitioned so volatile ranges on files and > on anonymous memory are created via separate syscalls (fvrange(fd, > start, len, ...) vs mvrange(start_addr, len,...)). Roughly merges > the original first approach with the previous one. > > Both (John): Currently working on above, further extending mvrange() > so it can also be used to set volatility on MAP_SHARED file mappings > in an address space. Has the problem that handling both file and > anonymous memory types in a single call requires iterating over vmas, > which makes the operation more expensive. > > [TBD]: Cost impact of mvrange() supporting mapped file pages vs dev > confusion of it not supporting file pages > > > > Current interfaces: > ------------------- > > Two current interfaces: > fvrange(fd, start_off, length, mode, flags, &purged) > > mvrange(start_addr, length, mode, flags, &purged) > > > fd/start/length: > Hopefully obvious :) > > mode: > VOLATILE: Sets range as volatile. Returns number of bytes marked > volatile. > > NON_VOLATILE: Marks range as non-volatile. Returns number of bytes > marked non-volatile, sets purged value to 1 if any memory in the > bytes marked non-volatile were purged. > > flags: > VRANGE_FULL: On eviction, the entire range specified will be purged > > VRANGE_PARTIAL: On eviction, we may purge only part of the > specified range. > > In earlier discussions, it was deemed that if any page in > a volatile range was purged, we might as well purge the entire > range, since if we mark any portion of that range as non-volatile, > the application would have to regenerate the entire range. Thus > we might as well reduce memory pressure by puring the entire range. > > However, with the SIGBUS semantics, applications may be able to > continue accessing pages in a volatile range where one unused > page is purged, so we may want to avoid purging the entire range > to allow for optimistic continued use. > > Additionally partial purging is helpful so that we don't over-react > when we have slight memory pressure. An example, if we have a > 64M vrange, and the kernel only needs 8M, its much cheaper to > free 8M now and then later when the range is marked non-volatile, > re-allocate only 8M (fault + allocation + zero-clearing) instead > of the entire 64M. > > [TBD]: May consider merging flags w/ mode: ie: VOLATILE_FULL, > VOLATILE_PARTIAL, NON_VOLATILE > > [TBD]: Might be able to simplify and go with VRANGE_PARTIAL all > the time? > > purged: > Flag that returns 1 if any pages in the range marked > NON_VOLATILE were purged. Is set to zero otherwise. Can be null > if mode==VOLATILE. > > [TBD]: Might consider value passed to it will be |'ed with 1?. > > [TBD]: Might consider purged to be more of a status bitflag, > allowing vrange(VOLATILE) calls to get some meaningful data like > if memory pressure is currently going on. > > > Return value: > Number of bytes marked VOLATILE or NON_VOLATILE. This is necessary > as if we are to deal with setting ranges that cross anonymous and > file backed pages, we have to split the operations up into multiple > operations against the respective mm_struct or addess_space, and > there's a possibility that we could run out of memory mid-way > through an operation. If we do run out of memory mid way, we > simply return the number of bytes successfully marked, and we > can return an error on the next invocation if we hit the ENOMEM > right away. > > [TBD]: If mvrange() doesn't affect mapped file pages, then the > return value can be simpler. > > > > Current TODOs: > -------------- > > Add proper SIGBUS signaling when accessing purged file ranges. > > Working on handling mvrange() ranges that cross anonymous and mapped > file regions. > > Handle errors mid-way through operations. > > Cleanups and better function names. > > > > [TBD] Contentious interface issues: > ------------------------------**----- > > Does handling mvrange() calls that cross anonymous & file pages > increase costs too much for ebizzy workload Minchan likes? > > Have to take mmap_sem and traverse vmas. > > Could mvrange() on file pages not be shared in the same way as > in fvrange() > > Sane interface vs Speed? > > Minchan's idea of mvrange(VOLATILE_FILE|**VOLATILE_ANON|VOLATILE_BOTH): > > Avoid traversing vmas on VOLATILE_ANON flag, regardless of if > range covers mapped file pages > > Not sure we can throw sane errors without checking vmas? > > Do we really need a new syscall interface? > > Can we maybe go back to using madvise? > > Should mvrange be prioritized over fvrange, if mvrange can create > volatile ranges on files. > > Some folks still don't like SIGBUS on accessing a purged volatile page, > instead want standard zero-fill fault. > > Need some way to know page was dropped (zero is a valid data value) > > After marking non-volatile, it can be zero-fill fault. > > > [TBD] Contentious implementation issues: > ------------------------------**---------- > > Still using shrinker for purging, got early complaints from NUMA folks > > Can make sure we check first page in each range and purge only > ranges where some page is in the zone being shrinked? > > Still use shrinker, but also use normal page shrinking path, > but check for volatility. (swapless still needs shrinker) > > Probably don't want to actually hang vrange interval tree (vrange_root) > off of address_space and struct_mm. > > In earlier attempts I used a hashtable to avoid this > http://thread.gmane.org/gmane.**linux.kernel/1278541/focus=** > 1278542 > > I assume this is still a concern? > > > Older non-contentious points: > ----------------------------- > > Coalescing of ranges: Don't do it unless the ranges overlaps > > Range granular vs page granular purging: Resolved with _FULL/_PARTIAL > flags > > > Other ideas/use-cases proposed: > ------------------------------**- > > PTurner: Marking deep user-stack-frames as volatile to return that > memory? > > Great write-up John. Since there's a question mark I thought I'd add a qualifier: I think this would be specifically useful with segmented stacks. As we cross region boundaries we could then mark the previous region as volatile to allow reclaim without a large re-use penalty if the stack quickly grows again. This is a trade-off that is typically difficult to manage. Dmitry Vyukov: 20-80TB allocation, marked volatile right away. Never > marking non-volatile. > > Wants zero-fill and doesn't want SIGBUG > > https://code.google.com/p/**thread-sanitizer/wiki/**VolatileRanges > > > Misc: > ---- > Previous discussion: https://lwn.net/Articles/**518130/ > >