Hi, On 01/20/2014 10:56 PM, Li Wang wrote: > > Hello, > It will be appreciated if I have a chance to discuss the fadvise > extension topic at the incoming LSF/MM summit. I am also very > interested in the topics on VFS, MM, SSD optimization as well as ext4, > xfs, ceph and so on. > In the last year, I have been involved in Ceph development, the > features done/ongoing include punch hole support, inline data > support, cephfs quota support, cephfs fuse file lock support etc, as > well as some bug fixes and performance evaluations. > > The proposal is below, comments/suggestions are welcome. > > Fadvise Extensions for Directory Level Cache Cleaning and > POSIX_FADV_NOREUSE > > 1 Motivation > > 1.1 Directory Level Cache Cleaning > > VFS relies on LRU-like page cache eviction algorithm to reclaim cache > space, since LRU is not aware of application semantics, it may > incorrectly evict going-to-be referenced pages out, resulting in severe > performance degradation due to cache thrashing, especially under high > memory pressure situation. Applications have the most semantic > knowledge, they can always do better if they are given a chance. This > motivates to endow the applications more abilities to manipulate the > vfs cache. > > Currently, Linux support file system wide cache cleaning by virtue of > proc interface 'drop-caches', but it is very coarse granularity and > was originally proposed for debugging. The other is to do file-level > page cache cleaning through 'fadvise', however, since there is no way of > determining whether a path name is in the dentry cache, simply calling > fadvise(name, DONTNEED) will very likely pollute the cache rather > than cleaning it. Even there is a cache query API available, it will > incur heavy system call overhead, especially in massive small-file > situations. This motivates to extend fadvise() to support directory > level cache cleaning. Currently, the original implementation is > available at https://lkml.org/lkml/2013/12/30/147, and received some > constructive comments. We think there are some designs need be put > under discussion, and we summarize them in Section 2.1. > > 1.2 POSIX_FADV_NOREUSE > > POSIX_FADV_NOREUSE is useful for backup and data streaming applications. > There are already some efforts on POSIX_FADV_NOREUSE implementation, > the latest seems to be https://lkml.org/lkml/2012/2/11/133. The > alternative ways can be (a) Use fadvise(DONTNEED) instead; (b) Use > container-based approach, such as setting memory.file.limit_in_bytes. > However, both (a) and (b) have limitations. (a) may impolitely destroy > other application's work set, which is not a desirable behavior; (b) is > kind of rude, and the threshold may have to be carefully tuned, > otherwise it may cause applications to start swapping or even worse. > In addition, we are not sure if it shares the same issue with (a). > This motivates to develop a simple yet efficient POSIX_FADV_NOREUSE > implementation. > > 2 Designs to be discussed > > Since these are both suggestive interfaces, the overall idea behind our > design is to minimize the modification to current MM magic, stay the > implementation as simple as possible. > > 2.1 Directory Level Cache Cleaning > > For directory level cache cleaning, fadivse(fd, DONTNEED) will clean > all the page caches as well as unreferenced dentry caches and inode > caches inside the directory fd. > > (1) For page cache cleaning, the policy in our original design is to > collect those inodes not on any LRU list into our private list for > further cleaning. However, as pointed out by Andrew and Dave, most > inodes are actually on the LRU list, hence this policy will leave many > inodes fail to be processed. And, since we want to reuse the > inode->i_lru rather than adding a new list_head field into inode, we > will encounter a problem that we can not determine whether an inode is > on superblock LRU list or on our private list. While a fadvise() caller > A is trying to collect an inode, it may happen that another fadvise() > caller B has already gathered the inode into his private LRU list, then > it will end up that A grabs inode from B's list, and the worse thing is, > the operations on B'list are not synchronized within multiple fadvise() > callers. To address this, We have two candidates, > > (a) Introduce a new inode state I_PRIVATE, indicating the inode is on a > private list. While collecting one inode into private list, the flag is > set on it, and cleared after finishing page cache invalidation. > Fadvise() caller will check the flag prior to collecting one inode into > his private list. This avoids the race between one fadvise() caller is > adding a new inode to his list and another caller is grabbing a inode > from this list. > > (b) Introduce a global list as well as a global lock. The inodes to be > manipulated are always collected into the global list, protected by the > global lock. Given the cache cleaning is not a frequent operation, the > performance impact is negligible. > > (2) For dentry cache cleaning, shrink_dcache_parent() meets most of our > demands except it does not take permission into account, the caller > should not touch the dentries and inodes which he does not own > appropriate permission. There are also two ways to perform the check, > > (a) Check if the caller has permission on parent directory, i.e, > inode_permission(dentry->d_parent->d_inode, MAY_WRITE | MAY_EXEC) > > (b) Check if the caller has permission on corresponding inode, i.e, > (inode_owner_or_capable(dentry->d_inode) || capable(CAP_SYS_ADMIN)) > > (3) For dentry cache cleaning, if dentries are freed, there seems no > easy way to walk all inodes inside a specific directory, our idea lies > in that before freeing those unreferenced dentries, gather the inodes > referenced by them into a private list, __iget() the inodes and mark > I_PRIVATE on (if the I_PRIVATE scheme is acceptable). Thereafter from > where we can still find those inodes to further free them. > > (4) For inode cache cleaning, in most situations, iput_final() will put > unreferenced inodes into superblock lru list rather than freeing them. > To free the inodes in our private list, it seems there is not a handy > API to use. The process could be, for each inode in our list, hold the > inode lock, clear I_PRIVATE, detach from list, atomic decrease its > reference count. If the reference count reaches zero, there are two > possible ways, > > (a) Introduce a new inode state I_FORCE_FREE, and mark it on, then pass > the inode into iput_final(). iput_final() is with tiny modifications to > be able to recognize the flag, who will then invoke evict() to free the > inode rather than adding it to super block LRU list. > > (b) Wrap iput_final() into __iput_final(struct inode *inode, bool > force_free), we call __iput_final(inode, TRUE), define iput_final() to > static inline __iput_final(inode, FALSE). > > 2.2 POSIX_FADV_NOREUSE Implementation > > Our key idea behind is to translate 'The application will access the > page once' into 'The access leaves no side-effect on the page'. For > current MM implementation, normal access will has side-effect on the > page accessed, i.e, it will increase the temperature of the page, > in a way of from inactive to active or from unreferenced to referenced. > Against normal access, NOREUSE is intended to tell the MM system that > the access will leave the page as it is. This can be detailed as > follows, > > (a) If a page is accessed for the first time, after NOREUSE access, it > is kept inactive and unreferenced, then it will potentially get > reclaimed soon since it has a lowest temperature, unless a later > NON-NOREUSE access increases its temperature. Here we do not > explicitly immediately free the page after access, this is for three > reasons, the first is the semantics of NOREUSE differs from DONTNEED, > NOREUSE does not mean the page should be dropped immediately; the > second is synchronously freeing the page will more or less slow down > the read performance; And the last, a near-future reference of the page > by other applications will have a chance to hit in the cache. > > (b) If a page is accessed before, in other words, it is active or > referenced, then it may belong to the work set of other applications, > and will very likely be accessed again. NOREUSE just makes a silent > access, without changing any status of the page. > > Another assumption is that file wide NOREUSE is enough to capture most > of the usages, the fine granularity of interval-level NOREUSE is not > desirable given its rare use and its implementation complexity. So this > results in the following simple NOREUSE implementation, > > (1) Introduce a new fmode FMODE_NOREUSE, set it on when calling > fadvise(NOREUSE) > > So when will this flag be cleared? Do you need clear it while setting FMODE_RANDOM, FMODE_NORMAL, FMODE_SEQ etc, like https://lkml.org/lkml/2012/2/11/13 does? (2) do_generic_file_read(): > From: > if (prev_index != index || offset != prev_offset) > mark_page_accessed(page); > To: > if ((prev_index != index || offset != prev_offset) && !(filp->f_mode & > FMODE_NOREUSE)) > mark_page_accessed(page); > There are no more than ten LOC to go. > > Cheers, > Li Wang > > > > > >