From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from alogconduit1ah.ccr.net (root@alogconduit1al.ccr.net [208.130.159.12]) by kvack.org (8.8.7/8.8.7) with ESMTP id OAA11867 for ; Sun, 23 May 1999 14:31:59 -0400 Subject: Re: [PATCHES] References: From: ebiederm+eric@ccr.net (Eric W. Biederman) Date: 23 May 1999 13:34:11 -0500 In-Reply-To: Ingo Molnar's message of "Sun, 23 May 1999 17:49:18 +0200 (CEST)" Message-ID: Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: linux-mm@kvack.org List-ID: >>>>> "IM" == Ingo Molnar writes: My current patches can be found at: http://www.ccr.net/ebiederm/files in: patches9.tar.gz and soon in shmfs-0.1.011.tar.gz (Which should go out later today). shmfs is my filesystem that resides in swap & the page cache. Currently it doesn't work with swap-off, but that is in progress. IM> On 23 May 1999, Eric W. Biederman wrote: LT> - Ingo just did the page cache / buffer cache dirty stuff, this is going LT> to clash quite badly with his changes I suspect. >> >> Interesting. I have been telling folks I've been working on this for >> quite a while. I wish I'd heard about him or vis versa. IM> i'm mainly working on these two areas: IM> - merging ext2fs (well, this means basically any block-based filesystem, IM> but ext2fs is the starting point) data buffers into the page cache. IM> - redesigning the page cache for SMP (mainly because i was touching IM> things and introducing bugs anyway) [on my box the page cache is IM> already completely parallel on SMP, we drop the kernel lock on entry IM> into page-cache routines and re-lock it only if we call IM> filesystem-specific code or buffer-cache code. (in this patch the IM> ll_rw_block IO layer is being executed outside the kernel lock as well, IM> and appears to work quite nicely.] - I added support for large files for much the same reason. my summary of patches is: The patches included are: eb1 --- Allow reuse of page->buffers if you aren't the buffer cache eb2 --- Allow old old a.out binaries to run even if we can't mmap them properly because their data isn't page aligned. eb3 --- Muck with page offset. eb4 --- Allow registration and unregistration for functions needed by swap off. This allows a modular filesystem to reside in swap... eb5 --- Large file support, basically this removes unused bits from all of the relevant interfaces. I also begin to handle PAGE_CACHE_SIZE != PAGE_SIZE eb6 --- Introduction of struct vm_store, and associated cleanups. In particular get_inode_page. vm_store is a variation on the inode struct which is lighter weight. vm_stores's seperates out the vm layer from the vfs layer more, making things like the swap_cache easier to build, and cleaner. This is potentially very useful and the cost is low. eb7 --- Actuall patch for dirty buffers in the page cache. I'm fairly well satisfied except for generic_file_write. Which I haven't touched. It looks like I need 2 variations on generic_file_write at the moment. 1) for network filesystems that can get away without filling the page on a partial write. 2) for block based filesystems that must fill the page on a partial write because they can't write arbitrary chunks of data. eb8 -- Misc things I use, Included simply for reference. IM> the design/implementation is this: we put dirty and clean block-device IM> pages into the buffer-cache as well. bdflush automatically takes care of IM> undirtying buffers (pages). I've modified ext2fs to to have block IM> allocation/freeing separated from read/write/truncate activities, and now IM> ext2fs uses (a modified version of) generic_file_write(). Yes the current version of generic_file_write() which doesn't read before writing is interesting. . . IM> This brought IM> some fuits already: whenever we write big enough to modify a full, IM> uncached page, we can now overwrite the page-cache page without first IM> having to read it. (security issues taken care of as well) [the old IM> mechanizm was that we first allocated the data block which was memset to IM> zero deep inside ext2fs's block allocation path, then we read the block if IM> this was a partial write, then we overwrote it with data from user-space. IM> yuck.] I don't think I ever traced it deep enough to see the memset. But I follow. It sounds like you have made some nice performance improvements. The first version of my stuff is just after unlocking potential. IM> the current state of the patch is that it's working and brings a nice IM> performance jump on SMP boxes on disk-benchmarks and is stable even under IM> heavy stress-testing. Also (naturally) dirty buffers show up only once, in IM> the page cache. I've broken some things though (swapping and NFS IM> side-effects are yet untested), i'm currently working on cleaning the IM> impact on these things up, once it's cleaned up (today or tomorrow) i'll IM> Cc: you as well. Cool. If you are really that close to done we can probably synchronise our work and submit the result to Linus. I think we are conceptually orthogonal except for handle dirty data in the page cache. And that _needs_ synchronizing. My patches that don't depend on dirty data in the page cache, (large files, no need to support unaligned mappings, etc) I'm going to send in now. IM> i didnt know about you working on this until Stephen Tweedie told me, then IM> i quickly looked at archives and (maybe wrongly?) thought that while our IM> work does collide patch-wise but is quite orthogonal conceptually. Except that we both handle dirty data in the page cache... Hmm. You must not subscribe to linux-mm@kvack.org (It's a majordomo list) There isn't a linux-vfs list out there anywhere is there? IM> I've IM> tried to sync activities with others working in this area (Andrea for IM> example). I completely overlooked that you are working on the block-cache IM> side as well, would you mind outlining the basic design? Well I'm trying to eliminate the buffer cache... My work on dirty pages sets up a bdflush like mechanism on top of the page cache. So for anything that can fit in the page cache the buffer cache simply isn't needed. Where the data goes when it is written simply doesn't matter. This means that everyone can reuse the same mechanism for keeping track of what is dirty and what isn't. As this a significant kernel tuning issue I thinks it's better (if possible) to share the code between all of the filesystems. Further the mechanism for writing handling dirty buffers doesn't need to be tied to the disk buffers. As far as I can tell bdflush doesn't really care that data is going to disk except that: (a) it calls ll_rw_block(instead of calling through a function) and (b) it names all of the buffers by their destination on the disk. Something as brilliant as that could happily sit at the page cache level and be reusable, by all filesystems. This also allows allocation on write and other fun tricks. I have recoverd the buffer pointer in struct page, and put it to use as a generic pointer. A block based filesystem will probably use is as a pointer to buffers, but NFS can use it for something else.. IM> one thing i saw in one of your earlier mails: an undirtification mechanizm IM> in the page cache, although i've still got to see your patch (is it IM> available somewhere?). Originally i thought about doing it that way too, IM> but then i have avoided to have the page-writeout mechanizm in the IM> page-cache according to Linus's suggestion. I'm not familiar with that suggestion. I do know the page writeout mechanism shouldn't be in shrink-mmap... IM> I'm now pretty much convinced IM> that this is the right way: future optimizations will enable us to delay IM> the filesystem block allocation part completely (apart from bumping free IM> space estimation counters up and down and thus avoiding the async 'out of IM> free space' problem). I've played with things a couple of different ways, (so I'm not certain what you saw). I can already delay the filesystem block allocation, though I currently don't do that in shmfs because last time I played with that it wasn't working too well. Mainly because I had a pretty stupid on demand allocator, instead of allocating for all dirty pages of a file at once. What I do now is allocate blocks when they are written too, (which works great without contention). With contention a delayed strategy almost certainly will do better. The other limiting factor of my fs is I'm using swap pages so I'm not totally in control, of block allocation. IM> I'd like the page cache end up in a design where we can almost completely IM> avoid any filesystem overhead for quickly created/destroyed and/or fully IM> cached files. I'd like to have a very simple unaliased pagecache and no IM> filesystem overhead, on big RAM boxes. IM> This was the orignal goal of the IM> page cache as well, as far as i remember. Turning the page-cache into a IM> buffer-cache again doesnt make much sense IMO... An unaliased pagecache? I'm not quite certain what you mean by that. I don't intend to turn it into a buffer-cache, but I was thinking of sitting what must remains of the buffer-cache in page-cache inode, like swap is now. And the low over head sounds good to me too. Right now there are a couple of paths I would like to clean up. In particular dirtying mapped pages, in the swap out routines. If/when all of the filesystems are converted over I can just about do it with a simple mark_page_dirty call, (except this doesn't handle fs that want to track on a finer granularity what is dirty...) So I'll probably wind up calling something like updatepage. Except right now updatepage has way too much overhead. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/