From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <3D29F868.1338ACF3@zip.com.au> Date: Mon, 08 Jul 2002 13:39:04 -0700 From: Andrew Morton MIME-Version: 1.0 Subject: Re: scalable kmap (was Re: vm lock contention reduction) References: <3D28042E.B93A318C@zip.com.au> <3D293E19.2AD24982@zip.com.au> <20020708080953.GC1350@dualathlon.random> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Andrea Arcangeli Cc: Linus Torvalds , "Martin J. Bligh" , Rik van Riel , "linux-mm@kvack.org" List-ID: Andrea Arcangeli wrote: > > ... > > generic_file_write() > > { > > ... > > atomic_inc(¤t->mm->dont_unmap_pages); > > > > { > > volatile char dummy; > > __get_user(dummy, addr); > > __get_user(dummy, addr+bytes+1); > > } > > lock_page(); > > ->prepare_write() > > kmap_atomic() > > copy_from_user() > > kunmap_atomic() > > ->commit_write() > > atomic_dec(¤t->mm->dont_unmap_pages); > > unlock_page() > > } > > > > and over in mm/rmap.c:try_to_unmap_one(), check mm->dont_unmap_pages. > > > > Obviously, all this is dependent on CONFIG_HIGHMEM. > > > > Workable? > > the above pseudocode still won't work correctly, Sure. It's crap. It can be used to get mlockall() for free. > if you don't pin the > page as Martin proposed and you only rely on its virtual mapping to stay > there because the page can go away under you despite the > swap_out/rmap-unmapping work, if there's a parallel thread running > munmap+re-mmap under you. So at the very least you need the mmap_sem at > every generic_file_write to avoid other threads to change your virtual > address under you. And you'll basically need to make the mmap_sem > recursive, because you have to take it before running __get_user to > avoid races. You could easily do that using my rwsem, I made two versions > of them, with one that supports recursion, however this is just for your > info, I'm not suggesting to make it recursive. I think I'll just go for pinning the damn page. It's a spinlock and maybe three cachelines but the kernel is about to do a 4k memcpy anyway. And get_user_pages() doesn't show up much on O_DIRECT profiles and it'll be a net win and we need to do SOMETHING, dammit. > ... > The only reason I can imagine rmap useful in todays > hardware for all kind of vma (what the patch provides compared to what > we have now) is to more efficiently defragment ram with an algorithm in > the memory balancing to provide largepages more efficiently from mixed > zones, if somebody would suggest rmap for this reason (nobody did yet) It has been discussed. But no action yet. > I > would have to agree completely that it is very useful for that, OTOH it > seems everybody is reserving (or planning to reserve) a zone for > largepages anyways so that we don't run into fragmentation in the first > place. And btw - talking about largepages - we have three concurrent and > controversial largepage implementations for linux available today, they > all have different API, one is even shipped in production by a vendor, What implementation do you favour? > and while auditing the code I seen it also exports an API visible to > userspace [ignoring the sysctl] (unlike what I was told): > > +#define MAP_BIGPAGE 0x40 /* bigpage mapping */ > [..] > _trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN) | > _trans(flags, MAP_DENYWRITE, VM_DENYWRITE) | > + _trans(flags, MAP_BIGPAGE, VM_BIGMAP) | > _trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE); > return prot_bits | flag_bits; > #undef _trans > > that's a new unofficial bitflag to mmap that any proprietary userspace > can pass to mmap today. Other implementations of the largepage feature > use madvise or other syscalls to tell the kernel to allocate > largepages. At least the above won't return -EINVAL so the binaryonly > app will work transparently on a mainline kernel, but it can eventually > malfunction if we use 0x40 for something else in 2.5. So I think we > should do something about the largepages too ASAP into 2.5 (like > async-io). Yup. I don't think the -aa kernel has a large page patch, does it? Is that something which you have time to look into? - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/