From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 25 Sep 2008 20:38:46 +0200 From: Andrea Arcangeli Subject: RFC: race between o_direct and fork (harder to fix with get_user_page_fast) Message-ID: <20080925183846.GA6877@duo.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin Cc: linux-mm@kvack.org, Avi Kivity , Izik Eidus List-ID: Hi Nick, with Izik and Avi, we've been discussing on how to best make ksm work with O_DIRECT. I don't think it's an immediate priority but eventually we've to fix this as KVM is close to being able to dma from disk directly into guest physical memory without intermediate copies. Checking if pages have O_DIRECT (or similar other physical I/O) in flight is fairly easy, comparing page count with page mapcount should do the trick. The source of the problem is that this page count == page mapcount check should happen under some lock that blocks get_user_pages and get_user_page_fast. If the page is already shared, we only need to block the get_user_pages running on the 'mm' of the 'pte' that we're overwriting. So for get_user_pages it'd be enough to do the check of page count == page mapcount under the PT lock. 1) PT lock if (page_count != page_mapcount) goto fail make pte readonly PT unlock then in the final stage: 2) PT lock if (!pte_same) goto fail change pte to point to ksm page PT unlock If other tasks are starting in-flight O_DIRECT on the page we don't care, those will have to copy-on-write it before starting the O_DIRECT anyway, so there will be still no in-flight I/O on the physical page we're working on. All we care about is that get_user_pages doesn't run on our mm/PTlock between the page_count!=page_mapcount check and the mark of the pte readonly. Otherwise it won't trigger the COW and it should (for us to later notice it in pte_same!). Now with get_user_pages_fast the above PT lock isn't enough anymore to make it safe. While thinking at get_user_pages_fast I figured another worse way things can go wrong with ksm and o_direct: think a thread writing constantly to the last 512bytes of a page, while another thread read and writes to/from the first 512bytes of the page. We can lose O_DIRECT reads, the very moment we mark any pte wrprotected. Then Avi immediately pointed out this means also fork is affected by the same bug that ksm would have. So Avi just found a very longstanding bug in fork. Fork has the very same problem of ksm in marking readonly ptes that could point to pages that have O_DIRECT in flight. So this isn't a KSM problem anymore. We've to fix this longstanding bug in fork first. Then we'll think at KSM and we'll use the same locking technique to make KSM safe against O_DIRECT too The best I can think of, is to re-introduce of the brlock (possibly after making it fair). We can't use RCU as far as I can tell. No idea why brlock was removed perhaps somebody thought RCU was an equivalent replacement? RCU/SRCU can't block anything, and we've to block the get_user_page_fast in the critical section at point 1 to be safe. There's a practical limit of how much things can be delayed, for page faults (at least practically) they can't. ksm br_write_lock() if (page_count != page_mapcount) goto fail make pte readonly br_write_unlock() fork br_write_lock() if (page_count != page_mapcount) copy_page() else make pte readonly br_write_unlock() get_user_page_fast br_read_lock() walk ptes out of order w/o mmap_sem br_read_unlock() Another way of course is to take the mmap_sem in read mode around the out of order part of get_user_page_fast but that'd be invalidating the 'thread vs thread' smp scalability of get_user_page_fast. If it was just for KSM I suggested we could fix it by sigstopping (or getting out of the scheduler in some other more reliable mean) all threads that shared the 'mm' that ksm was working on. That would take care of the fast path of get_user_page_fast and the PT lock would take care of the get_user_page_fast slow path. But this schedule technique ala stop_machine surely isn't workable for fork() for performance reasons. Yet another way is as usual to use a page bitflag to serialize things at the page level. That will prevent multiple O_DIRECT reads to the same page simultaneously but it'll allow fork to wait IO completion and avoid the copy_page(). Ages ago I always wanted to keep the PG_lock for pages under O_DIRECT... We instead relied solely on page pinning which has a few advantages but it makes things like fork more complicated and harder to fix. I'm very interested to know your ideas on how to best fix fork vs o_direct! Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org