From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 23 Jun 2008 14:11:35 -0500 From: Robin Holt Subject: Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2? Message-ID: <20080623191135.GG10062@sgi.com> References: <20080618164158.GC10062@sgi.com> <200806190329.30622.nickpiggin@yahoo.com.au> <200806191307.04499.nickpiggin@yahoo.com.au> <20080619133809.GC10123@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org Return-Path: To: Hugh Dickins Cc: Robin Holt , Nick Piggin , Ingo Molnar , Christoph Lameter , Jack Steiner , linux-mm@kvack.org List-ID: On Thu, Jun 19, 2008 at 02:49:50PM +0100, Hugh Dickins wrote: > On Thu, 19 Jun 2008, Robin Holt wrote: > > On Thu, Jun 19, 2008 at 12:09:15PM +0100, Hugh Dickins wrote: > > > > > > (I assume Robin is not forking, we do know that causes this kind > > > of problem, but he didn't mention any forking so I assume not.) > > > > There has been a fork long before this mapping was created. There was a > > hole at this location and the mapping gets established and pages populated > > following all ranks of the MPI job getting initialized. > > There's usually been a fork somewhen in the past! That's no problem. > > The fork problem comes when someone has done a get_user_pages to break > all the COWs, then another thread does a fork which writeprotects and > raises page_mapcount, so the next write from userspace breaks COW again > and writes to a different page from that which the kernel is holding. > > That one kept on coming up, but I've not heard of it again since we > added madvise MADV_DONTFORK so apps could exclude such parts of the > address space from copy_page_range. I think you still have a hole. Here is what I _think_ I was actually running into. A range of memory was exported with xpmem. This is on a sles10 kernel which has no mmu_notifier equivalent functionality. The exporting process has write faulted a range of addresses which it plans on using for a functionality validation test which verifies its results. The address range is then imported by the other MPI ranks (128 ranks total) and pages are faulted in. At this time, the system comes under severe memory pressure. The swap code makes a swap entry and replaces both process's PTE with the swap entry. XPMEM is holding an extra reference (_count) on the page. The imported now faults the page again (either read or write, does not matter, merely that it faults first). After that, the exporter read faults the address and then write faults. The read followed by write seems to be the key. At that point, the _count and _mapcount are both elevated to the point where the page will be COW'd. To verify that it was _something_ like this, I had inserted a BUG_ON when we return from get_user_pages() to verify the _mapcount is 1 or greater and the _count is 2 or greater. Additionally, I walked the process page tables at this point and verified pte_write was true. I also added a page flag (just a kludge to verify). When XPMEM exports the page, I set the page flag. In can_share_swap_page, I made the return (count == 1) && test_bit(27, &page->flags); I clearly messed something up, but it does indicate I am finally in the right neighborhood. The test program completes with success. I definitely messed up the clearing of bit 27 because the machine will no longer launch new executables after the job completes. If I reboot, I can rerun the job to successful completion again. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org