From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail6.bemta8.messagelabs.com (mail6.bemta8.messagelabs.com [216.82.243.55]) by kanga.kvack.org (Postfix) with ESMTP id 3EB5D6B0169 for ; Fri, 12 Aug 2011 18:35:31 -0400 (EDT) Received: from hpaq3.eem.corp.google.com (hpaq3.eem.corp.google.com [172.25.149.3]) by smtp-out.google.com with ESMTP id p7CMZSav017994 for ; Fri, 12 Aug 2011 15:35:28 -0700 Received: from gwb20 (gwb20.prod.google.com [10.200.2.20]) by hpaq3.eem.corp.google.com with ESMTP id p7CMZPhX012223 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Fri, 12 Aug 2011 15:35:26 -0700 Received: by gwb20 with SMTP id 20so2669786gwb.31 for ; Fri, 12 Aug 2011 15:35:25 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <1312492042-13184-1-git-send-email-walken@google.com> <20110807142532.GC1823@barrios-desktop> Date: Fri, 12 Aug 2011 15:35:25 -0700 Message-ID: Subject: Re: [RFC PATCH 0/3] page count lock for simpler put_page From: Michel Lespinasse Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Andrea Arcangeli , Hugh Dickins , linux-mm@kvack.org, Andrew Morton , "Paul E. McKenney" , Rik van Riel , Mel Gorman , Johannes Weiner , KOSAKI Motohiro , Shaohua Li On Tue, Aug 9, 2011 at 3:22 PM, Minchan Kim wrote: > On Tue, Aug 9, 2011 at 8:04 PM, Michel Lespinasse wrote: >> On Sun, Aug 7, 2011 at 7:25 AM, Minchan Kim wrote: >>> On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote: >>>> I can see that the page_cache_get_speculative comment in >>>> include/linux/pagemap.h maps out one way to prevent the issue. If >>>> thread T continually held an rcu read lock from the time it finds the >>>> pointer to P until the time it calls get_page_unless_zero on that >>>> page, AND there was a synchronize_rcu() call somewhere between the >>>> time a THP page gets allocated and the time __split_huge_page_refcount >>>> might first get called on that page, then things would be safe. >>>> However, that does not seem to be true today: I could not find a >>>> synchronize_rcu() call before __split_huge_page_refcount(), AND there >>>> are also places (such as deactivate_page() for example) that call >>>> get_page_unless_zero without being within an rcu read locked section >>>> (or holding the zone lru lock to provide exclusion against >>>> __split_huge_page_refcount). >> >> Going forward, I can see several possible solutions: >> - Use my proposed page count lock in order to avoid the race. One >> would have to convert all get_page_unless_zero() sites to use it. I >> expect the cost would be low but still measurable. > > It's not necessary to apply it on *all* get_page_unless_zero sites. > Because deactivate_page does it on file pages while THP handles only anon pages. > So the race should not a problem. But it doesn't matter what kind of page the get_page_unless_zero call site hopes to get a reference on - if it doesn't already hold a reference on the page (either directly as a reference, or if a known mapping points to that page and the page table lock is taken or interrupts are disabled in order to guarantee the mapping won't get yanked), then the page can get yanked and a THP page could show up there before the call site gets a reference. >> - Protect all get_page_unless_zero call sites with rcu read lock or >> lru lock (page_cache_get_speculative already has it, but there are >> others to consider), and add a synchronize_rcu() before splitting huge >> pages. > > I think it can't be a solution. > If we don't have any lock for protect write-side, page_count could be > unstable again while we peek page->count in > __split_huge_page_refcount after calling synchronize_rcu. > Do I miss something? The tail page count would be unstable for at most one rcu grace period after the page got allocated. This is guaranteed by making all get_page_unless_zero call sites make sure they somehow determine the page is not a THP tail page (for example because they found it in radix tree) before calling get_page_unless_zero and having an rcu read lock wrapping these two together. This is basically the protocol described in the comment for page_cache_get_speculative() in pagemap.h -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org