From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christopher Lameter Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit Date: Thu, 14 Sep 2017 11:39:53 -0500 (CDT) Message-ID: References: <83f675ad385d67760da4b99cd95ee912ca7c0b44.1503677178.git.tim.c.chen@linux.intel.com> <37D7C6CF3E00A74B8858931C1DB2F077537A07E9@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F077537A1C19@SHSMSX103.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org To: Tim Chen Cc: Linus Torvalds , "Liang, Kan" , Mel Gorman , Peter Zijlstra , Ingo Molnar , Andi Kleen , Andrew Morton , Johannes Weiner , Jan Kara , "Eric W . Biederman" , Davidlohr Bueso , linux-mm , Linux Kernel Mailing List List-Id: linux-mm.kvack.org On Wed, 13 Sep 2017, Tim Chen wrote: > Here's what the customer think happened and is willing to tell us. > They have a parent process that spawns off 10 children per core and > kicked them to run. The child processes all access a common library. > We have 384 cores so 3840 child processes running. When migration occur on > a page in the common library, the first child that access the page will > page fault and lock the page, with the other children also page faulting > quickly and pile up in the page wait list, till the first child is done. I think we need some way to avoid migration in cases like this. This is crazy. Page migration was not written to deal with something like this.