From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx116.postini.com [74.125.245.116]) by kanga.kvack.org (Postfix) with SMTP id AE8136B0044 for ; Tue, 1 May 2012 12:20:40 -0400 (EDT) Received: by lbjn8 with SMTP id n8so3097069lbj.14 for ; Tue, 01 May 2012 09:20:38 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <1335214564-17619-1-git-send-email-yinghan@google.com> Date: Tue, 1 May 2012 09:20:38 -0700 Message-ID: Subject: Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop From: Ying Han Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Nick Piggin Cc: Michal Hocko , Johannes Weiner , Mel Gorman , KAMEZAWA Hiroyuki , Rik van Riel , Minchan Kim , Hugh Dickins , KOSAKI Motohiro , Andrew Morton , linux-mm@kvack.org take off the old mailing address of Nick :) --Ying On Tue, May 1, 2012 at 9:18 AM, Ying Han wrote: > On Mon, Apr 30, 2012 at 8:34 PM, Nick Piggin wrote: >> On 25 April 2012 04:37, Ying Han wrote: >>> On Mon, Apr 23, 2012 at 10:36 PM, Nick Piggin wrote= : >>>> On 24 April 2012 06:56, Ying Han wrote: >>>>> This is not a patch targeted to be merged at all, but trying to under= stand >>>>> a logic in global direct reclaim. >>>>> >>>>> There is a logic in global direct reclaim where reclaim fails on prio= rity 0 >>>>> and zone->all_unreclaimable is not set, it will cause the direct to s= tart over >>>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang = which is >>>>> very likely caused by direct reclaim enters infinite loop. >>>> >>>> Very likely, or definitely? Can you reproduce it? What workload? >>> >>> No, we don't have reproduce workload for that yet. Everything is based >>> on the watchdog dump file :( >>> >>>> >>>>> >>>>> There have been serious patches trying to fix similar issue and the l= atest >>>>> patch has good summary of all the efforts: >>>>> >>>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74 >>>>> Author: KOSAKI Motohiro >>>>> Date: =A0 Thu Apr 14 15:22:12 2011 -0700 >>>>> >>>>> =A0 =A0vmscan: all_unreclaimable() use zone->all_unreclaimable as a n= ame >>>>> >>>>> Kosaki explained the problem triggered by async zone->all_unreclaimab= le and >>>>> zone->pages_scanned where the later one was being checked by direct r= eclaim. >>>>> However, after the patch, the problem remains where the setting of >>>>> zone->all_unreclaimable is asynchronous with zone is actually reclaim= able or not. >>>>> >>>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->p= ages_scanned in >>>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable = =3D=3D false while >>>>> the zone is actually unreclaimable? >>>>> >>>>> 1. while kswapd in reclaim priority loop, someone frees a page on the= zone. It >>>>> will end up resetting the pages_scanned. >>>>> >>>>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered t= he >>>>> hibernation case by checking oom_killer_disabled, but not sure if tha= t is >>>>> everything we need to worry about. The key point here is that direct = reclaim >>>>> relies on a flag which is set by kswapd asynchronously, that doesn't = sound safe. >>>>> >>>>> Instead of keep fixing the problem, I am wondering why we have the lo= gic >>>>> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at th= e first place: >>>>> >>>>> Here is the patch introduced the logic initially: >>>>> >>>>> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f >>>>> Author: Nick Piggin >>>>> Date: =A0 Mon Sep 25 23:31:27 2006 -0700 >>>>> >>>>> =A0 =A0[PATCH] oom: use unreclaimable info >>>>> >>>>> However, I didn't find detailed description of what problem the commi= t trying >>>>> to fix and wondering if the problem still exist after 5 years. I woul= d be happy >>>>> to see the later case where we can consider to revert the initial pat= ch. >>>> >>>> The problem we were having is that processes would be killed at seemin= gly >>>> random points of time, under heavy swapping, but long before all swap = was >>>> used. >>>> >>>> The particular problem IIRC was related to testing a lot of guests on = an s390 >>>> machine. I'm ashamed to have not included more information in the >>>> changelog -- I suspect it was probably in a small batch of patches wit= h a >>>> description in the introductory mail and not properly placed into patc= hes :( >>>> >>>> There are certainly a lot of changes in the area since then, so I coul= dn't be >>>> sure of what will happen by taking this out. >>>> >>>> I don't think the page allocator "try harder" logic was enough to solv= e the >>>> problem, and I think it was around in some form even back then. >>>> >>>> The biggest problem is that it's not an exact science. It will never d= o the >>>> right thing for everybody, sadly. Even if it is able to allocate pages= at a >>>> very slow rate, this is effectively as good as a hang for some users. = For >>>> others, they want to be able to manually intervene before anything is = killed. >>>> >>>> Sorry if this isn't too helpful! Any ideas would be good. Possibly nee= d to have >>>> a way to describe these behaviours in an abstract way (i.e., not just = magic >>>> numbers), and allow user to tune it. >>> >>> Thank you Nick and this is helpful. I looked up on the patches you >>> mentioned, and I can see what problem they were trying to solve by >>> that time. However things have been changed a lot, and it is hard to >>> tell if the problem still remains on the current kernel or not. By >>> spotting each by each, I see either the patch has been replaced by >>> different logic or the same logic has been implemented differently. >>> >>> For this particular one patch, we now have code which does page alloc >>> retry before entering OOM. So I am wondering if that will help the OOM >>> situation by that time. >> >> Well it's not doing exactly the same thing, actually. And note that the >> problem was not about parallel OOM-killing. The fact that the page >> reclaim has not made any progress when we last called in does not >> actually mean that it cannot make _any_ progress. >> >> My patch is more about detecting the latter case. I don't see there >> is equivalent logic in page allocator to replace it. >> >> But again: this is not a question of correct or incorrect as far as I >> can see, simply a matter of where you define "hopeless"! I could >> easily see the need for way to bias that (kill quickly, medium, try to >> never kill). > > That is right. We ( at google) seems to be on the other end of the > bias where we prefer to oom kill instead of hopelessly looping in the > reclaim path. Normally if the application gets into that state, the > performance would be sucks already and they might prefer to be > restarted :) > > Unfortunately, We weren't being able to reproduce the issue with > synthetic workload. So far it only happens in production with a > particular workload when the memory runs really really tight. > > The current logic seems perfer to reclaim more than going oom kill, > and that might not fit all user's expectation. However, I guess it is > hard to convince for any changes since different users has different > bias as you said.... > > Thanks > > --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org