From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com [209.85.220.43]) by kanga.kvack.org (Postfix) with ESMTP id 9E0356B0038 for ; Fri, 6 Mar 2015 06:54:06 -0500 (EST) Received: by pabli10 with SMTP id li10so50993198pab.13 for ; Fri, 06 Mar 2015 03:54:06 -0800 (PST) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id c4si14163728pdo.227.2015.03.06.03.54.04 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 06 Mar 2015 03:54:05 -0800 (PST) Subject: Re: How to handle TIF_MEMDIE stalls? From: Tetsuo Handa References: <20150223004521.GK12722@dastard> <20150228162943.GA17989@phnom.home.cmpxchg.org> <20150228164158.GE5404@thunk.org> <20150228221558.GA23028@phnom.home.cmpxchg.org> <201503012017.EAD00571.HOOJVOStMFLFQF@I-love.SAKURA.ne.jp> In-Reply-To: <201503012017.EAD00571.HOOJVOStMFLFQF@I-love.SAKURA.ne.jp> Message-Id: <201503062053.GIC34848.FOMFtOSOLFJHVQ@I-love.SAKURA.ne.jp> Date: Fri, 6 Mar 2015 20:53:53 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: david@fromorbit.com Cc: hannes@cmpxchg.org, tytso@mit.edu, mhocko@suse.cz, dchinner@redhat.com, linux-mm@kvack.org, rientjes@google.com, oleg@redhat.com, akpm@linux-foundation.org, mgorman@suse.de, torvalds@linux-foundation.org, xfs@oss.sgi.com, fernando_b1@lab.ntt.co.jp Tetsuo Handa wrote: > If underestimating is tolerable, can we simply set different watermark > levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations? > For example, > > GFP_KERNEL (or above) can fail if memory usage exceeds 95% > GFP_NOFS can fail if memory usage exceeds 97% > GFP_NOIO can fail if memory usage exceeds 98% > GFP_ATOMIC can fail if memory usage exceeds 99% > > I think that below order-0 GFP_NOIO allocation enters into retry-forever loop > when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds > strange. Use of same watermark is preventing kernel worker threads from > processing workqueue. While it is legal to do blocking operation from > workqueue, being blocked forever is an exclusive occupation for workqueue; > other jobs in the workqueue get stuck. > Below experimental patch which raises zone watermark works for me. ---------- diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432..92233e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1710,6 +1710,7 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif + gfp_t gfp_mask; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7abfa70..1a6b830 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1810,6 +1810,12 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order, min -= min / 2; if (alloc_flags & ALLOC_HARDER) min -= min / 4; + if (min == mark) { + if (current->gfp_mask & __GFP_FS) + min <<= 1; + if (current->gfp_mask & __GFP_IO) + min <<= 1; + } #ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */ if (!(alloc_flags & ALLOC_CMA)) @@ -2810,6 +2816,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, .nodemask = nodemask, .migratetype = gfpflags_to_migratetype(gfp_mask), }; + gfp_t orig_gfp_mask; gfp_mask &= gfp_allowed_mask; @@ -2831,6 +2838,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; + orig_gfp_mask = current->gfp_mask; + current->gfp_mask = gfp_mask; retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); @@ -2873,6 +2882,7 @@ out: if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; + current->gfp_mask = orig_gfp_mask; return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); ---------- Thanks again to Jonathan Corbet for writing https://lwn.net/Articles/635354/ . Is Dave Chinner's "reservations" suggestion conceptually doing the patch above? Dave's suggestion is to ask each GFP_NOFS and GFP_NOIO users to estimate how much amount of pages they need for their transaction like if (min == mark) { if (current->gfp_mask & __GFP_FS) min += atomic_read(&reservation_for_gfp_fs); if (current->gfp_mask & __GFP_IO) min += atomic_read(&reservation_for_gfp_io); } than ask the administrator to specify a static amount like if (min == mark) { if (current->gfp_mask & __GFP_FS) min += sysctl_reservation_for_gfp_fs; if (current->gfp_mask & __GFP_IO) min += sysctl_reservation_for_gfp_io; } ? The retry-forever loop will happen if underestimated, won't it? Then, how to handle it when the OOM killer missed the target (due to __GFP_FS) or the OOM killer cannot be invoked (due to !__GFP_FS)? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org