From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 62E796B0517 for ; Fri, 28 Jul 2017 06:16:25 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id z36so29744048wrb.13 for ; Fri, 28 Jul 2017 03:16:25 -0700 (PDT) Received: from outbound-smtp04.blacknight.com (outbound-smtp04.blacknight.com. [81.17.249.35]) by mx.google.com with ESMTPS id 10si12369049wmt.48.2017.07.28.03.16.23 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 28 Jul 2017 03:16:23 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17]) by outbound-smtp04.blacknight.com (Postfix) with ESMTPS id 567E0990A4 for ; Fri, 28 Jul 2017 10:16:23 +0000 (UTC) Date: Fri, 28 Jul 2017 11:16:22 +0100 From: Mel Gorman Subject: Re: [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely Message-ID: <20170728101622.uoirf7ryvtoddq7b@techsingularity.net> References: <20170727160701.9245-1-vbabka@suse.cz> <20170727160701.9245-3-vbabka@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20170727160701.9245-3-vbabka@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: linux-mm@kvack.org, Joonsoo Kim , David Rientjes , Michal Hocko , Johannes Weiner , Andrea Arcangeli , Rik van Riel On Thu, Jul 27, 2017 at 06:06:57PM +0200, Vlastimil Babka wrote: > This patch deals with a corner case found when testing kcompactd with a very > simple testcase that first fragments memory (by creating a large shmem file and > then punching hole in every even page) and then uses artificial order-9 > GFP_NOWAIT allocations in a loop. This is freshly after virtme-run boot in KVM > and no other activity. > > What happens is that kswapd always reclaims too little to get over > compact_gap() in kswapd_shrink_node(), so it doesn't set sc->order to 0, thus > "goto kswapd_try_sleep" in kswapd() doesn't happen. In the next iteration of > kswapd() loop, alloc_order and reclaim_order is read again from > pgdat->kswapd_order, which the previous iteration has reset to 0 and there was > no other kswapd wakeup meanwhile (the workload inserts short sleeps between > allocations). With the working order 0, node appears balanced and > wakeup_kcompactd() does nothing. > The risk with a change like this is that there is an introduction of kswapd-stuck-at-100%-cpu reclaiming for high order pages. Consider for example this part > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, > +/* > + * Return true if kswapd fully slept because pgdat was balanced and there was > + * no premature wakeup. > + */ > +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, > unsigned int classzone_idx) > { > long remaining = 0; > DEFINE_WAIT(wait); > + bool ret = false; > > if (freezing(current) || kthread_should_stop()) > - return; > + return false; > > prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); > ... > @@ -3493,23 +3491,32 @@ static int kswapd(void *p) > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > set_freezable(); > > - pgdat->kswapd_order = 0; > + pgdat->kswapd_order = alloc_order = reclaim_order = 0; > pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > for ( ; ; ) { > bool ret; > > - alloc_order = reclaim_order = pgdat->kswapd_order; > + alloc_order = reclaim_order = max(alloc_order, pgdat->kswapd_order); > classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > kswapd_try_sleep: > - kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > - classzone_idx); > - > - /* Read the new order and classzone_idx */ > - alloc_order = reclaim_order = pgdat->kswapd_order; > - classzone_idx = kswapd_classzone_idx(pgdat, 0); > - pgdat->kswapd_order = 0; > - pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > + if (kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > + classzone_idx)) { > + > + /* Read the new order and classzone_idx */ > + alloc_order = reclaim_order = pgdat->kswapd_order; > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > + pgdat->kswapd_order = 0; > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > + } else { > + /* > + * We failed to sleep, so continue on the current order > + * and classzone_idx, unless they increased. > + */ > + alloc_order = max(alloc_order, pgdat->kswapd_order); > + reclaim_order = max(reclaim_order, pgdat->kswapd_order) ; > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > + } > > ret = try_to_freeze(); > if (kthread_should_stop()) kswapd_try_to_sleep returns true only if it fully slept. Now, consider a case where kswapd is woken for order-9, fails and there are streaming allocators that are keeping kswapd awake between the low/high watermark. Even though all subsequent wakeups are for potentially for order-0, the false branch above keeps kswapd at order-9. You should be very wary of keeping kswapd awake for high-order allocations and somehow defer to either kcompactd or push it into direct reclaim. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org