From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60490C433ED for ; Thu, 22 Apr 2021 17:13:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C837761360 for ; Thu, 22 Apr 2021 17:13:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C837761360 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 396176B0078; Thu, 22 Apr 2021 13:13:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 345D56B007B; Thu, 22 Apr 2021 13:13:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BFAE6B007D; Thu, 22 Apr 2021 13:13:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0184.hostedemail.com [216.40.44.184]) by kanga.kvack.org (Postfix) with ESMTP id 00B726B0078 for ; Thu, 22 Apr 2021 13:13:42 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 7F6F41801F2C6 for ; Thu, 22 Apr 2021 17:13:40 +0000 (UTC) X-FDA: 78060649800.24.4E5CAE1 Received: from mail-il1-f174.google.com (mail-il1-f174.google.com [209.85.166.174]) by imf23.hostedemail.com (Postfix) with ESMTP id 16F85A0000FF for ; Thu, 22 Apr 2021 17:13:37 +0000 (UTC) Received: by mail-il1-f174.google.com with SMTP id l19so34600490ilk.13 for ; Thu, 22 Apr 2021 10:13:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=DSl0517RJC/IGa12rn20xYGXFTSU8SInEu5Zm6c1rN4=; b=HjfnpiYsIYXjoM09ckrMCHq24Z8ntxc5d8yrRrlpt2BqPO8JWz+0BRUH9NhzbmDWA1 FaRylo/TwfB1RWF/+k8szdVr1mNTl/PDrKpHMn7dSgNlQIsmyFNl6M6sZotppnp8CJcj dksh0taOaejs/S5etgJ+YEfgsqOZQgFYPP5VxpInbgCrUTt8i2DqRL7HGfMSQ81F2nkb oSyhPPvWve6GB0SIVm1J7kslZ/nrMqx4/f+XhgFo/O+dw019yuZ1CgZ6scknngwnemCM zeUC/vj13V+W0qdBabtNyLTdMB18bW4m9CxlYk6vVU0kigcWb6EHOO+Ye4fUR3nO8CgX m+4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=DSl0517RJC/IGa12rn20xYGXFTSU8SInEu5Zm6c1rN4=; b=TzA+YMEIu3w+py+R+eGdiUYbh5E7r0FrDF2D/2mkcVYLP0rguZYBxPIRADhcDmj2Kc h0n4TpyCL2tlp2/kEyQ+Q7D4sCZZkLBNpwzlt4vSW1kAQ4IJfrF5cZHYsxhUZCpYA0Jk JLjOzasmkTFK2JVIUpF2HO/S909qup6ctNPE8LxjFVnTwVUqR51N5QhWQdGpm/giMI+o 9s7R24E8Oy2Q5oLfS4V6MICIii2l9Rixe+gJ+t2aa1VHLXop8UqJ+o0wMBVsszIvP5+X 1EFbnChRhGk4tXwAT1O9vBRkMpnOjCJFR1VEsHZWBbknW7u5xA9O90wFD/Rk9uUvodbm af6A== X-Gm-Message-State: AOAM530SPqshqliEFwqtiolcuBsDWko8jGwqHTSyGfKeQeXGA4Q78vkl ZxV2MekWH1ySgEm9c889T5eWCg== X-Google-Smtp-Source: ABdhPJy3Agi5bWMQjZE7N0wk/Vutq1KFhepkrIq4b8y3Aar43c5hcn1j0zm22UvPdHkCCIPyeSvNLw== X-Received: by 2002:a92:cc02:: with SMTP id s2mr3560847ilp.101.1619111619206; Thu, 22 Apr 2021 10:13:39 -0700 (PDT) Received: from google.com ([2620:15c:183:200:8e2c:52b6:a763:3432]) by smtp.gmail.com with ESMTPSA id x13sm1418534ilv.88.2021.04.22.10.13.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Apr 2021 10:13:38 -0700 (PDT) Date: Thu, 22 Apr 2021 11:13:34 -0600 From: Yu Zhao To: Xing Zhengjun Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, ying.huang@intel.com, tim.c.chen@linux.intel.com, Shakeel Butt , Michal Hocko , wfg@mail.ustc.edu.cn Subject: Re: [RFC] mm/vmscan.c: avoid possible long latency caused by too_many_isolated() Message-ID: References: <20210416023536.168632-1-zhengjun.xing@linux.intel.com> <7b7a1c09-3d16-e199-15d2-ccea906d4a66@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <7b7a1c09-3d16-e199-15d2-ccea906d4a66@linux.intel.com> X-Rspamd-Queue-Id: 16F85A0000FF X-Stat-Signature: yrtzhwqnhcqwt18woua7t8kj9ujidyuw X-Rspamd-Server: rspam02 Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf23; identity=mailfrom; envelope-from=""; helo=mail-il1-f174.google.com; client-ip=209.85.166.174 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1619111617-30898 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Apr 22, 2021 at 04:36:19PM +0800, Xing Zhengjun wrote: > Hi, >=20 > In the system with very few file pages (nr_active_file + nr_inactive= _file > < 100), it is easy to reproduce "nr_isolated_file > nr_inactive_file", = then > too_many_isolated return true, shrink_inactive_list enter "msleep(100)"= , the > long latency will happen. >=20 > The test case to reproduce it is very simple: allocate many huge pages(= near > the DRAM size), then do free, repeat the same operation many times. > In the test case, the system with very few file pages (nr_active_file + > nr_inactive_file < 100), I have dumpped the numbers of > active/inactive/isolated file pages during the whole test(see in the > attachments) , in shrink_inactive_list "too_many_isolated" is very easy= to > return true, then enter "msleep(100)",in "too_many_isolated" sc->gfp_ma= sk is > 0x342cca ("_GFP_IO" and "__GFP_FS" is masked) , it is also very easy to > enter =E2=80=9Cinactive >>=3D3=E2=80=9D, then =E2=80=9Cisolated > inact= ive=E2=80=9D will be true. >=20 > So I have a proposal to set a threshold number for the total file page= s to > ignore the system with very few file pages, and then bypass the 100ms s= leep. > It is hard to set a perfect number for the threshold, so I just give an > example of "256" for it. >=20 > I appreciate it if you can give me your suggestion/comments. Thanks. Hi Zhengjun, It seems to me using the number of isolated pages to keep a lid on direct reclaimers is not a good solution. We shouldn't keep going that direction if we really want to fix the problem because migration can isolate many pages too, which in turn blocks page reclaim. Here is something works a lot better. Please give it a try. Thanks. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 507d216610bf2..9a09f7e76f6b8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -951,6 +951,8 @@ typedef struct pglist_data { =20 /* Fields commonly accessed by the page reclaim scanner */ =20 + atomic_t nr_reclaimers; + /* * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED. * diff --git a/mm/vmscan.c b/mm/vmscan.c index 1c080fafec396..f7278642290a6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1786,43 +1786,6 @@ int isolate_lru_page(struct page *page) return ret; } =20 -/* - * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU li= st and - * then get rescheduled. When there are massive number of tasks doing pa= ge - * allocation, such sleeping direct reclaimers may keep piling up on eac= h CPU, - * the LRU list will go small and be scanned faster than necessary, lead= ing to - * unnecessary swapping, thrashing and OOM. - */ -static int too_many_isolated(struct pglist_data *pgdat, int file, - struct scan_control *sc) -{ - unsigned long inactive, isolated; - - if (current_is_kswapd()) - return 0; - - if (!writeback_throttling_sane(sc)) - return 0; - - if (file) { - inactive =3D node_page_state(pgdat, NR_INACTIVE_FILE); - isolated =3D node_page_state(pgdat, NR_ISOLATED_FILE); - } else { - inactive =3D node_page_state(pgdat, NR_INACTIVE_ANON); - isolated =3D node_page_state(pgdat, NR_ISOLATED_ANON); - } - - /* - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they - * won't get blocked by normal direct-reclaimers, forming a circular - * deadlock. - */ - if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) =3D=3D (__GFP_IO | __GFP_FS)= ) - inactive >>=3D 3; - - return isolated > inactive; -} - /* * move_pages_to_lru() moves pages from private @list to appropriate LRU= list. * On return, @list is reused as a list of pages to be freed by the call= er. @@ -1924,19 +1887,6 @@ shrink_inactive_list(unsigned long nr_to_scan, str= uct lruvec *lruvec, struct pglist_data *pgdat =3D lruvec_pgdat(lruvec); bool stalled =3D false; =20 - while (unlikely(too_many_isolated(pgdat, file, sc))) { - if (stalled) - return 0; - - /* wait a bit for the reclaimer. */ - msleep(100); - stalled =3D true; - - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; - } - lru_add_drain(); =20 spin_lock_irq(&lruvec->lru_lock); @@ -3302,6 +3252,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask,= struct zonelist *zonelist, unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *nodemask) { + int nr_cpus; unsigned long nr_reclaimed; struct scan_control sc =3D { .nr_to_reclaim =3D SWAP_CLUSTER_MAX, @@ -3334,8 +3285,17 @@ unsigned long try_to_free_pages(struct zonelist *z= onelist, int order, set_task_reclaim_state(current, &sc.reclaim_state); trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask); =20 + nr_cpus =3D current_is_kswapd() ? 0 : num_online_cpus(); + while (nr_cpus && !atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)= ) { + if (schedule_timeout_killable(HZ / 10)) + return SWAP_CLUSTER_MAX; + } + nr_reclaimed =3D do_try_to_free_pages(zonelist, &sc); =20 + if (nr_cpus) + atomic_dec(&pgdat->nr_reclaimers); + trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); set_task_reclaim_state(current, NULL);