From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.5 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2736C433B4 for ; Fri, 23 Apr 2021 20:23:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 12AB1613A9 for ; Fri, 23 Apr 2021 20:23:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 12AB1613A9 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5BC2B6B0036; Fri, 23 Apr 2021 16:23:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 56ADD6B006C; Fri, 23 Apr 2021 16:23:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3968A6B006E; Fri, 23 Apr 2021 16:23:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0106.hostedemail.com [216.40.44.106]) by kanga.kvack.org (Postfix) with ESMTP id 176C96B0036 for ; Fri, 23 Apr 2021 16:23:45 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id CEFB218015402 for ; Fri, 23 Apr 2021 20:23:44 +0000 (UTC) X-FDA: 78064757568.10.2B733A5 Received: from mail-il1-f181.google.com (mail-il1-f181.google.com [209.85.166.181]) by imf12.hostedemail.com (Postfix) with ESMTP id 6B2CF132 for ; Fri, 23 Apr 2021 20:23:35 +0000 (UTC) Received: by mail-il1-f181.google.com with SMTP id y10so987638ilv.0 for ; Fri, 23 Apr 2021 13:23:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=aX40hSHxXwjOOtduDth6BOwMQx+lmvX1XrYRrMT3+DA=; b=U2DHWpx7H3uEAXAfRq2GO7OlE46CLU9tpsxeTgtrAIO9RI3UzpbA6njlYrzjGoC8Vk 82H3BH9eBvNJPkJShp2n6EWjXo+6oTSzpzp5mWKfTnA+wZRMCpXzDC2/EhDPsuKyLAKA V4KZHMfX/0k9nZxFmIjW+sInViNJ+8OKOdbnx+SxBGTVcgM/zyW9JixdSkW2KCk2LUow sB7c0UPFFmk7gdQdv5Ve00c43YEK11vtl0CuzIeDS86AFjItN02YQud0s7Y/BztZEPxl 7Qvj7UJa9eeCe74/JuhbeXzOrFke/sr+pMqY3VwhT8Ioa9lJpoB77a1DGfoqF0s1PPGA bNjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=aX40hSHxXwjOOtduDth6BOwMQx+lmvX1XrYRrMT3+DA=; b=N9LRnwKpGyXlDQKt5fWtdDyOjdR1uHHKj4Ym17rX1wx1QrUkFsWxC5lsO+wzBJyCV9 6q0sKfFjkz3CHWnulIlyL27fiJ7aKwwnxk+XuGjfzk/kk9Nz0L5jyYUwP7aBy4aVd6YQ oW9ywxmY2bfBRm1l0/67cuGuWNtAXxzkB2gwBxKj30mEsE5xqeqYwxeiHvaKRfGHLFQC fr+TjEI0Hma+rJCqyVLd2suk66xFNNtNj6uLwwSw1Tko0L9WWuKRdWtasQVtcLtioIJ+ J+Rg3J/tEJQkMLjdshbDNNfNd43Yoqvz5MlPtyZMDbyOoVQmYrXUK/akqa1w/29aLNEh UKrg== X-Gm-Message-State: AOAM532CsdKydVkFI85WRrhaJhLnbnt+XIaUdLxzISFiCKV0voiQzcX0 TfEfg2U9vO/foW5ZHYEEypMzew== X-Google-Smtp-Source: ABdhPJyEJvX89TONfQQzbYOZWkTGQR71kw6GLXDSUJpmChPvZ78m79zxlR9KB1iqNA4RIJzBgNKaSg== X-Received: by 2002:a92:d092:: with SMTP id h18mr4415717ilh.62.1619209423477; Fri, 23 Apr 2021 13:23:43 -0700 (PDT) Received: from google.com ([2620:15c:183:200:a272:a6c7:58b2:8c6a]) by smtp.gmail.com with ESMTPSA id h11sm2991816ilr.84.2021.04.23.13.23.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Apr 2021 13:23:42 -0700 (PDT) Date: Fri, 23 Apr 2021 14:23:38 -0600 From: Yu Zhao To: Xing Zhengjun Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, ying.huang@intel.com, tim.c.chen@linux.intel.com, Shakeel Butt , Michal Hocko , wfg@mail.ustc.edu.cn Subject: Re: [RFC] mm/vmscan.c: avoid possible long latency caused by too_many_isolated() Message-ID: References: <20210416023536.168632-1-zhengjun.xing@linux.intel.com> <7b7a1c09-3d16-e199-15d2-ccea906d4a66@linux.intel.com> <7a0fecab-f9e1-ad39-d55e-01e574a35484@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <7a0fecab-f9e1-ad39-d55e-01e574a35484@linux.intel.com> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 6B2CF132 X-Stat-Signature: cgd5knu9ccm3k7maie9hmfw94a7mnmxr Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf12; identity=mailfrom; envelope-from=""; helo=mail-il1-f181.google.com; client-ip=209.85.166.181 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1619209415-592174 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Apr 23, 2021 at 02:57:07PM +0800, Xing Zhengjun wrote: > On 4/23/2021 1:13 AM, Yu Zhao wrote: > > On Thu, Apr 22, 2021 at 04:36:19PM +0800, Xing Zhengjun wrote: > > > Hi, > > >=20 > > > In the system with very few file pages (nr_active_file + nr_ina= ctive_file > > > < 100), it is easy to reproduce "nr_isolated_file > nr_inactive_fil= e", then > > > too_many_isolated return true, shrink_inactive_list enter "msleep(1= 00)", the > > > long latency will happen. > > >=20 > > > The test case to reproduce it is very simple: allocate many huge pa= ges(near > > > the DRAM size), then do free, repeat the same operation many times. > > > In the test case, the system with very few file pages (nr_active_fi= le + > > > nr_inactive_file < 100), I have dumpped the numbers of > > > active/inactive/isolated file pages during the whole test(see in th= e > > > attachments) , in shrink_inactive_list "too_many_isolated" is very = easy to > > > return true, then enter "msleep(100)",in "too_many_isolated" sc->gf= p_mask is > > > 0x342cca ("_GFP_IO" and "__GFP_FS" is masked) , it is also very eas= y to > > > enter =E2=80=9Cinactive >>=3D3=E2=80=9D, then =E2=80=9Cisolated > i= nactive=E2=80=9D will be true. > > >=20 > > > So I have a proposal to set a threshold number for the total file = pages to > > > ignore the system with very few file pages, and then bypass the 100= ms sleep. > > > It is hard to set a perfect number for the threshold, so I just giv= e an > > > example of "256" for it. > > >=20 > > > I appreciate it if you can give me your suggestion/comments. Thanks= . > >=20 > > Hi Zhengjun, > >=20 > > It seems to me using the number of isolated pages to keep a lid on > > direct reclaimers is not a good solution. We shouldn't keep going > > that direction if we really want to fix the problem because migration > > can isolate many pages too, which in turn blocks page reclaim. > >=20 > > Here is something works a lot better. Please give it a try. Thanks. >=20 > Thanks, I will try it with my test cases. Thanks. I took care my sloppiness from yesterday and tested the following. It should apply cleanly and work well. Please let me know. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 47946cec7584..48bb2b77389e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -832,6 +832,7 @@ typedef struct pglist_data { #endif =20 /* Fields commonly accessed by the page reclaim scanner */ + atomic_t nr_reclaimers; =20 /* * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED. diff --git a/mm/vmscan.c b/mm/vmscan.c index 562e87cbd7a1..3fcdfbee89c7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1775,43 +1775,6 @@ int isolate_lru_page(struct page *page) return ret; } =20 -/* - * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU li= st and - * then get rescheduled. When there are massive number of tasks doing pa= ge - * allocation, such sleeping direct reclaimers may keep piling up on eac= h CPU, - * the LRU list will go small and be scanned faster than necessary, lead= ing to - * unnecessary swapping, thrashing and OOM. - */ -static int too_many_isolated(struct pglist_data *pgdat, int file, - struct scan_control *sc) -{ - unsigned long inactive, isolated; - - if (current_is_kswapd()) - return 0; - - if (!writeback_throttling_sane(sc)) - return 0; - - if (file) { - inactive =3D node_page_state(pgdat, NR_INACTIVE_FILE); - isolated =3D node_page_state(pgdat, NR_ISOLATED_FILE); - } else { - inactive =3D node_page_state(pgdat, NR_INACTIVE_ANON); - isolated =3D node_page_state(pgdat, NR_ISOLATED_ANON); - } - - /* - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they - * won't get blocked by normal direct-reclaimers, forming a circular - * deadlock. - */ - if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) =3D=3D (__GFP_IO | __GFP_FS)= ) - inactive >>=3D 3; - - return isolated > inactive; -} - /* * move_pages_to_lru() moves pages from private @list to appropriate LRU= list. * On return, @list is reused as a list of pages to be freed by the call= er. @@ -1911,20 +1874,6 @@ shrink_inactive_list(unsigned long nr_to_scan, str= uct lruvec *lruvec, bool file =3D is_file_lru(lru); enum vm_event_item item; struct pglist_data *pgdat =3D lruvec_pgdat(lruvec); - bool stalled =3D false; - - while (unlikely(too_many_isolated(pgdat, file, sc))) { - if (stalled) - return 0; - - /* wait a bit for the reclaimer. */ - msleep(100); - stalled =3D true; - - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; - } =20 lru_add_drain(); =20 @@ -2903,6 +2852,8 @@ static void shrink_zones(struct zonelist *zonelist,= struct scan_control *sc) unsigned long nr_soft_scanned; gfp_t orig_mask; pg_data_t *last_pgdat =3D NULL; + bool should_retry =3D false; + int nr_cpus =3D num_online_cpus(); =20 /* * If the number of buffer_heads in the machine exceeds the maximum @@ -2914,9 +2865,18 @@ static void shrink_zones(struct zonelist *zonelist= , struct scan_control *sc) sc->gfp_mask |=3D __GFP_HIGHMEM; sc->reclaim_idx =3D gfp_zone(sc->gfp_mask); } - +retry: for_each_zone_zonelist_nodemask(zone, z, zonelist, sc->reclaim_idx, sc->nodemask) { + /* + * Shrink each node in the zonelist once. If the zonelist is + * ordered by zone (not the default) then a node may be shrunk + * multiple times but in that case the user prefers lower zones + * being preserved. + */ + if (zone->zone_pgdat =3D=3D last_pgdat) + continue; + /* * Take care memory controller reclaiming has small influence * to global LRU. @@ -2941,16 +2901,28 @@ static void shrink_zones(struct zonelist *zonelis= t, struct scan_control *sc) sc->compaction_ready =3D true; continue; } + } =20 - /* - * Shrink each node in the zonelist once. If the - * zonelist is ordered by zone (not the default) then a - * node may be shrunk multiple times but in that case - * the user prefers lower zones being preserved. - */ - if (zone->zone_pgdat =3D=3D last_pgdat) - continue; + /* + * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from + * the LRU list and then get rescheduled. When there are massive + * number of tasks doing page allocation, such sleeping direct + * reclaimers may keep piling up on each CPU, the LRU list will + * go small and be scanned faster than necessary, leading to + * unnecessary swapping, thrashing and OOM. + */ + VM_BUG_ON(current_is_kswapd()); =20 + if (!atomic_add_unless(&zone->zone_pgdat->nr_reclaimers, 1, nr_cpus)) = { + should_retry =3D true; + continue; + } + + if (last_pgdat) + atomic_dec(&last_pgdat->nr_reclaimers); + last_pgdat =3D zone->zone_pgdat; + + if (!cgroup_reclaim(sc)) { /* * This steals pages from memory cgroups over softlimit * and returns the number of reclaimed pages and @@ -2966,13 +2938,20 @@ static void shrink_zones(struct zonelist *zonelis= t, struct scan_control *sc) /* need some check for avoid more shrink_zone() */ } =20 - /* See comment about same check for global reclaim above */ - if (zone->zone_pgdat =3D=3D last_pgdat) - continue; - last_pgdat =3D zone->zone_pgdat; shrink_node(zone->zone_pgdat, sc); } =20 + if (last_pgdat) + atomic_dec(&last_pgdat->nr_reclaimers); + else if (should_retry) { + /* wait a bit for the reclaimer. */ + if (!schedule_timeout_killable(HZ / 10)) + goto retry; + + /* We are about to die and free our memory. Return now. */ + sc->nr_reclaimed +=3D SWAP_CLUSTER_MAX; + } + /* * Restore to original mask to avoid the impact on the caller if we * promoted it to __GFP_HIGHMEM. @@ -4189,6 +4168,15 @@ static int __node_reclaim(struct pglist_data *pgda= t, gfp_t gfp_mask, unsigned in set_task_reclaim_state(p, &sc.reclaim_state); =20 if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { + int nr_cpus =3D num_online_cpus(); + + VM_BUG_ON(current_is_kswapd()); + + if (!atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) { + schedule_timeout_killable(HZ / 10); + goto out; + } + /* * Free memory by calling shrink node with increasing * priorities until we have enough memory freed. @@ -4196,8 +4184,10 @@ static int __node_reclaim(struct pglist_data *pgda= t, gfp_t gfp_mask, unsigned in do { shrink_node(pgdat, &sc); } while (sc.nr_reclaimed < nr_pages && --sc.priority >=3D 0); - } =20 + atomic_dec(&pgdat->nr_reclaimers); + } +out: set_task_reclaim_state(p, NULL); current->flags &=3D ~PF_SWAPWRITE; memalloc_noreclaim_restore(noreclaim_flag);