From: Yu Zhao <yuzhao@google.com>
To: Xing Zhengjun <zhengjun.xing@linux.intel.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, ying.huang@intel.com,
tim.c.chen@linux.intel.com, Shakeel Butt <shakeelb@google.com>,
Michal Hocko <mhocko@suse.com>,
wfg@mail.ustc.edu.cn
Subject: Re: [RFC] mm/vmscan.c: avoid possible long latency caused by too_many_isolated()
Date: Thu, 22 Apr 2021 11:13:34 -0600 [thread overview]
Message-ID: <YIGuvh70JbE1Cx4U@google.com> (raw)
In-Reply-To: <7b7a1c09-3d16-e199-15d2-ccea906d4a66@linux.intel.com>
On Thu, Apr 22, 2021 at 04:36:19PM +0800, Xing Zhengjun wrote:
> Hi,
>
> In the system with very few file pages (nr_active_file + nr_inactive_file
> < 100), it is easy to reproduce "nr_isolated_file > nr_inactive_file", then
> too_many_isolated return true, shrink_inactive_list enter "msleep(100)", the
> long latency will happen.
>
> The test case to reproduce it is very simple: allocate many huge pages(near
> the DRAM size), then do free, repeat the same operation many times.
> In the test case, the system with very few file pages (nr_active_file +
> nr_inactive_file < 100), I have dumpped the numbers of
> active/inactive/isolated file pages during the whole test(see in the
> attachments) , in shrink_inactive_list "too_many_isolated" is very easy to
> return true, then enter "msleep(100)",in "too_many_isolated" sc->gfp_mask is
> 0x342cca ("_GFP_IO" and "__GFP_FS" is masked) , it is also very easy to
> enter “inactive >>=3”, then “isolated > inactive” will be true.
>
> So I have a proposal to set a threshold number for the total file pages to
> ignore the system with very few file pages, and then bypass the 100ms sleep.
> It is hard to set a perfect number for the threshold, so I just give an
> example of "256" for it.
>
> I appreciate it if you can give me your suggestion/comments. Thanks.
Hi Zhengjun,
It seems to me using the number of isolated pages to keep a lid on
direct reclaimers is not a good solution. We shouldn't keep going
that direction if we really want to fix the problem because migration
can isolate many pages too, which in turn blocks page reclaim.
Here is something works a lot better. Please give it a try. Thanks.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 507d216610bf2..9a09f7e76f6b8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -951,6 +951,8 @@ typedef struct pglist_data {
/* Fields commonly accessed by the page reclaim scanner */
+ atomic_t nr_reclaimers;
+
/*
* NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1c080fafec396..f7278642290a6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1786,43 +1786,6 @@ int isolate_lru_page(struct page *page)
return ret;
}
-/*
- * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
- * then get rescheduled. When there are massive number of tasks doing page
- * allocation, such sleeping direct reclaimers may keep piling up on each CPU,
- * the LRU list will go small and be scanned faster than necessary, leading to
- * unnecessary swapping, thrashing and OOM.
- */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
- struct scan_control *sc)
-{
- unsigned long inactive, isolated;
-
- if (current_is_kswapd())
- return 0;
-
- if (!writeback_throttling_sane(sc))
- return 0;
-
- if (file) {
- inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
- isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
- } else {
- inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
- isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
- }
-
- /*
- * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
- * won't get blocked by normal direct-reclaimers, forming a circular
- * deadlock.
- */
- if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
- inactive >>= 3;
-
- return isolated > inactive;
-}
-
/*
* move_pages_to_lru() moves pages from private @list to appropriate LRU list.
* On return, @list is reused as a list of pages to be freed by the caller.
@@ -1924,19 +1887,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
bool stalled = false;
- while (unlikely(too_many_isolated(pgdat, file, sc))) {
- if (stalled)
- return 0;
-
- /* wait a bit for the reclaimer. */
- msleep(100);
- stalled = true;
-
- /* We are about to die and free our memory. Return now. */
- if (fatal_signal_pending(current))
- return SWAP_CLUSTER_MAX;
- }
-
lru_add_drain();
spin_lock_irq(&lruvec->lru_lock);
@@ -3302,6 +3252,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
+ int nr_cpus;
unsigned long nr_reclaimed;
struct scan_control sc = {
.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -3334,8 +3285,17 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
set_task_reclaim_state(current, &sc.reclaim_state);
trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
+ nr_cpus = current_is_kswapd() ? 0 : num_online_cpus();
+ while (nr_cpus && !atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) {
+ if (schedule_timeout_killable(HZ / 10))
+ return SWAP_CLUSTER_MAX;
+ }
+
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ if (nr_cpus)
+ atomic_dec(&pgdat->nr_reclaimers);
+
trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
set_task_reclaim_state(current, NULL);
next prev parent reply other threads:[~2021-04-22 17:13 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-16 2:35 zhengjun.xing
2021-04-22 8:36 ` Xing Zhengjun
2021-04-22 10:23 ` Hillf Danton
2021-04-23 6:55 ` Xing Zhengjun
2021-04-30 5:33 ` Xing Zhengjun
2021-04-30 6:43 ` Hillf Danton
2021-05-10 8:03 ` Xing Zhengjun
2021-05-10 9:46 ` Hillf Danton
2021-04-22 17:13 ` Yu Zhao [this message]
2021-04-22 18:51 ` Shakeel Butt
2021-04-22 20:15 ` Yu Zhao
2021-04-22 20:17 ` Tim Chen
2021-04-22 20:30 ` Yu Zhao
2021-04-22 20:38 ` Tim Chen
2021-04-22 20:57 ` Yu Zhao
2021-04-22 21:02 ` Tim Chen
2021-04-23 6:57 ` Xing Zhengjun
2021-04-23 20:23 ` Yu Zhao
2021-04-25 0:48 ` Huang, Ying
2021-04-27 21:53 ` Yu Zhao
2021-04-30 5:57 ` Xing Zhengjun
2021-04-30 6:24 ` Yu Zhao
2021-04-28 11:55 ` Michal Hocko
2021-04-28 15:05 ` Yu Zhao
2021-04-29 10:00 ` Michal Hocko
2021-04-30 8:34 ` Yu Zhao
2021-04-30 9:17 ` Michal Hocko
2021-04-30 17:04 ` Yu Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YIGuvh70JbE1Cx4U@google.com \
--to=yuzhao@google.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=shakeelb@google.com \
--cc=tim.c.chen@linux.intel.com \
--cc=wfg@mail.ustc.edu.cn \
--cc=ying.huang@intel.com \
--cc=zhengjun.xing@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox