From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 905A9C433ED for ; Fri, 30 Apr 2021 08:34:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 160BA61107 for ; Fri, 30 Apr 2021 08:34:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 160BA61107 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5BC036B006E; Fri, 30 Apr 2021 04:34:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 56CE46B0070; Fri, 30 Apr 2021 04:34:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E6FB6B0071; Fri, 30 Apr 2021 04:34:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0025.hostedemail.com [216.40.44.25]) by kanga.kvack.org (Postfix) with ESMTP id 1AAC86B006E for ; Fri, 30 Apr 2021 04:34:42 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id C34E3181AF5C3 for ; Fri, 30 Apr 2021 08:34:41 +0000 (UTC) X-FDA: 78088372362.06.A595384 Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) by imf03.hostedemail.com (Postfix) with ESMTP id 48EF7C0007E0 for ; Fri, 30 Apr 2021 08:34:35 +0000 (UTC) Received: by mail-wm1-f53.google.com with SMTP id n84so10169597wma.0 for ; Fri, 30 Apr 2021 01:34:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=qmkitn5HYPbD9glHLe6Vv2dfXzjfNgbrEk6rBtVeZ0Q=; b=R54E5xAqO6EK/6j0xdVnLHJryXMFXvex3ztOZtXQa+ImfJC6zVVOP1D6gftyjXQSHR PPOac0A+MFrqfhnxi5z7gw5SR47cx5cKmUs9QiIpw0fH2tjz8MB7D7QmioAxddxAnNQk 7kP96Eclra8cA3zd/m4A2KHpM0ncZ0xRZvi2k4GNO55Fy5TtFvilzcJY1DOzQxHHv8AH 5GlfDrRisgd+9Fei7h9esXQ8sW7LylqJ+DL9sJeoLRu4yCx/CKWp6TtqWZF1HTohaO89 PuMWtVbKkdW4PVnZfttmXsOLw6zEOLMNGtfiIaX5i4kEG5sBbgB+HnWUGPLy3FVQEgds e5kA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qmkitn5HYPbD9glHLe6Vv2dfXzjfNgbrEk6rBtVeZ0Q=; b=rosSPudTIMDmNrM13Uow7z7Qn8575fZhX9M2cUcEHtjySqkl04iYgkuSTMvrriDD2i oCUxxgihBBl7yMZOQW2FSVdJIovIq5uG1gUKuYdVwy33mp4sb5PYCLSTT15B8Iqgse5Y UX3gfwXTrOeKqzViN/dhlJq89Fgg9aoVly7RD7+Vf58KlwbDCq94HLJykEVkJPDo0ifd yY7m3lyfM4w8krjE7ezRyOBNk/xQfGzVPsvA/4AHtUO/GjyFGu/z1YPFaS+EBri9jQ7A Aq/B3f1CLOSfAPxxcMex1BcEDbgRALMUUiiohML1ezZcUKQopL+4nSYDEjsy+kDRbLYF 2x4A== X-Gm-Message-State: AOAM5313ED//0uof55B+bjHSTYaDotK7roq2X1zKB+NZQk04tx47Vdy3 YMV/psb/A94FojF+bcAiDFr7VVKM3emyqhVxjeZ56w== X-Google-Smtp-Source: ABdhPJwQ0cQsHl4mAJQymr+Xl1TlxlGphoe2cGpDJTdjblcyeQFXpxRms38rRxVyvUJyVtYjHYFUlNHcgpW5Jcigpbw= X-Received: by 2002:a1c:c918:: with SMTP id f24mr5010590wmb.12.1619771679787; Fri, 30 Apr 2021 01:34:39 -0700 (PDT) MIME-Version: 1.0 References: <20210416023536.168632-1-zhengjun.xing@linux.intel.com> <7b7a1c09-3d16-e199-15d2-ccea906d4a66@linux.intel.com> In-Reply-To: From: Yu Zhao Date: Fri, 30 Apr 2021 02:34:28 -0600 Message-ID: Subject: Re: [RFC] mm/vmscan.c: avoid possible long latency caused by too_many_isolated() To: Michal Hocko Cc: Xing Zhengjun , Andrew Morton , Linux-MM , linux-kernel , Huang Ying , Tim Chen , Shakeel Butt , wfg@mail.ustc.edu.cn, Rik van Riel , Andrea Arcangeli Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=R54E5xAq; spf=pass (imf03.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.53 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 48EF7C0007E0 X-Stat-Signature: i9qgpct6hombzmxxujmmb4kobhgipxym Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf03; identity=mailfrom; envelope-from=""; helo=mail-wm1-f53.google.com; client-ip=209.85.128.53 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1619771675-632336 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Apr 29, 2021 at 4:00 AM Michal Hocko wrote: > > On Wed 28-04-21 09:05:06, Yu Zhao wrote: > > On Wed, Apr 28, 2021 at 5:55 AM Michal Hocko wrote: > [...] > > > > @@ -3334,8 +3285,17 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > > > > set_task_reclaim_state(current, &sc.reclaim_state); > > > > trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask); > > > > > > > > + nr_cpus = current_is_kswapd() ? 0 : num_online_cpus(); > > > > + while (nr_cpus && !atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) { > > > > + if (schedule_timeout_killable(HZ / 10)) > > > > + return SWAP_CLUSTER_MAX; > > > > + } > > > > + > > > > nr_reclaimed = do_try_to_free_pages(zonelist, &sc); > > > > > > > > + if (nr_cpus) > > > > + atomic_dec(&pgdat->nr_reclaimers); > > > > + > > > > trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); > > > > set_task_reclaim_state(current, NULL); > > > > > > This will surely break any memcg direct reclaim. > > > > Mind elaborating how it will "surely" break any memcg direct reclaim? > > I was wrong here. I though this is done in a common path for all direct > reclaimers (likely mixed up try_to_free_pages with do_try_free_pages). > Sorry about the confusion. > > Still, I do not think that the above heuristic will work properly. > Different reclaimers have a different reclaim target (e.g. lower zones > and/or numa node mask) and strength (e.g. GFP_NOFS vs. GFP_KERNEL). A > simple count based throttling would be be prone to different sorts of > priority inversions. I see where your concern is coming from. Let's look at it from multiple angles, and hopefully this will clear things up. 1, looking into this approach: This approach limits the number of direct reclaimers without any bias. It doesn't favor or disfavor anybody. IOW, everyone has an equal chance to run, regardless of the reclaim parameters. So where does the inversion come from? 2, comparing it with the existing code: Both try to limit direct reclaims,: one by the number of isolated pages and the other by the number of concurrent direct reclaimers. Neither numbers are correlated with any parameters you mentioned above except the following: too_many_isolated() { ... /* * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they * won't get blocked by normal direct-reclaimers, forming a circular * deadlock. */ if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) inactive >>= 3; ... } Let's at the commit that added the above: commit 3cf23841b4b7 ("mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()"): Date: Tue Dec 18 14:23:31 2012 -0800 Neil found that if too_many_isolated() returns true while performing direct reclaim we can end up waiting for other threads to complete their direct reclaim. If those threads are allowed to enter the FS or IO to free memory, but this thread is not, then it is possible that those threads will be waiting on this thread and so we get a circular deadlock. some task enters direct reclaim with GFP_KERNEL => too_many_isolated() false => vmscan and run into dirty pages => pageout() => take some FS lock => fs/block code does GFP_NOIO allocation => enter direct reclaim again => too_many_isolated() true => waiting for others to progress, however the other tasks may be circular waiting for the FS lock.. Hmm, how could reclaim be recursive nowadays? __alloc_pages_slowpath() { ... /* Avoid recursion of direct reclaim */ if (current->flags & PF_MEMALLOC) goto nopage; /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim() ... } Let's assume it still could, do you remember the following commit? commit db73ee0d4637 ("mm, vmscan: do not loop on too_many_isolated for ever") Date: Wed Sep 6 16:21:11 2017 -0700 If too_many_isolated() does loop forever anymore, how could the above deadlock happen? IOW, why would we need the first commit nowadays? If you don't remember the second commit, let me jog your memory: Author: Michal Hocko 3, thinking abstractly A problem hard to solve in one domain can become a walk in the park in another domain. This problem is a perfect example: it's different to solve based on the number of isolated pages; but it becomes a lot easier based on the number of direct reclaimers. But there is a caveat: when we transform to a new domain, we need to preserve the "reclaim target and strength" you mentioned. Fortunately, there is nothing to preserve, because the existing code has none, given that the "__GFP_IO | __GFP_FS" check in too_many_isolated() is obsolete. Does it make sense?