From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70AA5C46CA3 for ; Tue, 28 Nov 2023 02:00:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0E71E6B02EB; Mon, 27 Nov 2023 21:00:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0970E6B02ED; Mon, 27 Nov 2023 21:00:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA1DE6B02EE; Mon, 27 Nov 2023 21:00:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id DA2D66B02EB for ; Mon, 27 Nov 2023 21:00:03 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B3CF212016F for ; Tue, 28 Nov 2023 02:00:03 +0000 (UTC) X-FDA: 81505707486.20.21B5AAE Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf18.hostedemail.com (Postfix) with ESMTP id 42E511C0888 for ; Tue, 28 Nov 2023 01:59:59 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf18.hostedemail.com: domain of liushixin2@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=liushixin2@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701136801; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+B+PaYdSRDNzyBQNqp1FUhdeX6dWRSgfRj1EUs6LdS8=; b=BFSnHJWA//DLpUn9GZbArfDYFEBdNlI3sepoHteABOZqLJpccakQuPqC+LrzRNmmbzLrZj RbfdB07hlct0M3ccuXYb1wHfoPQxXkA8aXs7+p2FfJi8Hx/C8yNdWDHGfxncsBAPkuFrA1 lI7uPvSXan1aDViJ+gFnQCpVBZoB3ug= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf18.hostedemail.com: domain of liushixin2@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=liushixin2@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701136801; a=rsa-sha256; cv=none; b=xZMmlL8YjcrjoH5G5zdBYHNXZhkocwzvThCUkxHJiFuvPMPQZ+eOmwX6iNrsI6NWjmceQX y1/n3tqJIrsohY0OXFrg53TuvMCFeKl6uIJr0GCr5zkm0wYfevaUjvb4k5t41DVjtrrQn9 UIWsxdR1f5G1DDM6XfXaym7JvV8nFg4= Received: from dggpemd200004.china.huawei.com (unknown [172.30.72.56]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4SfQX920j6zShKJ; Tue, 28 Nov 2023 09:55:37 +0800 (CST) Received: from [10.174.179.24] (10.174.179.24) by dggpemd200004.china.huawei.com (7.185.36.141) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1258.28; Tue, 28 Nov 2023 09:59:54 +0800 Subject: Re: [PATCH v10] mm: vmscan: try to reclaim swapcache pages if no swap space To: Chris Li References: <20231121090624.1814733-1-liushixin2@huawei.com> CC: Yu Zhao , Andrew Morton , Yosry Ahmed , Huang Ying , Sachin Sant , Michal Hocko , Johannes Weiner , Kefeng Wang , , From: Liu Shixin Message-ID: Date: Tue, 28 Nov 2023 09:59:54 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.7.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.179.24] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemd200004.china.huawei.com (7.185.36.141) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: 42E511C0888 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: tpa83rzgun9jsyw3ybqte5wg9ww831km X-HE-Tag: 1701136799-378769 X-HE-Meta: U2FsdGVkX18fNOGFfYmOkFfjV/1ye2M/2P6Cl9Dx350lG4uBHilCV1W8tvg7XSkzlatxfxERQ1EUDEkHMvcR9VAy4wGP2utiPZ01vPekvNmGZ0uO6b51/e63dzJ8fQR8Trg7e0fZxIbiOi30980NPhmDXGIDJYCts7Bi3dlalzitLGg94I4erhMBMI6hYOFYrHXvTxbhv0A+169mVxcXvC5VUAl/Y/+ycn4bVxl5OZamlkiKsmIx8tC6tTM6x59C5hD+rVRr2hIfNtGZOh4EcRBKYA9jE6OvJ3XReyJRUDY6PVamL6ZyKBWeiqMQeQlIMZhPyXB+Y54bITUEXQUgkSG22KOh29Ra4kGjERdkcIE+Grf6QhEVpHLxqNPhMcG2FzrZSYzaRlPeFhiwSgBdte0NkcL1PQ8AvAbnKa5Te92Es3jnQV6q0h+383FlxYhDu5Sq4PuW8wqKGGiXHcc7SmyelZ4XM4/2Y2+rqAKTPQfdRDHFRBcKuwcIMknuy8C8ZyDaIfhduyBGdeUA7AN+YHH9eRIDNNvzmdMXdzPoQ4uShn9SVDzkMeX/ylZRr60BFD1mOOlfBKWDTaBbCOVlqgajsuhpSJUDuATgLekHYhT5ieI3ELD0yvD8FGiUIQf9OD4rXwo9tnaZ0UblPBijXoE9iqnVIomHM1FRhABGrktu2NM7cRV7dWjB3YgXFurI92OlWJBuud521JIDcceuCnxki8ofYU54+3dAQFeSVkS05aPSZj2v4ShhC63dbFgvI1YRpEp58hcuNPK42z1nds15ZZ/DGXfOFkfn7bEhZrBgkVRc45Ylwa1J+RMbnJ/9k5qOsC+/ynWfYc/jucnKv/F+hjnn/wpOiAhDt5t6JaT7dOrzokqRYXw5dTTErxYoQLj/+0RXJyxlkbwW+4utrv9BW01SFU0Mu7ZV/iQMfcgK3xT2lG3AMo1ykVjl527jDNND8wUuV68nOH2wKKx AWni3oxc kcdE4erzEwgkDIuhiO7NDybiWc5zbz97F/LlcCta1yXXaEhxWCZnbWki2ntvDFfXd3GdO/0ArKnHB/zwYfWHMqHlgvu+GCozvl0NR5ufzmmnA8xGwXwloBi9NUgY0Kayj6To3oj0ZmzUQxJn+yjg3duWmj2/ZZipAhnnLih6hijtzkVPQHayQxnTJWK1YhqmPtdSjZEDChmZeEWcjheswrLWSwGtsWdGXwGIslj9N25K3VFhHdPxbWOPb53GkZVt1ukVKVUBIToVTXawGD70GSiOtovlZCesO99zi X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/11/24 1:19, Chris Li wrote: > Hi Shixin, > > On Tue, Nov 21, 2023 at 12:08 AM Liu Shixin wrote: >> When spaces of swap devices are exhausted, only file pages can be >> reclaimed. But there are still some swapcache pages in anon lru list. >> This can lead to a premature out-of-memory. >> >> The problem is found with such step: >> >> Firstly, set a 9MB disk swap space, then create a cgroup with 10MB >> memory limit, then runs an program to allocates about 15MB memory. >> >> The problem occurs occasionally, which may need about 100 times [1]. > Just out of my curiosity, in your usage case, how much additional > memory in terms of pages or MB can be freed by this patch, using > current code as base line? My testcase is in a memory cgroup with memory limit of 10MB, the memory can be freed is only about 5MB. > Does the swap cache page reclaimed in swapcache_only mode, all swap > count drop to zero, and the only reason to stay in swap cache is to > void page IO write if we need to swap that page out again? Yes. > >> Fix it by checking number of swapcache pages in can_reclaim_anon_pages(). >> If the number is not zero, return true and set swapcache_only to 1. >> When scan anon lru list in swapcache_only mode, non-swapcache pages will >> be skipped to isolate in order to accelerate reclaim efficiency. > Here you said non-swapcache will be skipped if swapcache_only == 1 > >> However, in swapcache_only mode, the scan count still increased when scan >> non-swapcache pages because there are large number of non-swapcache pages >> and rare swapcache pages in swapcache_only mode, and if the non-swapcache > Here you suggest non-swapcache pages will also be scanned even when > swapcache_only == 1. It seems to contradict what you said above. I > feel that I am missing something here. The swapcache pages and non-swapcache pages are both in anon lru. So when scan anon pages, then non-swapcache pages will also be scanned. In isolate_lru_folios(), if we select to put non-swapcache pages in folios_skipped list, the scan of anon list will running until finding enough swapcache pages, this will waste too much time. To avoid such problem, after we scan enough anon pages, even if we don't isolate enough swapcache pages, we have to stop. > >> is skipped and do not count, the scan of pages in isolate_lru_folios() can > Can you clarify which "scan of pages", are those pages swapcache pages > or non-swapcache pages? I mean scan of anon pages, include both swapcache pages and non-swpacache pages. > >> eventually lead to hung task, just as Sachin reported [2]. >> >> By the way, since there are enough times of memory reclaim before OOM, it >> is not need to isolate too much swapcache pages in one times. >> >> [1]. https://lore.kernel.org/lkml/CAJD7tkZAfgncV+KbKr36=eDzMnT=9dZOT0dpMWcurHLr6Do+GA@mail.gmail.com/ >> [2]. https://lore.kernel.org/linux-mm/CAJD7tkafz_2XAuqE8tGLPEcpLngewhUo=5US14PAtSM9tLBUQg@mail.gmail.com/ >> >> Signed-off-by: Liu Shixin >> Tested-by: Yosry Ahmed >> Reviewed-by: "Huang, Ying" >> Reviewed-by: Yosry Ahmed >> --- >> v9->v10: Use per-node swapcache suggested by Yu Zhao. >> v8->v9: Move the swapcache check after can_demote() and refector >> can_reclaim_anon_pages() a bit. >> v7->v8: Reset swapcache_only at the beginning of can_reclaim_anon_pages(). >> v6->v7: Reset swapcache_only to zero after there are swap spaces. >> v5->v6: Fix NULL pointing derefence and hung task problem reported by Sachin. >> >> mm/vmscan.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 49 insertions(+), 1 deletion(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 506f8220c5fe..1fcc94717370 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -136,6 +136,9 @@ struct scan_control { >> /* Always discard instead of demoting to lower tier memory */ >> unsigned int no_demotion:1; >> >> + /* Swap space is exhausted, only reclaim swapcache for anon LRU */ >> + unsigned int swapcache_only:1; >> + >> /* Allocation order */ >> s8 order; >> >> @@ -308,10 +311,36 @@ static bool can_demote(int nid, struct scan_control *sc) >> return true; >> } >> >> +#ifdef CONFIG_SWAP >> +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid) >> +{ >> + struct pglist_data *pgdat = NODE_DATA(nid); >> + unsigned long nr_swapcache; >> + >> + if (!memcg) { >> + nr_swapcache = node_page_state(pgdat, NR_SWAPCACHE); >> + } else { >> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); >> + >> + nr_swapcache = lruvec_page_state_local(lruvec, NR_SWAPCACHE); >> + } >> + >> + return nr_swapcache > 0; >> +} >> +#else >> +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid) >> +{ >> + return false; >> +} >> +#endif >> + >> static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, >> int nid, >> struct scan_control *sc) >> { >> + if (sc) >> + sc->swapcache_only = 0; >> + > Minor nitpick. The sc->swapcache_only is first set to 0 then later set > to 1. Better use a local variable then write to sc->swapcache_only in > one go. If the scan_control has more than one thread accessing it, the > threads can see the flicker of 0->1 change. I don't think that is the > case in our current code, sc is created on stack. There are other > minor benefits as The "if (sc) test" only needs to be done once, one > store instruction. > > Chris Thanks for your advice. If finally decide to use this patch, I will revise it. >> if (memcg == NULL) { >> /* >> * For non-memcg reclaim, is there >> @@ -330,7 +359,17 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, >> * >> * Can it be reclaimed from this node via demotion? >> */ >> - return can_demote(nid, sc); >> + if (can_demote(nid, sc)) >> + return true; >> + >> + /* Is there any swapcache pages to reclaim in this node? */ >> + if (can_reclaim_swapcache(memcg, nid)) { >> + if (sc) >> + sc->swapcache_only = 1; >> + return true; >> + } >> + >> + return false; >> } >> >> /* >> @@ -1642,6 +1681,15 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan, >> */ >> scan += nr_pages; >> >> + /* >> + * Count non-swapcache too because the swapcache pages may >> + * be rare and it takes too much times here if not count >> + * the non-swapcache pages. >> + */ >> + if (unlikely(sc->swapcache_only && !is_file_lru(lru) && >> + !folio_test_swapcache(folio))) >> + goto move; >> + >> if (!folio_test_lru(folio)) >> goto move; >> if (!sc->may_unmap && folio_mapped(folio)) >> -- >> 2.25.1 >> >> > . >