From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4706BC352A1 for ; Tue, 6 Dec 2022 19:00:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B509C8E0003; Tue, 6 Dec 2022 14:00:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B00CE8E0001; Tue, 6 Dec 2022 14:00:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A2528E0003; Tue, 6 Dec 2022 14:00:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8A9008E0001 for ; Tue, 6 Dec 2022 14:00:21 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 487D21C6650 for ; Tue, 6 Dec 2022 19:00:21 +0000 (UTC) X-FDA: 80212797042.05.9CE03B5 Received: from mail-oi1-f170.google.com (mail-oi1-f170.google.com [209.85.167.170]) by imf19.hostedemail.com (Postfix) with ESMTP id AB2081A0010 for ; Tue, 6 Dec 2022 19:00:20 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b="jSdWy/gK"; spf=pass (imf19.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.167.170 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670353220; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7gB/zC3m6z18bCMXQvK6wTNUYxsgfiVZcwxTbNoVdos=; b=3rbqUC8ODJLeoEbD7ubfsADTzb7GHFQ+T3MvUN5DIaosezbekBW4ibEH5tyooeVTJcQDzx YLOoe4pUvJJZFktQQ7bS0hSfFbBgNdoOl0963IWC/N8exmLMht37Zq4m5PvmXZVvCKpuB5 RnabV45Uev+Phg+QGk9ASgliDFs1mok= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b="jSdWy/gK"; spf=pass (imf19.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.167.170 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670353220; a=rsa-sha256; cv=none; b=ybj71PW+yRN+vrsCkJhtsleyDk8P25ed6XvQ1nCfoD08Fdk+NvH5w7e2spYFmf0Vtt3HIj OmO0lwf0dzmM4geJF6jwn/Hz3TT2yJBPMsrQgE6XuCoPHYb1R6eNkObnxh55pegdP8u2rC xNxDPSeGQxopJzV/9TqT1CgXaPW0v8o= Received: by mail-oi1-f170.google.com with SMTP id r11so12296059oie.13 for ; Tue, 06 Dec 2022 11:00:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=7gB/zC3m6z18bCMXQvK6wTNUYxsgfiVZcwxTbNoVdos=; b=jSdWy/gKx+It09U4YF2P+7xcTqOoHTZQEmqSGvfz9HpmcbbClVtZE1IzBXcVMDNmQh /h5mKFAGvqvDXiBofNHhhwRBK8cXilalcdNBwy3k3nycfKb5Qq8XdOTZ/Yv0eMff8lGF PTlfpLzpt648yElTTUJ+j9l5P1RWsJwM3C7imX/tozIrGN8qmsVglIuZelLVy8uo9kc1 XyFSqwU+0gg6mGrzZ48w6PfB/T3bAJ/MwyYLZ2LVOqjWfMd0fxClG5leO3aMSdR75LIQ OsqM1DzXnsZN7JMTM+xDMgY+FlV/ibArwPYxMrV/vWC0evdlh4/17g2nr/QvvYw05ezi dcBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=7gB/zC3m6z18bCMXQvK6wTNUYxsgfiVZcwxTbNoVdos=; b=QkjjrzO0DFqutRVbKYw00tIqAlbhuKurBanUA+I47lSnv/yoegZeVzqCLzARnetttB C+vuyyUb53ZOVK0b6s4z+LkbMi3Bn8eValtbD3gswVc3ocOEFaGHflfJu6f/9BrDP7EI yp9GtXGWXPT787uKjBp+I4j3ksz4DxrUb932Q2Nth799USLCqQCpwI99EAVgsjcqx9Xn 2hjFXG9LagORhflDI5/rcOrYSfrzyIF3wfodPuqMZhdBazm0sU3dyzVWohJIYCuiuXM5 plaE2/q36nYi8c9RkJ7TKwL71mYhHaGvXsCn1cc9NoT0eM4DAQT6LBLIVMywZLfgW2Sj 3KCw== X-Gm-Message-State: ANoB5pnZDkWEfR+ETfpWXJrOrby+LhK/LZRXJAC1Ph41Qxb9QZfqoy+C YyuVHAJk3HZ7xwwqWjsq9idvgg== X-Google-Smtp-Source: AA0mqf7WwnFub6n+tLxpktEHW3bX4yd+fxESVtOraSVHeb5gxz1kb/mdwZf60nStByF+LhNeIgOoig== X-Received: by 2002:aca:4545:0:b0:359:f445:e03e with SMTP id s66-20020aca4545000000b00359f445e03emr33440702oia.180.1670353219576; Tue, 06 Dec 2022 11:00:19 -0800 (PST) Received: from localhost ([107.116.82.99]) by smtp.gmail.com with ESMTPSA id m6-20020a056870a40600b0012d939eb0bfsm11101496oal.34.2022.12.06.11.00.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Dec 2022 11:00:19 -0800 (PST) Date: Tue, 6 Dec 2022 20:00:14 +0100 From: Johannes Weiner To: Ivan Babrou Cc: Linux MM , Linux Kernel Network Developers , linux-kernel , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Eric Dumazet , "David S. Miller" , Hideaki YOSHIFUJI , David Ahern , Jakub Kicinski , Paolo Abeni , cgroups@vger.kernel.org, kernel-team Subject: Re: Low TCP throughput due to vmpressure with swap enabled Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: AB2081A0010 X-Stat-Signature: 4oiyqf3ckn1ha57cxtss75pchh7joagh X-Rspam-User: X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.167.170:from]; MIME_GOOD(-0.10)[text/plain]; RCVD_NO_TLS_LAST(0.10)[]; BAD_REP_POLICIES(0.10)[]; TO_DN_SOME(0.00)[]; RCPT_COUNT_TWELVE(0.00)[17]; DMARC_POLICY_ALLOW(0.00)[cmpxchg.org,none]; DKIM_TRACE(0.00)[cmpxchg-org.20210112.gappssmtp.com:+]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; RCVD_COUNT_THREE(0.00)[3]; R_DKIM_ALLOW(0.00)[cmpxchg-org.20210112.gappssmtp.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspamd-Server: rspam08 X-HE-Tag: 1670353220-726202 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 05, 2022 at 04:50:46PM -0800, Ivan Babrou wrote: > And now I can see plenty of this: > > [ 108.156707][ T5175] socket pressure[2]: 4294673429 > [ 108.157050][ T5175] socket pressure[2]: 4294673429 > [ 108.157301][ T5175] socket pressure[2]: 4294673429 > [ 108.157581][ T5175] socket pressure[2]: 4294673429 > [ 108.157874][ T5175] socket pressure[2]: 4294673429 > [ 108.158254][ T5175] socket pressure[2]: 4294673429 > > I think the first result below is to blame: > > $ rg '.->socket_pressure' mm > mm/memcontrol.c > 5280: memcg->socket_pressure = jiffies; > 7198: memcg->socket_pressure = 0; > 7201: memcg->socket_pressure = 1; > 7211: memcg->socket_pressure = 0; > 7215: memcg->socket_pressure = 1; Hoo boy, that's a silly mistake indeed. Thanks for tracking it down. > While we set socket_pressure to either zero or one in > mem_cgroup_charge_skmem, it is still initialized to jiffies on memcg > creation. Zero seems like a more appropriate starting point. With that > change I see it working as expected with no TCP speed bumps. My > ebpf_exporter program also looks happy and reports zero clamps in my > brief testing. Excellent, now this behavior makes sense. > I also think we should downgrade socket_pressure from "unsigned long" > to "bool", as it only holds zero and one now. Sounds good to me! Attaching the updated patch below. If nobody has any objections, I'll add a proper changelog, reported-bys, sign-off etc and send it out. --- include/linux/memcontrol.h | 8 +++--- include/linux/vmpressure.h | 7 ++--- mm/memcontrol.c | 20 +++++++++---- mm/vmpressure.c | 58 ++++++-------------------------------- mm/vmscan.c | 15 +--------- 5 files changed, 30 insertions(+), 78 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e1644a24009c..ef1c388be5b3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -283,11 +283,11 @@ struct mem_cgroup { atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS]; - unsigned long socket_pressure; + /* Socket memory allocations have failed */ + bool socket_pressure; /* Legacy tcp memory accounting */ bool tcpmem_active; - int tcpmem_pressure; #ifdef CONFIG_MEMCG_KMEM int kmemcg_id; @@ -1701,10 +1701,10 @@ void mem_cgroup_sk_alloc(struct sock *sk); void mem_cgroup_sk_free(struct sock *sk); static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) { - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure) + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->socket_pressure) return true; do { - if (time_before(jiffies, READ_ONCE(memcg->socket_pressure))) + if (memcg->socket_pressure) return true; } while ((memcg = parent_mem_cgroup(memcg))); return false; diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h index 6a2f51ebbfd3..20d93de37a17 100644 --- a/include/linux/vmpressure.h +++ b/include/linux/vmpressure.h @@ -11,9 +11,6 @@ #include struct vmpressure { - unsigned long scanned; - unsigned long reclaimed; - unsigned long tree_scanned; unsigned long tree_reclaimed; /* The lock is used to keep the scanned/reclaimed above in sync. */ @@ -30,7 +27,7 @@ struct vmpressure { struct mem_cgroup; #ifdef CONFIG_MEMCG -extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, unsigned long scanned, unsigned long reclaimed); extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); @@ -44,7 +41,7 @@ extern int vmpressure_register_event(struct mem_cgroup *memcg, extern void vmpressure_unregister_event(struct mem_cgroup *memcg, struct eventfd_ctx *eventfd); #else -static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, unsigned long scanned, unsigned long reclaimed) {} static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) {} diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d8549ae1b30..0d4b9dbe775a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5277,7 +5277,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) vmpressure_init(&memcg->vmpressure); INIT_LIST_HEAD(&memcg->event_list); spin_lock_init(&memcg->event_list_lock); - memcg->socket_pressure = jiffies; #ifdef CONFIG_MEMCG_KMEM memcg->kmemcg_id = -1; INIT_LIST_HEAD(&memcg->objcg_list); @@ -7195,10 +7194,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, struct page_counter *fail; if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) { - memcg->tcpmem_pressure = 0; + memcg->socket_pressure = false; return true; } - memcg->tcpmem_pressure = 1; + memcg->socket_pressure = true; if (gfp_mask & __GFP_NOFAIL) { page_counter_charge(&memcg->tcpmem, nr_pages); return true; @@ -7206,12 +7205,21 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, return false; } - if (try_charge(memcg, gfp_mask, nr_pages) == 0) { - mod_memcg_state(memcg, MEMCG_SOCK, nr_pages); - return true; + if (try_charge(memcg, gfp_mask & ~__GFP_NOFAIL, nr_pages) == 0) { + memcg->socket_pressure = false; + goto success; + } + memcg->socket_pressure = true; + if (gfp_mask & __GFP_NOFAIL) { + try_charge(memcg, gfp_mask, nr_pages); + goto success; } return false; + +success: + mod_memcg_state(memcg, MEMCG_SOCK, nr_pages); + return true; } /** diff --git a/mm/vmpressure.c b/mm/vmpressure.c index b52644771cc4..4cec90711cf4 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -219,7 +219,6 @@ static void vmpressure_work_fn(struct work_struct *work) * vmpressure() - Account memory pressure through scanned/reclaimed ratio * @gfp: reclaimer's gfp mask * @memcg: cgroup memory controller handle - * @tree: legacy subtree mode * @scanned: number of pages scanned * @reclaimed: number of pages reclaimed * @@ -227,16 +226,9 @@ static void vmpressure_work_fn(struct work_struct *work) * "instantaneous" memory pressure (scanned/reclaimed ratio). The raw * pressure index is then further refined and averaged over time. * - * If @tree is set, vmpressure is in traditional userspace reporting - * mode: @memcg is considered the pressure root and userspace is - * notified of the entire subtree's reclaim efficiency. - * - * If @tree is not set, reclaim efficiency is recorded for @memcg, and - * only in-kernel users are notified. - * * This function does not return any value. */ -void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, unsigned long scanned, unsigned long reclaimed) { struct vmpressure *vmpr; @@ -271,46 +263,14 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, if (!scanned) return; - if (tree) { - spin_lock(&vmpr->sr_lock); - scanned = vmpr->tree_scanned += scanned; - vmpr->tree_reclaimed += reclaimed; - spin_unlock(&vmpr->sr_lock); - - if (scanned < vmpressure_win) - return; - schedule_work(&vmpr->work); - } else { - enum vmpressure_levels level; - - /* For now, no users for root-level efficiency */ - if (!memcg || mem_cgroup_is_root(memcg)) - return; - - spin_lock(&vmpr->sr_lock); - scanned = vmpr->scanned += scanned; - reclaimed = vmpr->reclaimed += reclaimed; - if (scanned < vmpressure_win) { - spin_unlock(&vmpr->sr_lock); - return; - } - vmpr->scanned = vmpr->reclaimed = 0; - spin_unlock(&vmpr->sr_lock); + spin_lock(&vmpr->sr_lock); + scanned = vmpr->tree_scanned += scanned; + vmpr->tree_reclaimed += reclaimed; + spin_unlock(&vmpr->sr_lock); - level = vmpressure_calc_level(scanned, reclaimed); - - if (level > VMPRESSURE_LOW) { - /* - * Let the socket buffer allocator know that - * we are having trouble reclaiming LRU pages. - * - * For hysteresis keep the pressure state - * asserted for a second in which subsequent - * pressure events can occur. - */ - WRITE_ONCE(memcg->socket_pressure, jiffies + HZ); - } - } + if (scanned < vmpressure_win) + return; + schedule_work(&vmpr->work); } /** @@ -340,7 +300,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) * to the vmpressure() basically means that we signal 'critical' * level. */ - vmpressure(gfp, memcg, true, vmpressure_win, 0); + vmpressure(gfp, memcg, vmpressure_win, 0); } #define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2) diff --git a/mm/vmscan.c b/mm/vmscan.c index 04d8b88e5216..d348366d58d4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6035,8 +6035,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) memcg = mem_cgroup_iter(target_memcg, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - unsigned long reclaimed; - unsigned long scanned; /* * This loop can become CPU-bound when target memcgs @@ -6068,20 +6066,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) memcg_memory_event(memcg, MEMCG_LOW); } - reclaimed = sc->nr_reclaimed; - scanned = sc->nr_scanned; - shrink_lruvec(lruvec, sc); - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); - - /* Record the group's reclaim efficiency */ - if (!sc->proactive) - vmpressure(sc->gfp_mask, memcg, false, - sc->nr_scanned - scanned, - sc->nr_reclaimed - reclaimed); - } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); } @@ -6111,7 +6098,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) /* Record the subtree's reclaim efficiency */ if (!sc->proactive) - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, sc->nr_scanned - nr_scanned, sc->nr_reclaimed - nr_reclaimed); -- 2.38.1