From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0BB9DC43217 for ; Tue, 22 Nov 2022 18:01:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4FE786B0071; Tue, 22 Nov 2022 13:01:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4AEAF8E0002; Tue, 22 Nov 2022 13:01:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 377FB8E0001; Tue, 22 Nov 2022 13:01:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 28ADC6B0071 for ; Tue, 22 Nov 2022 13:01:38 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id C760B1411DA for ; Tue, 22 Nov 2022 18:01:37 +0000 (UTC) X-FDA: 80161845834.10.C38F1A1 Received: from mail-yw1-f175.google.com (mail-yw1-f175.google.com [209.85.128.175]) by imf16.hostedemail.com (Postfix) with ESMTP id 9C412180042 for ; Tue, 22 Nov 2022 18:01:28 +0000 (UTC) Received: by mail-yw1-f175.google.com with SMTP id 00721157ae682-36cbcda2157so151658407b3.11 for ; Tue, 22 Nov 2022 10:01:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=6B0NrMvYa9h30nShzcOWnnoLa9VhsXqCL7iaXLlPHUo=; b=iWPChtAoKFdKhqcc/DN3txXed0Dwlsv6aeY/kU0GIAptMe3AhWO8wNTMkP814+mpot Lw+8OhB61MCpzcW/qHOQt9y0qCwc2TeVUQZL6D4AaTgJnk4brkjJmPWHb9ErENo8EMPV vNAw2U5rYK5vDLFMMjyG/YPN8oDJszHjXswrS7dU4RuyirXZUZn3LSiKEZH67u0ITTlB vm7GHgRSaPuARILnXSmyQdd3uOb1udK3X9qB19+nwQzHalBvF2/CKdn9VUtUr2dTyGfD Ub/JVtM14HO5rq2oN6TgFpsme/nE5KtZ05qtTeNRWANWEX7Ghk/R9QA5RRdfaycx8rvi csrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=6B0NrMvYa9h30nShzcOWnnoLa9VhsXqCL7iaXLlPHUo=; b=mgnrAlTIdTVlx6RjnJ8+1/ULzQajMjUt0BK6dzwRNPoJDMNuE02mn/RYmnGwNUxKG3 PBL1eMHQbbFLWqQwP1jNeI+3vauNNiZvqoA8hGRsAlbaD9RiUTspAZjOaUwcH/efXC3f gckp7wP3+aPWIvW8urpft79CdYxnee0YZ3N0WPLUZZqT8bfuzMVEK9Lgm6oqn1iFjPMy s3+gY08A+0L57nWzdvzyVBvhB9Q/JlbC+WacouufCPQiM1InPhT1/sn8OfaesbENfdco bn1hUxZRQKzM3WvU1FeSeWrtByc3g0t+uYW8Rb5Uaki/vOy8DvioiLcu7X8qpVl56SCR cz4Q== X-Gm-Message-State: ANoB5pl4X88TivG91CeLaNI7PfTMOXeQ/b95TORAjfzVWN7MV7Piz85F dYbw/Lj9lP5KceT03GcUtWGGyVJYhvcSbpAMhZduVg== X-Google-Smtp-Source: AA0mqf4Nx1p+QeB+x8oyriiysv0tI4zAGBzohwN/TWpubU148DMkvHTEfKohRc437Kc40T/hVsvx+Bc3QdOFRwtDMfE= X-Received: by 2002:a05:690c:a92:b0:36c:aaa6:e571 with SMTP id ci18-20020a05690c0a9200b0036caaa6e571mr23063228ywb.467.1669140087308; Tue, 22 Nov 2022 10:01:27 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Eric Dumazet Date: Tue, 22 Nov 2022 10:01:16 -0800 Message-ID: Subject: Re: Low TCP throughput due to vmpressure with swap enabled To: Ivan Babrou Cc: Linux MM , Linux Kernel Network Developers , linux-kernel , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , "David S. Miller" , Hideaki YOSHIFUJI , David Ahern , Jakub Kicinski , Paolo Abeni , cgroups@vger.kernel.org, kernel-team Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669140088; a=rsa-sha256; cv=none; b=whWn1AlHhhpzDaRfnkvcAQLeLTSHOtVLXgyyq0ck+Zepd0nalYlmMbaLnpCw06yh6H23Kt pe50iij21qYYWrSNYZgszWje7rIefvYCQEu8FnxFVSrnpRrjKUD21ATKCY0JGP3lIU3nOu +oSmE3dpZHvIu/PFnq2fzsXE1NgQd48= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=iWPChtAo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of edumazet@google.com designates 209.85.128.175 as permitted sender) smtp.mailfrom=edumazet@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669140088; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6B0NrMvYa9h30nShzcOWnnoLa9VhsXqCL7iaXLlPHUo=; b=GlrubK7UlIeVPHDqgcbZp4D+BAlQ2ruQ4kPszrPvqeSs02gdA3MzAYfVljB5NWmmRL48sm NkSlbuhp33wIjX6rR96jGxpY1DjFsSotxGG0NZ8a0+wY16qXJz71JuubU9uUH2OmBVXT5w 1qa2iV/H8Hsdk4u7KeQ0MEc92VXayW4= X-Stat-Signature: eww7kx1hc6b3w4wiw6qsqbxg4ut94bkf X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 9C412180042 Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=iWPChtAo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of edumazet@google.com designates 209.85.128.175 as permitted sender) smtp.mailfrom=edumazet@google.com X-Rspam-User: X-HE-Tag: 1669140088-953900 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Nov 21, 2022 at 4:53 PM Ivan Babrou wrote: > > Hello, > > We have observed a negative TCP throughput behavior from the following commit: > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure > > It landed back in 2016 in v4.5, so it's not exactly a new issue. > > The crux of the issue is that in some cases with swap present the > workload can be unfairly throttled in terms of TCP throughput. I guess defining 'fairness' in such a scenario is nearly impossible. Have you tried changing /proc/sys/net/ipv4/tcp_rmem (and/or tcp_wmem) ? Defaults are quite conservative. If for your workload you want to ensure a minimum amount of memory per TCP socket, that might be good enough. Of course, if your proxy has to deal with millions of concurrent TCP sockets, I fear this is not an option. > > I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8 > GiB of RAM with zram enabled. > > The setup is fairly simple: > > 1. Run the following go proxy in one cgroup (it has some memory > ballast to simulate useful memory usage): > > * https://gist.github.com/bobrik/2c1a8a19b921fefe22caac21fda1be82 > > sudo systemd-run --scope -p MemoryLimit=6G go run main.go > > 2. Run the following fio config in another cgroup to simulate mmapped > page cache usage: > > [global] > size=8g > bs=256k > iodepth=256 > direct=0 > ioengine=mmap > group_reporting > time_based > runtime=86400 > numjobs=8 > name=randread > rw=randread > > [job1] > filename=derp > > sudo systemd-run --scope fio randread.fio > > 3. Run curl to request a large file via proxy: > > curl -o /dev/null http://localhost:4444 > > 4. Observe low throughput. The numbers here are dependent on your > location, but in my VM the throughput drops from 60MB/s to 10MB/s > depending on whether fio is running or not. > > I can see that this happens because of the commit I mentioned with > some perf tracing: > > sudo perf probe --add 'vmpressure:48 memcg->css.cgroup->kn->id scanned > vmpr_scanned=vmpr->scanned reclaimed vmpr_reclaimed=vmpr->reclaimed' > sudo perf probe --add 'vmpressure:72 memcg->css.cgroup->kn->id' > > I can record the probes above during curl runtime: > > sudo perf record -a -e probe:vmpressure_L48,probe:vmpressure_L72 -- sleep 5 > > Line 48 allows me to observe scanned and reclaimed page counters, line > 72 is the actual throttling. > > Here's an example trace showing my go proxy cgroup: > > kswapd0 89 [002] 2351.221995: probe:vmpressure_L48: (ffffffed2639dd90) > id=0xf23 scanned=0x140 vmpr_scanned=0x0 reclaimed=0x0 > vmpr_reclaimed=0x0 > kswapd0 89 [007] 2351.333407: probe:vmpressure_L48: (ffffffed2639dd90) > id=0xf23 scanned=0x2b3 vmpr_scanned=0x140 reclaimed=0x0 > vmpr_reclaimed=0x0 > kswapd0 89 [007] 2351.333408: probe:vmpressure_L72: (ffffffed2639de2c) id=0xf23 > > We scanned lots of pages, but weren't able to reclaim anything. > > When throttling happens, it's in tcp_prune_queue, where rcv_ssthresh > (TCP window clamp) is set to 4 x advmss: > > * https://elixir.bootlin.com/linux/v5.15.76/source/net/ipv4/tcp_input.c#L5373 > > else if (tcp_under_memory_pressure(sk)) > tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); > > I can see plenty of memory available in both my go proxy cgroup and in > the system in general: > > $ free -h > total used free shared buff/cache available > Mem: 7.8Gi 4.3Gi 104Mi 0.0Ki 3.3Gi 3.3Gi > Swap: 11Gi 242Mi 11Gi > > It just so happens that all of the memory is hot and is not eligible > to be reclaimed. Since swap is enabled, the memory is still eligible > to be scanned. If swap is disabled, then my go proxy is not eligible > for scanning anymore (all memory is anonymous, nowhere to reclaim it), > so the whole issue goes away. > > Punishing well behaving programs like that doesn't seem fair. We saw > production metals with 200GB page cache out of 384GB of RAM, where a > well behaved proxy with 60GB of RAM + 15GB of swap is throttled like > that. The fact that it only happens with swap makes it extra weird. > > I'm not really sure what to do with this. From our end we'll probably > just pass cgroup.memory=nosocket in cmdline to disable this behavior > altogether, since it's not like we're running out of TCP memory (and > we can deal with that better if it ever comes to that). There should > probably be a better general case solution. Probably :) > > I don't know how widespread this issue can be. You need a fair amount > of page cache pressure to try to go to anonymous memory for reclaim to > trigger this. > > Either way, this seems like a bit of a landmine.