From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14573C4332F for ; Tue, 22 Nov 2022 00:53:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 75A1E8E0001; Mon, 21 Nov 2022 19:53:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 709C36B0073; Mon, 21 Nov 2022 19:53:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D2358E0001; Mon, 21 Nov 2022 19:53:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 4F3C66B0071 for ; Mon, 21 Nov 2022 19:53:55 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 232F88056C for ; Tue, 22 Nov 2022 00:53:55 +0000 (UTC) X-FDA: 80159256030.12.C4BF03B Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171]) by imf30.hostedemail.com (Postfix) with ESMTP id BA67680008 for ; Tue, 22 Nov 2022 00:53:54 +0000 (UTC) Received: by mail-yb1-f171.google.com with SMTP id 205so15572182ybe.7 for ; Mon, 21 Nov 2022 16:53:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=O6WUOw9T0DV19s/c2TGTrCQrdf4F5EyxA3HYag+nfl0=; b=K7uGD5oz/aX87z0dSzQiMLPSWToAYxYdnapsn443FHnDX0Zpnyu7ogdz44QyeZYtkh N2soQOhA7Uf54WmfPwkNv8Wh6ohKOfa1RV33ck4avR0Fos+V9s17/WKxkSUAxC5H1lPj MHaYxnD5euM+lAJWHkrzN/9sC9BjhZNqYWqqs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=O6WUOw9T0DV19s/c2TGTrCQrdf4F5EyxA3HYag+nfl0=; b=ZdrvUxiHPWXmSed0bZqie9Cp/MTmQxJ/ERb6NkU+Qrk3JmkqhBWW8WEEwa+kVEkZBJ pBnwArjIQzwDq53yYq4kSL2YD9s2166/wEw5O8P2wCipUArmhL01K/bJ+BnSPgDRhXow RAm4x0hN44mE/y9JR/CbOpPr63879CPqyiCb7A4gLzNMuUDtroeoAkEc5+1GXxR0AHIF LbCMShurFPZ+uiwnsSmYdEFNxPhP8kcM0iQSQT1VRP+GBRn0S5E7dKzHB3j7TpqLOpiX 8ZL/7c/dCNkIZqDEcLa0HFtmUWfNhmqgNI2YkRUmaXdu9IOaw6kV59XWfBLbddr/tRcm 7Pvg== X-Gm-Message-State: ANoB5plRiKJLPX0iG2RaeGyYaUSDLu4e8hNjeNEOxT3+rEWtKPwcF8dh 92FYq7QJCgNd5ryyUdTnrx8H3a7zNoStAl3KDfDQM3+kmHGURA== X-Google-Smtp-Source: AA0mqf4EpbJUR3UuJTQ7sbSad4/fl/NT35pwFr98CaDrRMn5fwcn3TSYJxpSSSeDeV2CANvkQa9SkrJLEtQifg4QTpY= X-Received: by 2002:a05:6902:1825:b0:6de:f09:2427 with SMTP id cf37-20020a056902182500b006de0f092427mr1386018ybb.125.1669078433693; Mon, 21 Nov 2022 16:53:53 -0800 (PST) MIME-Version: 1.0 From: Ivan Babrou Date: Mon, 21 Nov 2022 16:53:43 -0800 Message-ID: Subject: Low TCP throughput due to vmpressure with swap enabled To: Linux MM Cc: Linux Kernel Network Developers , linux-kernel , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Eric Dumazet , "David S. Miller" , Hideaki YOSHIFUJI , David Ahern , Jakub Kicinski , Paolo Abeni , cgroups@vger.kernel.org, kernel-team Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669078434; a=rsa-sha256; cv=none; b=tDNezhzmQv7RVB5NCjw7VzMUE0rdkP68odIdpJsYaX6GP9iSR+HlYRqc5KhMFBl3CphmIo WtIrsJumkky0xYmYszRLAyDkFDXc3+eIeSoBoelLAEV0TOuCpeoeeM3XHhOO5h1WPXR6yf ydpB2LKupC4FoLRz++gHa1KzNq6dDBs= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=cloudflare.com header.s=google header.b=K7uGD5oz; spf=pass (imf30.hostedemail.com: domain of ivan@cloudflare.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=ivan@cloudflare.com; dmarc=pass (policy=reject) header.from=cloudflare.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669078434; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=O6WUOw9T0DV19s/c2TGTrCQrdf4F5EyxA3HYag+nfl0=; b=Skvd6MpQLvMO7hgEA3WKuQ/VPHCAcA7vWQ8nqeS1UxY77VokprrkgG8LhSfQUuHZBlKml5 Ewoh9V1XOc0ZE0ocl19yFdtu+WtP6u6ijx0+XsI0Ih0TK83JrGxEAYNcDhby0TGKxpNRZa TvXiB2gLyRnbkvxj+zJ/oxjVDXa/8Mo= X-Rspam-User: X-Stat-Signature: nnq7px83zrkm8mi96m6y74zjececfctu X-Rspamd-Queue-Id: BA67680008 Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=cloudflare.com header.s=google header.b=K7uGD5oz; spf=pass (imf30.hostedemail.com: domain of ivan@cloudflare.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=ivan@cloudflare.com; dmarc=pass (policy=reject) header.from=cloudflare.com X-Rspamd-Server: rspam07 X-HE-Tag: 1669078434-129052 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, We have observed a negative TCP throughput behavior from the following commit: * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure It landed back in 2016 in v4.5, so it's not exactly a new issue. The crux of the issue is that in some cases with swap present the workload can be unfairly throttled in terms of TCP throughput. I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8 GiB of RAM with zram enabled. The setup is fairly simple: 1. Run the following go proxy in one cgroup (it has some memory ballast to simulate useful memory usage): * https://gist.github.com/bobrik/2c1a8a19b921fefe22caac21fda1be82 sudo systemd-run --scope -p MemoryLimit=6G go run main.go 2. Run the following fio config in another cgroup to simulate mmapped page cache usage: [global] size=8g bs=256k iodepth=256 direct=0 ioengine=mmap group_reporting time_based runtime=86400 numjobs=8 name=randread rw=randread [job1] filename=derp sudo systemd-run --scope fio randread.fio 3. Run curl to request a large file via proxy: curl -o /dev/null http://localhost:4444 4. Observe low throughput. The numbers here are dependent on your location, but in my VM the throughput drops from 60MB/s to 10MB/s depending on whether fio is running or not. I can see that this happens because of the commit I mentioned with some perf tracing: sudo perf probe --add 'vmpressure:48 memcg->css.cgroup->kn->id scanned vmpr_scanned=vmpr->scanned reclaimed vmpr_reclaimed=vmpr->reclaimed' sudo perf probe --add 'vmpressure:72 memcg->css.cgroup->kn->id' I can record the probes above during curl runtime: sudo perf record -a -e probe:vmpressure_L48,probe:vmpressure_L72 -- sleep 5 Line 48 allows me to observe scanned and reclaimed page counters, line 72 is the actual throttling. Here's an example trace showing my go proxy cgroup: kswapd0 89 [002] 2351.221995: probe:vmpressure_L48: (ffffffed2639dd90) id=0xf23 scanned=0x140 vmpr_scanned=0x0 reclaimed=0x0 vmpr_reclaimed=0x0 kswapd0 89 [007] 2351.333407: probe:vmpressure_L48: (ffffffed2639dd90) id=0xf23 scanned=0x2b3 vmpr_scanned=0x140 reclaimed=0x0 vmpr_reclaimed=0x0 kswapd0 89 [007] 2351.333408: probe:vmpressure_L72: (ffffffed2639de2c) id=0xf23 We scanned lots of pages, but weren't able to reclaim anything. When throttling happens, it's in tcp_prune_queue, where rcv_ssthresh (TCP window clamp) is set to 4 x advmss: * https://elixir.bootlin.com/linux/v5.15.76/source/net/ipv4/tcp_input.c#L5373 else if (tcp_under_memory_pressure(sk)) tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); I can see plenty of memory available in both my go proxy cgroup and in the system in general: $ free -h total used free shared buff/cache available Mem: 7.8Gi 4.3Gi 104Mi 0.0Ki 3.3Gi 3.3Gi Swap: 11Gi 242Mi 11Gi It just so happens that all of the memory is hot and is not eligible to be reclaimed. Since swap is enabled, the memory is still eligible to be scanned. If swap is disabled, then my go proxy is not eligible for scanning anymore (all memory is anonymous, nowhere to reclaim it), so the whole issue goes away. Punishing well behaving programs like that doesn't seem fair. We saw production metals with 200GB page cache out of 384GB of RAM, where a well behaved proxy with 60GB of RAM + 15GB of swap is throttled like that. The fact that it only happens with swap makes it extra weird. I'm not really sure what to do with this. From our end we'll probably just pass cgroup.memory=nosocket in cmdline to disable this behavior altogether, since it's not like we're running out of TCP memory (and we can deal with that better if it ever comes to that). There should probably be a better general case solution. I don't know how widespread this issue can be. You need a fair amount of page cache pressure to try to go to anonymous memory for reclaim to trigger this. Either way, this seems like a bit of a landmine.