From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EF85DD6B6D4 for ; Wed, 30 Oct 2024 20:25:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3EC9B6B0083; Wed, 30 Oct 2024 16:25:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 39CF06B00AA; Wed, 30 Oct 2024 16:25:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 23E796B0092; Wed, 30 Oct 2024 16:25:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 00FA06B00B9 for ; Wed, 30 Oct 2024 16:25:32 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 7D05840E5E for ; Wed, 30 Oct 2024 20:25:32 +0000 (UTC) X-FDA: 82731398148.12.F2CCE4A Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) by imf24.hostedemail.com (Postfix) with ESMTP id BC7D9180005 for ; Wed, 30 Oct 2024 20:25:26 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="hvno5/D0"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730319799; a=rsa-sha256; cv=none; b=W4SSPW699QXoWSKJO+sj6GwOzA9jJgn6zkuGxSr+Fq3pDibaGrELh6kWEIUpwI0aLre1gk SKicsrcxsjHWTeecuUS2e4x+k/xDgAz10vZVyIXw9wZCVaoewpcZEQ8nr4nwbP9GOKBBH8 FAxAa0lGjNGmofQ1ogPdbe1aGQCMEYM= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="hvno5/D0"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730319799; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fR2+vu403L40rXb7uTJaqctd7/7tN0LQaS/5y7wX67o=; b=vnywheEK8HCDOmBV3m6f7QJrsG6WILyn++V8ULJtsoEQxB9CcFazjE/achSoEedZsdz/rs iIVS7x1QLe/sTqovbBV33ZAdydQ1xUyL/E966aDEloGOLbK5CwneIoluX9IugGC0WsHOmZ qBTOrDUhaU7NnIfOAvZrC3WvZFKklN0= Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-4315eeb2601so2366555e9.2 for ; Wed, 30 Oct 2024 13:25:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730319929; x=1730924729; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=fR2+vu403L40rXb7uTJaqctd7/7tN0LQaS/5y7wX67o=; b=hvno5/D0Sd6ULESE/b8mfdAqkxlmw4ujnr7hr4fm/nU5LT/IGQxyXn0/5bt2Fd+lGC 1Zg+PRDW/t7RjosSCL5BztHsbP24ZK/ADNeHYV1O0OPkItkboCe3zCoZmjUzohOFwWAl rdzVJIPTKED0BpUnKrmX71+fzZpS5jO7+cOOY0PNme3JdFwVsoSnSpKXtLp2EqipOPbR HwLwjr6NOJTjX8vkaJ2M/dqUrczlmDOWhzZ0K+u+LvxuYO4DvtTw8cntDaag0S7mVL7k SXPZeGXCtLqQUAbXWNQqNlemnq777vlYpVK4Aor3rPXzMz4CqSpyxHSZA4VOw10InrQP ViiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730319929; x=1730924729; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fR2+vu403L40rXb7uTJaqctd7/7tN0LQaS/5y7wX67o=; b=lDe/Tx0RHnPQjxAAKkw4L82U8tb4rVi8sWKqKDqSN5wqpUzSrrzkkwepblgWmPgoY4 fwmAebCHJHKZSWpieCQ2USI/QDWKLap6y/d6CBbGUtccuxnT8BJ6SO5/l0eyvo0NjZ/4 et2dNQoO5pvBZ9S1bLIrasEwBGfsqqBmvPlw076aJou9w7QxlDIzNvPXq8geQ/sJupNW /VRfpDHKimRpa3Se+qa1Qwhhb9CKTmXV2jJ6gmudvaqpod03ASaxTaRJtMtbAP9chHF+ op3CpjK3yTFNqpD40qoKsPqveZABmcUxrTbUBRiHaHUCs2aSH2kDwdovsPyHGt5Qkz+w hvKQ== X-Forwarded-Encrypted: i=1; AJvYcCXvN7fHhSqPd9AGNw+NxXthhlaMcGM3wWbIxIj98X/DUPjeg2CkF6OGHATaIg9vLKdT6b/68SfLOQ==@kvack.org X-Gm-Message-State: AOJu0YzfxtG2DfT4qU7Uz1qn5IpUUY83YoP2WPa+PnM+lVVnIn1Pog0f hudd80moebncJ+Nkd9TjPfB8NbT52pfiUDxhbyJ1pm1y1DuuoOSZ X-Google-Smtp-Source: AGHT+IGTh7hqR4cXUiWQ4jlbKDxvDcPHYIrN4wJEVWmXREQG0vN4tnG4IFzCT3on6XUZFLt92kpOhw== X-Received: by 2002:a05:600c:3107:b0:430:57e8:3c7e with SMTP id 5b1f17b1804b1-4319ad065abmr169813115e9.28.1730319928745; Wed, 30 Oct 2024 13:25:28 -0700 (PDT) Received: from ?IPV6:2a02:6b67:d751:7400:c2b:f323:d172:e42a? ([2a02:6b67:d751:7400:c2b:f323:d172:e42a]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-431bd8e8471sm31810545e9.4.2024.10.30.13.25.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 30 Oct 2024 13:25:28 -0700 (PDT) Message-ID: <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> Date: Wed, 30 Oct 2024 20:25:27 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg To: Yosry Ahmed Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , "Huang, Ying" , Kairui Song , Ryan Roberts , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: BC7D9180005 X-Stat-Signature: pz1g1u1o87ns4o8xkn7w3tkiyk3n4xb6 X-Rspam-User: X-HE-Tag: 1730319926-674606 X-HE-Meta: U2FsdGVkX1/kHeIaiJQkZhaAK2uR+Xg9svf2UYftwujs17NetWNt870suCJ8ubJH5P9NS1n9Yz7Y25bWowEgDLR+ssccIltkPacuwIcYu3GITdmMeiqdVlOQto/HfnzeUHIJ/C+oLM9NEPdrfU1S8Zl1dhEBxzr39hVdGFzL9TMK2QmtfE9iuHvmx/nmjfSM0IKzlTpxqnxwvGqrWUM+gompKzeiuUrhhLmFTLG4fwb0iePQr6Gq55fiUXrmALGKMbt8J6fff/J0x0W4rzgkmwBOpq4oGetWPuzxZ8RJgI8KaACRH1/f3u/0WNJ6MdtxvlkbuqM8go8BPfLG5MNtsfT3T+b9coDAqFfamDGal1wnV8JMBBQrrdQEsU5++xblOUb+2iFfb4rPoNszB7cri08jMDay5ABvRrv8BbyjQuZuDb37ZE0CoFHYFa9qjcFNr3XJaGn6r8XKXqOVRweG4r8RIjgwWqX+ksa5AgejzpnmowFvT3OGn19mmyWWT2cKjGwUPbmBXrVtjIhKbjB7eeybGI7AjUBMD9ofJmE+suHM/3DHWtIbklb4lm4EOqh6B8qU/YNUqF5ChmATVVztXXyq6NLke+VWR+ioErNKKFh6bhKFkqFyO6Wnk68n5mzq3xI4lXOGZP0Uvk9g3+g3PJ1JFuBMp6Qvm1tM71jorpV4yQaQDmq0xqt4jpqcks6tIx6vGoBd4RiAJpsh8grkJC2rzcrcN9AJAGLIbAkdj+tIdOpwKj6s8B6aruwNqSEHNc8ivW6+s3s069CxO43L10Lje6GHKeEDDHkU6FhSnKjU8/avcUaulfZIX4l1y33IrRtmxzNTfBnaTArUYYXMdj9Zq4+2mFwJSvu/vUYE54rvrOtcY/4Q6Bpkb3lA9YKV7cXuJsNiYBSko/Ub33N3dGzwdps0mf8mTiJFjP7Pw4beuadYFeKmgJEWFfxX61tobdJ0XtM486sljzyVs+T ANs8gNhT 5t1viRBCM9hGFmXlJ3JJSIYptsV+QAV3k4uomz7dJQasaCTFqFQDueu2Vx80smP94U3l2xwhDQQE6fxmkdtPvBRDR60GwaQOc1O60ZESewgpcEgqW16U+qlgRyJEweDryNiSN8GuIgLYGaI5q4tzC+EocrKAL9p+fTfp4M+51qoFcXO435UbX+xs0FeHTJ1RsbSdi486O47/KrcCcW3d4giHN5ZamtlSsQGn5VRwsE7p8K82WmC+BKex2Q4PvHD3maNPfWip+rb3XzLusiiP1HJXO8ejedUq9vzuXPrOSXxq5BREGxyD81jpu3SLYOSm5gVX57YX+ACEPwUoidDoBrL9ffREF98Nq742O2fzffZejEuCxIWzRwQTkmV/x84qltwtsSshviJrOoZo3/PV+MAydfkdjRKsPp5WkbIunx+Ne50X5gFNjNRwztDmu8YRU2n2MN8b/F9O6DbjHWb173Khz2g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 30/10/2024 19:51, Yosry Ahmed wrote: > [..] >>> My second point about the mitigation is as follows: For a system (or >>> memcg) under severe memory pressure, especially one without hardware TLB >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>> a larger granularity, some internal fragmentation is unavoidable, regardless >>> of optimization. Could the mitigation code help in automatically tuning >>> this fragmentation? >>> >> >> I agree with the point that enabling mTHP always is not the right thing to do >> on all platforms. I also think it might be the case that enabling mTHP >> might be a good thing for some workloads, but enabling mTHP swapin along with >> it might not. >> >> As you said when you have apps switching between foreground and background >> in android, it probably makes sense to have large folio swapping, as you >> want to bringin all the pages from background app as quickly as possible. >> And also all the TLB optimizations and smaller lru overhead you get after >> you have brought in all the pages. >> Linux kernel build test doesnt really get to benefit from the TLB optimization >> and smaller lru overhead, as probably the pages are very short lived. So I >> think it doesnt show the benefit of large folio swapin properly and >> large folio swapin should probably be disabled for this kind of workload, >> eventhough mTHP should be enabled. >> >> I am not sure that the approach we are trying in this patch is the right way: >> - This patch makes it a memcg issue, but you could have memcg disabled and >> then the mitigation being tried here wont apply. > > Is the problem reproducible without memcg? I imagine only if the > entire system is under memory pressure. I guess we would want the same > "mitigation" either way. > What would be a good open source benchmark/workload to test without limiting memory in memcg? For the kernel build test, I can only get zswap activity to happen if I build in cgroup and limit memory.max. I can just run zswap large folio zswapin in production and see, but that will take me a few days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, then maybe its not really an issue? I believe Barry doesnt see an issue in android phones (but please correct me if I am wrong), and if there isnt an issue in Meta production as well, its a good data point for servers as well. And maybe kernel build in 4G memcg is not a good test. >> - Instead of this being a large folio swapin issue, is it more of a readahead >> issue? If we zswap (without the large folio swapin series) and change the window >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> when cgroup memory is limited as readahead would probably cause swap thrashing as >> well. > > I think large folio swapin would make the problem worse anyway. I am > also not sure if the readahead window adjusts on memory pressure or > not. > readahead window doesnt look at memory pressure. So maybe the same thing is being seen here as there would be in swapin_readahead? Maybe if we check kernel build test performance in 4G memcg with below diff, it might get better? diff --git a/mm/swap_state.c b/mm/swap_state.c index 4669f29cf555..9e196e1e6885 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, pgoff_t ilx; bool page_allocated; - win = swap_vma_ra_win(vmf, &start, &end); + win = 1; if (win == 1) goto skip; >> - Instead of looking at cgroup margin, maybe we should try and look at >> the rate of change of workingset_restore_anon? This might be a lot more complicated >> to do, but probably is the right metric to determine swap thrashing. It also means >> that this could be used in both the synchronous swapcache skipping path and >> swapin_readahead path. >> (Thanks Johannes for suggesting this) >> >> With the large folio swapin, I do see the large improvement when considering only >> swapin performance and latency in the same way as you saw in zram. >> Maybe the right short term approach is to have >> /sys/kernel/mm/transparent_hugepage/swapin >> and have that disabled by default to avoid regression. >> If the workload owner sees a benefit, they can enable it. >> I can add this when sending the next version of large folio zswapin if that makes >> sense? > > I would honestly prefer we avoid this if possible. It's always easy to > just put features behind knobs, and then users have the toil of > figuring out if/when they can use it, or just give up. We should find > a way to avoid the thrashing due to hitting the memcg limit (or being > under global memory pressure), it seems like something the kernel > should be able to do on its own. > >> Longer term I can try and have a look at if we can do something with >> workingset_restore_anon to improve things. > > I am not a big fan of this, mainly because reading a stat from the > kernel puts us in a situation where we have to choose between: > - Doing a memcg stats flush in the kernel, which is something we are > trying to move away from due to various problems we have been running > into. > - Using potentially stale stats (up to 2s), which may be fine but is > suboptimal at best. We may have blips of thrashing due to stale stats > not showing the refaults.