From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EF85DD6B6D4
	for <linux-mm@archiver.kernel.org>; Wed, 30 Oct 2024 20:25:33 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3EC9B6B0083; Wed, 30 Oct 2024 16:25:33 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 39CF06B00AA; Wed, 30 Oct 2024 16:25:33 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 23E796B0092; Wed, 30 Oct 2024 16:25:33 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 00FA06B00B9
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 16:25:32 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 7D05840E5E
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 20:25:32 +0000 (UTC)
X-FDA: 82731398148.12.F2CCE4A
Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50])
	by imf24.hostedemail.com (Postfix) with ESMTP id BC7D9180005
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 20:25:26 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="hvno5/D0";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730319799; a=rsa-sha256;
	cv=none;
	b=W4SSPW699QXoWSKJO+sj6GwOzA9jJgn6zkuGxSr+Fq3pDibaGrELh6kWEIUpwI0aLre1gk
	SKicsrcxsjHWTeecuUS2e4x+k/xDgAz10vZVyIXw9wZCVaoewpcZEQ8nr4nwbP9GOKBBH8
	FAxAa0lGjNGmofQ1ogPdbe1aGQCMEYM=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="hvno5/D0";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730319799;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=fR2+vu403L40rXb7uTJaqctd7/7tN0LQaS/5y7wX67o=;
	b=vnywheEK8HCDOmBV3m6f7QJrsG6WILyn++V8ULJtsoEQxB9CcFazjE/achSoEedZsdz/rs
	iIVS7x1QLe/sTqovbBV33ZAdydQ1xUyL/E966aDEloGOLbK5CwneIoluX9IugGC0WsHOmZ
	qBTOrDUhaU7NnIfOAvZrC3WvZFKklN0=
Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-4315eeb2601so2366555e9.2
        for <linux-mm@kvack.org>; Wed, 30 Oct 2024 13:25:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1730319929; x=1730924729; darn=kvack.org;
        h=content-transfer-encoding:in-reply-to:from:content-language
         :references:cc:to:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=fR2+vu403L40rXb7uTJaqctd7/7tN0LQaS/5y7wX67o=;
        b=hvno5/D0Sd6ULESE/b8mfdAqkxlmw4ujnr7hr4fm/nU5LT/IGQxyXn0/5bt2Fd+lGC
         1Zg+PRDW/t7RjosSCL5BztHsbP24ZK/ADNeHYV1O0OPkItkboCe3zCoZmjUzohOFwWAl
         rdzVJIPTKED0BpUnKrmX71+fzZpS5jO7+cOOY0PNme3JdFwVsoSnSpKXtLp2EqipOPbR
         HwLwjr6NOJTjX8vkaJ2M/dqUrczlmDOWhzZ0K+u+LvxuYO4DvtTw8cntDaag0S7mVL7k
         SXPZeGXCtLqQUAbXWNQqNlemnq777vlYpVK4Aor3rPXzMz4CqSpyxHSZA4VOw10InrQP
         ViiA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1730319929; x=1730924729;
        h=content-transfer-encoding:in-reply-to:from:content-language
         :references:cc:to:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=fR2+vu403L40rXb7uTJaqctd7/7tN0LQaS/5y7wX67o=;
        b=lDe/Tx0RHnPQjxAAKkw4L82U8tb4rVi8sWKqKDqSN5wqpUzSrrzkkwepblgWmPgoY4
         fwmAebCHJHKZSWpieCQ2USI/QDWKLap6y/d6CBbGUtccuxnT8BJ6SO5/l0eyvo0NjZ/4
         et2dNQoO5pvBZ9S1bLIrasEwBGfsqqBmvPlw076aJou9w7QxlDIzNvPXq8geQ/sJupNW
         /VRfpDHKimRpa3Se+qa1Qwhhb9CKTmXV2jJ6gmudvaqpod03ASaxTaRJtMtbAP9chHF+
         op3CpjK3yTFNqpD40qoKsPqveZABmcUxrTbUBRiHaHUCs2aSH2kDwdovsPyHGt5Qkz+w
         hvKQ==
X-Forwarded-Encrypted: i=1; AJvYcCXvN7fHhSqPd9AGNw+NxXthhlaMcGM3wWbIxIj98X/DUPjeg2CkF6OGHATaIg9vLKdT6b/68SfLOQ==@kvack.org
X-Gm-Message-State: AOJu0YzfxtG2DfT4qU7Uz1qn5IpUUY83YoP2WPa+PnM+lVVnIn1Pog0f
	hudd80moebncJ+Nkd9TjPfB8NbT52pfiUDxhbyJ1pm1y1DuuoOSZ
X-Google-Smtp-Source: AGHT+IGTh7hqR4cXUiWQ4jlbKDxvDcPHYIrN4wJEVWmXREQG0vN4tnG4IFzCT3on6XUZFLt92kpOhw==
X-Received: by 2002:a05:600c:3107:b0:430:57e8:3c7e with SMTP id 5b1f17b1804b1-4319ad065abmr169813115e9.28.1730319928745;
        Wed, 30 Oct 2024 13:25:28 -0700 (PDT)
Received: from ?IPV6:2a02:6b67:d751:7400:c2b:f323:d172:e42a? ([2a02:6b67:d751:7400:c2b:f323:d172:e42a])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-431bd8e8471sm31810545e9.4.2024.10.30.13.25.27
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 30 Oct 2024 13:25:28 -0700 (PDT)
Message-ID: <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com>
Date: Wed, 30 Oct 2024 20:25:27 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org,
 linux-mm@kvack.org, linux-kernel@vger.kernel.org,
 Barry Song <v-songbaohua@oppo.com>,
 Kanchana P Sridhar <kanchana.p.sridhar@intel.com>,
 David Hildenbrand <david@redhat.com>,
 Baolin Wang <baolin.wang@linux.alibaba.com>, Chris Li <chrisl@kernel.org>,
 "Huang, Ying" <ying.huang@intel.com>, Kairui Song <kasong@tencent.com>,
 Ryan Roberts <ryan.roberts@arm.com>, Johannes Weiner <hannes@cmpxchg.org>,
 Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>,
 Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>
References: <20241027001444.3233-1-21cnbao@gmail.com>
 <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
 <CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com>
 <e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com>
 <CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
Content-Language: en-US
From: Usama Arif <usamaarif642@gmail.com>
In-Reply-To: <CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: BC7D9180005
X-Stat-Signature: pz1g1u1o87ns4o8xkn7w3tkiyk3n4xb6
X-Rspam-User: 
X-HE-Tag: 1730319926-674606
X-HE-Meta: U2FsdGVkX1/kHeIaiJQkZhaAK2uR+Xg9svf2UYftwujs17NetWNt870suCJ8ubJH5P9NS1n9Yz7Y25bWowEgDLR+ssccIltkPacuwIcYu3GITdmMeiqdVlOQto/HfnzeUHIJ/C+oLM9NEPdrfU1S8Zl1dhEBxzr39hVdGFzL9TMK2QmtfE9iuHvmx/nmjfSM0IKzlTpxqnxwvGqrWUM+gompKzeiuUrhhLmFTLG4fwb0iePQr6Gq55fiUXrmALGKMbt8J6fff/J0x0W4rzgkmwBOpq4oGetWPuzxZ8RJgI8KaACRH1/f3u/0WNJ6MdtxvlkbuqM8go8BPfLG5MNtsfT3T+b9coDAqFfamDGal1wnV8JMBBQrrdQEsU5++xblOUb+2iFfb4rPoNszB7cri08jMDay5ABvRrv8BbyjQuZuDb37ZE0CoFHYFa9qjcFNr3XJaGn6r8XKXqOVRweG4r8RIjgwWqX+ksa5AgejzpnmowFvT3OGn19mmyWWT2cKjGwUPbmBXrVtjIhKbjB7eeybGI7AjUBMD9ofJmE+suHM/3DHWtIbklb4lm4EOqh6B8qU/YNUqF5ChmATVVztXXyq6NLke+VWR+ioErNKKFh6bhKFkqFyO6Wnk68n5mzq3xI4lXOGZP0Uvk9g3+g3PJ1JFuBMp6Qvm1tM71jorpV4yQaQDmq0xqt4jpqcks6tIx6vGoBd4RiAJpsh8grkJC2rzcrcN9AJAGLIbAkdj+tIdOpwKj6s8B6aruwNqSEHNc8ivW6+s3s069CxO43L10Lje6GHKeEDDHkU6FhSnKjU8/avcUaulfZIX4l1y33IrRtmxzNTfBnaTArUYYXMdj9Zq4+2mFwJSvu/vUYE54rvrOtcY/4Q6Bpkb3lA9YKV7cXuJsNiYBSko/Ub33N3dGzwdps0mf8mTiJFjP7Pw4beuadYFeKmgJEWFfxX61tobdJ0XtM486sljzyVs+T
 ANs8gNhT
 5t1viRBCM9hGFmXlJ3JJSIYptsV+QAV3k4uomz7dJQasaCTFqFQDueu2Vx80smP94U3l2xwhDQQE6fxmkdtPvBRDR60GwaQOc1O60ZESewgpcEgqW16U+qlgRyJEweDryNiSN8GuIgLYGaI5q4tzC+EocrKAL9p+fTfp4M+51qoFcXO435UbX+xs0FeHTJ1RsbSdi486O47/KrcCcW3d4giHN5ZamtlSsQGn5VRwsE7p8K82WmC+BKex2Q4PvHD3maNPfWip+rb3XzLusiiP1HJXO8ejedUq9vzuXPrOSXxq5BREGxyD81jpu3SLYOSm5gVX57YX+ACEPwUoidDoBrL9ffREF98Nq742O2fzffZejEuCxIWzRwQTkmV/x84qltwtsSshviJrOoZo3/PV+MAydfkdjRKsPp5WkbIunx+Ne50X5gFNjNRwztDmu8YRU2n2MN8b/F9O6DbjHWb173Khz2g==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 30/10/2024 19:51, Yosry Ahmed wrote:
> [..]
>>> My second point about the mitigation is as follows: For a system (or
>>> memcg) under severe memory pressure, especially one without hardware TLB
>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
>>> a larger granularity, some internal fragmentation is unavoidable, regardless
>>> of optimization. Could the mitigation code help in automatically tuning
>>> this fragmentation?
>>>
>>
>> I agree with the point that enabling mTHP always is not the right thing to do
>> on all platforms. I also think it might be the case that enabling mTHP
>> might be a good thing for some workloads, but enabling mTHP swapin along with
>> it might not.
>>
>> As you said when you have apps switching between foreground and background
>> in android, it probably makes sense to have large folio swapping, as you
>> want to bringin all the pages from background app as quickly as possible.
>> And also all the TLB optimizations and smaller lru overhead you get after
>> you have brought in all the pages.
>> Linux kernel build test doesnt really get to benefit from the TLB optimization
>> and smaller lru overhead, as probably the pages are very short lived. So I
>> think it doesnt show the benefit of large folio swapin properly and
>> large folio swapin should probably be disabled for this kind of workload,
>> eventhough mTHP should be enabled.
>>
>> I am not sure that the approach we are trying in this patch is the right way:
>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> then the mitigation being tried here wont apply.
> 
> Is the problem reproducible without memcg? I imagine only if the
> entire system is under memory pressure. I guess we would want the same
> "mitigation" either way.
> 
What would be a good open source benchmark/workload to test without limiting memory
in memcg?
For the kernel build test, I can only get zswap activity to happen if I build
in cgroup and limit memory.max.

I can just run zswap large folio zswapin in production and see, but that will take me a few
days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
then maybe its not really an issue? I believe Barry doesnt see an issue in android
phones (but please correct me if I am wrong), and if there isnt an issue in Meta 
production as well, its a good data point for servers as well. And maybe
kernel build in 4G memcg is not a good test.

>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> issue? If we zswap (without the large folio swapin series) and change the window
>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> well.
> 
> I think large folio swapin would make the problem worse anyway. I am
> also not sure if the readahead window adjusts on memory pressure or
> not.
> 
readahead window doesnt look at memory pressure. So maybe the same thing is being
seen here as there would be in swapin_readahead? Maybe if we check kernel build test
performance in 4G memcg with below diff, it might get better?  

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4669f29cf555..9e196e1e6885 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
        pgoff_t ilx;
        bool page_allocated;
 
-       win = swap_vma_ra_win(vmf, &start, &end);
+       win = 1;
        if (win == 1)
                goto skip;

>> - Instead of looking at cgroup margin, maybe we should try and look at
>> the rate of change of workingset_restore_anon? This might be a lot more complicated
>> to do, but probably is the right metric to determine swap thrashing. It also means
>> that this could be used in both the synchronous swapcache skipping path and
>> swapin_readahead path.
>> (Thanks Johannes for suggesting this)
>>
>> With the large folio swapin, I do see the large improvement when considering only
>> swapin performance and latency in the same way as you saw in zram.
>> Maybe the right short term approach is to have
>> /sys/kernel/mm/transparent_hugepage/swapin
>> and have that disabled by default to avoid regression.
>> If the workload owner sees a benefit, they can enable it.
>> I can add this when sending the next version of large folio zswapin if that makes
>> sense?
> 
> I would honestly prefer we avoid this if possible. It's always easy to
> just put features behind knobs, and then users have the toil of
> figuring out if/when they can use it, or just give up. We should find
> a way to avoid the thrashing due to hitting the memcg limit (or being
> under global memory pressure), it seems like something the kernel
> should be able to do on its own.
> 
>> Longer term I can try and have a look at if we can do something with
>> workingset_restore_anon to improve things.
> 
> I am not a big fan of this, mainly because reading a stat from the
> kernel puts us in a situation where we have to choose between:
> - Doing a memcg stats flush in the kernel, which is something we are
> trying to move away from due to various problems we have been running
> into.
> - Using potentially stale stats (up to 2s), which may be fine but is
> suboptimal at best. We may have blips of thrashing due to stale stats
> not showing the refaults.