From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E61F8D6B6DB for ; Wed, 30 Oct 2024 21:13:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 741BD6B00B5; Wed, 30 Oct 2024 17:13:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6CAF46B00B8; Wed, 30 Oct 2024 17:13:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 56B266B00BC; Wed, 30 Oct 2024 17:13:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2EFE56B00B5 for ; Wed, 30 Oct 2024 17:13:21 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id DA86BAAF87 for ; Wed, 30 Oct 2024 21:13:20 +0000 (UTC) X-FDA: 82731518058.07.EFD3C6D Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) by imf25.hostedemail.com (Postfix) with ESMTP id 22D5DA0027 for ; Wed, 30 Oct 2024 21:12:59 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QmbYLApg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730322668; a=rsa-sha256; cv=none; b=gW7H4e8h+JOvA8KZK7lAggjZl8a+Gq6q32Tn1Omt8nCO+cQotemVPfM5uvLHispslwqg/T XOWhfPnADXXUWla0bAqg9f8LV43RkXdzOP/b7P78s7lSh7v6UpwQAd02A5wraQFKMyS8XP fgtVG+v2VL8DUn7qA8GtGe91gGvYuZ8= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QmbYLApg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730322668; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WKeGul/SDsN92Pb5au+3ekIaxkpI//pN1FvsRpcKQm8=; b=twWeDapFnhq90ZDQQ8KHFKsSaAmnVkoJ0uCbPXVORaollM8d8uz/5zqhXji4t75U6rHT9z xg++0NtnLskAHQZbVHsSWVcKLe9jUMvXH0C02XroN/bvftyE0G/7ZJT3HvKdDI9anr85Rz EMlFfur0tcSk1g5/zY2gPQsfEPSCTao= Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-43155afca99so8831605e9.1 for ; Wed, 30 Oct 2024 14:13:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730322797; x=1730927597; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=WKeGul/SDsN92Pb5au+3ekIaxkpI//pN1FvsRpcKQm8=; b=QmbYLApglNI3Usd1lUR6ChJqm6w2Gxpen4aA1qLjy1hZXqg+UHokEgy6r1SQm+vE1t 5H3e8WNH5MXOFDfIbEIBkEvY5EmsFoAIj8RBZGXyGaRSfTFjl5SxGBtNR/ZGQYDxG4q5 NOLMJkbNWpSgooyyG2/nIk+Edak/TsmxS8vzSprd57UXMU+SjZLIS0X36MI0XaEKCCvO Idp+ZH2URPXUDcK+SpJGDdeBPS/uFxmvbI44GAjhdwtuQVyoH224FrpSGzMyilnRtIF5 QPlO9l40f6FyR8mKxrjCtPE6t4BL88aHmxkJynJH9Olwc6kYN8Dy9RXuhC2vw0Zg0Ef0 Srmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730322797; x=1730927597; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=WKeGul/SDsN92Pb5au+3ekIaxkpI//pN1FvsRpcKQm8=; b=odoQRYku5LQ+oB5fQRXVoG3XRpPEFIpAk13HnP56TCSkZhvs8spBTp9cNVvSCaKp0x T+bpZjLAU4nyVyrtscIaITyRm5S+9dM9rS1+7ovx+yXEauNhkRUBf+qpVbCOT65tHOaJ DSbfqPbIpEpaHmmpnwTrxBjKePHHNNjZuW9e+0VkO4iMEfROthyme4yi9abFVG/meKjM iyneQ+72BAMSDXVj2+/o1NNPhHj+TVrx+CUvpo6g2vOIZEVvgap/7Dp9BC3cl1W5LKMi 1m4XjOyCGjr9WDW+nOQWf3vg/eSqotrPy0AQITcSa+eGC3B/lxeexIANrl2SNICU6i5Z hOmg== X-Forwarded-Encrypted: i=1; AJvYcCXODAOmfjZ3GzimRsMQuzgLRFVY0vM82PBLGUuWzpm9u9bUIRu0E2j8A9EVlEw8LgIYZ9N2skkzew==@kvack.org X-Gm-Message-State: AOJu0YxlsZLRjMj9O2Aem76n4Hiaecc4CQdgxxhnWvqxitUnjbUBxOjB dTCFF4WsPoRoJ0uQ5Ql8Wx1jqVEpClqWC7ZjTQTa4GUDpwivg3Up X-Google-Smtp-Source: AGHT+IGWn1XwDEoeEIROjeVkfQWLW+GPMZvxP5fY7z850DOOVJ5adUnyA6yGIVW4DrvSn1gA8NC3Xg== X-Received: by 2002:a05:600c:1d20:b0:431:4a82:97f2 with SMTP id 5b1f17b1804b1-4327da87536mr929165e9.6.1730322797293; Wed, 30 Oct 2024 14:13:17 -0700 (PDT) Received: from ?IPV6:2a02:6b67:d751:7400:c2b:f323:d172:e42a? ([2a02:6b67:d751:7400:c2b:f323:d172:e42a]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-381c10e7387sm106489f8f.51.2024.10.30.14.13.16 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 30 Oct 2024 14:13:17 -0700 (PDT) Message-ID: Date: Wed, 30 Oct 2024 21:13:16 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg To: Yosry Ahmed Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , "Huang, Ying" , Kairui Song , Ryan Roberts , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 22D5DA0027 X-Stat-Signature: hzosr59jy53mzkc5tir84w77yn1qkq6r X-Rspam-User: X-HE-Tag: 1730322779-937819 X-HE-Meta: U2FsdGVkX1+Zq3qxjOu9Jt+ZzuS1k4ghUuqTQ3B/HNi7H2g162nH6xQhjGHajRtNgDNuzMT8qaCzDNFwNlsyXZbKJIZ8JSkZnmUV+LuWtPH5GUFZ8XGJWc+1t9vOEhQOKkLqNPHx31bZpGvOLq0J2CdVMZgDwsYL5udEYowNnYGrkpGPMAdfhtBQ62rneH1vNi4XFmruSP4ZrsN4NaeQt852r/9dsRcXK+ZJ0ICEtuFeIUzf6+8kopypZ8Njsm2KMiTtG00i4b+Tnaubm8Um8TGEpZLGF5ku80jycd18Q3VPOB9/5S/z8v7or28exWlTsV0wC5u2n/vxQSM9SeUMWzDqwgjtwHHu/ff+7f4OMey3OPCG7zblwSQHURiZ5TmJqFzvkS9Iblln/yUVYli0WTD9uu0a5nToNg92yYFUz0SPOodbd+dVeppxY7n3wznBW4RlpynspHO7yfJNN7pGypEjOmhPR6SgtdFaWWa0brV1yusPnT+xgFAZO0Yv5h1eVaCTjiyMsBNRKOOGEg9NzXazC2HVJjqQU7KHRoJuzjQIys4lxulYtxfTqowmR1RNk+uZm9Uo7YHfnB7qsw7U17SdPsFkIS82M7ZK/sUw3dxFg2s/8bBHeUncKNO2KFmsQc1mLHpKx04IovYF/Kqk81bAOJJe6f+fDUrKGG8z1lPS+VqwRYMDwtrT3tDGHj+TrH7vRfqPQRp5cuTki5MVvAMAbkdvuzdW1D5WlSF8OOSOd5AzusQx/JMyzsCtx2n3PhgHtjHQliMwlAz7KGwQOq6OvCwBTTR97hIaUZ2vmRWqEDosEeir4Gy4pQgsHyT3op+W2e+Q99apALWZGkEYie14G7S3IjYBLYo+hcGAaFuqaJpx71Wyk51E4US1bjfG9o3XCZbTMpQoXzlFetwDHXGj8d1V7nmgp+xmekgLxyMlAiHa0qvEYqACDgOJsRucyNt0qx4mv6We/oelJke 5key2xM1 l6TDUF5kYSoE900FZvLBYbotyJZ1O2lTNM6JM6/HaoH0MeqB0kGi5Jw6wZRao0rkSLu6lsyXetHTrPSIsF02i57B5RBgYRMNbEmOWs40sFl3KVgG9OZACJhbz4fPUeeOti84iTVsEJQQEKF7DK8TaNJKBOzueW4GsSRw+G49lkt6PzgVJa7p4VBo4rxpEgx92kW4J/MRSwz1TYUthDPsOwsH6vAFzHWOzjI2GapvzeTrvGFJhdEnmqOVcpoVIwxfrT2q5+9MrncGiVO8ynx3KgksWKf1QGkB2UZkxmqQZSDtzvsUid/SSYUG98YoKSZcMiM0NKencK6g/PfDLeL+n1FE/cPWTz60GuaHo409VjoJ6GlVoe04M7HALYi8EhEtNsfKYNe2JMuhx2vP25fHDn8+fM942YKPA1GT0OoPvsVlJanm5Sn4YGdH4tkIIgSJQd5Bo3QsQJnbfaBKyUoBOUhGX5CxO5LXGC6qQLYz73LHzmwA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000055, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 30/10/2024 21:01, Yosry Ahmed wrote: > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif wrote: >> >> >> >> On 30/10/2024 19:51, Yosry Ahmed wrote: >>> [..] >>>>> My second point about the mitigation is as follows: For a system (or >>>>> memcg) under severe memory pressure, especially one without hardware TLB >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless >>>>> of optimization. Could the mitigation code help in automatically tuning >>>>> this fragmentation? >>>>> >>>> >>>> I agree with the point that enabling mTHP always is not the right thing to do >>>> on all platforms. I also think it might be the case that enabling mTHP >>>> might be a good thing for some workloads, but enabling mTHP swapin along with >>>> it might not. >>>> >>>> As you said when you have apps switching between foreground and background >>>> in android, it probably makes sense to have large folio swapping, as you >>>> want to bringin all the pages from background app as quickly as possible. >>>> And also all the TLB optimizations and smaller lru overhead you get after >>>> you have brought in all the pages. >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization >>>> and smaller lru overhead, as probably the pages are very short lived. So I >>>> think it doesnt show the benefit of large folio swapin properly and >>>> large folio swapin should probably be disabled for this kind of workload, >>>> eventhough mTHP should be enabled. >>>> >>>> I am not sure that the approach we are trying in this patch is the right way: >>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>> then the mitigation being tried here wont apply. >>> >>> Is the problem reproducible without memcg? I imagine only if the >>> entire system is under memory pressure. I guess we would want the same >>> "mitigation" either way. >>> >> What would be a good open source benchmark/workload to test without limiting memory >> in memcg? >> For the kernel build test, I can only get zswap activity to happen if I build >> in cgroup and limit memory.max. > > You mean a benchmark that puts the entire system under memory > pressure? I am not sure, it ultimately depends on the size of memory > you have, among other factors. > > What if you run the kernel build test in a VM? Then you can limit is > size like a memcg, although you'd probably need to leave more room > because the entire guest OS will also subject to the same limit. > I had tried this, but the variance in time/zswap numbers was very high. Much higher than the AMD numbers I posted in reply to Barry. So found it very difficult to make comparison. >> >> I can just run zswap large folio zswapin in production and see, but that will take me a few >> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, >> then maybe its not really an issue? I believe Barry doesnt see an issue in android >> phones (but please correct me if I am wrong), and if there isnt an issue in Meta >> production as well, its a good data point for servers as well. And maybe >> kernel build in 4G memcg is not a good test. > > If there is a regression in the kernel build, this means some > workloads may be affected, even if Meta's prod isn't. I understand > that the benchmark is not very representative of real world workloads, > but in this instance I think the thrashing problem surfaced by the > benchmark is real. > >> >>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>> issue? If we zswap (without the large folio swapin series) and change the window >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>> well. >>> >>> I think large folio swapin would make the problem worse anyway. I am >>> also not sure if the readahead window adjusts on memory pressure or >>> not. >>> >> readahead window doesnt look at memory pressure. So maybe the same thing is being >> seen here as there would be in swapin_readahead? > > Maybe readahead is not as aggressive in general as large folio > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > of the window is the smaller of page_cluster (2 or 3) and > SWAP_RA_ORDER_CEILING (5). Yes, I was seeing 8 pages swapin (order 3) when testing. So might be similar to enabling 32K mTHP? > > Also readahead will swapin 4k folios AFAICT, so we don't need a > contiguous allocation like large folio swapin. So that could be > another factor why readahead may not reproduce the problem. > >> Maybe if we check kernel build test >> performance in 4G memcg with below diff, it might get better? > > I think you can use the page_cluster tunable to do this at runtime. > >> >> diff --git a/mm/swap_state.c b/mm/swap_state.c >> index 4669f29cf555..9e196e1e6885 100644 >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, >> pgoff_t ilx; >> bool page_allocated; >> >> - win = swap_vma_ra_win(vmf, &start, &end); >> + win = 1; >> if (win == 1) >> goto skip; >>