From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 912D0D6B6D6 for ; Wed, 30 Oct 2024 20:48:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0EEFC6B00B9; Wed, 30 Oct 2024 16:48:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 077BC6B00BD; Wed, 30 Oct 2024 16:48:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE5C16B00BF; Wed, 30 Oct 2024 16:48:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B5D326B00B9 for ; Wed, 30 Oct 2024 16:48:28 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 67BC61210DD for ; Wed, 30 Oct 2024 20:48:28 +0000 (UTC) X-FDA: 82731455520.08.6E44915 Received: from mail-vk1-f182.google.com (mail-vk1-f182.google.com [209.85.221.182]) by imf02.hostedemail.com (Postfix) with ESMTP id E9A3A80017 for ; Wed, 30 Oct 2024 20:47:36 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Jg1sgkFZ; spf=pass (imf02.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.182 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730321145; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ez0OuFdcogxoFd2zQ7V782VNSdNeqDy5+Mdo9bJHtso=; b=e8mx3YZlomaQRD8GfF6+QJB7GXvzLrhVv3tc+ZtLDi9HyqeCb93iQenyj4InM+kpC0F3Wp Wni9xBY7mx2g4CgpPWvPedByhuYJgOb3j3MVqsZtRSEQLgRXJQesKkWgdlMZeYxNvND70D KzKsXIYaBs181OIah7m8iIHWarzFFGY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730321145; a=rsa-sha256; cv=none; b=TmzFn9yrmggvsduCsEoQtRQ4Ejb+PUh1hFCCb+D3L/rticFhJ/lASjCo0XuL2Sm7bTrB8r NsZeh4o2jhicaz7c24n7MMePDN5+IEK6yZPQqBx2b4/ItjSQT3M1cGmFDoAGGO9Brpx0XQ AQZkIFt1rWmZ2NNb9ru5xcBvNB70Jmc= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Jg1sgkFZ; spf=pass (imf02.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.182 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vk1-f182.google.com with SMTP id 71dfb90a1353d-50d58e5bfd2so80624e0c.2 for ; Wed, 30 Oct 2024 13:48:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730321305; x=1730926105; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ez0OuFdcogxoFd2zQ7V782VNSdNeqDy5+Mdo9bJHtso=; b=Jg1sgkFZuniKU43V2Qxq+7+W6/ehcVmfADZYUhYVSUr5gFxVlVQ+LSIQEYZTwRAAve 7T4NZMSwDweAqUFkGUDqf3kVRQBHtSznQug0CYV5WetualIJvEGggV3t/RfMwgsUdKzE 09PuX9GiTfGjdDEz/2fzMT+n3BAqPylKJnyqGTdIG7QoiLkS0hTB0RXnRIIAWmFPAAfx WGltdFVTCIGR0rWBfYyAjSiNB36WbS7kRZ4iHNiCjTIDaBJNbhX5ss0b+osJGVCmYhPm emxjLWxAJA0pPaoVBY7cnDbHwwQWHHugo90926AG985cCPOFaQ9m7UefFSAfHKPgy5ye DBkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730321305; x=1730926105; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ez0OuFdcogxoFd2zQ7V782VNSdNeqDy5+Mdo9bJHtso=; b=ZG748/c5yAfHser7GZzqleZnbMNzCDvq9PUtpM6je7X8wAZxHVsrANqqVW4ZLivHkL 3u04y1ObfIIHAM3xqFkk46wLcvAiCXTBMUaw9r13rGQGM+2aiHY47S9uTTIy3Yu0J1+d tRza4gqX9DTE9OpuHQ57oVAPr23XQTjXg8N9DBy2k61Ro7uNW8IdVGe4LwDMbfYwXTdV JnMvr+jV8kemvBxAQxlfUtDZ6e3+280oijZUmcY0Nbmo05oPdgPkGBiwZhsoz/RciUP1 YpLkGJh5qh+LbHxWspbSghwX6yJCwcAiuJrZ7Wqk7NK/tCpSVLQMXK9mng7ioxreddPE NwCA== X-Forwarded-Encrypted: i=1; AJvYcCUKB99v4FhMwdhhM8NziKifFnwNUmNkiOroXFYqT1SfvsFsZCDXs0IMppcOPnlxB9sA4L1zo9XAXA==@kvack.org X-Gm-Message-State: AOJu0Yw8HX5IMgJwDs/UohlzUAp56fKZbMKotUaNvy5/uKsNutGA2giW gOjieFqjXtRU93gQoSG5wrcMNxd1NeeAb83lWhSlqOIT3E5GKaIsIWjXLYTsb+UW4fGoMltSj/6 H3/Ljzi2iBewpKVouXb+fZPSLX/4= X-Google-Smtp-Source: AGHT+IFXs56rd0IPIxpjDZE4xRpJnKxTGnWSVWqIUdF+gchFTiRnqLm0bE49KtKIX/5TJha6Bre/x3zA/mMTylUrFT0= X-Received: by 2002:a05:6122:1809:b0:50d:83e1:94fb with SMTP id 71dfb90a1353d-5101511d93fmr15204137e0c.12.1730321305306; Wed, 30 Oct 2024 13:48:25 -0700 (PDT) MIME-Version: 1.0 References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 31 Oct 2024 09:48:13 +1300 Message-ID: Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg To: Usama Arif Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , Yosry Ahmed , "Huang, Ying" , Kairui Song , Ryan Roberts , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: E9A3A80017 X-Stat-Signature: 3hqicayyug5musfnohc6sua8q1t9hwz7 X-HE-Tag: 1730321256-855039 X-HE-Meta: U2FsdGVkX1/cgoDrXMXk6pe4XChNE85f0tyQmwsdZ6pd8FPAmKAEFw1V6AdtHw/MX86MsHrj9sVp4BoCtaqxICXGa6axHFfSdn2AY+eE3roN5DVtVwTWk3Ux3NMBsptmDrUxq7D8kQkc2xTKF4QCv+ML31nb7DzmVUYT2LVMh79z1OIepKWPB+cvIo7WmsuYuKp3GK4Wdzm4/+Xrf3y1BI7OnzaFX86XjZXiDNlUvKiKnerTS2hRpwYdo9ro9Tyj+wuXHpte5ksAEN3VGlhXSy9/HZ077aiwDACmp2kW+kSfNBxngaeM+MCNyS0Yyz3z8gbSaU+EK7qPlTatbmGe9p+psr40g4Du8SRKAZxv8mpP9xlCMFsVufQg5aKz0qgtrvd4S0lYCHIaeCSju6C0z0UY98wjZrgyGIyGqXRwXgFOObjZoVtQHpTNS0m3FnM9ApnxisipNUQ7QdmA3bmWPWebUG2zfsktFgjcJZWy2AEKUraSseieOdFYhWlmY55VfcTiwdoEY81swr86vf2aefArKCRmsrRnJNbz9m9dCgZ22laajVS28tRnwcFjF/HHr0CvWscNDMeO5AryqB2M2KIbGR6ZS1EpSm6zpIMgRF+72cnecxkvtdsiMeqGJJVmckq56+wLAFFRgs3vGXSctjAWyqK5L4O2P2tvkdeOZ/WTp46WCFMt96VkHvbCF/6WUWHzK2rJxa2y4B5PydZJ6e5UfNPRI8EWavG1JJr/WAAlnitwDMbPGNrD89K9SgvpLgsNm9yqJ0EMVGmZK+ZuyEk3W8c21RrrXB7RtL/A/mECba5cLr4JL6eAe99A8Stcv9Q/Cn36NzXEsbaHPwZL5pQyxjwK1UocVX3/FwE8gJ8AAAsHQ32llMu9o1C/vDiLsx4YbilDID4ULo13GXd1LAwYr6hijLT2eFqfUEeANFKhiR3q1dMIxcwwIt7ji8ncFLOX514zh35RCtVddlo gf9mjPzZ r0ukxm6MUVd2sO8jfW4EDlJEnR28FI0NfjNNr2CD+w6PX26i4m+8ilDYMvrvouifJx91CQz2TUI2PgWs/oJdS+GBY4yUT+Wj3cHi4AM8+R4gmuR/JfmZ6XwJe1Jmmaf2xQZWnehFoCBYE0/RRwXqIFbKMs/4IK2cBODd9svA47dEoGbN/9KUOQ+sEHi65QeFb8FyEBv4WiRnMtKsPx4jEswxn1XQ9BX+W0G9aOON7n+t6rA1lxczTGSFFrA3g70R4R7jeBaqMdcGy3nhjcxiow09qiNwnD3XD22tofVfJj3n1Yk8Cnr7C/ARoXJziHmZjMxcIp/nhlxmTzeH+zl/LRvYlWjAHAN2/Xao+ZMBwYLiXXh6zO5wVRSUWvYW0mN+zOJLZsQjSZqBC5qBZyV+7oT+uWpZ82EztOMcjMlbIKQYvnUfA4IjexsmlvAcdJfXJjwadUTwTmAgTi+kvt3M12/lTyb4WkblMsYV3uXh2RTeG5ysf5RSBcJG2Aqi4ATIKj4mi17D58b64WZzGIpnIwMjRjkmXqMgtJ4M0cEn12bxtB1FYVN5k4WkGryIgGWLRCeB6XZ4sMOpZk9AGoZ8JpzjLbjXt0rLWI1xfPhbW683fUYkIuYq7mXNZXQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 31, 2024 at 9:41=E2=80=AFAM Usama Arif = wrote: > > > > On 30/10/2024 20:27, Barry Song wrote: > > On Thu, Oct 31, 2024 at 3:51=E2=80=AFAM Usama Arif wrote: > >> > >> > >> > >> On 28/10/2024 22:03, Barry Song wrote: > >>> On Mon, Oct 28, 2024 at 8:07=E2=80=AFPM Usama Arif wrote: > >>>> > >>>> > >>>> > >>>> On 27/10/2024 01:14, Barry Song wrote: > >>>>> From: Barry Song > >>>>> > >>>>> In a memcg where mTHP is always utilized, even at full capacity, it > >>>>> might not be the best option. Consider a system that uses only smal= l > >>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER= _MAX > >>>>> of buffer space before it can initiate the next reclamation. Howeve= r, > >>>>> large folios can quickly fill this space, rapidly bringing the memc= g > >>>>> back to full capacity, even though some portions of the large folio= s > >>>>> may not be immediately needed and used by the process. > >>>>> > >>>>> Usama and Kanchana identified a regression when building the kernel= in > >>>>> a memcg with memory.max set to a small value while enabling large > >>>>> folio swap-in support on zswap[1]. > >>>>> > >>>>> The issue arises from an edge case where the memory cgroup remains > >>>>> nearly full most of the time. Consequently, bringing in mTHP can > >>>>> quickly cause a memcg overflow, triggering a swap-out. The subseque= nt > >>>>> swap-in then recreates the overflow, resulting in a repetitive cycl= e. > >>>>> > >>>>> We need a mechanism to stop the cup from overflowing continuously. > >>>>> One potential solution is to slow the filling process when we ident= ify > >>>>> that the cup is nearly full. > >>>>> > >>>>> Usama reported an improvement when we mitigate mTHP swap-in as the > >>>>> memcg approaches full capacity[2]: > >>>>> > >>>>> int mem_cgroup_swapin_charge_folio(...) > >>>>> { > >>>>> ... > >>>>> if (folio_test_large(folio) && > >>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_= nr_pages(folio))) > >>>>> ret =3D -ENOMEM; > >>>>> else > >>>>> ret =3D charge_memcg(folio, memcg, gfp); > >>>>> ... > >>>>> } > >>>>> > >>>>> AMD 16K+32K THP=3Dalways > >>>>> metric mm-unstable mm-unstable + large folio zswapin s= eries mm-unstable + large folio zswapin + no swap thrashing fix > >>>>> real 1m23.038s 1m23.050s = 1m22.704s > >>>>> user 53m57.210s 53m53.437s = 53m52.577s > >>>>> sys 7m24.592s 7m48.843s = 7m22.519s > >>>>> zswpin 612070 999244 = 815934 > >>>>> zswpout 2226403 2347979 = 2054980 > >>>>> pgfault 20667366 20481728 = 20478690 > >>>>> pgmajfault 385887 269117 = 309702 > >>>>> > >>>>> AMD 16K+32K+64K THP=3Dalways > >>>>> metric mm-unstable mm-unstable + large folio zswapin s= eries mm-unstable + large folio zswapin + no swap thrashing fix > >>>>> real 1m22.975s 1m23.266s = 1m22.549s > >>>>> user 53m51.302s 53m51.069s = 53m46.471s > >>>>> sys 7m40.168s 7m57.104s = 7m25.012s > >>>>> zswpin 676492 1258573 = 1225703 > >>>>> zswpout 2449839 2714767 = 2899178 > >>>>> pgfault 17540746 17296555 = 17234663 > >>>>> pgmajfault 429629 307495 = 287859 > >>>>> > >>>>> I wonder if we can extend the mitigation to do_anonymous_page() as > >>>>> well. Without hardware like AMD and ARM with hardware TLB coalescin= g > >>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation wi= th > >>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running ker= nel > >>>>> builds in a memcg with memory.max set to 1 GiB. > >>>>> > >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/= enabled > >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/= enabled > >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/= enabled > >>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB= /enabled > >>>>> > >>>>> $ time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \ > >>>>> CROSS_COMPILE=3Daarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/nul= l > >>>>> > >>>>> disable-mTHP-swapin mm-unstable with-this-patch > >>>>> Real: 6m54.595s 7m4.832s 6m45.811s > >>>>> User: 66m42.795s 66m59.984s 67m21.150s > >>>>> Sys: 12m7.092s 15m18.153s 12m52.644s > >>>>> pswpin: 4262327 11723248 5918690 > >>>>> pswpout: 14883774 19574347 14026942 > >>>>> 64k-swpout: 624447 889384 480039 > >>>>> 32k-swpout: 115473 242288 73874 > >>>>> 16k-swpout: 158203 294672 109142 > >>>>> 64k-swpin: 0 495869 159061 > >>>>> 32k-swpin: 0 219977 56158 > >>>>> 16k-swpin: 0 223501 81445 > >>>>> > >>>> > >>> > >>> Hi Usama, > >>> > >>>> hmm, both the user and sys time are worse with the patch compared to > >>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if = you > >>>> repeat the experiment the real time might be worse as well? > >>> > >>> Well, I've improved my script to include a loop: > >>> > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enab= led > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enab= led > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enab= led > >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ena= bled > >>> > >>> for ((i=3D1; i<=3D100; i++)) > >>> do > >>> echo "Executing round $i" > >>> make ARCH=3Darm64 CROSS_COMPILE=3Daarch64-linux-gnu- clean 1>/dev/n= ull 2>/dev/null > >>> echo 3 > /proc/sys/vm/drop_caches > >>> time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \ > >>> CROSS_COMPILE=3Daarch64-linux-gnu- vmlinux -j15 1>/dev/null 2= >/dev/null > >>> cat /proc/vmstat | grep pswp > >>> echo -n 64k-swpout: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout > >>> echo -n 32k-swpout: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout > >>> echo -n 16k-swpout: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout > >>> echo -n 64k-swpin: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin > >>> echo -n 32k-swpin: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin > >>> echo -n 16k-swpin: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin > >>> done > >>> > >>> I've noticed that the user/sys/real time on my i9 machine fluctuates > >>> constantly, could be things > >>> like: > >>> real 6m52.087s > >>> user 67m12.463s > >>> sys 13m8.281s > >>> ... > >>> > >>> real 7m42.937s > >>> user 66m55.250s > >>> sys 12m56.330s > >>> ... > >>> > >>> real 6m49.374s > >>> user 66m37.040s > >>> sys 12m44.542s > >>> ... > >>> > >>> real 6m54.205s > >>> user 65m49.732s > >>> sys 11m33.078s > >>> ... > >>> > >>> likely due to unstable temperatures and I/O latency. As a result, my > >>> data doesn=E2=80=99t seem > >>> reference-worthy. > >>> > >> > >> So I had suggested retrying the experiment to see how reproducible it = is, > >> but had not done that myself! > >> Thanks for sharing this. I tried many times on the AMD server and I se= e > >> varying numbers as well. > >> > >> AMD 16K THP always, cgroup =3D 4G, large folio zswapin patches > >> real 1m28.351s > >> user 54m14.476s > >> sys 8m46.596s > >> zswpin 811693 > >> zswpout 2137310 > >> pgfault 27344671 > >> pgmajfault 290510 > >> .. > >> real 1m24.557s > >> user 53m56.815s > >> sys 8m10.200s > >> zswpin 571532 > >> zswpout 1645063 > >> pgfault 26989075 > >> pgmajfault 205177 > >> .. > >> real 1m26.083s > >> user 54m5.303s > >> sys 9m55.247s > >> zswpin 1176292 > >> zswpout 2910825 > >> pgfault 27286835 > >> pgmajfault 419746 > >> > >> > >> The sys time can especially vary by large numbers. I think you see the= same. > >> > >> > >>> As a phone engineer, we never use phones to run kernel builds. I'm al= so > >>> quite certain that phones won't provide stable and reliable data for = this > >>> type of workload. Without access to a Linux server to conduct the tes= t, > >>> I really need your help. > >>> > >>> I used to work on optimizing the ARM server scheduler and memory > >>> management, and I really miss that machine I had until three years ag= o :-) > >>> > >>>> > >>>>> I need Usama's assistance to identify a suitable patch, as I lack > >>>>> access to hardware such as AMD machines and ARM servers with TLB > >>>>> optimization. > >>>>> > >>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426= d@gmail.com/ > >>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be55= 5@gmail.com/ > >>>>> > >>>>> Cc: Kanchana P Sridhar > >>>>> Cc: Usama Arif > >>>>> Cc: David Hildenbrand > >>>>> Cc: Baolin Wang > >>>>> Cc: Chris Li > >>>>> Cc: Yosry Ahmed > >>>>> Cc: "Huang, Ying" > >>>>> Cc: Kairui Song > >>>>> Cc: Ryan Roberts > >>>>> Cc: Johannes Weiner > >>>>> Cc: Michal Hocko > >>>>> Cc: Roman Gushchin > >>>>> Cc: Shakeel Butt > >>>>> Cc: Muchun Song > >>>>> Signed-off-by: Barry Song > >>>>> --- > >>>>> include/linux/memcontrol.h | 9 ++++++++ > >>>>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++= ++++ > >>>>> mm/memory.c | 17 ++++++++++++++ > >>>>> 3 files changed, 71 insertions(+) > >>>>> > >>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.= h > >>>>> index 524006313b0d..8bcc8f4af39f 100644 > >>>>> --- a/include/linux/memcontrol.h > >>>>> +++ b/include/linux/memcontrol.h > >>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct foli= o *folio, struct mm_struct *mm, > >>>>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t = gfp, > >>>>> long nr_pages); > >>>>> > >>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > >>>>> + swp_entry_t *entry); > >>>>> + > >>>>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_= struct *mm, > >>>>> gfp_t gfp, swp_entry_t entry); > >>>>> > >>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_cha= rge(struct mem_cgroup *memcg, > >>>>> return 0; > >>>>> } > >>>>> > >>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struc= t *mm, > >>>>> + swp_entry_t *entry) > >>>>> +{ > >>>>> + return 0; > >>>>> +} > >>>>> + > >>>>> static inline int mem_cgroup_swapin_charge_folio(struct folio *fol= io, > >>>>> struct mm_struct *mm, gfp_t gfp, swp_entry_t = entry) > >>>>> { > >>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>>> index 17af08367c68..f3d92b93ea6d 100644 > >>>>> --- a/mm/memcontrol.c > >>>>> +++ b/mm/memcontrol.c > >>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem= _cgroup *memcg, gfp_t gfp, > >>>>> return 0; > >>>>> } > >>>>> > >>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) > >>>>> +{ > >>>>> + for (; !mem_cgroup_is_root(memcg); memcg =3D parent_mem_cgrou= p(memcg)) { > >>>>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) > >>>> > >>>> There might be 3 issues with the approach: > >>>> > >>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you ha= ve > >>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are > >>>> basically saying you need 512M free memory to swapin just 256K? > >>> > >>> Right, sorry for the noisy code. I was just thinking about 4KB pages > >>> and wondering > >>> if we could simplify the code. > >>> > >>>> > >>>> Its an uneven margin for different folio sizes. > >>>> For 16K folio swapin, you are checking if there is margin for 128 fo= lios, > >>>> but for 1M folio swapin, you are checking there is margin for just 2= folios. > >>>> > >>>> Maybe it might be better to make this dependent on some factor of fo= lio_nr_pages? > >>> > >>> Agreed. This is similar to what we discussed regarding your zswap mTH= P > >>> swap-in series: > >>> > >>> int mem_cgroup_swapin_charge_folio(...) > >>> { > >>> ... > >>> if (folio_test_large(folio) && > >>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, > >>> folio_nr_pages(folio))) > >>> ret =3D -ENOMEM; > >>> else > >>> ret =3D charge_memcg(folio, memcg, gfp); > >>> ... > >>> } > >>> > >>> As someone focused on phones, my challenge is the absence of stable p= latforms to > >>> benchmark this type of workload. If possible, Usama, I would greatly > >>> appreciate it if > >>> you could take the lead on the patch. > >>> > >>>> > >>>> As Johannes pointed out, the charging code already does the margin c= heck. > >>>> So for 4K, the check just checks if there is 4K available, but for 1= 6K it checks > >>>> if a lot more than 16K is available. Maybe there should be a similar= policy for > >>>> all? I guess this is similar to my 2nd point, but just considers 4K = folios as > >>>> well. > >>> > >>> I don't think the charging code performs a margin check. It simply > >>> tries to charge > >>> the specified nr_pages (whether 1 or more). If nr_pages are available= , > >>> the charge > >>> proceeds; otherwise, if GFP allows blocking, it triggers memory recla= mation to > >>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. > >>> > >> > >> So if you have defrag not set to always, it will not trigger reclamati= on. > >> I think that is a bigger usecase, i.e. defrag=3Dmadvise,defer,etc is p= robably > >> used much more then always. > >> > >> In the current code in that case try_charge_memcg will return -ENOMEM = all > >> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will th= en > >> try the next order. So eventhough it might not be calling the mem_cgro= up_margin > >> function, it is kind of is doing the same? > >> > >>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages availab= le, a > >>> large folio with nr_pages =3D=3D SWAP_CLUSTER_MAX will successfully c= harge, > >>> immediately filling the memcg. > >>> > >>> Shortly after, smaller folios=E2=80=94typically with blockable GFP=E2= =80=94will quickly trigger > >>> additional reclamation. While nr_pages - 1 subpages of the large foli= o may not > >>> be immediately needed, they still occupy enough space to fill the mem= cg to > >>> capacity. > >>> > >>> My second point about the mitigation is as follows: For a system (or > >>> memcg) under severe memory pressure, especially one without hardware = TLB > >>> optimization, is enabling mTHP always the right choice? Since mTHP op= erates at > >>> a larger granularity, some internal fragmentation is unavoidable, reg= ardless > >>> of optimization. Could the mitigation code help in automatically tuni= ng > >>> this fragmentation? > >>> > >> > >> I agree with the point that enabling mTHP always is not the right thin= g to do > >> on all platforms. I also think it might be the case that enabling mTHP > >> might be a good thing for some workloads, but enabling mTHP swapin alo= ng with > >> it might not. > >> > >> As you said when you have apps switching between foreground and backgr= ound > >> in android, it probably makes sense to have large folio swapping, as y= ou > >> want to bringin all the pages from background app as quickly as possib= le. > >> And also all the TLB optimizations and smaller lru overhead you get af= ter > >> you have brought in all the pages. > >> Linux kernel build test doesnt really get to benefit from the TLB opti= mization > >> and smaller lru overhead, as probably the pages are very short lived. = So I > >> think it doesnt show the benefit of large folio swapin properly and > >> large folio swapin should probably be disabled for this kind of worklo= ad, > >> eventhough mTHP should be enabled. > > > > I'm not entirely sure if this applies to platforms without TLB > > optimization, especially > > in the absence of swap. In a memory-limited cgroup without swap, would > > mTHP still > > cause significant thrashing of file-backed folios? When a large swap > > file is present, > > the inability to swap in mTHP seems to act as a workaround for fragment= ation, > > allowing fragmented pages of the original mTHP from do_anonymous_page()= to > > remain in swap. > > > >> > >> I am not sure that the approach we are trying in this patch is the rig= ht way: > >> - This patch makes it a memcg issue, but you could have memcg disabled= and > >> then the mitigation being tried here wont apply. > >> - Instead of this being a large folio swapin issue, is it more of a re= adahead > >> issue? If we zswap (without the large folio swapin series) and change = the window > >> to 1 in swap_vma_readahead, we might see an improvement in linux kerne= l build time > >> when cgroup memory is limited as readahead would probably cause swap t= hrashing as > >> well. > >> - Instead of looking at cgroup margin, maybe we should try and look at > >> the rate of change of workingset_restore_anon? This might be a lot mor= e complicated > >> to do, but probably is the right metric to determine swap thrashing. I= t also means > >> that this could be used in both the synchronous swapcache skipping pat= h and > >> swapin_readahead path. > >> (Thanks Johannes for suggesting this) > >> > >> With the large folio swapin, I do see the large improvement when consi= dering only > >> swapin performance and latency in the same way as you saw in zram. > >> Maybe the right short term approach is to have > >> /sys/kernel/mm/transparent_hugepage/swapin > >> and have that disabled by default to avoid regression. > > > > A crucial component is still missing=E2=80=94managing the compression a= nd decompression > > of multiple pages as a larger block. This could significantly reduce > > system time and > > potentially resolve the kernel build issue within a small memory > > cgroup, even with > > swap thrashing. > > > > I=E2=80=99ll send an update ASAP so you can rebase for zswap. > > Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gm= ail.com/? > Thats wont benefit zswap, right? That's right. I assume we can also make it work with zswap? > I actually had a few questions about it. Mainly that the benefit comes if= the > pagefault happens on page 0 of the large folio. But if the page fault hap= pens > on any other page, lets say page 1 of a 64K folio. then it will decompres= s the > entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pa= ges_partial). > Could that cause a regression as you have to decompress a large chunk for= just > getting 1 4K page? > If we assume uniform distribution of page faults, maybe it could make thi= ngs worse? > > I probably should ask all of this in that thread. With mTHP swap-in, a page fault on any page behaves the same as a fault on page 0. Without mTHP swap-in, there=E2=80=99s also no difference between faults on page 0 and other pages. A fault on any page means that the entire block is decompressed. The only difference is that we don=E2=80=99t partially copy one page when mTHP swap-in is prese= nt. > > > > >> If the workload owner sees a benefit, they can enable it. > >> I can add this when sending the next version of large folio zswapin if= that makes > >> sense? > >> Longer term I can try and have a look at if we can do something with > >> workingset_restore_anon to improve things. > >> > >> Thanks, > >> Usama Thanks Barry