From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 551FDD6B6DB for ; Wed, 30 Oct 2024 21:09:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E1C316B009F; Wed, 30 Oct 2024 17:09:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DCC186B00A1; Wed, 30 Oct 2024 17:09:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF7396B00A6; Wed, 30 Oct 2024 17:09:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9A0F76B009F for ; Wed, 30 Oct 2024 17:09:00 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4970516104B for ; Wed, 30 Oct 2024 21:09:00 +0000 (UTC) X-FDA: 82731507264.19.148B2EC Received: from mail-oi1-f176.google.com (mail-oi1-f176.google.com [209.85.167.176]) by imf10.hostedemail.com (Postfix) with ESMTP id B663AC001F for ; Wed, 30 Oct 2024 21:08:46 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LpzFvDvb; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.167.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730322483; a=rsa-sha256; cv=none; b=HxI2L+66MI+GhhQF/EZfZdHvUsN67DyvUP0+IaX1IsQExmS0UtomaeT0uIBcUqsVzSv8lI z2hiudVM+zzRE4QNJR373SwwLtfPL9HZ9KD6lKEkGy0+wEdB2Ci/lUeraAVfrcTwse9sLv z6qTMoiu9czcJRJlcf+u4XZrcMrHRZ8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LpzFvDvb; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.167.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730322483; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/8qEhGsSL6vQiLY6JfM8/jbwTRFANfivmgS+Df1RolI=; b=heduNYEMD5NYSjvhIVyrEkm6yBqDomyBCJglVY4o6xdbdRPuzjwv8x36IWFl6v88ttZzQ3 +wI2PYNvFbK0NYB7uDwr2qCIer9xndjj3fpHJReL+8MViPurPMpAjPq/R4n7GrgyqmwfjX +t6wBuiyNH4a6vmUQbeSjpEZF3nwhgo= Received: by mail-oi1-f176.google.com with SMTP id 5614622812f47-3e5f86e59f1so186551b6e.1 for ; Wed, 30 Oct 2024 14:08:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730322537; x=1730927337; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/8qEhGsSL6vQiLY6JfM8/jbwTRFANfivmgS+Df1RolI=; b=LpzFvDvblkOfRh7thplZ9LaRGK0beWFu2/QapKk5tMEtulhXMO0zKyn3bERRWW0oqW 9qBgFs2DNue6lOqhIZ3QE21NBUMcLgl5VrOtNJf3EoFv6NH7xMigu5h1skcntPn+/ubr w1TWg549+RQahZVvef7d1euYNwggskYxcOrpPg7e6yMk2XrPGKR6qflip1eo+c9yT/+R bcFvpXPx7isCfAgqyZb+5emK2sf/vLCMml1X+giDsLu/U0HbKmMGEW5aq8jcP8bav29y YOG+o7aqaUtvkC+UcY4lKM6mGAJYRep2pEbxSf+sIit9J54nOyeb8bVgO8Piygq0ZZL8 kqAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730322537; x=1730927337; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/8qEhGsSL6vQiLY6JfM8/jbwTRFANfivmgS+Df1RolI=; b=Eo6S9p+ipcR63zNEKGSzbR6Y6tn8dyht4IUWbfIqYQm0znkRUodpwE0bLQERN61WKB U55xLKPRRMlC0XisO4KVjgmyDPvzqAF8d3nCGBYvpUVhlMiWPAQvsbCDNe0SbvUwgeS2 thu5T0Qk3YfZk9DeRq6Wr/c0OtviK8kVa+B8zkjy0DqF+THL8d+hROZTO8AnWC6FCuaV a+DLT2RnTBQsN8rRlJtxYFsHzsgCh2hSucRjjVRo/orL0rQx54IyKzj68cUbCswJfXR3 m44tpYlcU3eM9qu0dc7l1HdxUqBbrXY7qtZri/62Wf4IjZPZaKPbOP6TPEAZKagP/Gsd sd5g== X-Forwarded-Encrypted: i=1; AJvYcCUX91INf7wztb8wAXQPS8TsHCmbWlGN06YhNp9jLLYiLNXZwP7mYc/JBsf0W9ioFeL4sKDWHizEcg==@kvack.org X-Gm-Message-State: AOJu0YyHovFq6Po1YszrKXcC6L7MRSy4U+2WZ+jRPhWxX0mdUk9O895p OQJtNHooKnZ8+TiE59M6cM3c0ewh2yzsijkmuZLcREdAlMZk1X4TvurNR18zB02sU7vcr7rk9WC zFNRCC/93DEFc5PTMQtsHIyYWW4g= X-Google-Smtp-Source: AGHT+IGplrMwrxRb4c3VQ50wgPtiHzGLwuf+wFU1LScGcJ7HGNZ8DDtKvvjcZptVVmHZwPb3pZHXC7ni6oR0aHVRRdg= X-Received: by 2002:a05:6358:e490:b0:1c3:7b75:24ec with SMTP id e5c5f4694b2df-1c5ee96bd40mr114214755d.15.1730322537001; Wed, 30 Oct 2024 14:08:57 -0700 (PDT) MIME-Version: 1.0 References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> <03b37d84-c167-48f2-9c18-24268b0e73e2@gmail.com> In-Reply-To: <03b37d84-c167-48f2-9c18-24268b0e73e2@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Thu, 31 Oct 2024 10:08:45 +1300 Message-ID: Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg To: Usama Arif Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , Yosry Ahmed , "Huang, Ying" , Kairui Song , Ryan Roberts , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: B663AC001F X-Rspamd-Server: rspam01 X-Stat-Signature: r4rakadfzcogywk44jzd8ob3kd8b3fwa X-HE-Tag: 1730322526-15724 X-HE-Meta: U2FsdGVkX18NSbO6rxy7IiRXUNyHJdPbxyqUrO53Ze9Z/kSe59WFYt/70PgMdF6RGJzgfaHN/F3NVpUeFA26T9rQrOl6ENYkWZwxjlS0NCtlUgeccmkUWrWaXm6Z6okqimc8QOieED9Tqll7+mgyvp4rM5Bs1QqSzXzQxmjAA2B3z03uJJWkrlrtotz1fKIqzZ4biPQNG3l3NS3eYMbW4a9Y3UHdV7DlJhD1/g7LA4/qC54O6A2wSQVaBe8mH02jAFAVZJ8oTKrz58i9NfHeESOnfTRyY5FdTzPBFO+3PNNS47uvOBLGGF7oKvmMgZqqarqNUHanoT+ken8e8K5zNCS9JjOUmblVCLMCg321vhrXi2zcyn2rJ/MgTZGaoHY1jaAJBXQ6c0osvmF/ZTBbvjgB/FanYnfMK0ZsydpzjJ5HmJkU1nqIdtjbiy6i94cfFF9xcdwlMUuJmaV5fz7aWSkRS+zhDFTj+g8cPYmrPrNXW120JxgTjSGmp+DoQfx6vxoMxhIJlbYPE9pRWfgtFjcNHwqytwB4qlAQf5MUrwOx4voYfSVkUipAYNav/vZhjHVp8fyG4gL5UTyMSmfJdA88Ttekxb55vynEb6rK17MOCwVnoWsdnYdBujVMC0rOfPQm2wgjgOnT9FbYXb37NnsWyiSzrzme4wZRUwtH38gHUPLIqU/cFveemsqIR3aj83xk4vjxk0d3tOCCN6Qo2BGZ4nRigH6fXvMSxI+thLkyfqyTPqubhp5LzVLv4zALVNusu4lkg1XY/moxXpV8PFnusXmYJYKzBaVEV3INJX8lIfdXVuD02C+ggkFOHdn4VZf62vODXvB/0iq5rK2sLLYhehPDEohJm8kL/n5CC4vnHfPh/oBBJ/FxW9Sir/mi9tPbDpANQvN4HZjKSYt1JW1nS7krOifLt05YW3/AROpMYl2BCuDUWZDNg8Hjr9AhpqfEdW4vmASRlEyDIt5 dsK+ouRO D6efFH9ojGCLc+SRYWoqd22jyeky/dDtfASDMi7PXjC38ysGCig1beN+bD8BRbkZlPIMK0jr6mg3Yuos3a7rOrUTXLApzmen76Fu2PAtl5m7mwrcdtsEPnmFwQijlS0TQSpDBm4cpttCBYr21s/KjrlvHyPAIrz/FtavIBmmFPhkGwguy0nANl61eCbHlJYucMvXFV2SopomyOgBSvwbLlEfyVTWhXCPQE6GL/rn5OcFpGJVeFCMtZjhCh86oIbqR/lJH2qy2luo3etBHjQRMvkgRULD5WVqE/ItlGlLPWnrCzkCGbJe1CJSk1B0LHcBXjMDbWWE7dXyQolw2DxPMlS9pv+I5m1fXjV0R0LXIX3Y3eBv43/PAUKJPj1WWZ1sulSPKnokkM5mIJ8whWCoGJ5qgCcKya2jVy8bKeOr+Ql90rpyqri/wA5w9hl8zbQj2bIZIQeFclE43YFW9FVCbzgjy7f6re6qvt8jn9KKXGBET6UXDXZEtbY9E6odm2+TVlFFeUF2jqo1xngeVGNqeqN7H5vejM8jbutvmTDZdwNv8sqa0W7PxKXUzu141cx39Yzd3HSt13RSDuLqeNoUrFLU0Q63dJPnlLFQvx9ha7kMDXjbLzK1SVwWaXMBK8uxlspYuxgDhMXzbIQlRjEbh4ZxGvQu99F5tAF16hICchd7AqGYxebqjsRbDgA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 31, 2024 at 10:00=E2=80=AFAM Usama Arif wrote: > > > > On 30/10/2024 20:48, Barry Song wrote: > > On Thu, Oct 31, 2024 at 9:41=E2=80=AFAM Usama Arif wrote: > >> > >> > >> > >> On 30/10/2024 20:27, Barry Song wrote: > >>> On Thu, Oct 31, 2024 at 3:51=E2=80=AFAM Usama Arif wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 22:03, Barry Song wrote: > >>>>> On Mon, Oct 28, 2024 at 8:07=E2=80=AFPM Usama Arif wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 27/10/2024 01:14, Barry Song wrote: > >>>>>>> From: Barry Song > >>>>>>> > >>>>>>> In a memcg where mTHP is always utilized, even at full capacity, = it > >>>>>>> might not be the best option. Consider a system that uses only sm= all > >>>>>>> folios: after each reclamation, a process has at least SWAP_CLUST= ER_MAX > >>>>>>> of buffer space before it can initiate the next reclamation. Howe= ver, > >>>>>>> large folios can quickly fill this space, rapidly bringing the me= mcg > >>>>>>> back to full capacity, even though some portions of the large fol= ios > >>>>>>> may not be immediately needed and used by the process. > >>>>>>> > >>>>>>> Usama and Kanchana identified a regression when building the kern= el in > >>>>>>> a memcg with memory.max set to a small value while enabling large > >>>>>>> folio swap-in support on zswap[1]. > >>>>>>> > >>>>>>> The issue arises from an edge case where the memory cgroup remain= s > >>>>>>> nearly full most of the time. Consequently, bringing in mTHP can > >>>>>>> quickly cause a memcg overflow, triggering a swap-out. The subseq= uent > >>>>>>> swap-in then recreates the overflow, resulting in a repetitive cy= cle. > >>>>>>> > >>>>>>> We need a mechanism to stop the cup from overflowing continuously= . > >>>>>>> One potential solution is to slow the filling process when we ide= ntify > >>>>>>> that the cup is nearly full. > >>>>>>> > >>>>>>> Usama reported an improvement when we mitigate mTHP swap-in as th= e > >>>>>>> memcg approaches full capacity[2]: > >>>>>>> > >>>>>>> int mem_cgroup_swapin_charge_folio(...) > >>>>>>> { > >>>>>>> ... > >>>>>>> if (folio_test_large(folio) && > >>>>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, foli= o_nr_pages(folio))) > >>>>>>> ret =3D -ENOMEM; > >>>>>>> else > >>>>>>> ret =3D charge_memcg(folio, memcg, gfp); > >>>>>>> ... > >>>>>>> } > >>>>>>> > >>>>>>> AMD 16K+32K THP=3Dalways > >>>>>>> metric mm-unstable mm-unstable + large folio zswapin= series mm-unstable + large folio zswapin + no swap thrashing fix > >>>>>>> real 1m23.038s 1m23.050s = 1m22.704s > >>>>>>> user 53m57.210s 53m53.437s = 53m52.577s > >>>>>>> sys 7m24.592s 7m48.843s = 7m22.519s > >>>>>>> zswpin 612070 999244 = 815934 > >>>>>>> zswpout 2226403 2347979 = 2054980 > >>>>>>> pgfault 20667366 20481728 = 20478690 > >>>>>>> pgmajfault 385887 269117 = 309702 > >>>>>>> > >>>>>>> AMD 16K+32K+64K THP=3Dalways > >>>>>>> metric mm-unstable mm-unstable + large folio zswapin= series mm-unstable + large folio zswapin + no swap thrashing fix > >>>>>>> real 1m22.975s 1m23.266s = 1m22.549s > >>>>>>> user 53m51.302s 53m51.069s = 53m46.471s > >>>>>>> sys 7m40.168s 7m57.104s = 7m25.012s > >>>>>>> zswpin 676492 1258573 = 1225703 > >>>>>>> zswpout 2449839 2714767 = 2899178 > >>>>>>> pgfault 17540746 17296555 = 17234663 > >>>>>>> pgmajfault 429629 307495 = 287859 > >>>>>>> > >>>>>>> I wonder if we can extend the mitigation to do_anonymous_page() a= s > >>>>>>> well. Without hardware like AMD and ARM with hardware TLB coalesc= ing > >>>>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation = with > >>>>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running k= ernel > >>>>>>> builds in a memcg with memory.max set to 1 GiB. > >>>>>>> > >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64k= B/enabled > >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32k= B/enabled > >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16k= B/enabled > >>>>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048= kB/enabled > >>>>>>> > >>>>>>> $ time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \ > >>>>>>> CROSS_COMPILE=3Daarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/n= ull > >>>>>>> > >>>>>>> disable-mTHP-swapin mm-unstable with-this-patch > >>>>>>> Real: 6m54.595s 7m4.832s 6m45.811s > >>>>>>> User: 66m42.795s 66m59.984s 67m21.150s > >>>>>>> Sys: 12m7.092s 15m18.153s 12m52.644s > >>>>>>> pswpin: 4262327 11723248 5918690 > >>>>>>> pswpout: 14883774 19574347 14026942 > >>>>>>> 64k-swpout: 624447 889384 480039 > >>>>>>> 32k-swpout: 115473 242288 73874 > >>>>>>> 16k-swpout: 158203 294672 109142 > >>>>>>> 64k-swpin: 0 495869 159061 > >>>>>>> 32k-swpin: 0 219977 56158 > >>>>>>> 16k-swpin: 0 223501 81445 > >>>>>>> > >>>>>> > >>>>> > >>>>> Hi Usama, > >>>>> > >>>>>> hmm, both the user and sys time are worse with the patch compared = to > >>>>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and i= f you > >>>>>> repeat the experiment the real time might be worse as well? > >>>>> > >>>>> Well, I've improved my script to include a loop: > >>>>> > >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/en= abled > >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/en= abled > >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/en= abled > >>>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/e= nabled > >>>>> > >>>>> for ((i=3D1; i<=3D100; i++)) > >>>>> do > >>>>> echo "Executing round $i" > >>>>> make ARCH=3Darm64 CROSS_COMPILE=3Daarch64-linux-gnu- clean 1>/dev= /null 2>/dev/null > >>>>> echo 3 > /proc/sys/vm/drop_caches > >>>>> time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \ > >>>>> CROSS_COMPILE=3Daarch64-linux-gnu- vmlinux -j15 1>/dev/null= 2>/dev/null > >>>>> cat /proc/vmstat | grep pswp > >>>>> echo -n 64k-swpout: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout > >>>>> echo -n 32k-swpout: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout > >>>>> echo -n 16k-swpout: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout > >>>>> echo -n 64k-swpin: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin > >>>>> echo -n 32k-swpin: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin > >>>>> echo -n 16k-swpin: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin > >>>>> done > >>>>> > >>>>> I've noticed that the user/sys/real time on my i9 machine fluctuate= s > >>>>> constantly, could be things > >>>>> like: > >>>>> real 6m52.087s > >>>>> user 67m12.463s > >>>>> sys 13m8.281s > >>>>> ... > >>>>> > >>>>> real 7m42.937s > >>>>> user 66m55.250s > >>>>> sys 12m56.330s > >>>>> ... > >>>>> > >>>>> real 6m49.374s > >>>>> user 66m37.040s > >>>>> sys 12m44.542s > >>>>> ... > >>>>> > >>>>> real 6m54.205s > >>>>> user 65m49.732s > >>>>> sys 11m33.078s > >>>>> ... > >>>>> > >>>>> likely due to unstable temperatures and I/O latency. As a result, m= y > >>>>> data doesn=E2=80=99t seem > >>>>> reference-worthy. > >>>>> > >>>> > >>>> So I had suggested retrying the experiment to see how reproducible i= t is, > >>>> but had not done that myself! > >>>> Thanks for sharing this. I tried many times on the AMD server and I = see > >>>> varying numbers as well. > >>>> > >>>> AMD 16K THP always, cgroup =3D 4G, large folio zswapin patches > >>>> real 1m28.351s > >>>> user 54m14.476s > >>>> sys 8m46.596s > >>>> zswpin 811693 > >>>> zswpout 2137310 > >>>> pgfault 27344671 > >>>> pgmajfault 290510 > >>>> .. > >>>> real 1m24.557s > >>>> user 53m56.815s > >>>> sys 8m10.200s > >>>> zswpin 571532 > >>>> zswpout 1645063 > >>>> pgfault 26989075 > >>>> pgmajfault 205177 > >>>> .. > >>>> real 1m26.083s > >>>> user 54m5.303s > >>>> sys 9m55.247s > >>>> zswpin 1176292 > >>>> zswpout 2910825 > >>>> pgfault 27286835 > >>>> pgmajfault 419746 > >>>> > >>>> > >>>> The sys time can especially vary by large numbers. I think you see t= he same. > >>>> > >>>> > >>>>> As a phone engineer, we never use phones to run kernel builds. I'm = also > >>>>> quite certain that phones won't provide stable and reliable data fo= r this > >>>>> type of workload. Without access to a Linux server to conduct the t= est, > >>>>> I really need your help. > >>>>> > >>>>> I used to work on optimizing the ARM server scheduler and memory > >>>>> management, and I really miss that machine I had until three years = ago :-) > >>>>> > >>>>>> > >>>>>>> I need Usama's assistance to identify a suitable patch, as I lack > >>>>>>> access to hardware such as AMD machines and ARM servers with TLB > >>>>>>> optimization. > >>>>>>> > >>>>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f14= 26d@gmail.com/ > >>>>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be= 555@gmail.com/ > >>>>>>> > >>>>>>> Cc: Kanchana P Sridhar > >>>>>>> Cc: Usama Arif > >>>>>>> Cc: David Hildenbrand > >>>>>>> Cc: Baolin Wang > >>>>>>> Cc: Chris Li > >>>>>>> Cc: Yosry Ahmed > >>>>>>> Cc: "Huang, Ying" > >>>>>>> Cc: Kairui Song > >>>>>>> Cc: Ryan Roberts > >>>>>>> Cc: Johannes Weiner > >>>>>>> Cc: Michal Hocko > >>>>>>> Cc: Roman Gushchin > >>>>>>> Cc: Shakeel Butt > >>>>>>> Cc: Muchun Song > >>>>>>> Signed-off-by: Barry Song > >>>>>>> --- > >>>>>>> include/linux/memcontrol.h | 9 ++++++++ > >>>>>>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++= ++++++ > >>>>>>> mm/memory.c | 17 ++++++++++++++ > >>>>>>> 3 files changed, 71 insertions(+) > >>>>>>> > >>>>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontro= l.h > >>>>>>> index 524006313b0d..8bcc8f4af39f 100644 > >>>>>>> --- a/include/linux/memcontrol.h > >>>>>>> +++ b/include/linux/memcontrol.h > >>>>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct fo= lio *folio, struct mm_struct *mm, > >>>>>>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_= t gfp, > >>>>>>> long nr_pages); > >>>>>>> > >>>>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > >>>>>>> + swp_entry_t *entry); > >>>>>>> + > >>>>>>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct m= m_struct *mm, > >>>>>>> gfp_t gfp, swp_entry_t entry); > >>>>>>> > >>>>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_c= harge(struct mem_cgroup *memcg, > >>>>>>> return 0; > >>>>>>> } > >>>>>>> > >>>>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_str= uct *mm, > >>>>>>> + swp_entry_t *entry) > >>>>>>> +{ > >>>>>>> + return 0; > >>>>>>> +} > >>>>>>> + > >>>>>>> static inline int mem_cgroup_swapin_charge_folio(struct folio *f= olio, > >>>>>>> struct mm_struct *mm, gfp_t gfp, swp_entry_= t entry) > >>>>>>> { > >>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>>>>> index 17af08367c68..f3d92b93ea6d 100644 > >>>>>>> --- a/mm/memcontrol.c > >>>>>>> +++ b/mm/memcontrol.c > >>>>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct m= em_cgroup *memcg, gfp_t gfp, > >>>>>>> return 0; > >>>>>>> } > >>>>>>> > >>>>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memc= g) > >>>>>>> +{ > >>>>>>> + for (; !mem_cgroup_is_root(memcg); memcg =3D parent_mem_cgr= oup(memcg)) { > >>>>>>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) > >>>>>> > >>>>>> There might be 3 issues with the approach: > >>>>>> > >>>>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you = have > >>>>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are > >>>>>> basically saying you need 512M free memory to swapin just 256K? > >>>>> > >>>>> Right, sorry for the noisy code. I was just thinking about 4KB page= s > >>>>> and wondering > >>>>> if we could simplify the code. > >>>>> > >>>>>> > >>>>>> Its an uneven margin for different folio sizes. > >>>>>> For 16K folio swapin, you are checking if there is margin for 128 = folios, > >>>>>> but for 1M folio swapin, you are checking there is margin for just= 2 folios. > >>>>>> > >>>>>> Maybe it might be better to make this dependent on some factor of = folio_nr_pages? > >>>>> > >>>>> Agreed. This is similar to what we discussed regarding your zswap m= THP > >>>>> swap-in series: > >>>>> > >>>>> int mem_cgroup_swapin_charge_folio(...) > >>>>> { > >>>>> ... > >>>>> if (folio_test_large(folio) && > >>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, > >>>>> folio_nr_pages(folio))) > >>>>> ret =3D -ENOMEM; > >>>>> else > >>>>> ret =3D charge_memcg(folio, memcg, gfp); > >>>>> ... > >>>>> } > >>>>> > >>>>> As someone focused on phones, my challenge is the absence of stable= platforms to > >>>>> benchmark this type of workload. If possible, Usama, I would greatl= y > >>>>> appreciate it if > >>>>> you could take the lead on the patch. > >>>>> > >>>>>> > >>>>>> As Johannes pointed out, the charging code already does the margin= check. > >>>>>> So for 4K, the check just checks if there is 4K available, but for= 16K it checks > >>>>>> if a lot more than 16K is available. Maybe there should be a simil= ar policy for > >>>>>> all? I guess this is similar to my 2nd point, but just considers 4= K folios as > >>>>>> well. > >>>>> > >>>>> I don't think the charging code performs a margin check. It simply > >>>>> tries to charge > >>>>> the specified nr_pages (whether 1 or more). If nr_pages are availab= le, > >>>>> the charge > >>>>> proceeds; otherwise, if GFP allows blocking, it triggers memory rec= lamation to > >>>>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. > >>>>> > >>>> > >>>> So if you have defrag not set to always, it will not trigger reclama= tion. > >>>> I think that is a bigger usecase, i.e. defrag=3Dmadvise,defer,etc is= probably > >>>> used much more then always. > >>>> > >>>> In the current code in that case try_charge_memcg will return -ENOME= M all > >>>> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will = then > >>>> try the next order. So eventhough it might not be calling the mem_cg= roup_margin > >>>> function, it is kind of is doing the same? > >>>> > >>>>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages avail= able, a > >>>>> large folio with nr_pages =3D=3D SWAP_CLUSTER_MAX will successfully= charge, > >>>>> immediately filling the memcg. > >>>>> > >>>>> Shortly after, smaller folios=E2=80=94typically with blockable GFP= =E2=80=94will quickly trigger > >>>>> additional reclamation. While nr_pages - 1 subpages of the large fo= lio may not > >>>>> be immediately needed, they still occupy enough space to fill the m= emcg to > >>>>> capacity. > >>>>> > >>>>> My second point about the mitigation is as follows: For a system (o= r > >>>>> memcg) under severe memory pressure, especially one without hardwar= e TLB > >>>>> optimization, is enabling mTHP always the right choice? Since mTHP = operates at > >>>>> a larger granularity, some internal fragmentation is unavoidable, r= egardless > >>>>> of optimization. Could the mitigation code help in automatically tu= ning > >>>>> this fragmentation? > >>>>> > >>>> > >>>> I agree with the point that enabling mTHP always is not the right th= ing to do > >>>> on all platforms. I also think it might be the case that enabling mT= HP > >>>> might be a good thing for some workloads, but enabling mTHP swapin a= long with > >>>> it might not. > >>>> > >>>> As you said when you have apps switching between foreground and back= ground > >>>> in android, it probably makes sense to have large folio swapping, as= you > >>>> want to bringin all the pages from background app as quickly as poss= ible. > >>>> And also all the TLB optimizations and smaller lru overhead you get = after > >>>> you have brought in all the pages. > >>>> Linux kernel build test doesnt really get to benefit from the TLB op= timization > >>>> and smaller lru overhead, as probably the pages are very short lived= . So I > >>>> think it doesnt show the benefit of large folio swapin properly and > >>>> large folio swapin should probably be disabled for this kind of work= load, > >>>> eventhough mTHP should be enabled. > >>> > >>> I'm not entirely sure if this applies to platforms without TLB > >>> optimization, especially > >>> in the absence of swap. In a memory-limited cgroup without swap, woul= d > >>> mTHP still > >>> cause significant thrashing of file-backed folios? When a large swap > >>> file is present, > >>> the inability to swap in mTHP seems to act as a workaround for fragme= ntation, > >>> allowing fragmented pages of the original mTHP from do_anonymous_page= () to > >>> remain in swap. > >>> > >>>> > >>>> I am not sure that the approach we are trying in this patch is the r= ight way: > >>>> - This patch makes it a memcg issue, but you could have memcg disabl= ed and > >>>> then the mitigation being tried here wont apply. > >>>> - Instead of this being a large folio swapin issue, is it more of a = readahead > >>>> issue? If we zswap (without the large folio swapin series) and chang= e the window > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux ker= nel build time > >>>> when cgroup memory is limited as readahead would probably cause swap= thrashing as > >>>> well. > >>>> - Instead of looking at cgroup margin, maybe we should try and look = at > >>>> the rate of change of workingset_restore_anon? This might be a lot m= ore complicated > >>>> to do, but probably is the right metric to determine swap thrashing.= It also means > >>>> that this could be used in both the synchronous swapcache skipping p= ath and > >>>> swapin_readahead path. > >>>> (Thanks Johannes for suggesting this) > >>>> > >>>> With the large folio swapin, I do see the large improvement when con= sidering only > >>>> swapin performance and latency in the same way as you saw in zram. > >>>> Maybe the right short term approach is to have > >>>> /sys/kernel/mm/transparent_hugepage/swapin > >>>> and have that disabled by default to avoid regression. > >>> > >>> A crucial component is still missing=E2=80=94managing the compression= and decompression > >>> of multiple pages as a larger block. This could significantly reduce > >>> system time and > >>> potentially resolve the kernel build issue within a small memory > >>> cgroup, even with > >>> swap thrashing. > >>> > >>> I=E2=80=99ll send an update ASAP so you can rebase for zswap. > >> > >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao= @gmail.com/? > >> Thats wont benefit zswap, right? > > > > That's right. I assume we can also make it work with zswap? > > Hopefully yes. Thats mainly why I was looking at that series, to try and = find > a way to do something similar for zswap. > > > >> I actually had a few questions about it. Mainly that the benefit comes= if the > >> pagefault happens on page 0 of the large folio. But if the page fault = happens > >> on any other page, lets say page 1 of a 64K folio. then it will decomp= ress the > >> entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi= _pages_partial). > >> Could that cause a regression as you have to decompress a large chunk = for just > >> getting 1 4K page? > >> If we assume uniform distribution of page faults, maybe it could make = things worse? > >> > >> I probably should ask all of this in that thread. > > > > With mTHP swap-in, a page fault on any page behaves the same as a fault= on > > page 0. Without mTHP swap-in, there=E2=80=99s also no difference betwee= n > > faults on page 0 > > and other pages. > > Ah ok, its because of the ALIGN_DOWN in > https://elixir.bootlin.com/linux/v6.12-rc5/source/mm/memory.c#L4158, > right? right. > > > > A fault on any page means that the entire block is decompressed. The > > only difference > > is that we don=E2=80=99t partially copy one page when mTHP swap-in is p= resent. > > > Ah so zram_bvec_read_multi_pages_partial would be called only > if someone swaps out mTHP, disables it and then tries to do swapin? > For example, if the block contains 16KB of original data but only 4KB is swapped in without mTHP swap-in, this means we decompress the entire 16KB while only copying a portion of it to do_swap_page(). So likely compression/ decompression of large blocks without mTHP swapin can make things worse though it brings higher compression ratio. > Thanks > > >> > >>> > >>>> If the workload owner sees a benefit, they can enable it. > >>>> I can add this when sending the next version of large folio zswapin = if that makes > >>>> sense? > >>>> Longer term I can try and have a look at if we can do something with > >>>> workingset_restore_anon to improve things. > >>>> > >>>> Thanks, > >>>> Usama > > Thanks Barry