From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AEC0DC52D70 for ; Tue, 6 Aug 2024 17:38:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 431066B0088; Tue, 6 Aug 2024 13:38:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B9FA6B0089; Tue, 6 Aug 2024 13:38:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 233B66B008A; Tue, 6 Aug 2024 13:38:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 011DA6B0088 for ; Tue, 6 Aug 2024 13:38:48 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id ADE8112040A for ; Tue, 6 Aug 2024 17:38:48 +0000 (UTC) X-FDA: 82422530736.28.3D22177 Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf15.hostedemail.com (Postfix) with ESMTP id 93626A0016 for ; Tue, 6 Aug 2024 17:38:46 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=nEAcVTcJ; spf=pass (imf15.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.51 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722965863; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DFT04eDqUDRuoXV+iFO/OFbUzVkAnj/MsXC8LjTJQCY=; b=a+HM9EIUdawrg/Oo0ca1gOPj+wVTUTWCaMu2Njq8/xYRHzt/tZTgnzLJuI632EXzTIrWpe VkB1ChTHSgax0WF9EaSWoo4V5t7NgboM0sYc6hFSHGbllp7BKJPuQFH6KVfHjKd++Ng1bX XOxjoA+WqyANpuTlnV7OpocZz3KOpHc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722965863; a=rsa-sha256; cv=none; b=XKEGrtqtZIs+wsg+CXtHqFJRnlGWEe0luiw0xzAvzwhfL8KjPEwsuqjQh4+HxBmf4X6A2W z9qEj+obkXlCS52TYJFsUf9LkMg2BuWA9EcC3/bCsbhK3KJCsuq75LeyxguWE2VNpjwo4/ PG0mvj9rXY82pmN5mHIuAb6yQxlmaQE= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=nEAcVTcJ; spf=pass (imf15.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.51 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-6b7a3773a95so5077546d6.2 for ; Tue, 06 Aug 2024 10:38:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1722965925; x=1723570725; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=DFT04eDqUDRuoXV+iFO/OFbUzVkAnj/MsXC8LjTJQCY=; b=nEAcVTcJ2HQcbNHjIDpTK9kA/n4k7M+fy26hyIC8m/znS3PID1n3S0nnIbZv1Rtt+g wWL3cuIk79d+PVuy5IX8ie72/NIV9QAZsm5haJdaVu95a1dfZXNHA8oSXqzeIfgdM/IE ssbKhbEnS6m3dJxZQtZz/bEKyXs0zKotF4ad8LfF8enIcKdKSBWWYfG3lpHse//2KYBy oaibcmFNPlo23/b+Ljg4tY3rawFupz5d5nkC/zs8NHyZvbsrOR2K8W7kjDx12hrAVliY cm5EgjWLcorOOwE83IsmOPgFR8obYooFKf3Kufkq8wf4dEZC8j/zW86SaImQpOfn2zJv F4BA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722965925; x=1723570725; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=DFT04eDqUDRuoXV+iFO/OFbUzVkAnj/MsXC8LjTJQCY=; b=hiDgAuxm9ZG6XqFHMXEB9rEOJHeeyIC4odMlytG1OzFJZSHe5ZjdNHL6q/TyGYklcL nKrdfET1qJBkvMB2mJZyMdj8LZ4+fXAF8VkjSokquZ90diZJrf3kon7jpZyDTUg+j+d2 QgHmq4/5s7thOp4otmG8Y51WMc9ndCcN7RGFAO4VsK3JPvVhUpKjnDa8/qWQOBZRc24u SXcl1POMuexEifTBzm+Imb/n22/48iSdsEvmfT2glPXm87XrbGJCkQQDvgmo0hQhZ0iM mKwvAePum5dKDeNpUzbLT66PViY3hWQBVEnlIbKo/2OWr1FAet0EP7yrzBJBTcB99fwr XPBA== X-Forwarded-Encrypted: i=1; AJvYcCU++Fc4zouRDqlOPeP8sOufgud4lz+5zt3eNr+VH/8nUr+WB4TYrsOWGab0gERZ7Ju9T3ycoweiqHUaqObxH4g6XXU= X-Gm-Message-State: AOJu0Yyi0IzxGiV0R05lm2vgZOK3KT180kjB7tL2uU2SPp7HF/2cuUiC TpZ+WjpL4VfdIau+Ai09co6CKIacYfKeS+/LmJnTM6KOBijfaM+BZApSXRYgEK0= X-Google-Smtp-Source: AGHT+IFfst/Qde1Lynn7BrIeWKbz/l71KGJXRt749ul6SogcxbbFqrdsTQFw10Bolr+DHEqWpnV/5g== X-Received: by 2002:a05:6214:5886:b0:6b7:9a5c:c9ce with SMTP id 6a1803df08f44-6bb9842d7e6mr163767056d6.53.1722965925578; Tue, 06 Aug 2024 10:38:45 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6bb9c866ccfsm48229976d6.107.2024.08.06.10.38.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Aug 2024 10:38:45 -0700 (PDT) Date: Tue, 6 Aug 2024 13:38:40 -0400 From: Johannes Weiner To: Yu Zhao Cc: Usama Arif , akpm@linux-foundation.org, linux-mm@kvack.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH 0/6] mm: split underutilized THPs Message-ID: <20240806173840.GE322282@cmpxchg.org> References: <20240730125346.1580150-1-usamaarif642@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Stat-Signature: u7y67rfxk35xardkmpjew5u9kpkf7hjg X-Rspamd-Queue-Id: 93626A0016 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1722965926-399753 X-HE-Meta: U2FsdGVkX1+DreeZRwm6oUMxr4IgvqQnoPVIwF2p3sUJHmNFOk6siA5GgmSEdHJq225s5S3FfsEbPujDtQrbNIa6rV/EBRPV0JZJOIqDw0ac4gfJ4Loqpj3DoaTSsFajy0YQfKbJz880rCIWJr+cBaOvFJ5wAWpP/DzdtSzLFzWD9zMIEZVGZJxiMonNygIsPjh+7RMvtfeF0eqmxG9LrCqKoECMBNBzAG1TSTR6Sl1aJiTQ+CmNQRbKgfkgIjL47WHrTVAm9dQ8aVIO67x7Vjj9xT7qfPM9PFNa7B0SjkNLIWbKXwcpCjlJhJuBD+Z0ikrXMbHjoxzBka4iaH07XuhDa7fIajlKKQfmcE61J8rPZMpKZgqMMPwvqjsCC6aEbJDA1P72KnCsvCfFh51iqqlxj5WFF1kOsk2RZ+tw/Fwp88ahGaFgYIhB8/zQToGskW25H+w+1K9PQ2LdGqEDLQ4S4WjtchaK/NYSOQNBNEpcdYCeSxl5cdmgFO48yT6z2UeErHSKJI6DvUOoW6Y/xTgOaF+/8aEiC7SvM0JBp689EKqQJ0aSCj9n0lwRkrBB4pBWN4SytIuBAaNqfOkMMHQZn+p3PN3z+SaEBXH6IQKWnx228E+2T41QHyanB44TOZ5hIhPX/gsBILdJlIAJjfwgriPN+hfeDK6Z21He94fSdM9KIkM9qkOqT9LdOLRNdSo4lRJ3p0up9hoetk1jPCRwkFHaXKPEpj8mLDv6gLQIi9AkwehYcSORDiuK879ZNFzf+hRSeyl0rr8WetPBLIm5KJ0niXQ1x9nV/RB06NyvKcctE2uXxUOSoBcZSIOFwtCdmzTiUYXet5rlasO8fgLPpLhG7Etkt9GMZ03AhAPZiFKPdBj5RMpw7+NQrbzot67j/4kEgQuwYaKHiiGQmvPTE3vre5jDYE8B0w9NTeTFlso6Z/PZjVIBwA8DTPuMS1FXxdmILOGedf65Xpp 7xtzkLy/ d6EmhBWxe7QkdSySCHuVflm2uqYwETAEXhY4ikpj4LYl4AnEqe3+cfrXbSXDOzyIYGGHR7mZVfwBZh1vZmlRb3VnfuSNgi6bakNGHN2j+EvaIjOFmDyUsW9r10Ehh8HCIkHELz0/Ie1jNzLUhziUs6xAJTS8DKoMxg1G5Mz50CVfKNKxc7D8LZarchlYHvcFSkwqzKXog2fIvWVyvgoukYS3nApEJSIlMAZKdCtHkhf8USMZPXtI7TyLL4Aiznfoo57zbtIrTsfNfEpjSwcs+uPKNmw6IkkYEC8GL50d+8KQczE+GQbxasW9L2ON8PsMZKHpFTqlFLW7amsCdbsEOHXfBqX8cWLaXqIevxv2BVqaliGAaVzOQ8YBAw+ZfpAjLS3+OkVVVNHDLGAQVOvn3dVD3PaBalhQ0eruVEMGsoAOHtSi/TVfHu8siKw2k5cfwRLUDXE9zSOqkUoZNxDT8tJK+PfFNEKayS8CD3CVLCKTyELBW94sNerA7IcrrXTlR4m4w2IvTiVFXQeMjN8GJm7JC2gAbFUsWHMH+ERhjfupXxtU+H9dOXUyHmGgeHQbgiQX1q1DPt1wj7UvJzIrXblHCUqSq+0d2RgFa X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 01, 2024 at 12:09:16AM -0600, Yu Zhao wrote: > On Tue, Jul 30, 2024 at 6:54 AM Usama Arif wrote: > > > > The current upstream default policy for THP is always. However, Meta > > uses madvise in production as the current THP=always policy vastly > > overprovisions THPs in sparsely accessed memory areas, resulting in > > excessive memory pressure and premature OOM killing. > > Using madvise + relying on khugepaged has certain drawbacks over > > THP=always. Using madvise hints mean THPs aren't "transparent" and > > require userspace changes. Waiting for khugepaged to scan memory and > > collapse pages into THP can be slow and unpredictable in terms of performance > > (i.e. you dont know when the collapse will happen), while production > > environments require predictable performance. If there is enough memory > > available, its better for both performance and predictability to have > > a THP from fault time, i.e. THP=always rather than wait for khugepaged > > to collapse it, and deal with sparsely populated THPs when the system is > > running out of memory. > > > > This patch-series is an attempt to mitigate the issue of running out of > > memory when THP is always enabled. During runtime whenever a THP is being > > faulted in or collapsed by khugepaged, the THP is added to a list. > > Whenever memory reclaim happens, the kernel runs the deferred_split > > shrinker which goes through the list and checks if the THP was underutilized, > > i.e. how many of the base 4K pages of the entire THP were zero-filled. > > If this number goes above a certain threshold, the shrinker will attempt > > to split that THP. Then at remap time, the pages that were zero-filled are > > not remapped, hence saving memory. This method avoids the downside of > > wasting memory in areas where THP is sparsely filled when THP is always > > enabled, while still providing the upside THPs like reduced TLB misses without > > having to use madvise. > > > > Meta production workloads that were CPU bound (>99% CPU utilzation) were > > tested with THP shrinker. The results after 2 hours are as follows: > > > > | THP=madvise | THP=always | THP=always > > | | | + shrinker series > > | | | + max_ptes_none=409 > > ----------------------------------------------------------------------------- > > Performance improvement | - | +1.8% | +1.7% > > (over THP=madvise) | | | > > ----------------------------------------------------------------------------- > > Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) > > ----------------------------------------------------------------------------- > > max_ptes_none=409 means that any THP that has more than 409 out of 512 > > (80%) zero filled filled pages will be split. > > > > To test out the patches, the below commands without the shrinker will > > invoke OOM killer immediately and kill stress, but will not fail with > > the shrinker: > > > > echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none > > mkdir /sys/fs/cgroup/test > > echo $$ > /sys/fs/cgroup/test/cgroup.procs > > echo 20M > /sys/fs/cgroup/test/memory.max > > echo 0 > /sys/fs/cgroup/test/memory.swap.max > > # allocate twice memory.max for each stress worker and touch 40/512 of > > # each THP, i.e. vm-stride 50K. > > # With the shrinker, max_ptes_none of 470 and below won't invoke OOM > > # killer. > > # Without the shrinker, OOM killer is invoked immediately irrespective > > # of max_ptes_none value and kill stress. > > stress --vm 1 --vm-bytes 40M --vm-stride 50K > > > > Patches 1-2 add back helper functions that were previously removed > > to operate on page lists (needed by patch 3). > > Patch 3 is an optimization to free zapped tail pages rather than > > waiting for page reclaim or migration. > > Patch 4 is a prerequisite for THP shrinker to not remap zero-filled > > subpages when splitting THP. > > Patches 6 adds support for THP shrinker. > > > > (This patch-series restarts the work on having a THP shrinker in kernel > > originally done in > > https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/. > > The THP shrinker in this series is significantly different than the > > original one, hence its labelled v1 (although the prerequisite to not > > remap clean subpages is the same).) > > > > Alexander Zhu (1): > > mm: add selftests to split_huge_page() to verify unmap/zap of zero > > pages > > > > Usama Arif (3): > > Revert "memcg: remove mem_cgroup_uncharge_list()" > > Revert "mm: remove free_unref_page_list()" > > mm: split underutilized THPs > > > > Yu Zhao (2): > > mm: free zapped tail pages when splitting isolated thp > > mm: don't remap unused subpages when splitting isolated thp > > I would recommend shatter [1] instead of splitting so that I agree with Rik, this seems like a possible optimization, not a pre-requisite. > 1) whoever underutilized their THPs get punished for the overhead; Is that true? The downgrade is done in a shrinker. With or without shattering, the compaction effort will be on the allocation side. > 2) underutilized THPs are kept intact and can be reused by others. If migration of the subpages is possible, then compaction can clear the block as quickly as shattering can. The only difference is that compaction would do the work on-demand, whereas shattering would do it unconditionally, whether a THP has been requested or not... Anyway, I think it'd be better to keep those discussions separate.