From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2AC5EC3DA64 for ; Sun, 4 Aug 2024 23:24:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 850216B007B; Sun, 4 Aug 2024 19:24:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 800486B0082; Sun, 4 Aug 2024 19:24:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C7966B0085; Sun, 4 Aug 2024 19:24:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 5090B6B007B for ; Sun, 4 Aug 2024 19:24:40 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id C6AF01C5404 for ; Sun, 4 Aug 2024 23:24:39 +0000 (UTC) X-FDA: 82416144678.19.BA37A54 Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf18.hostedemail.com (Postfix) with ESMTP id 13D581C0005 for ; Sun, 4 Aug 2024 23:24:37 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ZVdqyWHt; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722813871; a=rsa-sha256; cv=none; b=JRTI87cxc07UtkyV0ZJKsgdznaY70DMeGxFJcDkEpVScqrvFA9VZJawAYFtT31ot5Hz5SP utiG2f5/hy+OD4pjk2J3G9AD2fwaJblkFbY6b89navIiFrxOEfRAxc8rFBlA7HUixL3Qu9 avKdcAic8OXmgOaPpl5Wo5aMnlBuy8U= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ZVdqyWHt; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722813871; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=suSEtVR1I2xIVSc3Ex8NF0vFCDbOUuicRVSQsETi9/g=; b=7pb58V/sYuexX9NWKQxwBPKM3c8bPfb7AfUx2U/gysshvDDwo+9sBk9EjWCQAg7+ThTi+i Jgs8zsaqR0zSolXQy6u8+4zEqcP1wO6B5mBCJoepdPkuVn6kr5RAVIyJcoCEs5Oh9S6UEE F4OjFAyfr/Gwqy8EwNJSlTQz5ePphlY= Received: by mail-qt1-f174.google.com with SMTP id d75a77b69052e-44fee2bfd28so242981cf.1 for ; Sun, 04 Aug 2024 16:24:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722813877; x=1723418677; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=suSEtVR1I2xIVSc3Ex8NF0vFCDbOUuicRVSQsETi9/g=; b=ZVdqyWHtfYFIlwcPqiNZV1nb8Ueq7uO7/5oT1ya4uPFJ1NWzbyMkCx1kCYapyJsSsm R3rqp3h0T5owVcI2r+hYGVHYRR5epewNfoUd1Ewk18JTu2gpDmsGlHilVotVz4bRtr7G 7rkfmM+ew730m+z+wjNRe0ziTD2dEy6hyMWp/Pyb943nzj9CVozB7rhlh8CR8RDWsbQD LRyFbcw69bUpO4wJa2Ap17uobomVvD5j4tzVbF9XHtXUUWQLJbIGgCoPUd3AficV4l5f wr4bE9tDc9A30KjJgYStjLrot3I31oFJfTqEbUBrwDLGW2kZayc/mD12cdv2Sb528k/o WPNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722813877; x=1723418677; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=suSEtVR1I2xIVSc3Ex8NF0vFCDbOUuicRVSQsETi9/g=; b=VdUHhD96FF3aZVHWhbMLxoQdqdQNla76eyvSW9ID+TibcX++A4zJB0+24CiA8Lu8gn 4ytq7UREHVz73jFGHZuOSCgstYVcs4mPUcyDxj/N9HdE5NFtpzAxMFNcVdOG/7+yS1Yc IgjUiN0Ooa5Ku2HkRl5JUDFNGah7O5z53apyYsojqVxS2LNP3L0VI7oaa9LWgkVngwSd yeiOE5zoJx0jO84hvI6x9PAPEl8Nfa/GcUxZWdQIu5Rl85s3eb32SQj3NaNaehBw7DWk kjvuLBpm3h2DjdU7NaD4P3YfFPZP731TBIKMpLjrdwruYkWrCRT0dMelB4dEKssMDcg9 T8tA== X-Forwarded-Encrypted: i=1; AJvYcCW5IcpgpTkprvZne5xjgfqMs5etgPd4dNpSdRVEDfyLXiWRmLiE5a4eT2f9ZsUopnZT5+ccaGobuzkdq348tBfQqFY= X-Gm-Message-State: AOJu0YyiChUlJXnPdCCtMEDUbG7OtOJhxZkrzsKiHXHJ/DRExxHGWLcL HDa7waRSVEpznlQw5U7VijbUV9+385y0bclTEx8KOFuYCqN/rG5k46//F5ND04IdNizvZzgjEGR gj8Nr+IFAvzml5aF6JFJFxUxy1ez1fWASoEK1 X-Google-Smtp-Source: AGHT+IGu2BLZ1Xcsuy9DvPcV0ZGyp2G746XQkt9mJ0BJ1uphUDo9aLZEoy3WeHcw/GDVMOkr07Uzk43OlYElH9/IMxA= X-Received: by 2002:a05:622a:1a8a:b0:447:e728:d9b with SMTP id d75a77b69052e-4519ae1ee8emr3138701cf.26.1722813876807; Sun, 04 Aug 2024 16:24:36 -0700 (PDT) MIME-Version: 1.0 References: <20240730125346.1580150-1-usamaarif642@gmail.com> In-Reply-To: From: Yu Zhao Date: Sun, 4 Aug 2024 17:23:58 -0600 Message-ID: Subject: Re: [PATCH 0/6] mm: split underutilized THPs To: Usama Arif Cc: akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 13D581C0005 X-Rspamd-Server: rspam01 X-Stat-Signature: fzuh9euzjr5zshzonkt3bih8yxbzsfgr X-HE-Tag: 1722813877-236905 X-HE-Meta: U2FsdGVkX19BRObxC34rldMiS7Jk+gLBjtgJZBf/PkFmYtPlxbZvL34S/SNuovPH75Wbc1bBMrdMqbXawhKT9FntyHLUZjO8ep7dl2rm+4J3Xq0/IlbBww/w06Fj7hJINmY6XQb7mMRr/TO9mtHnyo4YXCoHUjRJm6jsfg+6mOEV0yNp8/XAw6+wRm8CvbTYwWIqFkt4lTHMdG9yVvpG1yqR+4sA0FMmvqR7tEtSgDApnuZuaJygPV+I+0eWN7obTQMRSKviB8u27Al2oXG7HjbTqyACTXLES7ZjF4EWMCWjq6NF5Eesw9xxE2YptBRcq+ei4OJrjEITXzWJ/iuPG28R+10vtXiaI53k9WuwSo8fo4ON3nOiAB/lSxg9O/CNy4HiSwHvFWZI5Yru6HbTupbRxkLtEXazAwzazCg29KIM4ierFJ+F0XQS8uTe0pehv2M588FsKvSKWthQwVQFWR2pdx2cd7Vq0L3LFp15Eb/FpkCmeLGKsVHkXFCIMzPlMOfyENCly/YM3PgDdSyw3X1mB0bdS3Qm5uGppb27JGqWKT4oJrkKoHh9E238VZNx9PDc4OQty0TtXFwb+3Vxv+SuTpd420FHAvFNAtDoAj/mbbL54q65KFH7JC/xcX6E1IcFomaqYpdk3oM6rqbF72OcbNUVExN7mdBCvpVX9rrEpsV2AKgK7icHaMpuvOJ5+dyelffqHxhiO2PBN+5hmMdzkktbs6plchk5vPm3uSLOyzZ+i+gPad5waaUBz9C3kpRccBnLTP3dlJbBGPoSNIi/F7M1dlbrDcDnTGMUnUuiuplTSlaEOINjehj7oGcPiEV9dS8gq1yJB61d7MKi32T2/aaKI//xPazx3QC4hokgc1Ew1KsEMc0Z+6AR5zHUTj0vFGr/DNmD+WxqvApMvYk5vfjkTk7jtimkvQgQB2WcmfP3ZEPRU7lLG+ZZbdAaDVYADm6giaEUzX2EsFQ BbxjDycj CFlrsExB+2u70+2z/oY3hs6yyUPBYQeECyahkYHQKR4cNcmmfQWFLPepGYesUfdhjK7CFERZEr95yKTzy2bYcVsjC/1PoRB9SjJ0jeC5V4L8y+zn+DAR5PsPLrUig1hWyp+eWRvJi8T2kn9Je9dUebrlLL7o9UXNvGSDD4o4rUjrIjMaol0ZVbMPs0RksrpQewBkdQU9D528VTJ/dfYNaMJ9TJMbYAAsvjYxTf1ZI9vHbAycmCGkPTbDquYy/P4kTztJFd63wJEXp56W9Eft1PtqUO/Bfv1sHQBf/ed9H7KiPVDO8o4dTuK4Qlqg2IkBohF8ukoHGcCefVSTWWcq6xIwreQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 1, 2024 at 10:22=E2=80=AFAM Usama Arif = wrote: > > > > On 01/08/2024 07:09, Yu Zhao wrote: > > On Tue, Jul 30, 2024 at 6:54=E2=80=AFAM Usama Arif wrote: > >> > >> The current upstream default policy for THP is always. However, Meta > >> uses madvise in production as the current THP=3Dalways policy vastly > >> overprovisions THPs in sparsely accessed memory areas, resulting in > >> excessive memory pressure and premature OOM killing. > >> Using madvise + relying on khugepaged has certain drawbacks over > >> THP=3Dalways. Using madvise hints mean THPs aren't "transparent" and > >> require userspace changes. Waiting for khugepaged to scan memory and > >> collapse pages into THP can be slow and unpredictable in terms of perf= ormance > >> (i.e. you dont know when the collapse will happen), while production > >> environments require predictable performance. If there is enough memor= y > >> available, its better for both performance and predictability to have > >> a THP from fault time, i.e. THP=3Dalways rather than wait for khugepag= ed > >> to collapse it, and deal with sparsely populated THPs when the system = is > >> running out of memory. > >> > >> This patch-series is an attempt to mitigate the issue of running out o= f > >> memory when THP is always enabled. During runtime whenever a THP is be= ing > >> faulted in or collapsed by khugepaged, the THP is added to a list. > >> Whenever memory reclaim happens, the kernel runs the deferred_split > >> shrinker which goes through the list and checks if the THP was underut= ilized, > >> i.e. how many of the base 4K pages of the entire THP were zero-filled. > >> If this number goes above a certain threshold, the shrinker will attem= pt > >> to split that THP. Then at remap time, the pages that were zero-filled= are > >> not remapped, hence saving memory. This method avoids the downside of > >> wasting memory in areas where THP is sparsely filled when THP is alway= s > >> enabled, while still providing the upside THPs like reduced TLB misses= without > >> having to use madvise. > >> > >> Meta production workloads that were CPU bound (>99% CPU utilzation) we= re > >> tested with THP shrinker. The results after 2 hours are as follows: > >> > >> | THP=3Dmadvise | THP=3Dalways | THP=3D= always > >> | | | + shrinker= series > >> | | | + max_ptes= _none=3D409 > >> ----------------------------------------------------------------------= ------- > >> Performance improvement | - | +1.8% | +1.7% > >> (over THP=3Dmadvise) | | | > >> ----------------------------------------------------------------------= ------- > >> Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+= 2.4%) > >> ----------------------------------------------------------------------= ------- > >> max_ptes_none=3D409 means that any THP that has more than 409 out of 5= 12 > >> (80%) zero filled filled pages will be split. > >> > >> To test out the patches, the below commands without the shrinker will > >> invoke OOM killer immediately and kill stress, but will not fail with > >> the shrinker: > >> > >> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_non= e > >> mkdir /sys/fs/cgroup/test > >> echo $$ > /sys/fs/cgroup/test/cgroup.procs > >> echo 20M > /sys/fs/cgroup/test/memory.max > >> echo 0 > /sys/fs/cgroup/test/memory.swap.max > >> # allocate twice memory.max for each stress worker and touch 40/512 of > >> # each THP, i.e. vm-stride 50K. > >> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM > >> # killer. > >> # Without the shrinker, OOM killer is invoked immediately irrespective > >> # of max_ptes_none value and kill stress. > >> stress --vm 1 --vm-bytes 40M --vm-stride 50K > >> > >> Patches 1-2 add back helper functions that were previously removed > >> to operate on page lists (needed by patch 3). > >> Patch 3 is an optimization to free zapped tail pages rather than > >> waiting for page reclaim or migration. > >> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled > >> subpages when splitting THP. > >> Patches 6 adds support for THP shrinker. > >> > >> (This patch-series restarts the work on having a THP shrinker in kerne= l > >> originally done in > >> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/. > >> The THP shrinker in this series is significantly different than the > >> original one, hence its labelled v1 (although the prerequisite to not > >> remap clean subpages is the same).) > >> > >> Alexander Zhu (1): > >> mm: add selftests to split_huge_page() to verify unmap/zap of zero > >> pages > >> > >> Usama Arif (3): > >> Revert "memcg: remove mem_cgroup_uncharge_list()" > >> Revert "mm: remove free_unref_page_list()" > >> mm: split underutilized THPs > >> > >> Yu Zhao (2): > >> mm: free zapped tail pages when splitting isolated thp > >> mm: don't remap unused subpages when splitting isolated thp > > > > I would recommend shatter [1] instead of splitting so that > > 1) whoever underutilized their THPs get punished for the overhead; > > 2) underutilized THPs are kept intact and can be reused by others. > > > > [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/ > > The objective of this series is to reduce memory usage, while trying to k= eep the performance benefits you get of using THP=3Dalways. Of course. > Punishing any applications performance is the opposite of what I am tryin= g to do here. For applications that prefer THP=3Dalways, you would punish them more by using split. > For e.g. if there is only one main application running in production, and= its using majority of the THPs, then reducing its performance doesn't make= sense. Exactly, and that's why I recommended shatter. Let's walk through the big picture, and hopefully you'll agree. Applications prefer THP=3Dalways because they want to allocate THPs. As you mentioned above, the majority of their memory would be backed by THPs, highly utilized. You also mentioned that those applications can run into memory pressure or even OOMs, which I agree, and this is essentially what we are trying to solve here. Otherwise, with unlimited memory, we wouldn't need to worry about internal fragmentation in this context. So on one hand, we want to allocate THPs; on the other, we run into memory pressure. It's obvious that splitting under this specific condition can't fully solve our problem -- after splitting, we still have to do compaction to fulfill new THP allocation requests. Theoretically, splitting plus compaction is more expensive than shattering itself: expressing the efficiency in compact_success/(compact_success+fail), the latter is 100%; the former is nowhere near it, and our experiments agree with this. If applications opt for direct compaction, they'd pay for THP allocation latency; if they don't want to wait, i.e., with background compaction, but they'd pay for less THP coverage. So they are punished either way, not in the THP shrinker path, but in their allocation path. In comparison, shattering wins in both cases, as I explained above. > Also, just going through the commit, and found the line "The advantage of= shattering is that it keeps the original THP intact" a bit confusing. I am= guessing the THP is freed? Yes, so that we don't have to do compaction. > i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-= filled, then after shattering we will have 10*4K memory and not 2M+10*4K? Correct. > Is it the case the THP is reused at next fault? Yes, and this is central to our condition: we are under memory pressure with THP=3Dalways.