From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3311AC3DA64 for ; Tue, 6 Aug 2024 11:18:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9BA1A6B0085; Tue, 6 Aug 2024 07:18:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 942BB6B0088; Tue, 6 Aug 2024 07:18:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7BC536B0089; Tue, 6 Aug 2024 07:18:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 59AE36B0085 for ; Tue, 6 Aug 2024 07:18:12 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 03ABF14091E for ; Tue, 6 Aug 2024 11:18:11 +0000 (UTC) X-FDA: 82421571624.12.49EECB9 Received: from mail-lf1-f45.google.com (mail-lf1-f45.google.com [209.85.167.45]) by imf06.hostedemail.com (Postfix) with ESMTP id D9D8A180015 for ; Tue, 6 Aug 2024 11:18:09 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=glG66l49; spf=pass (imf06.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.167.45 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722943007; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pdZmIlmEl/z5USosoXB4iHqbx3ZAMH4Dl+lwOwX8j6A=; b=rzoX4s/7642QFOPrSdQ1FNuFStXHMLJtF7LE/5X9SPUiQdyX0UCYVGa6jU6i5+qhNx1Oee TTP/w4YSpzqk/ylxESp3vS9+D9lb7fRZDkMDBar/bQejpuV3Zytwp/E2rxOVuLNSIOahsr wxhKmDkL/GkT68jSNh2rH5uFakFqFdM= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=glG66l49; spf=pass (imf06.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.167.45 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722943007; a=rsa-sha256; cv=none; b=KbaUaEgES5a/Gaw+V3oDQu0NTaer1jj994v1YrsIxGKYvQfxAnmtwU44l7JRPtjnZusj4y Kle/Z3nsi+yCBEjy7IuPc3nnL5+OLyA3s0eGbqemeJoctSmMK0uYUSmtdPYjkMBLOGLb3y 67RNndbPKvZOSBLNJgpVvo1nPSnJgLY= Received: by mail-lf1-f45.google.com with SMTP id 2adb3069b0e04-530ae4ef29dso976129e87.3 for ; Tue, 06 Aug 2024 04:18:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722943088; x=1723547888; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=pdZmIlmEl/z5USosoXB4iHqbx3ZAMH4Dl+lwOwX8j6A=; b=glG66l49AnOliVJpCJIhQs2QwgzN0mhrF3h6fUzAQgLexByS9I83k06JX7nXmqcZVs IG0442QCKGoKFuwqeA59N8ntApAHuaMz41hJ51vRZSJ0kKYggGR7php1qeD1+poTgczw HoLTArQn9n9Ge8rk8wrXbb5cn0in4fj74uQay9y33V8fIU0ZN/Lv4Xl+unPYubtD0cZq /sCBxYCVU8I2d1wLH4dawax5bqU5DUiepdwo+ZsIeFDGTBVmCCowuTst6v23ORLtD9NC LjwGZ4f+wlQz0gxG7MVTFBk5KCIR5GCmsmiQfaYHT6WJrCxmZhIGDgqgo1aGbrYkHwsr oEag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722943088; x=1723547888; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pdZmIlmEl/z5USosoXB4iHqbx3ZAMH4Dl+lwOwX8j6A=; b=A9mFRyINe7vEvr0AGFFmBTOQ5g+DjVBNJtvnsvevW5quP3SKKbB5ZdJ/USEXBF/R29 vm98yK+Hz9temLvoeUA3DDTwt8Pt4e0KtKP6GjKCETV4IZ0/ZW2MXeQwnYFKq6OLrIqs 3htCW2Hjum4PJAgtKB0WybTS0ZbL3dvRWNjc/mjNbGFxnpkBKG7wrB9EjxaMtfSsgMFk E78jLJZHBMviwaEm3+jttkvN/AdRT+gnPWQMquL07dtNsa9B2CuoeC2KOX+4fxMgbKsL 0h1k/nkTqVMaHJBXIlwLmCK9Td+tcwnXYSrXSvwsol0ykOXTZ+zLWVG1xa4VX9ubO5f8 nwSg== X-Forwarded-Encrypted: i=1; AJvYcCWDngqmPiO8rzGDwiPFgZ8cMW+LYVboNpRI750obMGvsAI9AehFpSGgzHBIMwlPI6VjZyAI6FvnIrHp31DP1/+sHcA= X-Gm-Message-State: AOJu0Yxn4BbAm1SHRtCQ5tUBO3DCbB1uBBSfR+0WKRO4SFDbE7zC/yzP eJ8txzmPUlgws7yrBGzUylQsi3i/9BcFnpUWywtqeOS5oxuhcqX6 X-Google-Smtp-Source: AGHT+IGh3JogPgDcEth2xxsigXyXrQun6H4DpMT8X7FGQBPiQ7Tsax7chUQ9M9K8qgETdvi6Dn17ng== X-Received: by 2002:a05:6512:1092:b0:52e:976a:b34b with SMTP id 2adb3069b0e04-530bb381012mr10644770e87.15.1722943087518; Tue, 06 Aug 2024 04:18:07 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:eb:d0d0:c7fd:c82c? ([2620:10d:c092:500::7:c24b]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a7dc9e85286sm548986966b.171.2024.08.06.04.18.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 06 Aug 2024 04:18:07 -0700 (PDT) Message-ID: <983a580f-11a9-442b-8b41-e9c2e4f0d113@gmail.com> Date: Tue, 6 Aug 2024 12:18:06 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/6] mm: split underutilized THPs To: Yu Zhao Cc: akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com References: <20240730125346.1580150-1-usamaarif642@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D9D8A180015 X-Stat-Signature: 9kfeb5ggt5szb6n6akkkiyrrubt3kfeh X-Rspam-User: X-HE-Tag: 1722943089-2213 X-HE-Meta: U2FsdGVkX19g3po+S17NwANeuHCVnJvUYQueehqzxLI9s0xeJ8x6mWR8tVryS+0hO1AszzSNroLOpcyLagQXJD/0OYJ9S9pCjMMccZMQVVpfg7RK8FAg08NVY1DC/berGCYUggl4vlpsnbGicHz7Tqh/7Tkrsdub0cayPcrJGnrAkvyRQpGLiNSQDBCoEBO0ahgdeTDhpQyJXpkkdE1KVwO2sYQkdKAyh/qFraKff5rQG2oAa0c54aZnBPAje8JwFAMk8aHwU9Q9LKYyFRy3yCRiK7D94qVyAxqbo43xOOhm7oUx3KYDmkdlNxb0XMDs26YEHdB5bHzLfXqpHHAUOBXQShBUKo4MF7LOo4A/uLTmrjOGLg5GlTU6yPgOM/MxCfY5xJzpFwSYB93cjy5MBu0065zYK4uyXYkv6Wb+wDRY8HgtzTf2/kygOAyidcwRPu1Dq+UI0gTKaeRCzcBk7hm1CNQqcdrXV5/6+YTBzRscNZiUE+6qjf32NuYtrUqtP/m0ZgFtUES+jLgXXojb4q2UqQNo3Ys5/0pQtDRc9dETs5wtv4qBbcCocq+HHrdnB5g5bqMtgJNLMtBKG5LrnbH9itW3PStKpbTrqx7kMJxEEQ069F30CjoFtz6GsKoYGaIfE6twH4AjZAuD0tyHItOlv0byw1A9RPHbPHajOM5BETS0t78leb3btYQUyMIbHQ659BA1CkXjKPWr7ZYUjHfeY8ja3YmeyfOgts1S3ctWP2YnOG+QdNUH+4HxmMtvfdEiPkkacwzS1i5Z8yp3Ty4/BuqyTmC5S3P8xzMLK6xfjRiik4RbJnxK3E3fKVha2hwnaonKl2P6l4V+JOZ2n4VFyYdr8DNvBaScKznHS2rfTjJ66Mb9Ft8eJUTGNeUg05NkApgmdAH8XWsRp6EVZ2CopuIccGFCO7xa8jMZ3X/hOeTDqVsAwDrW2AJ2weyhAsFkczyCj67AMxjcyee HANwficL slwbCwhx+n7/MOOP1gVAy21sSM6Rh+x4Tvbd9Si3gnm3kvwshsRc8Uj3UTJQ6gfZcB+Q6dbK7NJMkDXeB2Zv8kDreroPK0TgDagm0wePg36P0VAN20A9n0LFQ3U2keuJxwCbXULiArmtLqai0bw6BMrR2J+F0RckcR5JLOqamgWD7WrQfoJPYkrXsuyRyry6MjXQQnLCr2xP4ie6sTriwk9n+6KNaP+p6xhi7c2u1Idw8Ka81cg64IN6gjaN50n/cxcZt8nJig0t8aPbf6rh3E62rG+vQ+UOU3LDzBzBZuKJJiqK1R7jQfDs+zYBCyHVRqsYJiTtUeIYqWF8rzBTW359QNp1q8v8FEkxI07wp7JWhrBWFAcgZ07pm9DV4dDeg+qYTCVbPmV/IxUlRWoSG16dttzLA2/GwwLw1ZPWZ4WHoYzo2wH+KxzBgIM+8glushMSdXOZEkDub6Hlp7hGLGc/pNTs+S1cUVfNADWCl/InDjMsKg6Xa8zIfG0T7/gCl0PLex7EpkFvnF8yOlrndM89JgNO1e/SMmmF+RF3S2AJsTZDa7GH8ZkekQWCkxBTG8PNaPPExTgHSqqosJAbOHzxjtedFNI3DXZVYGvM1v6TNO9s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 05/08/2024 00:23, Yu Zhao wrote: > On Thu, Aug 1, 2024 at 10:22 AM Usama Arif wrote: >> >> >> >> On 01/08/2024 07:09, Yu Zhao wrote: >>> On Tue, Jul 30, 2024 at 6:54 AM Usama Arif wrote: >>>> >>>> The current upstream default policy for THP is always. However, Meta >>>> uses madvise in production as the current THP=always policy vastly >>>> overprovisions THPs in sparsely accessed memory areas, resulting in >>>> excessive memory pressure and premature OOM killing. >>>> Using madvise + relying on khugepaged has certain drawbacks over >>>> THP=always. Using madvise hints mean THPs aren't "transparent" and >>>> require userspace changes. Waiting for khugepaged to scan memory and >>>> collapse pages into THP can be slow and unpredictable in terms of performance >>>> (i.e. you dont know when the collapse will happen), while production >>>> environments require predictable performance. If there is enough memory >>>> available, its better for both performance and predictability to have >>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged >>>> to collapse it, and deal with sparsely populated THPs when the system is >>>> running out of memory. >>>> >>>> This patch-series is an attempt to mitigate the issue of running out of >>>> memory when THP is always enabled. During runtime whenever a THP is being >>>> faulted in or collapsed by khugepaged, the THP is added to a list. >>>> Whenever memory reclaim happens, the kernel runs the deferred_split >>>> shrinker which goes through the list and checks if the THP was underutilized, >>>> i.e. how many of the base 4K pages of the entire THP were zero-filled. >>>> If this number goes above a certain threshold, the shrinker will attempt >>>> to split that THP. Then at remap time, the pages that were zero-filled are >>>> not remapped, hence saving memory. This method avoids the downside of >>>> wasting memory in areas where THP is sparsely filled when THP is always >>>> enabled, while still providing the upside THPs like reduced TLB misses without >>>> having to use madvise. >>>> >>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were >>>> tested with THP shrinker. The results after 2 hours are as follows: >>>> >>>> | THP=madvise | THP=always | THP=always >>>> | | | + shrinker series >>>> | | | + max_ptes_none=409 >>>> ----------------------------------------------------------------------------- >>>> Performance improvement | - | +1.8% | +1.7% >>>> (over THP=madvise) | | | >>>> ----------------------------------------------------------------------------- >>>> Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) >>>> ----------------------------------------------------------------------------- >>>> max_ptes_none=409 means that any THP that has more than 409 out of 512 >>>> (80%) zero filled filled pages will be split. >>>> >>>> To test out the patches, the below commands without the shrinker will >>>> invoke OOM killer immediately and kill stress, but will not fail with >>>> the shrinker: >>>> >>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none >>>> mkdir /sys/fs/cgroup/test >>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs >>>> echo 20M > /sys/fs/cgroup/test/memory.max >>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max >>>> # allocate twice memory.max for each stress worker and touch 40/512 of >>>> # each THP, i.e. vm-stride 50K. >>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM >>>> # killer. >>>> # Without the shrinker, OOM killer is invoked immediately irrespective >>>> # of max_ptes_none value and kill stress. >>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K >>>> >>>> Patches 1-2 add back helper functions that were previously removed >>>> to operate on page lists (needed by patch 3). >>>> Patch 3 is an optimization to free zapped tail pages rather than >>>> waiting for page reclaim or migration. >>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled >>>> subpages when splitting THP. >>>> Patches 6 adds support for THP shrinker. >>>> >>>> (This patch-series restarts the work on having a THP shrinker in kernel >>>> originally done in >>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/. >>>> The THP shrinker in this series is significantly different than the >>>> original one, hence its labelled v1 (although the prerequisite to not >>>> remap clean subpages is the same).) >>>> >>>> Alexander Zhu (1): >>>> mm: add selftests to split_huge_page() to verify unmap/zap of zero >>>> pages >>>> >>>> Usama Arif (3): >>>> Revert "memcg: remove mem_cgroup_uncharge_list()" >>>> Revert "mm: remove free_unref_page_list()" >>>> mm: split underutilized THPs >>>> >>>> Yu Zhao (2): >>>> mm: free zapped tail pages when splitting isolated thp >>>> mm: don't remap unused subpages when splitting isolated thp >>> >>> I would recommend shatter [1] instead of splitting so that >>> 1) whoever underutilized their THPs get punished for the overhead; >>> 2) underutilized THPs are kept intact and can be reused by others. >>> >>> [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/ >> >> The objective of this series is to reduce memory usage, while trying to keep the performance benefits you get of using THP=always. > > Of course. > >> Punishing any applications performance is the opposite of what I am trying to do here. > > For applications that prefer THP=always, you would punish them more by > using split. > >> For e.g. if there is only one main application running in production, and its using majority of the THPs, then reducing its performance doesn't make sense. > > Exactly, and that's why I recommended shatter. > > Let's walk through the big picture, and hopefully you'll agree. > > Applications prefer THP=always because they want to allocate THPs. As > you mentioned above, the majority of their memory would be backed by > THPs, highly utilized. > > You also mentioned that those applications can run into memory > pressure or even OOMs, which I agree, and this is essentially what we > are trying to solve here. Otherwise, with unlimited memory, we > wouldn't need to worry about internal fragmentation in this context. > > So on one hand, we want to allocate THPs; on the other, we run into > memory pressure. It's obvious that splitting under this specific > condition can't fully solve our problem -- after splitting, we still > have to do compaction to fulfill new THP allocation requests. > Theoretically, splitting plus compaction is more expensive than > shattering itself: expressing the efficiency in > compact_success/(compact_success+fail), the latter is 100%; the former > is nowhere near it, and our experiments agree with this. > > If applications opt for direct compaction, they'd pay for THP > allocation latency; if they don't want to wait, i.e., with background > compaction, but they'd pay for less THP coverage. So they are punished > either way, not in the THP shrinker path, but in their allocation > path. In comparison, shattering wins in both cases, as I explained > above. Thanks for the explanation. It makes the reason behind shattering much more clearer, and it explains why it could be an improvement over splitting. As Rik mentioned, I think its best to parallelize the efforts of THP low utilization shrinker and shattering, as we already have the low utilization shrinker tested and ready with splitting. I did have a question about shattering, I will mention it here but it probably is best to move the discussion over to the shattering patches you sent. If the system is close to running out of memory and the shrinker has a very large number of folios to shatter, then the initial THPs that are shattered by shrinker will take the order-0 pages for migration of non-zero filled pages. The system could run out of order-0 pages for the latter THPs, so the THPs that were preserved earlier in the shrinker process would need to be split to provide the order-0 pages for migration needed for shattering the latter THPs? The cost then becomes the cost of migration + the cost of splitting the THPs preserved earlier by the shrinker to provide 4K pages for migration (which means the advantage of preserving these THPs is lost?) + the cost of compaction when we later want THPs. Hopefully I understood what would happen in the above case correctly with shattering. > >> Also, just going through the commit, and found the line "The advantage of shattering is that it keeps the original THP intact" a bit confusing. I am guessing the THP is freed? > > Yes, so that we don't have to do compaction. > >> i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-filled, then after shattering we will have 10*4K memory and not 2M+10*4K? > > Correct. > >> Is it the case the THP is reused at next fault? > > Yes, and this is central to our condition: we are under memory > pressure with THP=always.