From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9CA0BC3DA4A for ; Wed, 14 Aug 2024 10:13:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8FFC56B0082; Wed, 14 Aug 2024 06:13:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8B0A26B0083; Wed, 14 Aug 2024 06:13:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 777876B0085; Wed, 14 Aug 2024 06:13:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 5B6206B0082 for ; Wed, 14 Aug 2024 06:13:41 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E0E7180853 for ; Wed, 14 Aug 2024 10:13:40 +0000 (UTC) X-FDA: 82450439400.16.F74D49B Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) by imf11.hostedemail.com (Postfix) with ESMTP id D6C6E40023 for ; Wed, 14 Aug 2024 10:13:38 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ili77L+V; spf=pass (imf11.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723630382; a=rsa-sha256; cv=none; b=gsX76DLfSm63KA0u5MeE+YrWresjClMk3weTpkmJ2QDdp3iarIJCmjJBF1Zb7yexuQfZh7 A5ufaVPPaVPVGkJi0qBOLFhbheddYS/o1av31eCsX8tIj0kE4+LI/ryVtcW/8+RnfLlNgc Q0qjxdGAcIJ2Vfn9//gwW7NnBRD+2do= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ili77L+V; spf=pass (imf11.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723630382; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BNd+he0K68SRIocWwyWopuqY7557TQ2wksIvC7bmzlM=; b=k2+N8hK/LlLkBdJwOwhF80HtfN7ur2V7L+rY30YmPMMw74Cz/QxFS1ZujqvJS7N/9TD+hZ sDR2yds0/6qBsEfLJHlm2nCQxk8kzGBOyXoBNwsl4UlUlecJOduh1kqFp4S7mKjfFg5gUS VRwrRJYQk35krDoydhikjkXzWrDviLk= Received: by mail-ed1-f48.google.com with SMTP id 4fb4d7f45d1cf-5a156557026so6724422a12.2 for ; Wed, 14 Aug 2024 03:13:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723630417; x=1724235217; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=BNd+he0K68SRIocWwyWopuqY7557TQ2wksIvC7bmzlM=; b=ili77L+V2QoE0QiHNrfUxUD/DXRMCt7McheerCW/sJ/ounPBfUyq7yHfaAzkXdQqbZ PaJvPL67bUBjIehxT/NWV5jROrljFn8srYyDoC3UUWkxPp6JcJvQs8DaPOA5v9WX3xEc zbI5bjXHQBlGGGKb2QrnqHclHeCy0vcT/oCsSBc2jM1qgTLTHUhCEeEKjI6MerGW27Mr IpIHrLpA+a2FQpjX4icuR/7ZLYOEy3IqgNEH45a8uYRd2zpAzjaEGg416iPDf4RjKrhq LrlQyDWy3EDpvp2ByDMp89sKiIacLfci7VixizWBny1nWoZ8yQfhI8aSpi8kdD4DxTct FVCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723630417; x=1724235217; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=BNd+he0K68SRIocWwyWopuqY7557TQ2wksIvC7bmzlM=; b=oiYmiXlClNuL5NDYJNcMbYd/SZXKK+rjnRGEtsD7mYizBs2571mkEG7FLWw5lgf33l AfJPvPftAnFdbfLc324YyszLY2418LL6pE72N73KxlSuUhv8zm5U2hEvv2STrnSudTcv hdQQ1+6+P6ZTOG5QcqqD3hzNRkCR9MikAJdSthKvBbltEjqAzW5BWbGn6DPhTE1bFYDk WaKSkneGnkw+3d5QqBnowMX4YCwoDQhQOh2IgmIzCxccFXxfJGTAlnO3+4wQxWyWk4ys lgWaRc00vPsVjOOssorFr/jjtJ2qD0C+IYK5BYhxkfDlt/TRouqZLML080V1lbb0ZZ5A 2MKQ== X-Forwarded-Encrypted: i=1; AJvYcCUd8JVWthvahOhSXU41rvlc7snr/L6bLOA4hMLfZ5dvCYx4MC4Nlu67K9kigG6CmtX3DW2rXsTRN9fIYr4T13Mmqvs= X-Gm-Message-State: AOJu0Yz4Q9laR2Z1XfpBtWQZQT21Lx+9C/dzY0S1TTP5TR32gZLjfMC/ vUvdLj2LSYC7Jk3kPdQ2nb4zMnjoW/5nC7+H+kuDzagP1uKGwIT5 X-Google-Smtp-Source: AGHT+IE0MyzO6aVdPYX+yPRmrPe85J2EQIBeNQ271y8eY3Wwo+A8u80lRS+O5ZJaQs020YDAO5oEqw== X-Received: by 2002:a05:6402:5112:b0:5a1:1b3f:fbf5 with SMTP id 4fb4d7f45d1cf-5bea1c77673mr1806201a12.12.1723630416758; Wed, 14 Aug 2024 03:13:36 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:eb:d0d0:c7fd:c82c? ([2620:10d:c092:500::4:61b7]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5bd196a696dsm3755173a12.50.2024.08.14.03.13.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 14 Aug 2024 03:13:36 -0700 (PDT) Message-ID: Date: Wed, 14 Aug 2024 11:13:35 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 0/6] mm: split underutilized THPs To: Andi Kleen Cc: akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, yuzhao@google.com, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com References: <20240813120328.1275952-1-usamaarif642@gmail.com> <87y150mj6f.fsf@linux.intel.com> Content-Language: en-US From: Usama Arif In-Reply-To: <87y150mj6f.fsf@linux.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: t4ym559p9tab6be9eftf1rx37fr6xeh6 X-Rspamd-Queue-Id: D6C6E40023 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1723630418-693571 X-HE-Meta: U2FsdGVkX19qgij+A2+2VFyZe5YR6XX0zhUBmeMsUKBkp0wUtgCD/eN08NddJdOsLp70EpdeKihrxxW3kelJU7TWEn7dCfdmI4262DpERBjR9y0WOecEjgcyNRL8o52kElEXf1XW/zFVKR3OjoYPA87z/T1V5Tmqjr6MF3RU1IhaquYsYnLhwhEj2Fur3kwfMU4E34ejvo5EHCzh4P81LNsOgNk+2/Vi3hyS33iLXU5uFW4P8GacYo5Js3TGzOKCc6IHgDiv4ZlPqD5MhCfA98jmE7tq68VIj1vd1mSaeiGjOHHjxedU5dKJE7pyYkPJbGgzGUwSJ4lb6m7nplmGFEbSKWuRaJQG0oqRJVnbcMnRmq29TWlicZuPIOiZtBbaol4jqRbm9U15I2AAZn0HiC00iKeju9vUZqeAlTHVSTDZjkRjoUaRU8XgELLoLvDXAxaCNz/7+kpo+I1msDj0n8DNpsZpDw71MC7grjuHuuqRFk05zlTT+Pem/U3xFWoud7iRObQzlQsd4PXpxcMtpSZcabtAaIox8WZpESvh1CnTmmhwytdGv8XU02eQGL8PTjAidb8ZL9/jDYhzrZt2bCv52fyqt30w1orKeoM1x8tzpPK9MM+JIZuNqHk21jSEzvOKF9QGn4tIUQDsFw0F6JIE2gFcOCYlA4A1GzLINej55CkjSAQTfXzEGxJk/e6KnnCZMak/Rf9I8Hy2Q8IojtUkMPmRrzALVOPWZRJKYMfdW1FzvIyYwXLv2iJxM8A8D9aO2fxEHE/RJlo6g3kw4ptZQjPlVrF4HCdNpWGSqoOCrHh5ZgLyJayrx6g2EmEqRP0fLdLyT0h/9sz7iH2HPdHBapYTNKU+v5tRv6QVDaXWVed2uFOG0OxAH4x1nGskijZENsCXStrQuro6+cFTllFTGlgnPtQYHWAucCkjcAptKGOuJXw2WZuUL+OwoxqV+lytGG0CB895ZTzjym3 551KgVd0 mE/pWznnImrK7Eai8SFY+w/tO/IbdtZliT5PKrNyiTp5RD0qXnOye0EIECSBK9uEPx0aKJuatjy0n61mhUlLbA9c5TpAE/Xex/+yePtnSZ6XKo5UAvngbwsJ9lErd4QY3bTq3xi4YaKYwu1ea4zRn1WT+4Ctivc0Be+6vK0niyCOMaEzIQKLuad2kA84Dg90fQOYGW0s2fC3FJtAelZm1u2kq+nSOL1abR5LHzoEvl/HSNl/CU1FwHt/czivZ3QvW/2DRfii4u97GoyZCZa9toAoHy2M64x3FHPkpZWBLpSCfZXtWFhwn9m4qvlmyfWVYyzZ/qKDnoCgLI9J6JEpPSZmUmlZdElevL9o26IJe70PQqA6KMTnK/VH9BW671iyRVvgxmayE2tQsJo9kW1EvjM195djSwwKsLXLRgn4YjDrZqOsN8cW+aopcHg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 13/08/2024 18:22, Andi Kleen wrote: > Usama Arif writes: >> >> This patch-series is an attempt to mitigate the issue of running out of >> memory when THP is always enabled. During runtime whenever a THP is being >> faulted in or collapsed by khugepaged, the THP is added to a list. >> Whenever memory reclaim happens, the kernel runs the deferred_split >> shrinker which goes through the list and checks if the THP was underutilized, >> i.e. how many of the base 4K pages of the entire THP were zero-filled. > > Sometimes when writing a benchmark I fill things with zero explictly > to avoid faults later. For example if you want to measure memory > read bandwidth you need to fault the pages first, but that fault > pattern may well be zero. > > With your patch if there is memory pressure there are two effects: > > - If things are remapped to the zero page the benchmark > reading memory may give unrealistically good results because > what is thinks is a big memory area is actually only backed > by a single page. > > - If I expect to write I may end up with an unexpected zeropage->real > memory fault if the pages got remapped. > > I expect such patterns can happen without benchmarking too. > I could see it being a problem for latency sensitive applications. > > Now you could argue that this all should only happen under memory > pressure and when that happens things may be slow anyways and your > patch will still be an improvement. > > Maybe that's true but there might be still corner cases > which are negatively impacted by this. I don't have a good solution > other than a tunable, but I expect it will cause problems for someone. > There are currently 2 knobs to control behaviour of THP low utilization shrinker introduced in this series. /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none: The current default value for this is HPAGE_PMD_NR - 1 (511 for x86). If set to 511, the shrinker will immediately remove the folio from the deferred_list (Please see first if statement in thp_underutilized in Patch 5) and split is not attempted. Not a single page is checked at this point and there is no memory accesses done to impact performance. If someone sets its to 510, it will exit as soon as a single page containing non-zero data is encountered (the else part in thp_underutilized). /sys/kernel/mm/transparent_hugepage/thp_low_util_shrinker: Introduced in patch 6, if someone really doesn't want to enable the shrinker, then they can set this to false. The folio will not be added to the _deferred_list at fault or collapse time, and it will be as if these patches didn't exist. Personally, I don't think its absolutely necessary to have this, but I added it incase someone comes up with some corner case. For the first effect you mentioned, with the default behaviour of the patches with max_ptes_none set to 511, there will be no splitting of THPs, so you will get the same performance as without the series. If there is some benchmark that allocates all of the system memory with zeropages, causing shrinker to run and if someone has changed max_ptes_none and if they have kept thp_low_util_shrinker enabled and if all the benchmark does is read those pages, thus giving good memory results, then that benchmark is not really useful and the good results it gives is not unrealistic but a result of these patches. The stress example I have in the cover letter is an example. With these patches you can run stress or any other benchmark that behaves like this and still run other applications at the same time that consume memory, so the improvement is not unrealistic. For the second effect of memory faults affecting latency sensitive applications, if THP is always enabled, and such applications are running out of memory resulting in shrinker to run, then a higher priority should be to have memory to run (which the shrinker will provide) rather than stalling for memory creating memory pressure which will result in latency spikes and possibly OOM killer being invoked killing the application. I think we should focus on real world applications for which I have posted numbers in the cover letter and not tailor this for some benchmarks. If there is some real world low latency application where you could show these patches causing an issue, I would be happy to look into it. But again, with the default max_ptes_none of 511, it wouldn't. > The other problem I have with your patch is that it may cause the kernel > to pollute CPU caches in the background, which again will cause noise in > the system. Instead of plain memchr_inv, you should probably use some > primitive to bypass caches or use a NTA prefetch hint at least. > A few points on this: - the page is checked in 2 places, at shrink time and at split time, so having the page in cache is useful and needed. - there is stuff like this already done in the kernel when there is memory pressure, for e.g. at swap time [1]. Its not memchr_inv, but doing the exact same thing as memchr_inv. - At the time the shrinker runs, one of the highest priority of the kernel/system is to get free memory. We should not try to make this slower by messing around with caches. I think the current behaviour in the patches is good because of the above points. But also I don't think there is a standard way of doing NTA prefetch across all architectures, x86 prefetch does it [1], but arm prefetch [2] does pld1keep, i.e. keep the data in L1 cache which is the opposite of what NTA prefetch is intended doing. [1] https://elixir.bootlin.com/linux/v6.10.4/source/mm/zswap.c#L1390 [2] https://elixir.bootlin.com/linux/v6.10.4/source/arch/x86/include/asm/processor.h#L614 [3] https://elixir.bootlin.com/linux/v6.10.4/source/arch/arm64/include/asm/processor.h#L360