From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B030E74AC3 for ; Tue, 3 Dec 2024 18:44:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E97746B007B; Tue, 3 Dec 2024 13:43:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E478D6B0082; Tue, 3 Dec 2024 13:43:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE86F6B0083; Tue, 3 Dec 2024 13:43:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id AF1926B007B for ; Tue, 3 Dec 2024 13:43:59 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5E3F3120B7F for ; Tue, 3 Dec 2024 18:43:59 +0000 (UTC) X-FDA: 82854521904.18.1A225D4 Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) by imf01.hostedemail.com (Postfix) with ESMTP id 36E2A4000A for ; Tue, 3 Dec 2024 18:43:46 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="3/cgKmb7"; spf=pass (imf01.hostedemail.com: domain of fvdl@google.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=fvdl@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733251422; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+YWgVPK/SpoACBIKJFxQCjuZ+EEWYYEXx3gjh4xmrLI=; b=SJfTy6nSPU5dSxsnZPw3UfENtDCwggsDwTHtpZbSjbW9gId8N2MFW9QH26mxZB7xHpXDlp cWsskH+hVewfOltlBiu9HwbSpCvBT4KlevKjCExFPqDySBHz8cYvCfi//llWpwaPJ4UQHn rv26pW0Y9M1QcCfz8sYSPJtJkr9Ma78= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="3/cgKmb7"; spf=pass (imf01.hostedemail.com: domain of fvdl@google.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=fvdl@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733251422; a=rsa-sha256; cv=none; b=ta5J/hSuguOdlaLybWafZ6CGqyPkiHJUsbq4/1cO4w4KCQZsZDmqyZiIN+ePpoCPscjiyK znvN1FwaceDMuud3yPCRDWwQvOXzVQf6AErnW/Bt4lG0YK+ffleRMA/9zlYayYdCwWDKIK OiIpHarilSa1ndn/+Bts+a8JBZ7S+As= Received: by mail-lf1-f51.google.com with SMTP id 2adb3069b0e04-53dd5f68c33so957e87.0 for ; Tue, 03 Dec 2024 10:43:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733251435; x=1733856235; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+YWgVPK/SpoACBIKJFxQCjuZ+EEWYYEXx3gjh4xmrLI=; b=3/cgKmb72SEVYLXlBoLFETNOYh6AzFr6lCNMXYh1JuAvtHLFZSPbx1Xj4y7SQ4RV5G 2fB4YCxBAo0+7feGEQ7BSMp194yU+uSPwZnVPzShhn4F5sVhVTbfAY3SHdF6xK6CNQdW D66rKPLh04a/u7vIR8Y+CTs17XocPkoiPSxMDE9/oBieNHCi5pdYJynecx9jq3o5mASN N23eh0Z5XTxppxpnYhviO8XY2RZPoo+HmdiGaCVuvEN8v9+MCiZSqF06xP8uS1gzkRPy JUqVDq2W0MFWkk7GqTAWpZ0py7PrawnJNvOWbKJTGOmKTb6omw2cZIVDH6JezPzZUh4/ /cdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733251435; x=1733856235; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+YWgVPK/SpoACBIKJFxQCjuZ+EEWYYEXx3gjh4xmrLI=; b=TejHAB9ltBWzLfimCPCrILDq+4kxsxFxV0xSX96A0j0FGhHmLdCcBWcqjGavo3qqwi 8cABK8bhz0lqlvYMhacWdq8NJw09r2w7SZuJ/ceFki2/aWMy4HhrgIOLDrGEOB8X+Yof GrYtsgZxdualFNIdPUwf7bKOu6n5wJsylM6WJCkyIkndTKx4Txe2igubK0H+5NeAkWUk xqNaC6u+7gYjEr7ASkJ/LiN3HkfJfIDJi1oaimzjsiSL8abpqN0nYx1MQaicTwUBZguv e0d+QFzbYHXAq7gP+JQRWVOPATg7E8vxnWJdDTwoi4rHaq1ubcaOLWWFXmBIjLPhEhBw GA9w== X-Forwarded-Encrypted: i=1; AJvYcCXJIpqTEdOGmog+fG7KLnYa+0aVvAV8zvKmp01oqOnHlNclE9UEC0jdqOdNiSQtNl7KxPpkCoxZSg==@kvack.org X-Gm-Message-State: AOJu0YxtKvY1R6VPfIfJC5nSOlXXVaPXGmwQhEuogXExtvT5XxQoV+DS w6cIFY0uLwmmW9X84JRLbT9QD6VTtvmz7fdBeyrKDugLJPh0LWYqrtca7X3XZin2G8Oa97esnXN 6wl/H7tdY/CA/joqlHF5AbFxHR04Fa0oKb3mA X-Gm-Gg: ASbGncu7Waqfm4IWx7N4hYUHdrf8NzZOw8FzAC17SynW9OKrr2Obu0qgImZf/Tpfi/Y dFVrHvEX97IuBlLNPi8551YKCrFeD X-Google-Smtp-Source: AGHT+IHQGk/5NHJx9Qv8zG7XVG5iZYG6w0hBvWOe/S7jbkmz5Xy2Mtt2Pp+kLrED+4ngNlmDPTOw3wpvxhPeLdNe3MY= X-Received: by 2002:a05:6512:6207:b0:53d:f58a:2f81 with SMTP id 2adb3069b0e04-53e134b225dmr205818e87.5.1733251434202; Tue, 03 Dec 2024 10:43:54 -0800 (PST) MIME-Version: 1.0 References: <20241202202058.3249628-1-fvdl@google.com> <3tqmyo3qqaykszxmrmkaa3fo5hndc4ok6xrxozjvlmq5qjv4cs@2geqqedyfzcf> In-Reply-To: From: Frank van der Linden Date: Tue, 3 Dec 2024 10:43:42 -0800 Message-ID: Subject: Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages To: Joao Martins Cc: Michal Hocko , Mateusz Guzik , linux-mm@kvack.org, akpm@linux-foundation.org, Muchun Song , Miaohe Lin , Oscar Salvador , David Hildenbrand , Peter Xu , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 36E2A4000A X-Rspam-User: X-Stat-Signature: kc3hu36tycxg9omsr4btzkcq5x5h7mbd X-HE-Tag: 1733251426-166493 X-HE-Meta: U2FsdGVkX18gYQZ5IZjBaTwzsaeYdKgKbSRtcsEdosH8xLOsl6QZXmgzdLgYyHZ5AizYWcdYvy1DvlLaulUMNShimf5od85XTkW3khEMM3utPohqfwCtvbBsXYuiA+iNh05uV1L3PLxz2hwX8JL8Ra9XqZr4cGie0VX9Tx05BI8n7Hhf5u2lT40d0l+B5bCefGDeLR3kcxmDicRx7YCt6l/UjnvflquHPUy+31rmDAole71DVsUglxopT60XW+WJdIZTlMo/0JDf9Ze5uWje+wF+VQPN1hpBhAMOKKo6xPoUEQ94K/S9Szt940t6VD48vQVvOQxicUH8YJzz/jtrVyFgzi7K0fMXPzvqC2EcoIQYjcBQrLU3x2KzgyxfB4DcQ9vQsDRX0EQc9YgkGsJ5rgkewr4RDOf6lsMNnINphfIEx2oq+PqE0MMNOITun0DD+kzm4STuK3IwaO/1xS1MgrjE8g7F7OSZU7GBkzK6W2DflBHm26HRJ6a9iyr5J6byct+o0LhzfN0/WvPpDJdXg+e9p5bOtXvXpGMUVmUCj4CoNIrAQmlZeJ0SYXPsXu3ClHG3lLPBAruT0cMjsrouoqCiWRQo4TdgmGoM0aeh1Yh5iavWFX6ifzasCJYtltnG9mQ8/lquS30CdLMiPsSSNyffLFm1yxo080Nc+qqqyg0+8WF8vCGv+ThJauTLXAffQ9+x8CQrhnjB64eZqBis8vGChtzF7r671mS/zdCFkfTDuiz96YahT9C4h+QnMdIrzZW5/W5KQejtyjKKzFrMnCPkGZE0nqhA/knuWNY8bZ4N93gCZiYTAOfyQkVw9LC7FxHQZyezZxM9kDSvrIM2IhtaqMD5WhuvO+elJBQ2FaZ027asDzBqcwP8BUOiaZckeyHvglUKsKL3JIXKU0+VJIC6OuVy11D38ccq0ZAzQVUx8m9ESZb9RceoohaSIzmCMKcIJvEIMwAmnLxMMxi 42bhNM54 U2yrTMX8ew0ofL97VnT0TJ/0+UPPsBlX/ImqpYWbMqPKMRm9zP3gpl6/H8ag6beZgUWkFhIwxb72tCd9pQr4zAA57uNn8kCO3zcXRrAQQTzzsjqZk5kNKhkYkiebMWlK+VhUxtN9DjxWQx3UQpaNE0fte9SzSXbKdGwaqkiY5QXRmjLUcHf3ueFbSy3tgSUBU2iXUCOaYUoWtffT8zQNyHWWodSkeMG3hQ7VQwgPcDvtKR7XSPDbHccyxiA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 3, 2024 at 6:26=E2=80=AFAM Joao Martins wrote: > > On 03/12/2024 12:06, Michal Hocko wrote: > > On Mon 02-12-24 14:50:49, Frank van der Linden wrote: > >> On Mon, Dec 2, 2024 at 1:58=E2=80=AFPM Mateusz Guzik wrote: > >>> Any games with "background zeroing" are notoriously crappy and I woul= d > >>> argue one should exhaust other avenues before going there -- at the e= nd > >>> of the day the cost of zeroing will have to get paid. > >> > >> I understand that the concept of background prezeroing has been, and > >> will be, met with some resistance. But, do you have any specific > >> concerns with the patch I posted? It's pretty well isolated from the > >> rest of the code, and optional. > > > > The biggest concern I have is that the overhead is payed by everybody o= n > > the system - it is considered to be a system overhead regardless only > > part of the workload benefits from hugetlb pages. In other words the > > workload using those pages is not accounted for the use completely. > > > > If the startup latency is a real problem is there a way to workaround > > that in the userspace by preallocating hugetlb pages ahead of time > > before those VMs are launched and hand over already pre-allocated pages= ? > > It should be relatively simple to actually do this. Me and Mike had exper= imented > ourselves a couple years back but we never had the chance to send it over= . IIRC > if we: > > - add the PageZeroed tracking bit when a page is zeroed > - clear it in the write (fixup/non-fixup) fault-path > > [somewhat similar to this series I suspect] > > Then what's left is to change the lookup of free hugetlb pages > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zero= ed > pages. Provided we don't track its 'cleared' state, there's no UAPI chang= e in > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only = and > free them back 'as zeroed' to implement a userspace scrubber. And in prin= ciple > existing apps should see no difference. The amount of changes is conseque= ntly > significantly smaller (or it looked as such in a quick PoC years back). This would work, and is easy to do, but: * You now have a userspace daemon that depends on kernel-internal behavio= r. * It has no way to track how much work is left to do or what needs to be done (unless it is part of an application that is the only user of hugetlbfs on the system). > > Something extra on the top would perhaps be the ability so select a looku= p > heuristic such that we can pick the search method of > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better gene= ric > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VM= M, etc) > without too much of a dance. Again, that would probably work, but if you take a step back: you now have a kernel behavior that can be guided in certain directions, but no guarantees and no stats to see if things are working out. And an explicit allocation method option (basically: take from the head or the tail of the freelist). The picture is getting murkier. At least with the patch I sent you have a clearly defined, optional, behavior that can be switched on or off, and stats to see if it's working. I do understand the argument against having pre-zeroing not being accounted to the current thread. I would counter that benefiting from work by kernel threads is not unheard of in the kernel today already. Also, the other proposals so far also have another thread doing the zeroing - it just is explicitly started by userspace. So, the cost is still not paid by the user of the pages. You just end up with explicitly controlling who does pay the cost. Which I suppose is better, but it's still not trivial to get it completely right (you perhaps could do it at the container level with some trickery). What we have done so far is to bind the khzerod threads introduced in this patch to CPUs in such a way that it doesn't interfere with the rest of the system. Which you would also have to do with any userspace solution. Again, this is optional - if you are a system manager who prefers to have the resources used by zeroing hugetlb pages to be explicitly accounted to the actual user, you can not enable this behavior (it's off by default). I guess I can summarize my thoughts like this: while I understand the argument against doing this outside of the context of the actual user of the pages, this is 1) optional, and 2) so far the other solutions introduce interfaces that I don't think are that great, or would require maintaining a hugetlb 'shadow pool' in userspace through hugetlbfs files. - Frank