From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B210E74AC1 for ; Tue, 3 Dec 2024 15:57:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 05C546B0085; Tue, 3 Dec 2024 10:57:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 00BD46B0093; Tue, 3 Dec 2024 10:57:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E15196B00A4; Tue, 3 Dec 2024 10:57:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C4FE76B0085 for ; Tue, 3 Dec 2024 10:57:27 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 7FAF880A8C for ; Tue, 3 Dec 2024 15:57:27 +0000 (UTC) X-FDA: 82854101904.19.BC6CCF8 Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50]) by imf21.hostedemail.com (Postfix) with ESMTP id 3A70F1C0002 for ; Tue, 3 Dec 2024 15:57:00 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DNT8L2RY; spf=pass (imf21.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733241435; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iPlCzvq7qA7SjQOv3uoOyKl2HPUxNOtMX+MKgm7tvBA=; b=HQvF+KzU2d+WqCOBaAzUUgGfo0mFJya5G0EVuZLVKC+mkncJ9useOkqs0Po8BDj3qS+TjE MmFySJWKK1Grr5/1T8ShFf356qZx+aKNRZGHe1YnKc8/JCkQZvd0G04KTDMC35wmd2vGqK WrUjge+/0NnCeQO0uUNKnUnmbpgPxXQ= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DNT8L2RY; spf=pass (imf21.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733241435; a=rsa-sha256; cv=none; b=RtLQKGMDqWjPhOzpjuK4zEgxjbHqeurVnvWdbS5StbJgBpavTenWTDvn5jlTuCyUpYLFW7 MT9TtCqe9AajyINNt9kGYYsqawQwvs6H8zl5bNzXJUU4yFNJxWMYxcenPGCLQcDGAYU1st y3TXE+HCyV8ogpPE/GL2KBG/SIpZeJI= Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-a9a68480164so771506566b.3 for ; Tue, 03 Dec 2024 07:57:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1733241444; x=1733846244; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=iPlCzvq7qA7SjQOv3uoOyKl2HPUxNOtMX+MKgm7tvBA=; b=DNT8L2RYU3SZyrVtybLffL7tqHgbmNBoIuZlgiSMdVRiPkxYxTlVqUSEU3s6XYeZMw IfwNcD74I8NIHuoVSU0YQRzW6Io7cWU6eqa2G45g+25NIo+IGcfJoS3hZAdGUsit2hA0 lVLjjGIQ4CnehQuZIAQJjC/vLdmbq92SJS3JtzSooP0TzVsHoe+CfoQu7f2qs+QOuA8l TXqyRudSf+pdfnF6XzCEFi1gkzL0+ruEbh0wivPXZ5qGTGzPJRUJ14h76IyinjR63ndo i+F0zTuLiv/h5/mO/4rEYl6J3P8cd42PUwwoJlFV2rWKA4sRudOOdlIOxZm5eRuwNPTN t1xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733241444; x=1733846244; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=iPlCzvq7qA7SjQOv3uoOyKl2HPUxNOtMX+MKgm7tvBA=; b=AHGquHyhK6wR5JxO2170E/Q9EYUhapw7BkFfU/DEiUalz0P3LM/j8Pg1CBdnouvQsI 36ro2Q2xuLaZxT1byxnpIv7qZgAWBwIJtDEgvJyFy/N2lMlaoo+QWkhIA6tyPrC4pY9y hwnyRGJtaQDkHGuIMm5w2cz3NTyLCcU9p2cMUW7QygyXn0uYON9fXBPLXgt6Jpma5lL7 zcQ0Ny9GddPLd8SPX/dfCmQms7JOt4gZqFgqnjcdgEeaKB2MC5GcFAg/m9uddQjdor3+ KL1H6dd29EJPLb9+pzdgrkoJSPTZ9t9kqnextJMoJ1boR9/ZJZpiix6OS5VGReMSvNPz CmdA== X-Forwarded-Encrypted: i=1; AJvYcCUiP3JChFxbY1xgdYAQZzkpMDKxZwOp3pm6ne4yvgeK0UBnlPw1a3C908Z4ZyFxfPFhR50e9+2YkQ==@kvack.org X-Gm-Message-State: AOJu0YydUuUw3PP/HLl+o53sL16a3h1wa52BkY0kDtJL0yPXIVq43XPi r+j08IsYWvVrw2hbcxmKgJbFfkhds/t5klNPVboi/4kb+Y4UJe+0nTQl2rDqWhcW476p1WS85Eq HWw6THiOFuDBN+3WNN8tWK79SNqE= X-Gm-Gg: ASbGnctK2em55D/3EhEMP8Ki4ZH0IlNU4GmGlNc4D7a49IEAKCZuE+HV5sCl2A0IyPV cfGxhXaDRdH0IpwjoTf4nOnUlUMtm X-Google-Smtp-Source: AGHT+IEktfjPAdnGxP6AI0+aCF/E8iIwO2Lqz92yHebx5GuURUqRsvaBvH/X8RbQ7dENd6xTLTmSr5kwTWPgLTKuglc= X-Received: by 2002:a17:906:328d:b0:aa5:d96:c57f with SMTP id a640c23a62f3a-aa5f7d199f4mr216720566b.20.1733241443902; Tue, 03 Dec 2024 07:57:23 -0800 (PST) MIME-Version: 1.0 References: <20241202202058.3249628-1-fvdl@google.com> <3tqmyo3qqaykszxmrmkaa3fo5hndc4ok6xrxozjvlmq5qjv4cs@2geqqedyfzcf> In-Reply-To: From: Mateusz Guzik Date: Tue, 3 Dec 2024 16:57:11 +0100 Message-ID: Subject: Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages To: Joao Martins Cc: Michal Hocko , Frank van der Linden , linux-mm@kvack.org, akpm@linux-foundation.org, Muchun Song , Miaohe Lin , Oscar Salvador , David Hildenbrand , Peter Xu , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 3A70F1C0002 X-Rspamd-Server: rspam12 X-Stat-Signature: u3r81huhdspujq1qj4o19b1chw1soxn9 X-Rspam-User: X-HE-Tag: 1733241419-296554 X-HE-Meta: U2FsdGVkX1+2yzLKZ+dCZySZxsftNb26zH+y0czeVXZEcdLA5PU0zeo0F0m3p/wX5Wvrr4MJGUPcDrlzVqHyVcy3mgUS/OPrGRzujVHThZ9XuIg98XaKE+DLvUOvx4jQFgUZEUgQJsR+fQYB96ClVbG2W1W6w4UimNOQz9k2PcHdKCGy1nMIl63byFXNeSu1/4R5FUkPLrGxsHhC0B2QjSe7SYzTVOxgvrjRgOMoDsKm6t6KYDrzaM6yRgP9wt8dLKmTx+k7kZhjuJFEOLY4/sTg6dDriQg5fojdDPW/KRzKJlzxbiFgAMTt3wTUm2PTjePsv8BdP+ITJqduqE6WLwjn9Bf1yNUV36arSL1Layez7KlbEl3Do9jHSDLVuuZNbL7H/bceU49742PdPsATHGshf+bzldGu6Ra2ugjK5ka6CdJ3PdrnyMb0FSUn3QdGwXdaKoyY7njt3AxxRyDjL59Jn+yU29z+fOI3OG5d/DiBPbRSkO2rNrFpS+JStdKUDlqSmuVmoL0TT75aOXcYGOfxQQiU2Etx0LlIaALNUh07YPBzTYq/0OWI0ojAtO5GXOC0ekc4Iq+LXvs4UOpo06U5qF84YCcRjyocpdjWyuKib1y0PJqbECqIXrq+TUP5MipNlSxW7zVnarYb4xSsMa45RDXoZkhAP5yr7LSqwsUZjsVZwHPRBnI+GtpUynN2oOxxSFDIurHJqcR8oTFrEvEEx4Pxq+gPdDnj3GWu3W3py9XPPZhwMPOcfu2fP9eouPjBf6XYwvuYdBbvx0cbF24QeClh9iW/RcADt97wz62TkxZJy3jCMVCiyJASu1NDVF8vEH0BhIS8WDVWo7bXB8rFLvrrTdLO635eKi/p6GaAJtMuhxNQ38fbFM+acV9oruZaquRX9sSgweMQciNH0qu32g3UqXtMrHjFBlLrWDmc48YkeFzJ9fyn8l2rqVlWISk0m28Kcxa3xoKb1hh Jun+I1KG hIvBOdg9VSfPXBpu1Hkwo4m7RHPncMtvOm6uvQMS2OddbB22HGlQC+5fqJ3pRk9IOedG76AVr0J1beUYe/VzRnyAo7EMdQm29PfAHR2TJ8EQYGXu3iVdbeDZD4rRaXLTFBqJMDnsylzdObpNek/4WIzTEQFKj2tviCzd1rcYuapiaxSJxpu2zd+jInAR/Uha0XCKQWI4jnSM0H9mnWEwENa98LNQTOWlhJZXGZr7aoAR2+4d1vtzf3klHyBZusEILdl/bi5N3Izmyqoi04jIEcbmTl0WtCo2A8t7YI2W6KXzVOapIfBtdHYkBAyEA+hx48Q9qTb6R6WHXz9o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 3, 2024 at 3:26=E2=80=AFPM Joao Martins wrote: > > On 03/12/2024 12:06, Michal Hocko wrote: > > If the startup latency is a real problem is there a way to workaround > > that in the userspace by preallocating hugetlb pages ahead of time > > before those VMs are launched and hand over already pre-allocated pages= ? > > It should be relatively simple to actually do this. Me and Mike had exper= imented > ourselves a couple years back but we never had the chance to send it over= . IIRC > if we: > > - add the PageZeroed tracking bit when a page is zeroed > - clear it in the write (fixup/non-fixup) fault-path > > [somewhat similar to this series I suspect] > > Then what's left is to change the lookup of free hugetlb pages > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zero= ed > pages. Provided we don't track its 'cleared' state, there's no UAPI chang= e in > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only = and > free them back 'as zeroed' to implement a userspace scrubber. And in prin= ciple > existing apps should see no difference. The amount of changes is conseque= ntly > significantly smaller (or it looked as such in a quick PoC years back). > > Something extra on the top would perhaps be the ability so select a looku= p > heuristic such that we can pick the search method of > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better gene= ric > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VM= M, etc) > without too much of a dance. > Ye after the qemu prefaulting got pointed out I started thinking about a userlevel daemon which would do the work proposed here. Except I got stuck at a good way to do it. The mmap + load from the area + munmap triple does work but also entails more overhead than necessary, but I only have some handwaving how to not do it. :) Suppose a daemon of the sort exists and there is a machine with 4 or more NUMA domains to deal with. Further suppose it spawns at least one thread per such domain and tasksets them accordingly. Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter indicating how many pages to zero out (or even just accept one page). This would avoid crap on munmap. This would still need majority of the patch, but all the zeroing policy would be taken out. Key point being that whatever specific behavior one sees fit, they can implement it in userspace, preventing future kernel patches to add more tweaks. --=20 Mateusz Guzik