From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2078BE74AC2 for ; Tue, 3 Dec 2024 16:18:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B21E6B0089; Tue, 3 Dec 2024 11:18:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 661456B008A; Tue, 3 Dec 2024 11:18:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 502066B008C; Tue, 3 Dec 2024 11:18:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2B9B96B0089 for ; Tue, 3 Dec 2024 11:18:05 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id C86881C755C for ; Tue, 3 Dec 2024 16:18:04 +0000 (UTC) X-FDA: 82854154068.23.867A01D Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf22.hostedemail.com (Postfix) with ESMTP id 320CAC0016 for ; Tue, 3 Dec 2024 16:17:47 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Vz/YLDg+"; spf=pass (imf22.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733242673; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zqmGuKtzUhbp+0ab0kuVrLR/kbiBZSj/b5+95+hTlbs=; b=ctYMoR3PsFUFOpgyAq4sm5hhDNa3dd1cryzNw2jwllTb+GbMBI0lceZH+Wy/uf3u1bqikK qRCJjCYn1KbkqHZFCsY3LU1cG56zcDfXbrz/4+w9jyF/Ww2/JurOsrZKpMIY0uIoGh8tpr o7DPxdw8Bw3Caiu2IIl7jd6RWrNMlz4= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Vz/YLDg+"; spf=pass (imf22.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733242673; a=rsa-sha256; cv=none; b=O72ChVz2Tjpdo4Qd+w7NrMokCXazgXXtdoZL7VGzteX/Gal3d3t5uMqnwi/pv+9qX3FIP3 bOZtcTaLoPtNEvoBS0cuaUVE6mRwOE+U66YBrnuByFgPKHX5Hp+K5E13jmn5bO3Gy8ibnu yOsxlY19HSMe18qCgOlXXXIcnaq/aVo= Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-2ffced84ba8so55646821fa.2 for ; Tue, 03 Dec 2024 08:18:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1733242681; x=1733847481; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zqmGuKtzUhbp+0ab0kuVrLR/kbiBZSj/b5+95+hTlbs=; b=Vz/YLDg+XtRqTpfgQrX4CyDVrkMbzQ2L4Q1TvI96bENzMlkMpverLgOcMrpYy2r1uA WbyGB5sItxOaTXUR537AhtbSB+EGmAO4VvFtP4CVgKlob4e3MuE2AX0kCPjTdx1/1b8z Bj06QT4eXNcmI8NL1iJyoH/cQzGIoldhftUC2SgLjGgi7e2IgmFgWX+PV1mDCZTIwZJT e1H0hrvhFdv9HoxlIWRymhnBp/PfN2NO6o7PZrYNJbZKJBcQ2mEfOpoyl14pcyhrHu81 QE/OARn4h1Hkvy6kBhjfjUw+X2rnLnpg3vGVT/3bb/nVAZSqS6az6HJp9TnUorX9xR5z yw0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733242681; x=1733847481; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zqmGuKtzUhbp+0ab0kuVrLR/kbiBZSj/b5+95+hTlbs=; b=Fwa/66s1VrvPyYEm724r0pNfESIpWEKCpu/qkUUhq+uuiRKmrTaJkrqvUMviN4SvTI LeD9fgHSEVzx6r7w5bznHN8r5znDp6OsUE/rz/ExKhisi0orjVqQUTuuhXSRi8YGC92t yLtxuFe+8Qp1apNzPJLlDU4QVG7EVp6TV4Cpe4LNru3vF5Rhv8XVN5GtttZnHgdMKISx zMhJFlYXRLNjSOpVW1uRlUyk/0niFdMAkFhCkFae/048LVN1tDX9ZjGDvyhg9A4xX500 scKO+BEwVEVoZeqIc/3fbVHrwbSpxWJXtSSCWNaBM2KTlHKsbgHc5hIvkS4hHdN/jeEC ANAQ== X-Forwarded-Encrypted: i=1; AJvYcCXPcIXtgR5FCIZL83GCSzlQcCIZkghPtD5JQwsxM0i3cr6JWuj52b9sdjA/4TLkKuj6zhOSRoPtiQ==@kvack.org X-Gm-Message-State: AOJu0Yxb07/gSbaYyirRaxO8qfW6WU+DXGEJkpLbYddYfTXzuBhuilHc ZyO6lahbi4OoOuhMVNfEQBLb3diCi8J6pqbeTJV79MAP9OHzYt2Sm0lkI2Z+1istZqRc8237X6L MDFzMDPEi+We4lyPQiw72KrIq8/g= X-Gm-Gg: ASbGnctcTI9R9lvtiSdgexelmW2/V8i1q5sbDyFwdPWCYalK2V+CI+NLHrAZ8VMFwwo uO3cX9JLgGipFpMLRCyfLf0mDfaT5 X-Google-Smtp-Source: AGHT+IH4/0E9XbjzYZeqJ56ZkCrmNBorgd3oAhdAoCsg9whUO2EU+pXbRui7fee8vx6r28ulNV7F8O2A5hiofId/cos= X-Received: by 2002:a2e:a98a:0:b0:2fa:d84a:bd8f with SMTP id 38308e7fff4ca-30009cab27cmr19183121fa.30.1733242680616; Tue, 03 Dec 2024 08:18:00 -0800 (PST) MIME-Version: 1.0 References: <20241202202058.3249628-1-fvdl@google.com> <3tqmyo3qqaykszxmrmkaa3fo5hndc4ok6xrxozjvlmq5qjv4cs@2geqqedyfzcf> In-Reply-To: From: Mateusz Guzik Date: Tue, 3 Dec 2024 17:17:48 +0100 Message-ID: Subject: Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages To: Joao Martins Cc: Michal Hocko , Frank van der Linden , linux-mm@kvack.org, akpm@linux-foundation.org, Muchun Song , Miaohe Lin , Oscar Salvador , David Hildenbrand , Peter Xu , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Stat-Signature: nxky8ewpnuqtft1i1djncid1o5agx93a X-Rspamd-Queue-Id: 320CAC0016 X-Rspam-User: X-HE-Tag: 1733242667-745735 X-HE-Meta: U2FsdGVkX18FdeM9C2mTU5RAg8Lif7CKSFZPgqi7lMBvcCUUPMgvk6Fo3Rd3mpnTLxw44etZN255mBWex5YdhLoz+WiejDVIY3XjoWa6o+jj+8gJNHDMAWRi4uepiK4lsUv6zgVBCyIGgRpCniUjfh2kOO8znc2bF+D5wIUmKQcg/+fnMQ30JdBsd/4fAcFDJF3yKW+tgo4AwfvXx8D+Rsg7abwjfCqdy3wkTQHv3SLuYJHsb9972dLNqzZCT5CgojLtmLcNvppX7oMI1yzWdRYpzDDt7sCtOsfqVL871LMhQI4d2scriErWg9U0CbCdAElm1JSomyZvIcOFkmtN2QCOzGhLRMC7aLfOIw+uIMWhzT65qRegFwn69cONtuMHBBxUTruBtNq3kBYELiJpWeNwdrTVr9gZEsakI5nzKSOAaZIrjiIGUlr3qe80IHe2V6qy+6dMn2/NbK1ta2+PF8FojdqmzljBFrikcUerBbCQuVvSTEvJxdXoPausBWbo5kZaB06+s69+ivu6ERMtPI66jbC/+5cy+4+vhVZQo+X+xORJ1SpSLzE6K6LPMSGv6wurhebeOfy8KDFnZVRe7QJgM9cVCLOHZFQzDneiCni8lavbZhoWeVi4oBqQoS2vIB97Lhx9kMb9p4Zp76KP9r+wP47xnadsqsqBOl1fMWcGA8O6AKgYr1bvUOmwrxyY9AEiq86kMQuoFv2u2njjKB0xONO3FPCce67nGGk4/9vpJkfxuOfHSV7C+0Akx9ZA35U2ReRX/iMnSRuhrNnjYM2fZvlvUI1+WqmMlV7kY+zgzaGp66UYRf3yuHqyf+BwkyQ7TqRa7/IpwQfc88OYNQjbXB2qbUmU6+CusJuhIbIQkcubUIZZu81UbiPp3llz24NzbRS3gdYZDbq16VrTUjqmJ9ZAL9/CMnQfGmw+cFFjuffOK5HN0WeDIa0MhdhwudnFWE3JnsvhKRNnL2A kWa0hCBH CCuqfsPFQ7Jz3W7E67KXmO9N/dH/08F0sU9mVzmhqg8XxfbfOOXO/NtyhyVdv2k7VQfUCkKjzVo6bICxyuL1wU63OreKSwuWzc7cztH3vBHn8RTrQruSaJ857UUbk1GeBxj8rrJniscHxQvtLGYJbvEI84oE00I1wRBmdi+rgywhhi5wPzc3lVxJSWysNw/KTSQVqfilJesqIvwypZRqRRCe4KiiJG92UObqF3Noft/djg8lKp/DxQ2o3NWmtMyW/TrDutDEh4TPNW7/GZXqxhD/E+Y2nKr6FAiAYR+wKTf2uBVH+Oqny/PqoYgY++0D6vgshAHbsVd6awdmAhUXugCiUgnt1tdADBkySChkDQ/IAMmZqgFLFt6S3PA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 3, 2024 at 4:57=E2=80=AFPM Mateusz Guzik wr= ote: > > On Tue, Dec 3, 2024 at 3:26=E2=80=AFPM Joao Martins wrote: > > > > On 03/12/2024 12:06, Michal Hocko wrote: > > > If the startup latency is a real problem is there a way to workaround > > > that in the userspace by preallocating hugetlb pages ahead of time > > > before those VMs are launched and hand over already pre-allocated pag= es? > > > > It should be relatively simple to actually do this. Me and Mike had exp= erimented > > ourselves a couple years back but we never had the chance to send it ov= er. IIRC > > if we: > > > > - add the PageZeroed tracking bit when a page is zeroed > > - clear it in the write (fixup/non-fixup) fault-path > > > > [somewhat similar to this series I suspect] > > > > Then what's left is to change the lookup of free hugetlb pages > > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-ze= roed > > pages. Provided we don't track its 'cleared' state, there's no UAPI cha= nge in > > behaviour. A daemon can just allocate/mmap+touch/etc them with read-onl= y and > > free them back 'as zeroed' to implement a userspace scrubber. And in pr= inciple > > existing apps should see no difference. The amount of changes is conseq= uently > > significantly smaller (or it looked as such in a quick PoC years back). > > > > Something extra on the top would perhaps be the ability so select a loo= kup > > heuristic such that we can pick the search method of > > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better ge= neric > > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a = VMM, etc) > > without too much of a dance. > > > > Ye after the qemu prefaulting got pointed out I started thinking about > a userlevel daemon which would do the work proposed here. > > Except I got stuck at a good way to do it. The mmap + load from the > area + munmap triple does work but also entails more overhead than > necessary, but I only have some handwaving how to not do it. :) > > Suppose a daemon of the sort exists and there is a machine with 4 or > more NUMA domains to deal with. Further suppose it spawns at least one > thread per such domain and tasksets them accordingly. > > Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter > indicating how many pages to zero out (or even just accept one page). > This would avoid crap on munmap. > > This would still need majority of the patch, but all the zeroing > policy would be taken out. Key point being that whatever specific > behavior one sees fit, they can implement it in userspace, preventing > future kernel patches to add more tweaks. How about this for a rough sketch (which I have 0 intention of implementing myself): /dev/hugepagectl or whatever is created with a bunch of ioctls, notably: - something to query hugepage stats - an event generated for epoll if count in any domain goes below a threshol= d - something to zero a page of given size from the free list Perhaps make it so that fds require an upfront ioctl to set a numa domain of interest before poll works -- for example if there is one thread per domain, each of them sleeps on its own relevant fd. Or maybe someone still wants the main thread to get the full view so they poll on all of them. then a google internal tool can react however it sees fit without waking up in a periodic fashion. (replace google with any other company which may want to mess with this). optional: - allocating and zeroing (but not mmaping!) a page then a party which shares the file descriptor could obtain it by passing the fd to mmap. munmap would just free it as it does now. this would allow qemu et al to avoid the mmap/munmap dance just to zero, but I don't know how useful it is for them --=20 Mateusz Guzik