From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61A2BD6B6DD for ; Wed, 30 Oct 2024 22:44:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ABE196B0099; Wed, 30 Oct 2024 18:44:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A6E626B009A; Wed, 30 Oct 2024 18:44:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 95CBC6B009B; Wed, 30 Oct 2024 18:44:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 787A96B0099 for ; Wed, 30 Oct 2024 18:44:12 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 0B26E1A1266 for ; Wed, 30 Oct 2024 22:44:12 +0000 (UTC) X-FDA: 82731747000.13.0304A71 Received: from mail-wm1-f44.google.com (mail-wm1-f44.google.com [209.85.128.44]) by imf20.hostedemail.com (Postfix) with ESMTP id 87D841C0012 for ; Wed, 30 Oct 2024 22:43:37 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DLj3ril5; spf=pass (imf20.hostedemail.com: domain of fvdl@google.com designates 209.85.128.44 as permitted sender) smtp.mailfrom=fvdl@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730328073; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aPnPpvMxjfs6ca/yrsKQ/ymzLSP4VWx+Z93efIfuEgg=; b=3EpbQG8jA1he8dcCByakePzdPAGlI0ihdwJXpt6NMeNBRlmUBDeEFYBeXs9BC6+N09YLXH cUno6pUPDCPUt9LXl1hwVtm1/kiislXhNZ1l+aIMAE9XaZ+Nf2MX/k1zSSaEzrR7/pzN7B VQhau8ugWn23OLgcbjHeNSZPF1gCWb0= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DLj3ril5; spf=pass (imf20.hostedemail.com: domain of fvdl@google.com designates 209.85.128.44 as permitted sender) smtp.mailfrom=fvdl@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730328073; a=rsa-sha256; cv=none; b=DsgAxE/08hkxxyu152eR0fqow1MmPWFCw8Z1uhYVCGCLMBVIUm5HE+aPrD8Tar1sRLWzt0 6TSWkFYJ6Cus2+Cc4uLVPAlagCPZLn0h28lF1uDAHoznqtJKx+lz6UNUnJOda/EMlUce7Z MslAL9dAbTgWDctP/8OmIUzx6Na+8QU= Received: by mail-wm1-f44.google.com with SMTP id 5b1f17b1804b1-43153c6f70aso48505e9.1 for ; Wed, 30 Oct 2024 15:44:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730328249; x=1730933049; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=aPnPpvMxjfs6ca/yrsKQ/ymzLSP4VWx+Z93efIfuEgg=; b=DLj3ril54ix3jEBpWSJmr2ucuaBn+8+zSfZzSGvATdwYknkgQelskxFFMrVzZFPNQ0 NGXbhIN4Lp1/QpXpq+BjDeeIUQIo77tdd/Kf4dLd86yg9dZNsGuIgkZTpsIMgn+BtVIF DCZEtWdpkWGXOGmYnmJUjGIqyV3qxLn4zOF2val80MCzZrzVbPk06o5/kWVfyMn8DiQH vFEpV96gEAH6HFvD5Arfz1L9d5B+CHJYylrbZcWTbCI52vtfIU0mzbzXe7uNSrP+SZYO eh73qbUiSq2USnnsn/zwgZbjvLiCooH477COjSx0BytFlvgJmTT7E2NZ85m5RyXyb8aa mAGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730328249; x=1730933049; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aPnPpvMxjfs6ca/yrsKQ/ymzLSP4VWx+Z93efIfuEgg=; b=vbipXxLaWR9x8XFs/IhS535Y2pOqPqZE00fuhNCVzX6kOMuIo5lVMDIBzLjnbdTJ1N 7IXYssnLC5WTwUJhadOTYVpZW/5ZFRp5c4cVzNSAOgn5rWh0w3xWDWI1JKgSrGEdS938 iMgcCV7D2bjJCEqyZCKuZGM4tXYnB/Rio2jQkwbA15vBmlAFagcvN4CGoWRHSYIOPpqs xyzynGyzy3Kb/dV1/6bFOmcv/nJu23Yvt8XcaTzomIo7eQ7GMgsfq96Fx1Di79jhaX64 AggQbxzL2qT2XEvZ5hnRaYUhem06nEL1YTBIMXbWmocpO6ysB8i//tW6z2H6eYSUf3xO 8kEQ== X-Forwarded-Encrypted: i=1; AJvYcCXzmv0i2UFdlt/iwNh2P8aD+y4Ibk5B3nLttod7KPyzC/cxj4+0u0h6NVa3NCA3KFAb7ezTLx5uPg==@kvack.org X-Gm-Message-State: AOJu0YziMkxNu5Xt7gr4GuyoTltMVMG9Ihvi9ClrXhg0tLkt1nnPAvRf aASQjxRbv/00eBU80yt0YQF+QRm/wVf6HFZyCfOhfLAZ6P0dsAqXj5BFIAvmSLmTKScRVUAJsmc SbBg6+qoUIunQbhIVT7Q66LMdXqXazXvO0dh2 X-Gm-Gg: ASbGncvuqXO7cXhA3KYD9VIUO84b+n75Ozxn+bX4xSeJjxNvMHB6+bKFHZjEUbrHwAw UdhAK9GuGFlZh5fl2lF/Cp5RgBT5z X-Google-Smtp-Source: AGHT+IGiDK/Yh78ucrdpGaMeQexkQNAiQJjBQAsjPXA1SS/EAbGjme0Q880vLm7UFXTyXv9x83UBq70x1u0XbwYFdhg= X-Received: by 2002:a05:600c:4e45:b0:431:3baa:2508 with SMTP id 5b1f17b1804b1-4327d9039cemr759015e9.3.1730328248598; Wed, 30 Oct 2024 15:44:08 -0700 (PDT) MIME-Version: 1.0 References: <5b6613bc-beef-79e6-62ed-23de4dfafe51@google.com> In-Reply-To: From: Frank van der Linden Date: Wed, 30 Oct 2024 15:43:57 -0700 Message-ID: Subject: Re: Pmemfs/guestmemfs discussion recap and open questions To: Mike Rapoport Cc: David Rientjes , James Gowans , Dave Hansen , David Hildenbrand , Matthew Wilcox , Mike Rapoport , Pasha Tatashin , Peter Xu , Alexander Graf , Ashish Kalra , Tom Lendacky , David Woodhouse , Anthony Yznaga , Jason Gunthorpe , Andrew Morton , Vipin Sharma , David Matlack , Steve Rutherford , Erdem Aktas , Alper Gun , Vishal Annapurve , Ackerley Tng , Sagi Shahar , linux-mm@kvack.org, kexec@lists.infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 87D841C0012 X-Stat-Signature: 1wpwehqjzteet4z8ti6x37inz3q1o9p5 X-Rspam-User: X-HE-Tag: 1730328217-392344 X-HE-Meta: U2FsdGVkX196/42VYfmwYAsuKrplJdo9AaSMcvbHG9tTP1qhmMt+Z6ZUvJFRF/uaol9sBQFD4t1BJBooo83KC+eshBHiJb5vSNHEZ88aBosm+nM1HuJUE/AVlxujDRKDYbGAvrqeVlHsRvNlWvBykwkGkLNCRtr9QjLC+EWsCmIMbkDFa3WyAGIxsnblPFzFGRTqKMZRy8S5Dzvw0AorhNuahbLQWjEblx7qWimmvGC80xRwFoHpSd0vdIq7/x5KebZ5JtzKEX41nJK1OeEA9Hfqks+VRQVC+YwKUyosJjNlTguILVA1sDm9lkx3qmOhWGpkf33cwlCiKpr2z2BCaOnpfTuzRiOXFHYoOk/Emg0Q/CEXjXDhtRP/6Zh36X8fGdjLnZpZS7Szpoc+dOUOch+LqdxY0APvvFYNhR+GB49t7YvgW+ghf9FjcO6iIGB2Ad1XTGkevFm6SzLMextmHKNSWRGgYvneyp1pFtX4Oi4ZqtSehMu2SVHni5u1KH3P2XgyT5+L5WBVij83KbLuqtQbetrHJLPtVqsz/p6jArCeEgzMyT8v/gZBIpvivBPhaUoL13CtHBIEA7aPmp2C/Uyvf3hHFu9xG4G344ddmbC6d9leVap6XUQW/fzScVw/niXvXkonavxxWlosXo/AoEmACL7UYSYvJmS6G2kCvzL9ZbUZCb6d2S4acrnqvXc4BI8+SKsVKsVHDaYx+Da/pF3slp/li0aYL3aBbGE3zXS+PWD7Tvp/L5olsuGlyT7TJ5vQn27aVlQqiW6wEHg9A9RoJmZcBNbwnrIkOMA6BMLrqwAMSliw4MpPD/TMfH75GhhmedlFTm2GYuk0CR7Sa53U8cTHC/jtmyzIG8JtagUP0Idock/i2GM1LRvTHaM0K9J0cONQfQNM1ZTAGIekdzRtbpUPTR5VyGt7iPU8gpKmbpWlfD0epqDuoR+Gw98MgP8j7Wvy2T2hr/eYx/S a01gA2XB Ic3UiC2Pe9uryU5lpiG7cPh2wLH9q6g99iJYduED4JHYRaRVhVE+zP5zttIOmF8vIBUbDKUfb/BV9PgHex1XidnT3bs3dHe40AFsaYSclRHj5WMESN84iZ/3hRAa7FuzHhdD6F/phke2RMBdCcjd9ckQOMAGgp01+mlL4MWvzC/od9R99guosLskrXNb7CIXzGJxPm08bt3Voa7EsD08HgLFPHRPa2LrNlHEsbMILl8tS+CSIkc9/dX7T4w++cPqzXWwctUWtI/Cgv8czYWeoYyCq981WHcfoF+c9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 29, 2024 at 9:39=E2=80=AFAM Mike Rapoport wro= te: > > Hi David, > > On Fri, Oct 25, 2024 at 11:07:27PM -0700, David Rientjes wrote: > > On Wed, 16 Oct 2024, David Rientjes wrote: > > > > > ----->o----- > > > My takeaway: based on the feedback that was provided in the discussio= n: > > > > > > - we need an allocator abstraction for persistent memory that can re= turn > > > memory with various characteristics: 1GB or not, kernel direct map= or > > > not, HVO or not, etc. > > > > > > - built on top of that, we need the ability to carve out very large > > > ranges of memory (cloud provider use case) with NUMA awareness on = the > > > kernel command line > > > > > > > Following up on this, I think this physical memory allocator would also= be > > possible to use as a backend for hugetlb. Hopefully this would be an > > allocator that would be generally useful for multiple purposes, somethi= ng > > like a mm/phys_alloc.c. > > Can you elaborate on this? mm/page_alloc.c already allocates physical > memory :) > > Or you mean an allocator that will deal with memory carved out from what = page > allocator manages? > > > Frank van der Linden may also have thoughts on the above? Yeah 'physical allocator' is a bit of a misnomer. You're right, an allocator that deals with memory not under page allocator control is a better description. To elaborate a bit: there are various scenarios where allocating contiguous stretches of physical memory is useful. HugeTLB, VM guest memory. Or where you are presented with an external range of VM_PFNMAP memory and need to manage it in a simple way and hand it out for guest memory support (see NVidia's github for nvgrace-egm). However, all of these cases may come with slightly different requirements: is the memory purely external? Does it have struct pages? If so, is it in the direct map? Is the memmap for the memory optimized (HVO-style)? Does it need to be persistent? When does it need to be zeroed out? So that's why it seems like a good idea to come up with a slightly more generalized version of pool allocator - something that manages, usually larger, chunks of physically contiguous memory. A is initialized with certain properties (persistence, etc). It has methods to grow and shrink the pool if needed. It's in no way meant to be anywhere near as sophisticated as the page allocator, that would not be useful (and pointless code duplication). A simple fixed-size chunk pool will satisfy a lot of these cases. A number of the building blocks are already there: there's CMA, there's ZONE_DEVICE which has tools to manipulate some of these properties (by going through a hotremove / hotplug cycle). I created a simple prototype that essentially uses CMA as a pool provider, and uses some ZONE_DEVICE tools to initialize memory however you want it when it's added to the pool. I also added some new init code to to avoid things like unneeded memmap allocation at boot for hugetlbfs pages. I put hugetlbfs on top of it - but in a restricted way for prototyping purposes (no reservations, no demotion). Anyway, this is the basic idea. - Frank