From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 65FAEF34C64 for ; Mon, 13 Apr 2026 17:05:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 930F86B0088; Mon, 13 Apr 2026 13:05:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8E2356B008A; Mon, 13 Apr 2026 13:05:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D1DA6B0092; Mon, 13 Apr 2026 13:05:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6A8EB6B0088 for ; Mon, 13 Apr 2026 13:05:27 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 0990AE2E73 for ; Mon, 13 Apr 2026 17:05:27 +0000 (UTC) X-FDA: 84654158694.11.A3C4280 Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) by imf29.hostedemail.com (Postfix) with ESMTP id 0ECAF120009 for ; Mon, 13 Apr 2026 17:05:24 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=jkLAp+3w; dmarc=none; spf=pass (imf29.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.174 as permitted sender) smtp.mailfrom=gourry@gourry.net ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1776099925; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4AKXTVWdZGGDUzDswSpuJJ++/7lCF35yufiWfZYsxfQ=; b=E5T/mxaVWM/QLWRLdc9YWi8v1q8OgJ30Y+NFD1mJlGeY9aKlhGQDuFEBGUY8cYow4QdWc9 TlcgYjXgZuqBQOwhKOLtNDbkBbh3llVpVrDFEvddpmjohhYnNKpgnJyTFfgdA+wTqdfO3K 0/FohfTiw4d20W7wU48g8M1JASAEM7c= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1776099925; a=rsa-sha256; cv=none; b=zFAjgKleiJoapbFkWRp+MB3XN4Tjk1EMmboQRBgs/dgicYaNMOoTQcgTcsXp6kMXk82WNw +Tbvd2DWGmhjeWWD4PCFemPDdKMIVAtV3bO0hNH8uTaMLr+1kFGUBh2l3DK7xEGp6se5rJ NJck2o+b8Dy9Hpy7mC1+6RF+40BuLyc= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=jkLAp+3w; dmarc=none; spf=pass (imf29.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.174 as permitted sender) smtp.mailfrom=gourry@gourry.net Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-8c6f21c2d81so350561985a.2 for ; Mon, 13 Apr 2026 10:05:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1776099924; x=1776704724; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=4AKXTVWdZGGDUzDswSpuJJ++/7lCF35yufiWfZYsxfQ=; b=jkLAp+3wf2iPXIc6AjwDR0wZ7RLp4yo9rIdzvu3XtD9H+AJVPf/Vzlk/kQmFm/UfWS B0OBUbkIWzExeACkMZ7j5AuRxdUNj43GCCRw1Gx/fn7h3q9IzEBKP4wXrC0tfdbuzkzb uTtGdZoc+/dU73iwPOPntKqa2K7CD+/BFIBl2+a27DnW1eNftzA3+oSPh3zo4hbMrbHi jb7hRHtiYLBx/i3kJVUXOJHerpkRpXSO2MvSov9nBncJ80VEF+jTIkzCiwDFQAlHqGzR uvpmDgzJ36e0EY847aW+YJrf8LkiaJtVUY2v2Ai27C9N8mAQ7sorQgci13GN/1hjxOIN VeoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776099924; x=1776704724; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4AKXTVWdZGGDUzDswSpuJJ++/7lCF35yufiWfZYsxfQ=; b=aPmhiLEUZqwNCtY6tUVyP8l55vqN6BShrm9S09xT3J+JWA236BLpRJxu2IU//nvHTj p5B5NH0/UMaYMvJRwBI2Cvha0rvU+iWRVulsAXge9jmAU1bYNQPPSkg26TF7C/aqRW5+ wc04l+d/h+eCIu/PrhEI9+yKUjJCWcrz1JSWgQ196BZze9xnI94MvY3P45WTrr6P0Hqq VNKRUYwSkGYyjHnPcBZMX7YMlaDcl83+UHzwFpPVH73wznuC7RfyAtrGN9+y1y5w2+ek cyhThPUTILh8CFLFTUz86okJWiFTGcigRmdj47fiLpGoa45GX1MuIJMTq0uQ33gac08s iEYw== X-Forwarded-Encrypted: i=1; AFNElJ9y58u5A+ZWOY59FQMND3UTgOhZCx2RO5r77PPsdtUyNsjK/K7Q1enyAILDiunaMnKOIL+1tTc/vw==@kvack.org X-Gm-Message-State: AOJu0Yz5m3OgpAHH9va51e69AgXRM+BYuDGa7g9HzqjK+6B8eThE9rhi QA8jjIO/L7X1RQWOGJFsRoP7lNQ3KNDGH3kn9OPuwxCBQWqsLo2KCJ41LeOC4U6eaog= X-Gm-Gg: AeBDiesE1ksoGP4kIkbof+3GpjXu9YBYKusPR1v2/czHovhuKtdTr16jCJjTmZb2CXy 7CV56dumzpx/68irW3cFIn11tX5kW4vT9Q3h8PbiIR6VeQ6CLqabpTuvtGk2+RpfqqMjRcjRLe9 fK3NzqU3BLCJdf+fFPLleQX6Q6ahe/SSxJvpfsqUd0I9nQYzd8nrP//jnC9K/I85J1tGuexkZhn yVeHVvZP6UPAv1CrORufoKwKcEHa6jtc/GT6u7ltUMNhd0lULD00WcDknyHERrqS2439W0PRge5 MWKLaYyxoocmRTJX/6Ot/fiySTNAG+9ML9xZfO22gCTjw5gbzkkTuRCjsr6eU9hTJoOGpb3t3lm 18wuS8DoT3RdSMAOqc5/+xwumLRSyhp9W8s6ZOC8+htG6cR5f/gWs/iIfQ9g5j2Z8ozjVmiRjD+ IaFUsbcn59XPsGmYztZbJnpwKSOnB/Ie/IEb8kTL+crYIdzUlZpL4nqQzoMvciAlYlKXBuG8UKH Sm9sBsTlQfQz9GIJb/sONY= X-Received: by 2002:a05:620a:4688:b0:8da:358a:c481 with SMTP id af79cd13be357-8ddccb293f3mr2047470485a.1.1776099923608; Mon, 13 Apr 2026 10:05:23 -0700 (PDT) Received: from gourry-fedora-PF4VCD3F (pool-71-191-243-150.washdc.fios.verizon.net. [71.191.243.150]) by smtp.gmail.com with ESMTPSA id af79cd13be357-8ddb6562409sm912511185a.17.2026.04.13.10.05.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Apr 2026 10:05:22 -0700 (PDT) Date: Mon, 13 Apr 2026 13:05:19 -0400 From: Gregory Price To: "David Hildenbrand (Arm)" Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, damon@lists.linux.dev, kernel-team@meta.com, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, longman@redhat.com, akpm@linux-foundation.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com, apopple@nvidia.com, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com, sj@kernel.org, baolin.wang@linux.alibaba.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn, chengming.zhou@linux.dev, jannh@google.com, linmiaohe@huawei.com, nao.horiguchi@gmail.com, pfalcato@suse.de, rientjes@google.com, shakeel.butt@linux.dev, riel@surriel.com, harry.yoo@oracle.com, cl@gentwo.org, roman.gushchin@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, zhengqi.arch@bytedance.com, terry.bowman@amd.com Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Message-ID: References: <20260222084842.1824063-1-gourry@gourry.net> <3342acb5-8d34-4270-98a2-866b1ff80faf@kernel.org> <2608a03b-72bb-4033-8e6f-a439502b5573@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2608a03b-72bb-4033-8e6f-a439502b5573@kernel.org> X-Rspamd-Queue-Id: 0ECAF120009 X-Stat-Signature: 78sx6sei3asog9rnhsz3ijb639ep1ms1 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1776099924-730963 X-HE-Meta: U2FsdGVkX1+e7tigBH2jVuZeLz3Eye3+ui1u43vyJ7nJw+F57UiwXPWLhg/YhhPLIXz86p01qO59PsjmEaAiPMAsox+TacKTlhVlXobkvF3wYyYzCXm6oq8BT+tiRo5xcUsredtOuTc5ceDzbj8JUXxCH6g1CV+CRHbpV1Gv4rti8E7AXIkS260VMVszJwUmeYxkofHXzKnUq7s2GtCNWsS/LMOkY5ii8plkqrpdM8Zodu1gBSq+M5bZ20cEv6f5EQ3E6ZgURpyO6E6/ZQEDyxSKXrfW8V25/MiybBVmdCRZxtFW/2tP55kK5Oc52kjN88TrgACTze34f8jjm8rTMNDW3dQdME0hqmR8Vl0SwBOQF8A3ysIoOxNueD3MSqqK3YWNN1Qj0TV9Qaio5WJ059DJ3u2CY542VyBg8BqX26kjllC7WxdHViKyMVhAgpP0gaoIgSpR3ICUw3IfTJn4W6KpuAU/qfkzN61N+3i4gotIDRJEgspaXhNAHdcinNyMIhqD1Mvp74butVbhggStMO+HCOssJ/Xa1uds8Rcmsdk9ZhQ4xb2pXMvhZwT6RqSPCBO3DP4ZB5tDPN57D+40IHgY6dgTwy+8hd0KQM81kZoq7sDxfA8DXmh+OkRk1uYxrrwuVTHbONdMIqgbJHJjPmV86QHOJIJ8TheVLqp995wN37K8Znc9NJ3ZQbSe6PnqUt0ic7q2OEJONFYDEU/pCdZ2gEqJiM8RhI0bymDycncjscvdDrWqRkFZ0DGWAWPakEoBM7PAu6jbQtH6fSHo6PRqNCaE5z9c/WTbWMRN0o2lbv6U1Ja6nR6sKTAS8H12NqUBguSBGzyavsBdw7UUTwuhyiCsuCzL1+uXogB4QVwgEZjpCidD8nMv9rSOzc2zRBSKZTUO9cXu/2QqfQGk4e0To2kn6PvirTz/HijdtRb9OUpxcSsWPNX2j5lfb8AcURbWnalyZ2/TuJwfuzN U9TtozFQ MlyPaC6lAt3MN+4pbGgEPaFUzuhFQCYxNuX6k3Qr3RkGlD1uW5u4Ika6Hal70WeYOEoBBp2m0EKH5W2FpGWOJs1hJPfJRmZCbsxPzDpFzG19NZn+oDxuJlrVwYPo8gYiwTVH7syVqSJRq9cywlmapdLJec//u5fS7WWyufmLyQVhgD94yozy7/JQw0/+rJYco/UTL4QHKTM8OknX1vJvEX3gs11yv9v3DSa5hWUr63bvRTgKN+ILoWTbwvTwTor8QYw4eRKCwKBBF3kUGAzoISbuGHU+YAC2GN3W+xRsXeIsn7vkPm3+J/Dyw3WMK+nmgwyV+D+Ik9xnajYggR1ekK7kCIEShm0CcLvzx/+hyryeDCbY/vFQY/yuXY7LvFrlyACad3VQI70aeG3SaOLe3MYod6DMq74MDMEfXeO3E9F5AiNOG4TCIGHXJbVnYK2vFmjYsi/LdOR/95CE= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote: > > Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM > > because the entire boot region gets marked shared. > > What exactly do you mean with "mark shared". Do you mean, that "shared > memory" is used in the hypervisor for all boot memory? > Sorry, meant MAP_SHARED. But yes, in some setups the hypervisor simply makes a memfd with the entire main memory region MAP_SHARED. This is because the virtio-net device / network stack does GFP_KERNEL allocations and then pins them on the host to allow zero-copy - so all of ZONE_NORMAL is a valid target. (At least that's my best understanding of the entire setup). > > You mean, in the VM, memory usable by virtio-net can only be consumed > from a dedicated physical memory region, and that region would be a > separate node? > Correct - it does requires teaching the network stack numa awareness. I was surprised by how little code this required, though I can't be 100% sure of its correctness since networking isn't my normal space. Alternatively you could imagine this as a real device bringing its own dedicated networking memory for network buffers, and then telling the network start "Hey, prefer this node over normal kernel allocations". What I'd been hacking on was cobbled together with memfd + SRAT bits to bring up a private node statically and then have the device claim it - but this is just a proof of concept. A proper implementation would be extending virtio-net to report a dedicated EFI_RESERVED region. > > > > I see you saw below that one of the extensions is removing the nodes > > from the fallback list. That is part one, but it's insufficient to > > prevent complete leakage (someone might iterate over the nodes-possible > > list and try migrating memory). > > Which code would do that? > There are many callers of for_each_node() throughout the system. but one discrete example: int alloc_shrinker_info(struct mem_cgroup *memcg) { ... snip ... for_each_node(nid) { struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size, GFP_KERNEL, nid); ... snip .. } If you disallow fallbacks in this scenario, this allocation always fails. This partially answers your question about slub fallback allocations, there are slab allocations like this that depend on fallbacks (more below on this explicitly). > > Basically the only isolation mechanism we have today is ZONE_DEVICE. > > > > Either via mbind and friends, or even just the driver itself managing it > > directly via alloc_pages_node() and exposing some userland interface. > > Would mbind() work here? I thought mbind() would not suddenly give > access to some ZONE_DEVICE memory. > Sorry these were orthogonal thoughts. 1) We don't have such a mechanism. ZONE_DEVICE's preferred mechanism is setting up explicit migrations via migrate_device.c 2) mbind / alloc_pages_node would only work for private nodes. Extending ZONE_DEVICE to enable mbind() would be an extreme lift, as the kernel makes a lot of assumptions about folio->lru. This is why i went the node route in the first place. > > > > in the NP_OPS_MIGRATION patch, this gets covered. > > Right, but I am not sure if NP_OPS_MIGRATION is really the right > approach for that. Have to think about that. > So, OPS is a bit misleading, but it's the closest i came to some existing pattern. OPS does not necessarily need to imply callbacks. I've been trying to minimize the patch set and I'm starting to think the MVP may actually be able to do away with the private_ops structure for a basic migration+mempolicy example by simply teaching some services (migrate.c, mempolicy.c) how/when to inject __GFP_PRIVATE. the mempolicy.c patch already does this, but not migrate.c - i haven't figured out the right pattern for that yet. > > 1) as you note, removing it from the default bitmaps, which is actually > > hard. You can't remove it from the possible-node bitmap, so that > > just seemed non-tractable. > > What about making people use a different set of bitmaps here? Quite some > work, but maybe that's the right direction given that we'll now treat > some nodes differently. > It's an option, although it is fragile. That means having to police all future users of possible-nodes and for_each_node and etc. I've been err'ing on the side of "not fragile", but i'm open to rework. > > > > 2) __GFP_THISNODE actually means (among other things) "don't fallback". > > And, in fact, there are some hotplug-time allocations that occur in > > SLAB (pglist_data) that target the private node that *must* fallback > > to successfully allocate for successful kernel operation. > > > Can you point me at the code? > There is actually a comment in slub.c that addresses this directly: static int slab_mem_going_online_callback(int nid) { ... snip ... /* * XXX: kmem_cache_alloc_node will fallback to other nodes * since memory is not yet available from the node that * is brought up. */ n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL); ... snip ... } Slab basically acknowledges the behavior is required on existing nodes and just falls back immediately for the "going online" path. Other specific calls in the hotplug path: mm/sparse.c: kzalloc_node(size, GFP_KERNEL, nid) mm/sparse-vmemmap.c: alloc_pages_node(nid, GFP_KERNEL|...) mm/slub.c: kmalloc_node(sizeof(*barn), GFP_KERNEL, nid) There are quite a number of callers to kmem_cache_alloc_node() that would have to be individually audited. And some non-slab interfaces examples as well: alloc_shrinker_info alloc_node_nr_active I've been looking at this for a while, but I'm starting to think trying to touch all this surface area is simply too fragile compared to just letting normal memory be a fallback for private nodes and adding: __GFP_PRIVATE - unlock's private node, but allow fallback #define GFP_PRIVATE (__GFP_PRIVATE | __GFP_THISNODE) - only this node __GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case. For mbind() it probably makes sense we'd use GFP_PRIVATE - either it succeeds or it OOMs. > > The flexibility is kind of the point :] > > Yeah, but it would be interesting which minimal support we would need to > just let some special memory be managed by the kernel, allowing mbind() > users to use it, but not have any other fallback allocations end up on it. > > Something very basic, on which we could build additional functionality. > I actually have a simplistic CXL driver that does exactly this: https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65 We have to support migration because mbind can migrate on bind if the VMA already has memory - but all this means is the migrate interfaces are live - not that the kernel actually uses them. so mbind requires (OPS_MIGRATE | OPS_MEMPOLICY) All these flags say is: - move_pages() syscalls can accept these nodes - migrate_pages() function calls can accept these nodes - mempolicy.c nodemasks allow the nodes (should restrict to mbind) - vma's with these nodes now inject __GFP_PRIVATE on fault All other services (reclaim, compaction, khugepaged, etc) do not scan these nodes and do not know about __GFP_PRIVATE, so they never see private node folios and can't allocate from the node. In this example, all migrate_to() really does is inject __GFP_THISNODE, but I've been thinking about whether we can just do this in migrate.c and leave implementing the .ops to a user that requires is. But otherwise "it just works". One note here though - OOM conditions and allocation failures are not intuitive, especially when THP/non-order-0 allocations are involved. But that might just mean this minimal setup should only allow order-0 allocations - which is fiiiiiiiiiiiiiine :P. ----------------- For basic examples I've implemented 4 examples to consider building on: 1) CXL mempolicy driver: https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65 As described above 2) Virtio-net / CXL.mem Network Card (Not published yet) This doesn't require any ops at all - the plumbing happens entirely inside the kernel. I onlined the node with an SRAT hack and no ops structure at all associated with the device (just set node affinity to the pcie_dev and plumbed it through the network stack). A proper implementation would have virtio-net register is own reserved memory region and online it during probe. 3) Accelerator (Not published yet) I have converted an open source but out of tree GPU driver which uses NUMA nodes to use private nodes. This required: NP_OPS_MIGRATION NP_OPS_MEMPOLICY The pattern is very similar to the CXL mempolicy driver, except that the driver had alloc_pages_node() calls that needed to have __GFP_PRIVATE added to ensure allocations landed on the device. 4) CXL Compressed RAM driver: https://github.com/gourryinverse/linux/blob/55c06eb6bced58132d9001e318f2958e8ac80614/mm/cram.c#L340 needs pretty much everything - it's "normal memory" with access rules, so the driver isn't really in the management lifecycle. In this example - the only way to allocate memory on the node is via demotion. This allows us to close off the device to new allocations if the hardware reports low memory but the OS percieves the device to still have free memory. Which is a cool example: The driver just sets up the node with certain attributes and then lets the kernel deal with it. I have started compacting the _OPS_* flags related to reclaim into a single NP_OPS_RECLAIM flag while testing with this. Really i've come around to thinking many mm/ services need to be taken as a package, not fully piecemeal. The tl;dr: Once you cede some control over to the kernel, you're very close to ceding ALL control, but you still get some control over how/when allocations on the node can be made. It is important to note that even if we don't expose callbacks, we do still need a modicum of node filtering in some places that still use for_each_node() (vmscan.c, compaction.c, oom_kill.c, etc). These are basically all the places ZONE_DEVICE *implicitly* opts itself out of by having managed_pages=0. We have to make those situations explicit - but that doesn't mean we need callbacks. > > > > I would simply state: "That depends on the memory device" > > Let's keep it very simple: just some memory that you mbind(), and you > only want the mbind() user to make use of that memory. > > What would be the minimal set of hooks to guarantee that. > If you want the mbind contract to stay intact: NP_OPS_MIGRATION (mbind can generate migrations) NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node) The set of callbacks required should be exactly 0 (assuming we teach migrate.c to inject __GFP_PRIVATE like we have mempolicy.c). If your device requires some special notification on allocation, free or migration to/from you need: ops.free_folio(folio) ops.migrate_to(folios, nid, mode, reason, nr_success) ops.migrate_folio(src_folio, dst_folio) The free path is the tricky one to get right. You can imagine: buf = malloc(...); mbind(buf, private_node); memset(buf, 0x42, ...); ioctl(driver, CHECK_OUT_THIS_DATA, buf); exit(0); The task dies and frees the pages back to the buddy - the question is whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can all eat an ops.free_folio() callback to inform the driver the memory has been freed. In practice - this worked on my accelerator and compressed examples, but I can't say it's 100% safe in all contexts. The free path needs more scrutiny. > For example, I assume compaction could just be supported for such > memory? Similarly, longterm-pinning. > > For some of the other hooks it's rather unclear how they would affect > the very simple mbind() rule. What is the effect of demotion or NUMA > balancing? > > I'm afraid we're making things too complicated here or it might be the > wrong abstraction, if i cannot even figure out how to make the simplest > use case work. > > Maybe I'm wrong :) > Actually, quite the opposite: None of that should be engaged by default. In our above example: OPS_MIGRATION | OPS_MEMPOLICY All this should say is that migration and mempolicy are supported - not that anything in the kernel that uses migration will suddenly operate on that memory. So: Compaction, Longterm Pin, NUMA balancing, Demotion - etc - all of these do not ever operate on this memory by default. Your device driver or service would have to specifically opt-in to those services and must be capable of dealing with the implications of that. --- kind of neat aside: You can hotplug private ZONE_NORMAL without NP_OPS_LONGTERMPIN and as long as the driver/service controls the type/lifetime of allocations, the node can remain hot-unpluggable in the future. e.g. if the service only ever allocates movable allocations, the lack of NP_OPS_LONGTERMPIN prevents those pages from being pinned. If you add NP_OPS_MIGRATION - the attempt to pin will cause migration :] ~Gregory