From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B9ABD2F7EC for ; Thu, 17 Oct 2024 04:42:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A130F6B007B; Thu, 17 Oct 2024 00:42:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 99BF96B0082; Thu, 17 Oct 2024 00:42:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 815576B0083; Thu, 17 Oct 2024 00:42:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5FD8A6B007B for ; Thu, 17 Oct 2024 00:42:10 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 721C940626 for ; Thu, 17 Oct 2024 04:42:03 +0000 (UTC) X-FDA: 82681846758.01.EFF64A5 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf11.hostedemail.com (Postfix) with ESMTP id 21A6240003 for ; Thu, 17 Oct 2024 04:41:56 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eEqocTaY; spf=pass (imf11.hostedemail.com: domain of rientjes@google.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729139982; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=SxwDXIKxjXP21FdsmXkLH79jYy6JTiroXLnGAM677n8=; b=E450LISt1UK8fYumYBMEu5NfcDZqXyOhebe6H9gGhkSoN6PPlhPp3f7uR/KeqHVjSs2UCD Sn91kKy8N5eXiopU52AquwSEZqCu00uaCK5LUejz3YVNt/LG70HESktte6m8JLdH+Fzsy2 AV7ga3bWHLhps5u5Ytz/TYTPNDHTKsQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729139982; a=rsa-sha256; cv=none; b=R+/yIyQi1LnUHmERrvC2BfZkZ8bturcaCXaxrfed1Rs3QIIdixjyNWzCwuPn4Jq/vhgBaN MVCJHknGIDZm5Fv0G2oIpbWf4eVNYFy10X0oXPGKkajjIA1DrpHCLYUOvrzNWbbAppq6a9 i5B0N8L9vwDRFBEX8AYHnuLnxa4AKYM= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eEqocTaY; spf=pass (imf11.hostedemail.com: domain of rientjes@google.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-20ca4877690so68605ad.1 for ; Wed, 16 Oct 2024 21:42:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1729140127; x=1729744927; darn=kvack.org; h=mime-version:message-id:subject:cc:to:from:date:from:to:cc:subject :date:message-id:reply-to; bh=SxwDXIKxjXP21FdsmXkLH79jYy6JTiroXLnGAM677n8=; b=eEqocTaYkz8j9svVuaXrQFsbpaHmf3VmfMgyRKeIm++nWTrTgUi41GQoFq3JZJIaFU +wHKkDTqkkktyj4xcc/zlgnjvQMP8o5rOAaAw1vBKxf7oEtKzeJyrEWX3eLBeN+NY1ml pNU8ic5ueTPLzTEcVl1iQlWPb5zLyC2Bgzdp4DQp3iPOwKuqu4NAzTxZV7149iFR63IZ 50w9S8xrRBbuYw38g7zqfQTRCfv0L7MM0O7pZhbKE+YHHmCTEye7jjw5Pw8GmJnhp99e KKaPITWHEHhGKjNTklR0sxJcbN1Q1dJNf8EewoNnx+LkyT9UlgcVgpVjifyBjWT9bw7A aDNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729140127; x=1729744927; h=mime-version:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=SxwDXIKxjXP21FdsmXkLH79jYy6JTiroXLnGAM677n8=; b=e8LjjEN0mbhRnRQqXVZam061v9bCI3Aro3pT30O70wmnVJWIAlQmT1QOOrp+UeOxkV PCBUXFMslnkiiBBOfn+ic+moTXTkr0QeUZttWvqkji4t1lXEzA3zhV99/CSlSDgrIerG c5XwcCTXZGqdpXOW8yhFfk1w3yX+8xCYQKwK498S160MH0daQoNF3b785QX8sJQgR1+P Q4M4BHTd6vB19R2KGtf62Jm9ihrRgHeEWPcbHb0ZK2CifNS/zlJIw0dU8Q4MhK60YoAv sJyttpX016qaw/jeObXi6cHF2ACEA+IMjVhJ1jynDd/PQGR62Pz5CbUH0z0PKTfL+h0h 60PQ== X-Gm-Message-State: AOJu0YyOwln8XxeNC4Jb4YA8HoeUxzeaw8vTda07bF6nDPvacqDNM9s9 uc1UOgyduSv6shd9bw+an9y6jmaq+7aFs5WNcWQe2h67uTULFI/0P2i7zM3yLQ== X-Google-Smtp-Source: AGHT+IGRlbQ6kzPJ28NkEONzWnblgHBvnulYQfVkCGoqmNaYRFSvxEl8c3XOA+M6/Wu7YgwzqX7aVw== X-Received: by 2002:a17:902:f542:b0:20c:f87f:b7ef with SMTP id d9443c01a7336-20d49640d27mr1707435ad.22.1729140126312; Wed, 16 Oct 2024 21:42:06 -0700 (PDT) Received: from [2620:0:1008:15:2231:4da8:e0e:5534] ([2620:0:1008:15:2231:4da8:e0e:5534]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-71e774a29f6sm4020888b3a.140.2024.10.16.21.42.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Oct 2024 21:42:05 -0700 (PDT) Date: Wed, 16 Oct 2024 21:42:04 -0700 (PDT) From: David Rientjes To: James Gowans , Dave Hansen , David Hildenbrand , Matthew Wilcox , Mike Rapoport , Pasha Tatashin , Peter Xu , Alexander Graf , Ashish Kalra , Tom Lendacky , David Woodhouse , Anthony Yznaga , Jason Gunthorpe , Andrew Morton , Frank van der Linden , Vipin Sharma , David Matlack , Steve Rutherford , Erdem Aktas , Alper Gun , Vishal Annapurve , Ackerley Tng , Sagi Shahar cc: linux-mm@kvack.org, kexec@lists.infradead.org Subject: Pmemfs/guestmemfs discussion recap and open questions Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Stat-Signature: k4zsuyndarqbxq4e7yg3i4qzyxw57ktx X-Rspamd-Queue-Id: 21A6240003 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1729140116-279065 X-HE-Meta: U2FsdGVkX1+iRj1xyBPImRoPpYu1hFTD3ieB22svWR4OsvVQ5Ak3qBki+PBPHh7dFm31wmbFdshSQpSSjMAUPF/CoNzpAjyiuftiVJg7r4eyupz+UmFv0kslyruyAJYoOVDx1KDBQftpzRSHFZbpAmzdKlasB1jpTK4EqEud1p7uz0f05fmhqNcLXyYfra80ZZOapP6tr1WcCO2aam5CFEMT2SqzUt4ogZqIy2EMFcwXW06f4tI93fKxN/TFVoG/51GqlOKlZ4WR+pDv3sVWkJeGtiENlpIxHzGDOOQ4YUUDRoXl78mMxAlx3oVF7EwsYwiqwvS3YsNY3s4/OE6PPD2FZGrktBY95UFbASGp6v2Bt+0pIALAu7ghBBv+cmVK5CBA0Ivt7bhzFSev2iujYlAZYIwoklvyg/k5tEy/lGGmOmdrQnkGkx9REXTFtEmzB3gKMmWIBZ6HjzAFzXzBG6MHndX6VoPhci5XwfQUowb3tKHqZMVP+vNFCZatpHo43DP0AHdjDyTqBDbd7Ht7WE8I93skoZgRJZxKid3nX7zvgmirEIthBygyBN6vOmV7I7H2Z2Dmd1W9B/NgA91L+KjvuWjonN7EP5lp+CiXdhz8cp7aFrfEXoU9cNPYD5hn1zeNz4ux29zxJQnBhXaDSbvjQ8Ifh2fOR4jsqKV6CpceEh25JcxJvYQ+LAqpROuFmoK8txCwuMraE5UEWmIufK6goovqUprcoQYcHniif8klG09toP/+7+iVen02iuVtXc0YPQj34ouj7Dm58UgKt40O1oD2sKWTKG7Y4cY9wasiyRGyNmF2Sy+mKXerL5nKCFAFzpZYtJX+LHENAKRObiAdRBLZ0tOTYo8K7DgyOBNIL2t7buLIHwg4+tUB8WXqiZC1c4r5LcetWo5KpFCU5SvrLKqgbuODXIrFwqYKXZpiB0TRJhzEuqeZxOB9EfFx5ofO4GEfyL1S8k+jxDr sIBUBJxk 1T5VNOen8A5me+wfgw9s464EMyWkYcj4oiviFD6JM6vx77WyKw7GQyfJIbo+ceVmSF/eber+XFKzXE5muJSLSPIWkQEqkXo0Bf5VrwofJPYwY5hz25oRUeH3vfy2nXH7Ppsz2+DFl01Bxec+Exqo7YaXgC2P5/8M29Fqj/2waaCPF1nRgtaQvNtpfZLlOtzHIeVZ8QfuAM7YpuItlSMHJczJL03OJHR4Y7VnZoa+Tyit5nmETe8GnpypTw1Vy1Ck7aCBeWsX6gpT14Grh+E+ZiR1j2Z+APwyll4tm71o23DmyU66B44KyWfR4mpZAtHD0yr2nYu5dT9tfwhKJuIHJqLTcrhpFIejx9AMfiMwtzva+cf8CbROcsjn/lODQSQRcxGG6+GiX3t1bHc8QzjhTjjlrrqWubabYmUWZ1/VETVx5Hh1bWPg+UT6JViXSCK5MyNrC6MusyVatlmGzHKNDzC7sHKZ6Rh5S2+2Jd7mG6S8ifyo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi all, We had a very interesting discussion today led by James Gowans in the Linux MM Alignment Session, thank you James! And thanks to everybody who attended and provided great questions, suggestions, and feedback. Guestmemfs[*] is proposed to provide an in-memory persistent filesystem primarily aimed at Kexec Hand-Over (KHO) use cases: 1GB allocations, no struct pages, unmapped from the kernel direct map. The memory for this filesystem is set aside by the memblock allocator as defined by the kernel command line (like guestmemfs=900G on a 1TB system). ----->o----- Feedback from David Hildenbrand was that we may want to leverge HVO to get struct page savings and the alignment was to define this as part of the filesystem configuration: do you want all struct pages to be gone and memory unmapped from the kernel direct map, or in the kernel direct map with tail pages freed for I/O? You get to choose! ----->o----- It was noted that the premise for guestmemfs sounded very similar to guest_memfd, a filesystem that would index non-anonymous guest_memfds; indeed, this is not dissimilar to persistent guest_memfd. The new kernel would need to present the fds to userspace so they can be used once again, so a filesystem abstraction may make sense. We may also want to use uid and gid permissions. It's highly desirable for the kernel to share the same infrastructure and source code, like struct page optimizations and unmapping from the kernel direct map, and name the guest_memfd. We'd want to avoid duplicating this, but it's still questionable how this would be glued together. David Hildenbrand brought up the idea of a persistent filesystem that even databases could use that may not be guest_memfd. Persistent filesystems do exist, but lack the 1GB memory allocation requirement; if we were to support databases or other workloads that want to persist memory across kexec, this instead would become a new optimized filesystem for generic use cases that require persistence. Mike Rapoport noted that tying the ability to persist memory across kexec to only guests would preclude this without major changes. Frank van der Linden noted the abstraction between guest_memfd and guestmemfs doesn't mesh very well and we may want to do this at the allocator level instead: basically a factory that gives you exactly what you want -- memory unmapped from the kernel direct map, with HVO instead, etc. Jason Gunthorpe noted there's a desire to add iommufd connections to guest_memfd and that would have to be duplicated for guestmemfs. KVM has special connections to it, ioctls, etc. So likely a whole new API surface is coming around guest_memfd that guestmemfs will want to re-use. To support this, it was also noted that guest_memfd is largely used for confidential computing and pKVM today, and confidential computing is a requirement for cloud providers: they need to expose guest_memfd style interface for such VMs as well. Jason suggested that when you create a file on the filesystem, you tell it exactly what you want: unmapped memory, guest_memfd semantics, or just a plain file. James expanded on this by brainstorming an API for such use cases and backed by this new kind of allocator to provide exactly what you need. ----->o----- James also noted some users are interested in smaller regions of memory that aren't preallocated, like tmpfs, so there is interest in "persistent tmpfs," including dynamic sizing. This may be tricky because tmpfs uses page cache. In this case, the preallocation would not be needed. Mike Rapoport noted the same is the case for memory mapped into the kernel direct map which is not required for persistence (including if you want to do I/O). The tricky part of this is to determine what should and should not be solved with the same solution. Is it acceptable to have something like guestmemfs which is very specific to cloud providers running VMs in most of their host memory? Matthew Wilcox noted there perhaps are ways to support persistence in tmpfs, such as with swap, for this other use case, James noted this could be used for things like systemd information that people have brought up for containerization. He indicated we should ensure KHO can mark tmpfs pages to be persistent. We'd need to follow up with Alex. ----->o----- Pasha Tatashin asked about NUMA support with the current guestmemfs proposal. James noted this would be an essential requirement. When specifying the kernel command line with guestmemfs=, we could specify the lengths required from each NUMA node. This would result in per-node mount points. ----->o----- Peter Xu asked if IOMMU page tables could be stored on the guestmemfs themselves to preserve across kexec. James noted previous solutions for this existed, but were tricky because of filesystem ordering at boot. This led to the conclusion that if we want persistent devices, then we need persistent memory as well; only files from guestmemfs that are known to be persistent can be mapped into a persistent VMA domain. In the case of IOMMU page tables, the IOMMU driver needs to tell KHO that they must be persisted. ----->o----- My takeaway: based on the feedback that was provided in the discussion: - we need an allocator abstraction for persistent memory that can return memory with various characteristics: 1GB or not, kernel direct map or not, HVO or not, etc. - built on top of that, we need the ability to carve out very large ranges of memory (cloud provider use case) with NUMA awareness on the kernel command line - we also need the ability to be able to dynamically resize this or provide hints at allocation time that memory must be persisted across kexec to support the non-cloud provider use case - we need a filesystem abstraction that map memory of the type that is requested, including guest_memfd and then deal with all the fun of multitenancy since it would be drawing from a finite per-NUMA node pool of persistent memory - absolutely critical to this discussion is defining what is the core infrastructure that is required for a generally acceptable solution and then what builds off of that to be more special cased (like the cloud provider use case or persistent tmpfs use case) We're looking to continue that discussion here and then come together again in a few weeks. Thanks! [*] https://lore.kernel.org/kvm/20240805093245.889357-1-jgowans@amazon.com/