* [RFC PATCH 1/4] mm: Add support for File Based Memory Management
2024-11-22 20:38 [RFC PATCH 0/4] Add support for File Based Memory Management Bijan Tabatabai
@ 2024-11-22 20:38 ` Bijan Tabatabai
2024-11-22 20:38 ` [RFC PATCH 2/4] fbmm: Add helper functions for FBMM MM Filesystems Bijan Tabatabai
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Bijan Tabatabai @ 2024-11-22 20:38 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, btabatabai; +Cc: akpm, viro, brauner, mingo
This patch introduces File Based Memory Management (FBMM), which allows for
memory managers that are written as filesystems, similar to HugeTLBFS, to
be used transparently by applications.
The steps for using FBMM are the following:
1) Mount the memory management filesystem (MFS)
2) Enable FBMM by writing 1 to /sys/kernel/mm/fbmm/state
3) Set the MFS an application should allocate its memory from by writing
the MFS's mount directory to /proc/<pid>/fbmm_mnt_dir, where <pid> is the
PID of the target process.
To have a process use an MFS for the entirety of the execution, one could
use a shim program that writes /proc/self/fbmm_mount_dir then calls exec
for the target process. We have created such a shim, which can be found
at [1].
Providing this transparency is useful since it allows applications to use
an arbitrary MFS to mange its memory without having to modify that
application.
Writing memory management functionality as an MFS is useful for more easily
prototyping MM functionality and maintaining support for a variety of
different hardware configurations or application needs.
FBMM was originally created as a research project at the University of
Wisconsin-Madison [2].
The core of FBMM is found in fs/file_based_mm.c. Other parts of the kernel
call into functions in that file to allow processes to allocate their
memory from an MFS without changing the application's code. For example,
the do_mmap function is modified so that when it is called with the
MAP_ANONYMOUS flag by a process using FBMM, fbmm_get_file is called to
acquire a file in the MFS used by the process along with the page offset to
map that file to. do_mmap then proceeds to map that file instead of
anonymous memory, allowing the desired MFS to control the memory behavior
of the mapped region. A similar process happens inside of the brk syscall
implementation. Another example is handle_mm_fault being modified to call
fbmm_fault for regions using FBMM which will invoke the MFS's page fault
handler.
The main overhead of FBMM comes from creating the files for the process
to memory map. To ammortize this cost, we give files created by FBMM a
large virtual size (currently 128GB) and have multiple calls to mmap/brk
share a file. The fbmm_get_file function handles this logic. It takes the
size of a new allocation and the virtual address it will be mapped to. On
a process's first call to fbmm_get_file, it creates a new file and assigns
the file a virtual address range that it can be mapped to. Files created by
FBMM are added to a per-process tree indexed by the files's virtual address
range. On subsequent calls to fbmm_get_file, it searches the tree for a
file that can fit the new memory allocation. If such a file does not exist,
a new file is created and added to the tree of files.
A pointer to a fbmm_info struct is added to task_struct to keep track of
the state used by FBMM. This includes the path to the MFS used by the
process and the tree of files used by the process.
Signed-off-by: Bijan Tabatabai <btabatabai@wisc.edu>
[1] https://github.com/multifacet/fbmm-workspace/blob/main/bmks/fbmm_wrapper.c
[2] https://www.usenix.org/conference/atc24/presentation/tabatabai
---
fs/Kconfig | 7 +
fs/Makefile | 1 +
fs/file_based_mm.c | 564 ++++++++++++++++++++++++++++++++++
fs/proc/base.c | 4 +
include/linux/file_based_mm.h | 81 +++++
include/linux/mm.h | 10 +
include/linux/sched.h | 4 +
kernel/exit.c | 3 +
kernel/fork.c | 3 +
mm/gup.c | 1 +
mm/memory.c | 2 +
mm/mmap.c | 42 ++-
12 files changed, 719 insertions(+), 3 deletions(-)
create mode 100644 fs/file_based_mm.c
create mode 100644 include/linux/file_based_mm.h
diff --git a/fs/Kconfig b/fs/Kconfig
index a46b0cbc4d8f..52994b2491fe 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -96,6 +96,13 @@ config FS_DAX_PMD
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
+config FILE_BASED_MM
+ bool "File Based Memory Management"
+ help
+ This option enables file based memory management (FBMM). FBMM allows users
+ to have a process transparently allocate its memory from a memory manager
+ that is written as a filesystem.
+
# Selected by DAX drivers that do not expect filesystem DAX to support
# get_user_pages() of DAX mappings. I.e. "limited" indicates no support
# for fork() of processes with MAP_SHARED mappings or support for
diff --git a/fs/Makefile b/fs/Makefile
index 6ecc9b0a53f2..f1a5e540fe72 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -45,6 +45,7 @@ obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o
obj-$(CONFIG_NFS_COMMON) += nfs_common/
obj-$(CONFIG_COREDUMP) += coredump.o
obj-$(CONFIG_SYSCTL) += drop_caches.o sysctls.o
+obj-$(CONFIG_FILE_BASED_MM) += file_based_mm.o
obj-$(CONFIG_FHANDLE) += fhandle.o
obj-y += iomap/
diff --git a/fs/file_based_mm.c b/fs/file_based_mm.c
new file mode 100644
index 000000000000..c05797d51cb3
--- /dev/null
+++ b/fs/file_based_mm.c
@@ -0,0 +1,564 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/types.h>
+#include <linux/file_based_mm.h>
+#include <linux/sysfs.h>
+#include <linux/kobject.h>
+#include <linux/namei.h>
+#include <linux/fs.h>
+#include <linux/mman.h>
+#include <linux/security.h>
+#include <linux/vmalloc.h>
+#include <linux/falloc.h>
+#include <linux/timekeeping.h>
+#include <linux/maple_tree.h>
+#include <linux/sched.h>
+#include <linux/kthread.h>
+#include <linux/pagemap.h>
+#include <linux/mm.h>
+
+#include "proc/internal.h"
+
+enum file_based_mm_state {
+ FBMM_OFF = 0,
+ FBMM_ON = 1,
+};
+
+#define FBMM_DEFAULT_FILE_SIZE (128L << 30)
+struct fbmm_file {
+ struct file *f;
+ /* The starting virtual address assigned to this file (inclusive) */
+ unsigned long va_start;
+ /* The ending virtual address assigned to this file (exclusive) */
+ unsigned long va_end;
+};
+
+static enum file_based_mm_state fbmm_state = FBMM_OFF;
+
+const int GUA_OPEN_FLAGS = O_EXCL | O_TMPFILE | O_RDWR;
+const umode_t GUA_OPEN_MODE = S_IFREG | 0600;
+
+static struct fbmm_info *fbmm_create_new_info(char *mnt_dir_str)
+{
+ struct fbmm_info *info;
+ int ret;
+
+ info = kmalloc(sizeof(struct fbmm_info), GFP_KERNEL);
+ if (!info)
+ return NULL;
+
+ info->mnt_dir_str = mnt_dir_str;
+ ret = kern_path(mnt_dir_str, LOOKUP_DIRECTORY | LOOKUP_FOLLOW, &info->mnt_dir_path);
+ if (ret) {
+ kfree(info);
+ return NULL;
+ }
+
+ info->get_unmapped_area_file = file_open_root(&info->mnt_dir_path, "",
+ GUA_OPEN_FLAGS, GUA_OPEN_MODE);
+ if (IS_ERR(info->get_unmapped_area_file)) {
+ path_put(&info->mnt_dir_path);
+ kfree(info);
+ return NULL;
+ }
+
+ mt_init(&info->files_mt);
+
+ return info;
+}
+
+static void drop_fbmm_file(struct fbmm_file *file)
+{
+ if (atomic_dec_return(&file->refcount) == 0) {
+ fput(file->f);
+ kfree(file);
+ }
+}
+
+static pmdval_t fbmm_alloc_pmd(struct vm_fault *vmf)
+{
+ struct mm_struct *mm = vmf->vma->vm_mm;
+ unsigned long address = vmf->address;
+ pgd_t *pgd;
+ p4d_t *p4d;
+
+ pgd = pgd_offset(mm, address);
+ p4d = p4d_alloc(mm, pgd, address);
+ if (!p4d)
+ return VM_FAULT_OOM;
+
+ vmf->pud = pud_alloc(mm, p4d, address);
+ if (!vmf->pud)
+ return VM_FAULT_OOM;
+
+ vmf->pmd = pmd_alloc(mm, vmf->pud, address);
+ if (!vmf->pmd)
+ return VM_FAULT_OOM;
+
+ vmf->orig_pmd = pmdp_get_lockless(vmf->pmd);
+
+ return pmd_val(*vmf->pmd);
+}
+
+inline bool is_vm_fbmm_page(struct vm_area_struct *vma)
+{
+ return !!(vma->vm_flags & VM_FBMM);
+}
+
+int fbmm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags)
+{
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = address & PAGE_MASK,
+ .real_address = address,
+ .flags = flags,
+ .pgoff = linear_page_index(vma, address),
+ .gfp_mask = mapping_gfp_mask(vma->vm_file->f_mapping) | __GFP_FS | __GFP_IO,
+ };
+
+ if (fbmm_alloc_pmd(&vmf) == VM_FAULT_OOM)
+ return VM_FAULT_OOM;
+
+ return vma->vm_ops->fault(&vmf);
+}
+
+bool use_file_based_mm(struct task_struct *tsk)
+{
+ if (fbmm_state == FBMM_OFF)
+ return false;
+ else
+ return tsk->fbmm_info && tsk->fbmm_info->mnt_dir_str;
+}
+
+unsigned long fbmm_get_unmapped_area(unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ struct fbmm_info *info;
+
+ info = current->fbmm_info;
+ if (!info)
+ return -EINVAL;
+
+ return get_unmapped_area(info->get_unmapped_area_file, addr, len, pgoff, flags);
+}
+
+struct file *fbmm_get_file(struct task_struct *tsk, unsigned long addr, unsigned long len,
+ unsigned long prot, int flags, bool topdown, unsigned long *pgoff)
+{
+ struct file *f;
+ struct fbmm_file *fbmm_file;
+ struct fbmm_info *info;
+ struct path *path;
+ int open_flags = O_EXCL | O_TMPFILE;
+ unsigned long truncate_len;
+ umode_t open_mode = S_IFREG;
+ s64 ret = 0;
+
+ info = tsk->fbmm_info;
+ if (!info)
+ return NULL;
+
+ /* Does a file exist that will already fit this mmap call? */
+ fbmm_file = mt_prev(&info->files_mt, addr + 1, 0);
+ if (fbmm_file) {
+ /*
+ * Just see if this mmap will fit inside the file.
+ * We don't need to check if other mappings in the file overlap
+ * because get_unmapped_area should have done that already.
+ */
+ if (fbmm_file->va_start <= addr && addr + len <= fbmm_file->va_end) {
+ f = fbmm_file->f;
+ goto end;
+ }
+ }
+
+ /* Determine what flags to use for the call to open */
+ if (prot & PROT_EXEC)
+ open_mode |= 0100;
+
+ if ((prot & (PROT_READ | PROT_WRITE)) == (PROT_READ | PROT_WRITE)) {
+ open_flags |= O_RDWR;
+ open_mode |= 0600;
+ } else if (prot & PROT_WRITE) {
+ open_flags |= O_WRONLY;
+ open_mode |= 0200;
+ } else if (prot & PROT_READ) {
+ /* It doesn't make sense for anon memory to be read only */
+ return NULL;
+ }
+
+ path = &info->mnt_dir_path;
+ f = file_open_root(path, "", open_flags, open_mode);
+ if (IS_ERR(f))
+ return NULL;
+
+ /*
+ * It takes time to create new files and create new VMAs for mappings
+ * with different files, so we want to create huge files that we can reuse
+ * for different calls to mmap
+ */
+ if (len < FBMM_DEFAULT_FILE_SIZE)
+ truncate_len = FBMM_DEFAULT_FILE_SIZE;
+ else
+ truncate_len = len;
+ ret = vfs_truncate(&f->f_path, truncate_len);
+ if (ret) {
+ filp_close(f, current->files);
+ return (struct file *)ret;
+ }
+
+ fbmm_file = kmalloc(sizeof(struct fbmm_file), GFP_KERNEL);
+ if (!fbmm_file) {
+ filp_close(f, current->files);
+ return NULL;
+ }
+ fbmm_file->f = f;
+ if (topdown) {
+ /*
+ * Since VAs in this region grow down, this mapping will be the
+ * "end" of the file
+ */
+ fbmm_file->va_end = addr + len;
+ fbmm_file->va_start = fbmm_file->va_end - truncate_len;
+ } else {
+ fbmm_file->va_start = addr;
+ fbmm_file->va_end = addr + truncate_len;
+ }
+
+ mtree_store(&info->files_mt, fbmm_file->va_start, fbmm_file, GFP_KERNEL);
+
+end:
+ if (f && !IS_ERR(f))
+ *pgoff = (addr - fbmm_file->va_start) >> PAGE_SHIFT;
+
+ return f;
+}
+
+void fbmm_populate_file(unsigned long start, unsigned long len)
+{
+ struct fbmm_info *info;
+ struct fbmm_file *file = NULL;
+ loff_t offset;
+
+ info = current->fbmm_info;
+ if (!info)
+ return;
+
+ file = mt_prev(&info->files_mt, start, 0);
+ if (!file || file->va_end <= start)
+ return;
+
+ offset = start - file->va_start;
+ vfs_fallocate(file->f, 0, offset, len);
+}
+
+int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len)
+{
+ struct fbmm_info *info = NULL;
+ struct fbmm_file *fbmm_file = NULL;
+ struct fbmm_file *prev_file = NULL;
+ unsigned long end = start + len;
+ unsigned long falloc_start_offset, falloc_end_offset, falloc_len;
+ int ret = 0;
+
+ info = tsk->fbmm_info;
+ if (!info)
+ return 0;
+
+ /*
+ * Finds the last (by va_start) mapping where file->va_start <= start, so we have to
+ * check this file is actually within the range
+ */
+ fbmm_file = mt_prev(&info->files_mt, start + 1, 0);
+ if (!fbmm_file || fbmm_file->va_end <= start)
+ goto exit;
+
+ /*
+ * Since the ranges overlap, we have to keep going backwards until we
+ * the first mapping where file->va_start <= start and file->va_end > start
+ */
+ while (1) {
+ prev_file = mt_prev(&info->files_mt, fbmm_file->va_start, 0);
+ if (!prev_file || prev_file->va_end <= start)
+ break;
+ fbmm_file = prev_file;
+ }
+
+ /*
+ * A munmap call can span multiple memory ranges, so we might have to do this
+ * multiple times
+ */
+ while (fbmm_file) {
+ if (start > fbmm_file->va_start)
+ falloc_start_offset = start - fbmm_file->va_start;
+ else
+ falloc_start_offset = 0;
+
+ if (fbmm_file->va_end <= end)
+ falloc_end_offset = fbmm_file->va_end - fbmm_file->va_start;
+ else
+ falloc_end_offset = end - fbmm_file->va_start;
+
+ falloc_len = falloc_end_offset - falloc_start_offset;
+
+ ret = vfs_fallocate(fbmm_file->f,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ falloc_start_offset, falloc_len);
+
+ fbmm_file = mt_next(&info->files_mt, fbmm_file->va_start, ULONG_MAX);
+ if (!fbmm_file || fbmm_file->va_end <= start)
+ break;
+ }
+
+exit:
+ return ret;
+}
+
+static void fbmm_free_info(struct task_struct *tsk)
+{
+ struct fbmm_file *file;
+ struct fbmm_info *info = tsk->fbmm_info;
+ unsigned long index = 0;
+
+ mt_for_each(&info->files_mt, file, index, ULONG_MAX) {
+ drop_fbmm_file(file);
+ }
+ mtree_destroy(&info->files_mt);
+
+ if (info->mnt_dir_str) {
+ path_put(&info->mnt_dir_path);
+ fput(info->get_unmapped_area_file);
+ kfree(info->mnt_dir_str);
+ }
+ kfree(info);
+}
+
+void fbmm_exit(struct task_struct *tsk)
+{
+ if (tsk->tgid != tsk->pid)
+ return;
+
+ if (!tsk->fbmm_info)
+ return;
+
+ fbmm_free_info(tsk);
+}
+
+int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clone_flags)
+{
+ struct fbmm_info *info;
+ char *buffer;
+ char *src_dir;
+
+ /* If this new task is just a thread, not a new process, just copy fbmm info */
+ if (clone_flags & CLONE_THREAD) {
+ dst_tsk->fbmm_info = src_tsk->fbmm_info;
+ return 0;
+ }
+
+ /* Does the src actually have a default mnt dir */
+ if (!use_file_based_mm(src_tsk)) {
+ dst_tsk->fbmm_info = NULL;
+ return 0;
+ }
+ info = src_tsk->fbmm_info;
+
+ /* Make a new fbmm_info with the same mnt dir */
+ src_dir = info->mnt_dir_str;
+
+ buffer = kstrndup(src_dir, PATH_MAX, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+
+ dst_tsk->fbmm_info = fbmm_create_new_info(buffer);
+ if (!dst_tsk->fbmm_info) {
+ kfree(buffer);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static ssize_t fbmm_state_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", fbmm_state);
+}
+
+static ssize_t fbmm_state_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int state;
+ int ret;
+
+ ret = kstrtoint(buf, 0, &state);
+
+ if (ret != 0) {
+ fbmm_state = FBMM_OFF;
+ return ret;
+ } else if (state == 0) {
+ fbmm_state = FBMM_OFF;
+ } else {
+ fbmm_state = FBMM_ON;
+ }
+ return count;
+}
+static struct kobj_attribute fbmm_state_attribute =
+__ATTR(state, 0644, fbmm_state_show, fbmm_state_store);
+
+static struct attribute *file_based_mm_attr[] = {
+ &fbmm_state_attribute.attr,
+ NULL,
+};
+
+static const struct attribute_group file_based_mm_attr_group = {
+ .attrs = file_based_mm_attr,
+};
+
+static ssize_t fbmm_mnt_dir_read(struct file *file, char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file_inode(file));
+ char *buffer;
+ struct fbmm_info *info;
+ size_t len, ret;
+
+ if (!task)
+ return -ESRCH;
+
+ buffer = kmalloc(PATH_MAX + 1, GFP_KERNEL);
+ if (!buffer) {
+ put_task_struct(task);
+ return -ENOMEM;
+ }
+
+ info = task->fbmm_info;
+ if (info && info->mnt_dir_str)
+ len = sprintf(buffer, "%s\n", info->mnt_dir_str);
+ else
+ len = sprintf(buffer, "not enabled\n");
+
+ ret = simple_read_from_buffer(ubuf, count, ppos, buffer, len);
+
+ kfree(buffer);
+ put_task_struct(task);
+
+ return ret;
+}
+
+static ssize_t fbmm_mnt_dir_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task;
+ struct path p;
+ char *buffer;
+ struct fbmm_info *info;
+ int ret = 0;
+
+ if (count > PATH_MAX)
+ return -ENOMEM;
+
+ buffer = kmalloc(count + 1, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+
+ if (copy_from_user(buffer, ubuf, count)) {
+ kfree(buffer);
+ return -EFAULT;
+ }
+ buffer[count] = 0;
+
+ /*
+ * echo likes to put an extra \n at the end of the string
+ * if it's there, remove it
+ */
+ if (buffer[count - 1] == '\n')
+ buffer[count - 1] = 0;
+
+ task = get_proc_task(file_inode(file));
+ if (!task) {
+ kfree(buffer);
+ return -ESRCH;
+ }
+
+ /* Check if the given path is actually a valid directory */
+ ret = kern_path(buffer, LOOKUP_DIRECTORY | LOOKUP_FOLLOW, &p);
+ if (!ret) {
+ path_put(&p);
+ info = task->fbmm_info;
+
+ if (!info) {
+ info = fbmm_create_new_info(buffer);
+ task->fbmm_info = info;
+ if (!info)
+ ret = -ENOMEM;
+ } else {
+ /*
+ * Cleanup the old directory info, but keep the fbmm files
+ * stuff because the application may still be using them
+ */
+ if (info->mnt_dir_str) {
+ path_put(&info->mnt_dir_path);
+ fput(info->get_unmapped_area_file);
+ kfree(info->mnt_dir_str);
+ }
+
+ info->mnt_dir_str = buffer;
+ ret = kern_path(buffer, LOOKUP_DIRECTORY | LOOKUP_FOLLOW,
+ &info->mnt_dir_path);
+ if (ret)
+ goto end;
+
+ fput(info->get_unmapped_area_file);
+ info->get_unmapped_area_file = file_open_root(&info->mnt_dir_path, "",
+ GUA_OPEN_FLAGS, GUA_OPEN_MODE);
+ if (IS_ERR(info->get_unmapped_area_file))
+ ret = PTR_ERR(info->get_unmapped_area_file);
+ }
+ } else {
+ kfree(buffer);
+
+ info = task->fbmm_info;
+ if (info && info->mnt_dir_str) {
+ kfree(info->mnt_dir_str);
+ path_put(&info->mnt_dir_path);
+ fput(info->get_unmapped_area_file);
+ info->mnt_dir_str = NULL;
+ }
+ }
+
+end:
+ put_task_struct(task);
+ if (ret)
+ return ret;
+ return count;
+}
+
+const struct file_operations proc_fbmm_mnt_dir = {
+ .read = fbmm_mnt_dir_read,
+ .write = fbmm_mnt_dir_write,
+ .llseek = default_llseek,
+};
+
+
+static int __init file_based_mm_init(void)
+{
+ struct kobject *fbmm_kobj;
+ int err;
+
+ fbmm_kobj = kobject_create_and_add("fbmm", mm_kobj);
+ if (unlikely(!fbmm_kobj)) {
+ pr_warn("failed to create the fbmm kobject\n");
+ return -ENOMEM;
+ }
+
+ err = sysfs_create_group(fbmm_kobj, &file_based_mm_attr_group);
+ if (err) {
+ pr_warn("failed to register the fbmm group\n");
+ kobject_put(fbmm_kobj);
+ return err;
+ }
+
+ return 0;
+}
+subsys_initcall(file_based_mm_init);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 72a1acd03675..ef5688f0ab95 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -97,6 +97,7 @@
#include <linux/resctrl.h>
#include <linux/cn_proc.h>
#include <linux/ksm.h>
+#include <linux/file_based_mm.h>
#include <uapi/linux/lsm.h>
#include <trace/events/oom.h>
#include "internal.h"
@@ -3359,6 +3360,9 @@ static const struct pid_entry tgid_base_stuff[] = {
ONE("ksm_merging_pages", S_IRUSR, proc_pid_ksm_merging_pages),
ONE("ksm_stat", S_IRUSR, proc_pid_ksm_stat),
#endif
+#ifdef CONFIG_FILE_BASED_MM
+ REG("fbmm_mnt_dir", S_IRUGO|S_IWUSR, proc_fbmm_mnt_dir),
+#endif
};
static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/file_based_mm.h b/include/linux/file_based_mm.h
new file mode 100644
index 000000000000..c1c5e82e36ec
--- /dev/null
+++ b/include/linux/file_based_mm.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _FILE_BASED_MM_H_
+#define _FILE_BASED_MM_H_
+
+#include <linux/types.h>
+#include <linux/mm_types.h>
+#include <linux/fs.h>
+#include <linux/maple_tree.h>
+
+struct fbmm_info {
+ char *mnt_dir_str;
+ struct path mnt_dir_path;
+ /* This file exists just to be passed to get_unmapped_area in mmap */
+ struct file *get_unmapped_area_file;
+ struct maple_tree files_mt;
+};
+
+
+#ifdef CONFIG_FILE_BASED_MM
+extern const struct file_operations proc_fbmm_mnt_dir;
+
+bool use_file_based_mm(struct task_struct *task);
+
+bool is_vm_fbmm_page(struct vm_area_struct *vma);
+int fbmm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags);
+unsigned long fbmm_get_unmapped_area(unsigned long addr, unsigned long len, unsigned long pgoff,
+ unsigned long flags);
+struct file *fbmm_get_file(struct task_struct *tsk, unsigned long addr, unsigned long len,
+ unsigned long prot, int flags, bool topdown, unsigned long *pgoff);
+void fbmm_populate_file(unsigned long start, unsigned long len);
+int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len);
+void fbmm_exit(struct task_struct *tsk);
+int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clone_flags);
+
+#else /* CONFIG_FILE_BASED_MM */
+
+static inline bool is_vm_fbmm_page(struct vm_area_struct *vma)
+{
+ return 0;
+}
+
+static inline bool use_file_based_mm(struct task_struct *tsk)
+{
+ return false;
+}
+
+static inline int fbmm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags)
+{
+ return 0;
+}
+
+static inline unsigned long fbmm_get_unmapped_area(unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ return 0;
+}
+
+static inline struct file *fbmm_get_file(struct task_struct *tsk, unsigned long addr,
+ unsigned long len, unsigned long prot, int flags, bool topdown,
+ unsigned long *pgoff)
+{
+ return NULL;
+}
+
+static inline void fbmm_populate_file(unsigned long start, unsigned long len) {}
+
+static inline int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len)
+{
+ return 0;
+}
+
+static inline void fbmm_exit(struct task_struct *tsk) {}
+
+static inline int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk,
+ u64 clone_flags)
+{
+ return 0;
+}
+#endif /* CONFIG_FILE_BASED_MM */
+
+#endif /* __FILE_BASED_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index eb7c96d24ac0..614d40ef249a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,7 @@
#include <linux/kasan.h>
#include <linux/memremap.h>
#include <linux/slab.h>
+#include <linux/file_based_mm.h>
struct mempolicy;
struct anon_vma;
@@ -321,12 +322,14 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_6 38 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
+#define VM_HIGH_ARCH_6 BIT(VM_HIGH_ARCH_BIT_6)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -357,6 +360,12 @@ extern unsigned int kobjsize(const void *objp);
# define VM_SHADOW_STACK VM_NONE
#endif
+#ifdef CONFIG_FILE_BASED_MM
+# define VM_FBMM VM_HIGH_ARCH_6
+#else
+# define VM_FBMM VM_NONE
+#endif
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
@@ -3465,6 +3474,7 @@ extern int __mm_populate(unsigned long addr, unsigned long len,
int ignore_errors);
static inline void mm_populate(unsigned long addr, unsigned long len)
{
+ fbmm_populate_file(addr, len);
/* Ignore errors */
(void) __mm_populate(addr, len, 1);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a5f4b48fca18..8a98490618b0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1554,6 +1554,10 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif
+#ifdef CONFIG_FILE_BASED_MM
+ struct fbmm_info *fbmm_info;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/kernel/exit.c b/kernel/exit.c
index 81fcee45d630..49a76f7f6cc6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -70,6 +70,7 @@
#include <linux/sysfs.h>
#include <linux/user_events.h>
#include <linux/uaccess.h>
+#include <linux/file_based_mm.h>
#include <uapi/linux/wait.h>
@@ -824,6 +825,8 @@ void __noreturn do_exit(long code)
WARN_ON(tsk->plug);
+ fbmm_exit(tsk);
+
kcov_task_exit(tsk);
kmsan_task_exit(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..2b47276b1300 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2369,6 +2369,9 @@ __latent_entropy struct task_struct *copy_process(
goto bad_fork_cleanup_perf;
/* copy all the process information */
shm_init_task(p);
+ retval = fbmm_copy(current, p, clone_flags);
+ if (retval)
+ goto bad_fork_cleanup_audit;
retval = security_task_alloc(p, clone_flags);
if (retval)
goto bad_fork_cleanup_audit;
diff --git a/mm/gup.c b/mm/gup.c
index f1d6bc06eb52..762bbaf1cabf 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -22,6 +22,7 @@
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
+#include <linux/file_based_mm.h>
#include "internal.h"
diff --git a/mm/memory.c b/mm/memory.c
index d10e616d7389..fa2fe3ee0867 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5685,6 +5685,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+ else if (unlikely(is_vm_fbmm_page(vma)))
+ ret = fbmm_fault(vma, address, flags);
else
ret = __handle_mm_fault(vma, address, flags);
diff --git a/mm/mmap.c b/mm/mmap.c
index 83b4682ec85c..d684d8bd218b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,6 +182,7 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
struct vm_area_struct *brkvma, *next = NULL;
unsigned long min_brk;
bool populate = false;
+ bool used_fbmm = false;
LIST_HEAD(uf);
struct vma_iterator vmi;
@@ -256,8 +257,23 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
brkvma = vma_prev_limit(&vmi, mm->start_brk);
/* Ok, looks good - let it rip. */
- if (do_brk_flags(&vmi, brkvma, oldbrk, newbrk - oldbrk, 0) < 0)
- goto out;
+ if (use_file_based_mm(current)) {
+ vm_flags_t vm_flags;
+ unsigned long prot = PROT_READ | PROT_WRITE;
+ unsigned long pgoff = 0;
+ struct file *f = fbmm_get_file(current, oldbrk, newbrk-oldbrk, prot, 0, false,
+ &pgoff);
+
+ if (f) {
+ vm_flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags | VM_FBMM;
+ mmap_region(f, oldbrk, newbrk-oldbrk, vm_flags, pgoff, NULL);
+ used_fbmm = true;
+ }
+ }
+ if (!used_fbmm) {
+ if (do_brk_flags(&vmi, brkvma, oldbrk, newbrk - oldbrk, 0) < 0)
+ goto out;
+ }
mm->brk = brk;
if (mm->def_flags & VM_LOCKED)
@@ -1219,6 +1235,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
{
struct mm_struct *mm = current->mm;
int pkey = 0;
+ bool used_fbmm = false;
*populate = 0;
@@ -1278,10 +1295,28 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+ /* Do we want to use FBMM? */
+ if (!file && (flags & MAP_ANONYMOUS) && use_file_based_mm(current)) {
+ addr = fbmm_get_unmapped_area(addr, len, pgoff, flags);
+
+ if (!IS_ERR_VALUE(addr)) {
+ bool topdown = test_bit(MMF_TOPDOWN, &mm->flags);
+
+ file = fbmm_get_file(current, addr, len, prot, flags, topdown, &pgoff);
+
+ if (file) {
+ used_fbmm = true;
+ flags = flags & ~MAP_ANONYMOUS;
+ vm_flags |= VM_FBMM;
+ }
+ }
+ }
+
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
- addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags);
+ if (!used_fbmm)
+ addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags);
if (IS_ERR_VALUE(addr))
return addr;
@@ -2690,6 +2725,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
mmap_read_unlock(mm);
__mt_destroy(&mt_detach);
+ fbmm_munmap(current, start, end - start);
return 0;
clear_tree_failed:
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread* [RFC PATCH 2/4] fbmm: Add helper functions for FBMM MM Filesystems
2024-11-22 20:38 [RFC PATCH 0/4] Add support for File Based Memory Management Bijan Tabatabai
2024-11-22 20:38 ` [RFC PATCH 1/4] mm: " Bijan Tabatabai
@ 2024-11-22 20:38 ` Bijan Tabatabai
2024-11-22 20:38 ` [RFC PATCH 3/4] mm: Export functions for writing " Bijan Tabatabai
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Bijan Tabatabai @ 2024-11-22 20:38 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, btabatabai; +Cc: akpm, viro, brauner, mingo
This patch adds four helper functions to simplify the implementation of
MFSs.
fbmm_swapout_folio: Takes a folio to swap out.
fbmm_writepage: An implementation of the address_space_operations.writepage
callback that simply writes the page to the default swap space.
fbmm_read_swap_entry: Reads the contents of a swap entry into a page.
fbmm_copy_page_range: Copies the page table corresponding to the VMA of
one process into the page table of another. The pages in both processes
are write protected for CoW.
This patch also adds infrastructure for FBMM to support copy on write.
The dup_mmap function is modified to create new FBMM files for a forked
process. We also add a callback to the super_operations struct called
copy_page_range, which is called in place of the normal copy_page_range
function in dup_mmap to copy the page table entries to the forked process.
The fbmm_copy_page_range helper is our base implementation of this that
MFSs can use to write protect pages for CoW. However, an MFS can have its
own copy_page_range implementation if, for example, the creaters prefer to
do a deep copy of the pages on fork.
Logic is added to FBMM to handle multiple processes sharing files. A forked
process will keep the list of FBMM files it depends on for CoW, and takes a
reference to those FBMM files. To ensure one process doesn't free memory
used by other another, FBMM will only free memory from a file if its
reference count is 1.
Signed-off-by: Bijan Tabatabai <btabatabai@wisc.edu>
---
fs/exec.c | 2 +
fs/file_based_mm.c | 105 +++++++++-
include/linux/file_based_mm.h | 18 ++
include/linux/fs.h | 1 +
kernel/fork.c | 54 ++++-
mm/Makefile | 1 +
mm/fbmm_helpers.c | 372 ++++++++++++++++++++++++++++++++++
mm/internal.h | 13 ++
mm/vmscan.c | 14 +-
9 files changed, 558 insertions(+), 22 deletions(-)
create mode 100644 mm/fbmm_helpers.c
diff --git a/fs/exec.c b/fs/exec.c
index 40073142288f..f8f8d3d3ccd1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -68,6 +68,7 @@
#include <linux/user_events.h>
#include <linux/rseq.h>
#include <linux/ksm.h>
+#include <linux/file_based_mm.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -1900,6 +1901,7 @@ static int bprm_execve(struct linux_binprm *bprm)
user_events_execve(current);
acct_update_integrals(current);
task_numa_free(current, false);
+ fbmm_clear_cow_files(current);
return retval;
out:
diff --git a/fs/file_based_mm.c b/fs/file_based_mm.c
index c05797d51cb3..1feabdea1b77 100644
--- a/fs/file_based_mm.c
+++ b/fs/file_based_mm.c
@@ -30,6 +30,12 @@ struct fbmm_file {
unsigned long va_start;
/* The ending virtual address assigned to this file (exclusive) */
unsigned long va_end;
+ atomic_t refcount;
+};
+
+struct fbmm_cow_list_entry {
+ struct list_head node;
+ struct fbmm_file *file;
};
static enum file_based_mm_state fbmm_state = FBMM_OFF;
@@ -62,6 +68,7 @@ static struct fbmm_info *fbmm_create_new_info(char *mnt_dir_str)
}
mt_init(&info->files_mt);
+ INIT_LIST_HEAD(&info->cow_files);
return info;
}
@@ -74,6 +81,11 @@ static void drop_fbmm_file(struct fbmm_file *file)
}
}
+static void get_fbmm_file(struct fbmm_file *file)
+{
+ atomic_inc(&file->refcount);
+}
+
static pmdval_t fbmm_alloc_pmd(struct vm_fault *vmf)
{
struct mm_struct *mm = vmf->vma->vm_mm;
@@ -212,6 +224,7 @@ struct file *fbmm_get_file(struct task_struct *tsk, unsigned long addr, unsigned
return NULL;
}
fbmm_file->f = f;
+ atomic_set(&fbmm_file->refcount, 1);
if (topdown) {
/*
* Since VAs in this region grow down, this mapping will be the
@@ -300,9 +313,18 @@ int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len)
falloc_len = falloc_end_offset - falloc_start_offset;
- ret = vfs_fallocate(fbmm_file->f,
- FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
- falloc_start_offset, falloc_len);
+ /*
+ * Because shared mappings via fork are hard, only punch a hole if there
+ * is only one proc using this file.
+ * It would be nice to be able to free the memory if all procs sharing
+ * the file have unmapped it, but that would require tracking usage at
+ * a page granularity.
+ */
+ if (atomic_read(&fbmm_file->refcount) == 1) {
+ ret = vfs_fallocate(fbmm_file->f,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ falloc_start_offset, falloc_len);
+ }
fbmm_file = mt_next(&info->files_mt, fbmm_file->va_start, ULONG_MAX);
if (!fbmm_file || fbmm_file->va_end <= start)
@@ -324,6 +346,8 @@ static void fbmm_free_info(struct task_struct *tsk)
}
mtree_destroy(&info->files_mt);
+ fbmm_clear_cow_files(tsk);
+
if (info->mnt_dir_str) {
path_put(&info->mnt_dir_path);
fput(info->get_unmapped_area_file);
@@ -346,6 +370,7 @@ void fbmm_exit(struct task_struct *tsk)
int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clone_flags)
{
struct fbmm_info *info;
+ struct fbmm_cow_list_entry *src_cow, *dst_cow;
char *buffer;
char *src_dir;
@@ -375,9 +400,83 @@ int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clon
return -ENOMEM;
}
+ /*
+ * If the source has CoW files, they may also be CoW files in the destination
+ * so we need to copy that too
+ */
+ list_for_each_entry(src_cow, &info->cow_files, node) {
+ dst_cow = kmalloc(sizeof(struct fbmm_cow_list_entry), GFP_KERNEL);
+ if (!dst_cow) {
+ fbmm_free_info(dst_tsk);
+ dst_tsk->fbmm_info = NULL;
+ return -ENOMEM;
+ }
+
+ get_fbmm_file(src_cow->file);
+ dst_cow->file = src_cow->file;
+
+ list_add(&dst_cow->node, &dst_tsk->fbmm_info->cow_files);
+ }
+
return 0;
}
+int fbmm_add_cow_file(struct task_struct *new_tsk, struct task_struct *old_tsk,
+ struct file *file, unsigned long start)
+{
+ struct fbmm_info *new_info;
+ struct fbmm_info *old_info;
+ struct fbmm_file *fbmm_file;
+ struct fbmm_cow_list_entry *cow_entry;
+ unsigned long search_start = start + 1;
+
+ new_info = new_tsk->fbmm_info;
+ old_info = old_tsk->fbmm_info;
+ if (!new_info || !old_info)
+ return -EINVAL;
+
+ /*
+ * Find the fbmm_file that corresponds with the struct file.
+ * fbmm files can overlap, so make sure to find the one that corresponds
+ * to this file
+ */
+ do {
+ fbmm_file = mt_prev(&old_info->files_mt, search_start, 0);
+ if (!fbmm_file || fbmm_file->va_end <= start) {
+ /* Could not find the corressponding fbmm file */
+ return -ENOMEM;
+ }
+ search_start = fbmm_file->va_start;
+ } while (fbmm_file->f != file);
+
+ cow_entry = kmalloc(sizeof(struct fbmm_cow_list_entry), GFP_KERNEL);
+ if (!cow_entry)
+ return -ENOMEM;
+
+ get_fbmm_file(fbmm_file);
+ cow_entry->file = fbmm_file;
+
+ list_add(&cow_entry->node, &new_info->cow_files);
+ return 0;
+}
+
+void fbmm_clear_cow_files(struct task_struct *tsk)
+{
+ struct fbmm_info *info;
+ struct fbmm_cow_list_entry *cow_entry, *tmp;
+
+ info = tsk->fbmm_info;
+ if (!info)
+ return;
+
+ list_for_each_entry_safe(cow_entry, tmp, &info->cow_files, node) {
+ list_del(&cow_entry->node);
+
+ drop_fbmm_file(cow_entry->file);
+ kfree(cow_entry);
+ }
+}
+
static ssize_t fbmm_state_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
diff --git a/include/linux/file_based_mm.h b/include/linux/file_based_mm.h
index c1c5e82e36ec..22bb8e890144 100644
--- a/include/linux/file_based_mm.h
+++ b/include/linux/file_based_mm.h
@@ -13,6 +13,7 @@ struct fbmm_info {
/* This file exists just to be passed to get_unmapped_area in mmap */
struct file *get_unmapped_area_file;
struct maple_tree files_mt;
+ struct list_head cow_files;
};
@@ -31,6 +32,16 @@ void fbmm_populate_file(unsigned long start, unsigned long len);
int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len);
void fbmm_exit(struct task_struct *tsk);
int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clone_flags);
+int fbmm_add_cow_file(struct task_struct *new_tsk, struct task_struct *old_tsk,
+ struct file *file, unsigned long start);
+void fbmm_clear_cow_files(struct task_struct *tsk);
+
+/* FBMM helper functions for MFSs */
+int fbmm_swapout_folio(struct folio *folio);
+int fbmm_writepage(struct page *page, struct writeback_control *wbc);
+struct page *fbmm_read_swap_entry(struct vm_fault *vmf, swp_entry_t entry, unsigned long pgoff,
+ struct page *page);
+int fbmm_copy_page_range(struct vm_area_struct *dst, struct vm_area_struct *src);
#else /* CONFIG_FILE_BASED_MM */
@@ -76,6 +87,13 @@ static inline int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst
{
return 0;
}
+
+static inline int fbmm_add_cow_file(struct task_struct *new_tsk, struct task_struct *old_tsk,
+ struct file *file, unsigned long start) {
+ return 0;
+}
+
+static inline void fbmm_clear_cow_files(struct task_struct *tsk) {}
#endif /* CONFIG_FILE_BASED_MM */
#endif /* __FILE_BASED_MM_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0283cf366c2a..d38691819880 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2181,6 +2181,7 @@ struct super_operations {
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
void (*shutdown)(struct super_block *sb);
+ int (*copy_page_range)(struct vm_area_struct *dst, struct vm_area_struct *src);
};
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2b47276b1300..249367110519 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -625,8 +625,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
}
#ifdef CONFIG_MMU
-static __latent_entropy int dup_mmap(struct mm_struct *mm,
- struct mm_struct *oldmm)
+static __latent_entropy int dup_mmap(struct task_struct *tsk,
+ struct mm_struct *mm, struct mm_struct *oldmm)
{
struct vm_area_struct *mpnt, *tmp;
int retval;
@@ -732,7 +732,45 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->vm_ops->open(tmp);
file = tmp->vm_file;
- if (file) {
+ if (file && use_file_based_mm(tsk) &&
+ (tmp->vm_flags & (VM_SHARED | VM_FBMM)) == VM_FBMM) {
+ /*
+ * If this is a private FBMM file, we need to create a new
+ * file for this allocation
+ */
+ bool topdown = test_bit(MMF_TOPDOWN, &mm->flags);
+ unsigned long len = tmp->vm_end - tmp->vm_start;
+ unsigned long prot = 0;
+ unsigned long pgoff;
+ struct file *orig_file = file;
+
+ if (tmp->vm_flags & VM_READ)
+ prot |= PROT_READ;
+ if (tmp->vm_flags & VM_WRITE)
+ prot |= PROT_WRITE;
+ if (tmp->vm_flags & VM_EXEC)
+ prot |= PROT_EXEC;
+
+ /*
+ * topdown may be incorrect if it is true but this is for a region created
+ * by brk, which grows up, but if it's wrong, it'll only affect the next
+ * brk allocation
+ */
+ file = fbmm_get_file(tsk, tmp->vm_start, len, prot, 0, topdown, &pgoff);
+ if (!file) {
+ retval = -ENOMEM;
+ goto loop_out;
+ }
+
+ tmp->vm_pgoff = pgoff;
+ tmp->vm_file = get_file(file);
+ call_mmap(file, tmp);
+
+ retval = fbmm_add_cow_file(tsk, current, orig_file, tmp->vm_start);
+ if (retval) {
+ goto loop_out;
+ }
+ } else if (file) {
struct address_space *mapping = file->f_mapping;
get_file(file);
@@ -747,8 +785,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
i_mmap_unlock_write(mapping);
}
- if (!(tmp->vm_flags & VM_WIPEONFORK))
- retval = copy_page_range(tmp, mpnt);
+ if (!(tmp->vm_flags & VM_WIPEONFORK)) {
+ if (file && file->f_inode->i_sb->s_op->copy_page_range)
+ retval = file->f_inode->i_sb->s_op->copy_page_range(tmp, mpnt);
+ else
+ retval = copy_page_range(tmp, mpnt);
+ }
if (retval) {
mpnt = vma_next(&vmi);
@@ -1685,7 +1727,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
if (!mm_init(mm, tsk, mm->user_ns))
goto fail_nomem;
- err = dup_mmap(mm, oldmm);
+ err = dup_mmap(tsk, mm, oldmm);
if (err)
goto free_pt;
diff --git a/mm/Makefile b/mm/Makefile
index 8fb85acda1b1..fc5d1c4e0d5e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -139,3 +139,4 @@ obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
obj-$(CONFIG_EXECMEM) += execmem.o
+obj-$(CONFIG_FILE_BASED_MM) += fbmm_helpers.o
diff --git a/mm/fbmm_helpers.c b/mm/fbmm_helpers.c
new file mode 100644
index 000000000000..2c3c5522f34c
--- /dev/null
+++ b/mm/fbmm_helpers.c
@@ -0,0 +1,372 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/types.h>
+#include <linux/file_based_mm.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/rmap.h>
+#include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
+#include <linux/swap_slots.h>
+#include <linux/pagewalk.h>
+#include <linux/zswap.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+#include "swap.h"
+
+/******************************************************************************
+ * Swap Helpers
+ *****************************************************************************/
+static bool fbmm_try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
+ unsigned long address, void *arg)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+ pte_t pteval, swp_pte;
+ swp_entry_t entry;
+ struct page *page;
+ bool ret = true;
+ struct mmu_notifier_range range;
+
+ range.end = vma_address_end(&pvmw);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
+ address, range.end);
+ mmu_notifier_invalidate_range_start(&range);
+
+ while (page_vma_mapped_walk(&pvmw)) {
+ page = folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio));
+ address = pvmw.address;
+
+ pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+ if (pte_dirty(pteval))
+ folio_mark_dirty(folio);
+
+ entry.val = page_private(page);
+
+ if (swap_duplicate(entry) < 0) {
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
+ dec_mm_counter(mm, MM_FILEPAGES);
+ inc_mm_counter(mm, MM_SWAPENTS);
+ swp_pte = swp_entry_to_pte(entry);
+ if (pte_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+
+ set_pte_at(mm, address, pvmw.pte, swp_pte);
+
+ folio_remove_rmap_pte(folio, page, vma);
+ folio_put(folio);
+ }
+
+ mmu_notifier_invalidate_range_end(&range);
+
+ return ret;
+}
+
+static int folio_not_mapped(struct folio *folio)
+{
+ return !folio_mapped(folio);
+}
+
+static void fbmm_try_to_unmap(struct folio *folio)
+{
+ struct rmap_walk_control rwc = {
+ .rmap_one = fbmm_try_to_unmap_one,
+ .arg = NULL,
+ .done = folio_not_mapped,
+ };
+
+ rmap_walk(folio, &rwc);
+}
+
+/*
+ * fbmm_swapout_folio - Helper function for MFSs to swapout a folio
+ * @folio: The folio to swap out. Must has a reference count of at least 3.
+ * One the thread is holding on to, one for the file mapping, and one for each
+ * page table entry it is mapped to
+ *
+ * Returns 0 on success and nonzero otherwise
+ */
+int fbmm_swapout_folio(struct folio *folio)
+{
+ struct address_space *mapping;
+ struct swap_info_struct *si;
+ unsigned long offset;
+ struct swap_iocb *plug = NULL;
+ swp_entry_t entry;
+
+ if (!folio_trylock(folio))
+ return 1;
+
+ entry = folio_alloc_swap(folio);
+ if (!entry.val)
+ goto unlock;
+
+ offset = swp_offset(entry);
+
+ folio->swap = entry;
+
+ folio_mark_dirty(folio);
+
+ if (folio_ref_count(folio) < 3)
+ goto unlock;
+
+ if (folio_mapped(folio)) {
+ fbmm_try_to_unmap(folio);
+ if (folio_mapped(folio))
+ goto unlock;
+ }
+
+ mapping = folio_mapping(folio);
+ if (folio_test_dirty(folio)) {
+ try_to_unmap_flush_dirty();
+ switch (pageout(folio, mapping, &plug)) {
+ case PAGE_KEEP:
+ fallthrough;
+ case PAGE_ACTIVATE:
+ goto unlock;
+ case PAGE_SUCCESS:
+ /* pageout eventually unlocks the folio on success, so lock it */
+ if (!folio_trylock(folio))
+ return 1;
+ fallthrough;
+ case PAGE_CLEAN:
+ ;
+ }
+ }
+
+ remove_mapping(mapping, folio);
+ folio_unlock(folio);
+
+ si = get_swap_device(entry);
+ si->swap_map[offset] &= ~SWAP_HAS_CACHE;
+ put_swap_device(si);
+
+ return 0;
+
+unlock:
+ folio_unlock(folio);
+ return 1;
+}
+EXPORT_SYMBOL(fbmm_swapout_folio);
+
+static void fbmm_end_swap_bio_write(struct bio *bio)
+{
+ struct folio *folio = bio_first_folio_all(bio);
+ int ret;
+
+ /* This is the simplification of __folio_end_writeback */
+ ret = folio_test_clear_writeback(folio);
+ if (!ret)
+ return;
+
+ sb_clear_inode_writeback(folio_mapping(folio)->host);
+
+ /* Simplification of folio_end_writeback */
+ smp_mb__after_atomic();
+ acct_reclaim_writeback(folio);
+}
+
+/* Analogue to __swap_writepage */
+static void __fbmm_writepage(struct folio *folio, struct writeback_control *wbc)
+{
+ struct bio bio;
+ struct bio_vec bv;
+ struct swap_info_struct *sis = swp_swap_info(folio->swap);
+
+ bio_init(&bio, sis->bdev, &bv, 1,
+ REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc));
+ bio.bi_iter.bi_sector = swap_folio_sector(folio);
+ bio_add_folio_nofail(&bio, folio, folio_size(folio), 0);
+
+ count_vm_events(PSWPOUT, folio_nr_pages(folio));
+ folio_start_writeback(folio);
+ folio_unlock(folio);
+
+ submit_bio_wait(&bio);
+ fbmm_end_swap_bio_write(&bio);
+}
+
+int fbmm_writepage(struct page *page, struct writeback_control *wbc)
+{
+ struct folio *folio = page_folio(page);
+ int ret = 0;
+
+ ret = arch_prepare_to_swap(folio);
+ if (ret) {
+ folio_mark_dirty(folio);
+ folio_unlock(folio);
+ return 0;
+ }
+
+ __fbmm_writepage(folio, wbc);
+ return 0;
+}
+EXPORT_SYMBOL(fbmm_writepage);
+
+struct page *fbmm_read_swap_entry(struct vm_fault *vmf, swp_entry_t entry, unsigned long pgoff,
+ struct page *page)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ struct swap_info_struct *si;
+ struct folio *folio;
+
+ if (unlikely(non_swap_entry(entry)))
+ return NULL;
+
+ /*
+ * If a folio is already mapped here, just return that.
+ * Another process has probably already brought in the shared page
+ */
+ folio = filemap_get_folio(mapping, pgoff);
+ if (!IS_ERR(folio))
+ return folio_page(folio, 0);
+
+ si = get_swap_device(entry);
+ if (!si)
+ return NULL;
+
+ folio = page_folio(page);
+
+ folio_lock(folio);
+ folio->swap = entry;
+ /* swap_read_folio unlocks the folio */
+ swap_read_folio(folio, true, NULL);
+ folio->private = NULL;
+
+ swap_free(entry);
+
+ put_swap_device(si);
+ count_vm_events(PSWPIN, folio_nr_pages(folio));
+ dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+ return folio_page(folio, 0);
+}
+EXPORT_SYMBOL(fbmm_read_swap_entry);
+
+/******************************************************************************
+ * Copy on write helpers
+ *****************************************************************************/
+struct page_walk_levels {
+ struct vm_area_struct *vma;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+};
+
+static int fbmm_copy_pgd(pgd_t *pgd, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+ struct page_walk_levels *dst_levels = walk->private;
+
+ dst_levels->pgd = pgd_offset(dst_levels->vma->vm_mm, addr);
+ return 0;
+}
+
+static int fbmm_copy_p4d(p4d_t *p4d, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+ struct page_walk_levels *dst_levels = walk->private;
+
+ dst_levels->p4d = p4d_alloc(dst_levels->vma->vm_mm, dst_levels->pgd, addr);
+ if (!dst_levels->p4d)
+ return -ENOMEM;
+ return 0;
+}
+
+static int fbmm_copy_pud(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+ struct page_walk_levels *dst_levels = walk->private;
+
+ dst_levels->pud = pud_alloc(dst_levels->vma->vm_mm, dst_levels->p4d, addr);
+ if (!dst_levels->pud)
+ return -ENOMEM;
+ return 0;
+}
+
+static int fbmm_copy_pmd(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+ struct page_walk_levels *dst_levels = walk->private;
+
+ dst_levels->pmd = pmd_alloc(dst_levels->vma->vm_mm, dst_levels->pud, addr);
+ if (!dst_levels->pmd)
+ return -ENOMEM;
+ return 0;
+}
+
+static int fbmm_copy_pte(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+ struct page_walk_levels *dst_levels = walk->private;
+ struct mm_struct *dst_mm = dst_levels->vma->vm_mm;
+ struct mm_struct *src_mm = walk->mm;
+ pte_t *src_pte = pte;
+ pte_t *dst_pte;
+ spinlock_t *dst_ptl;
+ pte_t entry;
+ struct page *page;
+ struct folio *folio;
+ int ret = 0;
+
+ dst_pte = pte_alloc_map(dst_mm, dst_levels->pmd, addr);
+ if (!dst_pte)
+ return -ENOMEM;
+ dst_ptl = pte_lockptr(dst_mm, dst_levels->pmd);
+ /* The spinlock for the src pte should already be taken */
+ spin_lock_nested(dst_ptl, SINGLE_DEPTH_NESTING);
+
+ if (pte_none(*src_pte))
+ goto unlock;
+
+ /* I don't really want to handle to swap case, so I won't for now */
+ if (unlikely(!pte_present(*src_pte))) {
+ ret = -EIO;
+ goto unlock;
+ }
+
+ entry = ptep_get(src_pte);
+ page = vm_normal_page(walk->vma, addr, entry);
+ if (page)
+ folio = page_folio(page);
+
+ folio_get(folio);
+ folio_dup_file_rmap_pte(folio, page);
+ percpu_counter_inc(&dst_mm->rss_stat[MM_FILEPAGES]);
+
+ if (!(walk->vma->vm_flags & VM_SHARED) && pte_write(entry)) {
+ ptep_set_wrprotect(src_mm, addr, src_pte);
+ entry = pte_wrprotect(entry);
+ }
+
+ entry = pte_mkold(entry);
+ set_pte_at(dst_mm, addr, dst_pte, entry);
+
+unlock:
+ pte_unmap_unlock(dst_pte, dst_ptl);
+ return ret;
+}
+
+int fbmm_copy_page_range(struct vm_area_struct *dst, struct vm_area_struct *src)
+{
+ struct page_walk_levels dst_levels;
+ struct mm_walk_ops walk_ops = {
+ .pgd_entry = fbmm_copy_pgd,
+ .p4d_entry = fbmm_copy_p4d,
+ .pud_entry = fbmm_copy_pud,
+ .pmd_entry = fbmm_copy_pmd,
+ .pte_entry = fbmm_copy_pte,
+ };
+
+ dst_levels.vma = dst;
+
+ return walk_page_range(src->vm_mm, src->vm_start, src->vm_end,
+ &walk_ops, &dst_levels);
+}
+EXPORT_SYMBOL(fbmm_copy_page_range);
diff --git a/mm/internal.h b/mm/internal.h
index cc2c5e07fad3..bed53f3a6ed3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1515,4 +1515,17 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry,
void workingset_update_node(struct xa_node *node);
extern struct list_lru shadow_nodes;
+/* possible outcome of pageout() */
+typedef enum {
+ /* failed to write folio out, folio is locked */
+ PAGE_KEEP,
+ /* move folio to the active list, folio is locked */
+ PAGE_ACTIVATE,
+ /* folio has been sent to the disk successfully, folio is unlocked */
+ PAGE_SUCCESS,
+ /* folio is clean and locked */
+ PAGE_CLEAN,
+} pageout_t;
+pageout_t pageout(struct folio *folio, struct address_space *mapping,
+ struct swap_iocb **plug);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..93291d25eb11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -591,23 +591,11 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
wake_up(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
}
-/* possible outcome of pageout() */
-typedef enum {
- /* failed to write folio out, folio is locked */
- PAGE_KEEP,
- /* move folio to the active list, folio is locked */
- PAGE_ACTIVATE,
- /* folio has been sent to the disk successfully, folio is unlocked */
- PAGE_SUCCESS,
- /* folio is clean and locked */
- PAGE_CLEAN,
-} pageout_t;
-
/*
* pageout is called by shrink_folio_list() for each dirty folio.
* Calls ->writepage().
*/
-static pageout_t pageout(struct folio *folio, struct address_space *mapping,
+pageout_t pageout(struct folio *folio, struct address_space *mapping,
struct swap_iocb **plug)
{
/*
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread* [RFC PATCH 3/4] mm: Export functions for writing MM Filesystems
2024-11-22 20:38 [RFC PATCH 0/4] Add support for File Based Memory Management Bijan Tabatabai
2024-11-22 20:38 ` [RFC PATCH 1/4] mm: " Bijan Tabatabai
2024-11-22 20:38 ` [RFC PATCH 2/4] fbmm: Add helper functions for FBMM MM Filesystems Bijan Tabatabai
@ 2024-11-22 20:38 ` Bijan Tabatabai
2024-11-22 20:38 ` [RFC PATCH 4/4] Add base implementation of an MFS Bijan Tabatabai
2024-11-23 12:23 ` [RFC PATCH 0/4] Add support for File Based Memory Management Lorenzo Stoakes
4 siblings, 0 replies; 9+ messages in thread
From: Bijan Tabatabai @ 2024-11-22 20:38 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, btabatabai; +Cc: akpm, viro, brauner, mingo
This patch exports memory management functions that are useful to memory
managers, so that they can be used in memory management filesystems created
in kernel modules.
Signed-off-by: Bijan Tabatabai <btabatabai@wisc.edu>
---
arch/x86/include/asm/tlbflush.h | 2 --
arch/x86/mm/tlb.c | 1 +
mm/filemap.c | 2 ++
mm/memory.c | 1 +
mm/mmap.c | 2 ++
mm/pgtable-generic.c | 1 +
mm/rmap.c | 2 ++
7 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 25726893c6f4..9877176d396f 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -57,7 +57,6 @@ static inline void cr4_clear_bits(unsigned long mask)
local_irq_restore(flags);
}
-#ifndef MODULE
/*
* 6 because 6 should be plenty and struct tlb_state will fit in two cache
* lines.
@@ -417,7 +416,6 @@ static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
{
}
#endif
-#endif /* !MODULE */
static inline void __native_tlb_flush_global(unsigned long cr4)
{
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 44ac64f3a047..f054cee7bc7c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1036,6 +1036,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
put_cpu();
mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
}
+EXPORT_SYMBOL_GPL(flush_tlb_mm_range);
static void do_flush_tlb_all(void *info)
diff --git a/mm/filemap.c b/mm/filemap.c
index 657bcd887fdb..8532ddd37e7f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -269,6 +269,7 @@ void filemap_remove_folio(struct folio *folio)
filemap_free_folio(mapping, folio);
}
+EXPORT_SYMBOL_GPL(filemap_remove_folio);
/*
* page_cache_delete_batch - delete several folios from page cache
@@ -955,6 +956,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
return xas_error(&xas);
}
ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
+EXPORT_SYMBOL_GPL(__filemap_add_folio);
int filemap_add_folio(struct address_space *mapping, struct folio *folio,
pgoff_t index, gfp_t gfp)
diff --git a/mm/memory.c b/mm/memory.c
index fa2fe3ee0867..23e74a0397fa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -448,6 +448,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
pte_free(mm, new);
return 0;
}
+EXPORT_SYMBOL_GPL(__pte_alloc);
int __pte_alloc_kernel(pmd_t *pmd)
{
diff --git a/mm/mmap.c b/mm/mmap.c
index d684d8bd218b..1090ef982929 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1780,6 +1780,7 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr,
info.high_limit = mmap_end;
return vm_unmapped_area(&info);
}
+EXPORT_SYMBOL_GPL(generic_get_unmapped_area);
#ifndef HAVE_ARCH_UNMAPPED_AREA
unsigned long
@@ -1844,6 +1845,7 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
return addr;
}
+EXPORT_SYMBOL_GPL(generic_get_unmapped_area_topdown);
#ifndef HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
unsigned long
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711a..1a3b4a86b005 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -304,6 +304,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
rcu_read_unlock();
return NULL;
}
+EXPORT_SYMBOL_GPL(__pte_offset_map);
pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp)
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..fdade910cc95 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1468,6 +1468,7 @@ void folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
{
__folio_add_file_rmap(folio, page, nr_pages, vma, RMAP_LEVEL_PTE);
}
+EXPORT_SYMBOL_GPL(folio_add_file_rmap_ptes);
/**
* folio_add_file_rmap_pmd - add a PMD mapping to a page range of a folio
@@ -1594,6 +1595,7 @@ void folio_remove_rmap_ptes(struct folio *folio, struct page *page,
{
__folio_remove_rmap(folio, page, nr_pages, vma, RMAP_LEVEL_PTE);
}
+EXPORT_SYMBOL_GPL(folio_remove_rmap_ptes);
/**
* folio_remove_rmap_pmd - remove a PMD mapping from a page range of a folio
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread* [RFC PATCH 4/4] Add base implementation of an MFS
2024-11-22 20:38 [RFC PATCH 0/4] Add support for File Based Memory Management Bijan Tabatabai
` (2 preceding siblings ...)
2024-11-22 20:38 ` [RFC PATCH 3/4] mm: Export functions for writing " Bijan Tabatabai
@ 2024-11-22 20:38 ` Bijan Tabatabai
2024-12-02 15:56 ` Jeff Johnson
2024-11-23 12:23 ` [RFC PATCH 0/4] Add support for File Based Memory Management Lorenzo Stoakes
4 siblings, 1 reply; 9+ messages in thread
From: Bijan Tabatabai @ 2024-11-22 20:38 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, btabatabai; +Cc: akpm, viro, brauner, mingo
Mount by running
sudo mount -t BasicMFS BasicMFS -o numpages=<pages> <mntdir>
Where <pages> is the max number of 4KB pages it can use, and <mntdir> is
the directory to mount the filesystem to.
This patch is meant to serve as a reference for the reviewers and is not
intended to be upstreamed.
Signed-off-by: Bijan Tabatabai <btabatabai@wisc.edu>
---
BasicMFS/Kconfig | 3 +
BasicMFS/Makefile | 8 +
BasicMFS/basic.c | 717 ++++++++++++++++++++++++++++++++++++++++++++++
BasicMFS/basic.h | 29 ++
4 files changed, 757 insertions(+)
create mode 100644 BasicMFS/Kconfig
create mode 100644 BasicMFS/Makefile
create mode 100644 BasicMFS/basic.c
create mode 100644 BasicMFS/basic.h
diff --git a/BasicMFS/Kconfig b/BasicMFS/Kconfig
new file mode 100644
index 000000000000..3b536eded0ed
--- /dev/null
+++ b/BasicMFS/Kconfig
@@ -0,0 +1,3 @@
+config BASICMMFS
+ tristate "Adds the BasicMMFS"
+ default m
diff --git a/BasicMFS/Makefile b/BasicMFS/Makefile
new file mode 100644
index 000000000000..e50d27819c3c
--- /dev/null
+++ b/BasicMFS/Makefile
@@ -0,0 +1,8 @@
+obj-m += basicmfs.o
+basicmfs-y += basic.o
+
+all:
+ make -C ../kbuild M=$(PWD) modules
+
+clean:
+ make -C ../kbuild M=$(PWD) clean
diff --git a/BasicMFS/basic.c b/BasicMFS/basic.c
new file mode 100644
index 000000000000..88490de64db4
--- /dev/null
+++ b/BasicMFS/basic.c
@@ -0,0 +1,717 @@
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/gfp.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
+#include <linux/pagemap.h>
+#include <linux/statfs.h>
+#include <linux/module.h>
+#include <linux/rmap.h>
+#include <linux/string.h>
+#include <linux/falloc.h>
+#include <linux/pagewalk.h>
+#include <linux/file_based_mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/pagevec.h>
+
+#include <asm/tlbflush.h>
+
+#include "basic.h"
+
+static const struct super_operations basicmfs_ops;
+static const struct inode_operations basicmfs_dir_inode_operations;
+
+static struct basicmfs_sb_info *BMFS_SB(struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
+static struct basicmfs_inode_info *BMFS_I(struct inode *inode)
+{
+ return inode->i_private;
+}
+
+/*
+ * Allocate a base page and assign it to the inode at the given page offset
+ * Takes the sbi->lock.
+ * Returns the allocated page if there is one, else NULL
+ */
+static struct page *basicmfs_alloc_page(struct basicmfs_inode_info *inode_info,
+ struct basicmfs_sb_info *sbi, u64 page_offset)
+{
+ u8 *kaddr;
+ u64 pages_added;
+ u64 alloc_size = 64;
+ struct page *page = NULL;
+
+ spin_lock(&sbi->lock);
+
+ /* First, do we have any free pages available? */
+ if (sbi->free_pages == 0) {
+ /* Try to allocate more pages if we can */
+ alloc_size = min(alloc_size, sbi->max_pages - sbi->num_pages);
+ if (alloc_size == 0)
+ goto unlock;
+
+ pages_added = alloc_pages_bulk_list(GFP_HIGHUSER, alloc_size, &sbi->free_list);
+
+ if (pages_added == 0)
+ goto unlock;
+
+ sbi->num_pages += pages_added;
+ sbi->free_pages += pages_added;
+ }
+
+ page = list_first_entry(&sbi->free_list, struct page, lru);
+ list_del(&page->lru);
+ sbi->free_pages--;
+
+ /* Zero the page outside of the critical section */
+ spin_unlock(&sbi->lock);
+
+ kaddr = kmap_local_page(page);
+ memset(kaddr, 0, PAGE_SIZE);
+ kunmap_local(kaddr);
+
+ spin_lock(&sbi->lock);
+
+ list_add(&page->lru, &sbi->active_list);
+
+unlock:
+ spin_unlock(&sbi->lock);
+ return page;
+}
+
+static void basicmfs_return_page(struct page *page, struct basicmfs_sb_info *sbi)
+{
+ spin_lock(&sbi->lock);
+
+ list_del(&page->lru);
+ /*
+ * We don't need to put page here for being unmapped that seems to have
+ * been handled by the unmapping code.
+ */
+
+ list_add_tail(&page->lru, &sbi->free_list);
+ sbi->free_pages++;
+
+ spin_unlock(&sbi->lock);
+}
+
+static void basicmfs_free_range(struct inode *inode, u64 offset, loff_t len)
+{
+ struct basicmfs_sb_info *sbi = BMFS_SB(inode->i_sb);
+ struct basicmfs_inode_info *inode_info = BMFS_I(inode);
+ struct address_space *mapping = inode_info->mapping;
+ struct folio_batch fbatch;
+ int i;
+ pgoff_t cur_offset = offset >> PAGE_SHIFT;
+ pgoff_t end_offset = (offset + len) >> PAGE_SHIFT;
+
+ folio_batch_init(&fbatch);
+ while (cur_offset < end_offset) {
+ filemap_get_folios(mapping, &cur_offset, end_offset - 1, &fbatch);
+
+ for (i = 0; i < fbatch.nr; i++) {
+ folio_lock(fbatch.folios[i]);
+ filemap_remove_folio(fbatch.folios[i]);
+ folio_unlock(fbatch.folios[i]);
+ basicmfs_return_page(folio_page(fbatch.folios[i], 0), sbi);
+ }
+
+ folio_batch_release(&fbatch);
+ }
+}
+
+static vm_fault_t basicmfs_fault(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ struct inode *inode = vma->vm_file->f_inode;
+ struct basicmfs_inode_info *inode_info;
+ struct basicmfs_sb_info *sbi;
+ struct page *page = NULL;
+ bool new_page = true;
+ bool cow_fault = false;
+ u64 pgoff = ((vmf->address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ vm_fault_t ret = 0;
+ pte_t entry;
+
+ inode_info = BMFS_I(inode);
+ sbi = BMFS_SB(inode->i_sb);
+
+ if (!vmf->pte) {
+ if (pte_alloc(vma->vm_mm, vmf->pmd))
+ return VM_FAULT_OOM;
+ }
+
+ vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+ vmf->orig_pte = *vmf->pte;
+ if (!pte_none(vmf->orig_pte) && pte_present(vmf->orig_pte)) {
+ if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+ /*
+ * It looks like the PTE is already populated,
+ * so maybe two threads raced to first fault.
+ */
+ ret = VM_FAULT_NOPAGE;
+ goto unmap;
+ }
+
+ cow_fault = true;
+ }
+
+ /* Get the page if it was preallocated */
+ page = mtree_erase(&inode_info->falloc_mt, pgoff);
+
+ /* Try to allocate the page if it hasn't been already */
+ if (!page) {
+ page = basicmfs_alloc_page(inode_info, sbi, pgoff);
+ if (!page) {
+ ret = VM_FAULT_OOM;
+ goto unmap;
+ }
+ }
+
+ if (!pte_none(vmf->orig_pte) && !pte_present(vmf->orig_pte)) {
+ /* Swapped out page */
+ struct page *ret_page;
+ swp_entry_t swp_entry = pte_to_swp_entry(vmf->orig_pte);
+
+ ret_page = fbmm_read_swap_entry(vmf, swp_entry, pgoff, page);
+ if (page != ret_page) {
+ /*
+ * A physical page was already being used for this virt page
+ * or there was an error, so we can return the page we allocated.
+ */
+ basicmfs_return_page(page, sbi);
+ page = ret_page;
+ new_page = false;
+ }
+ if (!page) {
+ pr_warn("BasicMFS: Error swapping in page! %lx\n", vmf->address);
+ goto unmap;
+ }
+ }
+
+ vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
+ spin_lock(vmf->ptl);
+ /* Check if some other thread faulted here */
+ if (!pte_same(vmf->orig_pte, *vmf->pte)) {
+ if (new_page)
+ basicmfs_return_page(page, sbi);
+ goto unlock;
+ }
+
+ /* Handle COW fault */
+ if (cow_fault) {
+ u8 *src_kaddr, *dst_kaddr;
+ struct page *old_page;
+ struct folio *old_folio;
+ unsigned long old_pfn;
+
+ old_pfn = pte_pfn(vmf->orig_pte);
+ old_page = pfn_to_page(old_pfn);
+
+ lock_page(old_page);
+
+ /*
+ * If there's more than one reference to this page, we need to copy it.
+ * Otherwise, we can just reuse it
+ */
+ if (page_mapcount(old_page) > 1) {
+ src_kaddr = kmap_local_page(old_page);
+ dst_kaddr = kmap_local_page(page);
+ memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
+ kunmap_local(dst_kaddr);
+ kunmap_local(src_kaddr);
+ } else {
+ basicmfs_return_page(page, sbi);
+ page = old_page;
+ }
+ /*
+ * Drop a reference to old_page even if we are going to keep it
+ * because the reference will be increased at the end of the fault
+ */
+ put_page(old_page);
+ /* Decrease the filepage and rmap count for the same reason */
+ percpu_counter_dec(&vma->vm_mm->rss_stat[MM_FILEPAGES]);
+ folio_remove_rmap_pte(page_folio(old_page), old_page, vma);
+
+ old_folio = page_folio(old_page);
+ /*
+ * If we are copying a page for the process that originally faulted the
+ * page, we have to replace the mapping.
+ */
+ if (mapping == old_folio->mapping) {
+ if (old_page != page)
+ replace_page_cache_folio(old_folio, page_folio(page));
+ new_page = false;
+ }
+ unlock_page(old_page);
+ }
+
+ if (new_page)
+ /*
+ * We want to manage the folio ourselves, and don't want it on the LRU lists,
+ * so we use __filemap_add_folio instead of filemap_add_folio.
+ */
+ __filemap_add_folio(mapping, page_folio(page), pgoff, GFP_KERNEL, NULL);
+
+ /* Construct the pte entry */
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = pte_mkyoung(entry);
+ if (vma->vm_flags & VM_WRITE)
+ entry = pte_mkwrite_novma(pte_mkdirty(entry));
+
+ folio_add_file_rmap_pte(page_folio(page), page, vma);
+ percpu_counter_inc(&vma->vm_mm->rss_stat[MM_FILEPAGES]);
+ set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+ update_mmu_cache(vma, vmf->address, vmf->pte);
+ vmf->page = page;
+ get_page(page);
+ flush_tlb_page(vma, vmf->address);
+ ret = VM_FAULT_NOPAGE;
+
+unlock:
+ spin_unlock(vmf->ptl);
+unmap:
+ pte_unmap(vmf->pte);
+ return ret;
+}
+
+const struct vm_operations_struct basicmfs_vm_ops = {
+ .fault = basicmfs_fault,
+ .page_mkwrite = basicmfs_fault,
+ .pfn_mkwrite = basicmfs_fault,
+};
+
+static int basicmfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct inode *inode = file_inode(file);
+ struct basicmfs_inode_info *inode_info = BMFS_I(inode);
+
+ file_accessed(file);
+ vma->vm_ops = &basicmfs_vm_ops;
+
+ inode_info->file_va_start = vma->vm_start - (vma->vm_pgoff << PAGE_SHIFT);
+ inode_info->mapping = file->f_mapping;
+
+ return 0;
+}
+
+static int basicmfs_release(struct inode *inode, struct file *file)
+{
+ struct basicmfs_sb_info *sbi = BMFS_SB(inode->i_sb);
+ struct basicmfs_inode_info *inode_info = BMFS_I(inode);
+ struct page *page;
+ unsigned long index = 0;
+ unsigned long free_count = 0;
+
+ basicmfs_free_range(inode, 0, inode->i_size);
+
+ mt_for_each(&inode_info->falloc_mt, page, index, ULONG_MAX) {
+ basicmfs_return_page(page, sbi);
+ free_count++;
+ }
+
+ mtree_destroy(&inode_info->falloc_mt);
+ kfree(inode_info);
+
+ return 0;
+}
+
+static long basicmfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+{
+ struct inode *inode = file_inode(file);
+ struct basicmfs_sb_info *sbi = BMFS_SB(inode->i_sb);
+ struct basicmfs_inode_info *inode_info = BMFS_I(inode);
+ struct page *page;
+ loff_t off;
+
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ basicmfs_free_range(inode, offset, len);
+ return 0;
+ } else if (mode != 0) {
+ return -EOPNOTSUPP;
+ }
+
+ for (off = offset; off < offset + len; off += PAGE_SIZE) {
+ page = basicmfs_alloc_page(inode_info, sbi, off >> PAGE_SHIFT);
+ mtree_store(&inode_info->falloc_mt, off >> PAGE_SHIFT, page, GFP_KERNEL);
+ if (!page)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+const struct file_operations basicmfs_file_operations = {
+ .mmap = basicmfs_mmap,
+ .release = basicmfs_release,
+ .fsync = noop_fsync,
+ .llseek = generic_file_llseek,
+ .get_unmapped_area = generic_get_unmapped_area_topdown,
+ .fallocate = basicmfs_fallocate,
+};
+
+const struct inode_operations basicmfs_file_inode_operations = {
+ .setattr = simple_setattr,
+ .getattr = simple_getattr,
+};
+
+const struct address_space_operations basicmfs_aops = {
+ .direct_IO = noop_direct_IO,
+ .dirty_folio = noop_dirty_folio,
+ .writepage = fbmm_writepage,
+};
+
+static struct inode *basicmfs_get_inode(struct super_block *sb,
+ const struct inode *dir, umode_t mode, dev_t dev)
+{
+ struct inode *inode = new_inode(sb);
+ struct basicmfs_inode_info *info;
+
+ if (!inode)
+ return NULL;
+
+ info = kzalloc(sizeof(struct basicmfs_inode_info), GFP_KERNEL);
+ if (!info)
+ return NULL;
+ mt_init(&info->falloc_mt);
+ info->file_va_start = 0;
+
+ inode->i_ino = get_next_ino();
+ inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
+ inode->i_mapping->a_ops = &basicmfs_aops;
+ inode->i_flags |= S_DAX;
+ inode->i_private = info;
+ switch (mode & S_IFMT) {
+ case S_IFREG:
+ inode->i_op = &basicmfs_file_inode_operations;
+ inode->i_fop = &basicmfs_file_operations;
+ break;
+ case S_IFDIR:
+ inode->i_op = &basicmfs_dir_inode_operations;
+ inode->i_fop = &simple_dir_operations;
+
+ /* Directory inodes start off with i_nlink == 2 (for "." entry) */
+ inc_nlink(inode);
+ break;
+ default:
+ return NULL;
+ }
+
+ return inode;
+}
+
+static int basicmfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
+ struct dentry *dentry, umode_t mode, dev_t dev)
+{
+ struct inode *inode = basicmfs_get_inode(dir->i_sb, dir, mode, dev);
+ int error = -ENOSPC;
+
+ if (inode) {
+ d_instantiate(dentry, inode);
+ dget(dentry); /* Extra count - pin the dentry in core */
+ error = 0;
+ }
+
+ return error;
+}
+
+static int basicmfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
+ struct dentry *dentry, umode_t mode)
+{
+ return -EINVAL;
+}
+
+static int basicmfs_create(struct mnt_idmap *idmap, struct inode *dir,
+ struct dentry *dentry, umode_t mode, bool excl)
+{
+ // TODO: Replace 0777 with mode and see if anything breaks
+ return basicmfs_mknod(idmap, dir, dentry, 0777 | S_IFREG, 0);
+}
+
+static int basicmfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
+ struct dentry *dentry, const char *symname)
+{
+ return -EINVAL;
+}
+
+static int basicmfs_tmpfile(struct mnt_idmap *idmap,
+ struct inode *dir, struct file *file, umode_t mode)
+{
+ struct inode *inode;
+
+ inode = basicmfs_get_inode(dir->i_sb, dir, mode, 0);
+ if (!inode)
+ return -ENOSPC;
+ d_tmpfile(file, inode);
+ return finish_open_simple(file, 0);
+}
+
+static const struct inode_operations basicmfs_dir_inode_operations = {
+ .create = basicmfs_create,
+ .lookup = simple_lookup,
+ .link = simple_link,
+ .unlink = simple_unlink,
+ .symlink = basicmfs_symlink,
+ .mkdir = basicmfs_mkdir,
+ .rmdir = simple_rmdir,
+ .mknod = basicmfs_mknod,
+ .rename = simple_rename,
+ .tmpfile = basicmfs_tmpfile,
+};
+
+static int basicmfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct basicmfs_sb_info *sbi = BMFS_SB(sb);
+
+ buf->f_type = sb->s_magic;
+ buf->f_bsize = PAGE_SIZE;
+ buf->f_blocks = sbi->num_pages;
+ buf->f_bfree = buf->f_bavail = sbi->free_pages;
+ buf->f_files = LONG_MAX;
+ buf->f_ffree = LONG_MAX;
+ buf->f_namelen = 255;
+
+ return 0;
+}
+
+static int basicmfs_show_options(struct seq_file *m, struct dentry *root)
+{
+ return 0;
+}
+
+#define BASICMFS_MAX_PAGEOUT 512
+static long basicmfs_nr_cached_objects(struct super_block *sb, struct shrink_control *sc)
+{
+ struct basicmfs_sb_info *sbi = BMFS_SB(sb);
+ long nr = 0;
+
+ spin_lock(&sbi->lock);
+ if (sbi->free_pages > 0)
+ nr = sbi->free_pages;
+ else
+ nr = max(sbi->num_pages - sbi->free_pages, (u64)BASICMFS_MAX_PAGEOUT);
+ spin_unlock(&sbi->lock);
+
+ return nr;
+}
+
+static long basicmfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc)
+{
+ LIST_HEAD(folio_list);
+ LIST_HEAD(fail_list);
+ struct basicmfs_sb_info *sbi = BMFS_SB(sb);
+ struct page *page;
+ u64 i, num_scanned;
+
+ if (sbi->free_pages > 0) {
+ spin_lock(&sbi->lock);
+ for (i = 0; i < sc->nr_to_scan && i < sbi->free_pages; i++) {
+ page = list_first_entry(&sbi->free_list, struct page, lru);
+ list_del(&page->lru);
+ put_page(page);
+ }
+
+ sbi->num_pages -= i;
+ sbi->free_pages -= i;
+ spin_unlock(&sbi->lock);
+ } else if (sbi->num_pages > 0) {
+ spin_lock(&sbi->lock);
+ for (i = 0; i < sc->nr_to_scan && sbi->num_pages > 0; i++) {
+ page = list_first_entry(&sbi->active_list, struct page, lru);
+ list_move(&page->lru, &folio_list);
+ sbi->num_pages--;
+ }
+ spin_unlock(&sbi->lock);
+
+ num_scanned = i;
+ for (i = 0; i < num_scanned && !list_empty(&folio_list); i++) {
+ page = list_first_entry(&folio_list, struct page, lru);
+ list_del(&page->lru);
+ if (fbmm_swapout_folio(page_folio(page)))
+ list_add_tail(&page->lru, &fail_list);
+ else
+ put_page(page);
+ }
+
+ spin_lock(&sbi->lock);
+ while (!list_empty(&fail_list)) {
+ page = list_first_entry(&fail_list, struct page, lru);
+ list_del(&page->lru);
+ list_add_tail(&page->lru, &sbi->active_list);
+ sbi->num_pages++;
+ }
+ spin_unlock(&sbi->lock);
+
+ }
+
+ sc->nr_scanned = i;
+ return i;
+}
+
+static const struct super_operations basicmfs_ops = {
+ .statfs = basicmfs_statfs,
+ .drop_inode = generic_delete_inode,
+ .show_options = basicmfs_show_options,
+ .nr_cached_objects = basicmfs_nr_cached_objects,
+ .free_cached_objects = basicmfs_free_cached_objects,
+ .copy_page_range = fbmm_copy_page_range,
+};
+
+static int basicmfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ struct inode *inode;
+ struct basicmfs_sb_info *sbi = kzalloc(sizeof(struct basicmfs_sb_info), GFP_KERNEL);
+ u64 nr_pages = *(u64 *)fc->fs_private;
+ u64 alloc_size = 1024;
+
+ if (!sbi)
+ return -ENOMEM;
+
+ sb->s_fs_info = sbi;
+ sb->s_maxbytes = MAX_LFS_FILESIZE;
+ sb->s_magic = 0xDEADBEEF;
+ sb->s_op = &basicmfs_ops;
+ sb->s_time_gran = 1;
+ sb->s_blocksize = PAGE_SIZE;
+ sb->s_blocksize_bits = PAGE_SHIFT;
+
+ spin_lock_init(&sbi->lock);
+ INIT_LIST_HEAD(&sbi->free_list);
+ INIT_LIST_HEAD(&sbi->active_list);
+ sbi->max_pages = nr_pages;
+ sbi->num_pages = 0;
+ for (int i = 0; i < nr_pages / alloc_size; i++)
+ sbi->num_pages += alloc_pages_bulk_list(GFP_HIGHUSER, alloc_size, &sbi->free_list);
+ sbi->free_pages = sbi->num_pages;
+
+ inode = basicmfs_get_inode(sb, NULL, S_IFDIR | 0755, 0);
+ sb->s_root = d_make_root(inode);
+ if (!sb->s_root) {
+ kfree(sbi);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int basicmfs_get_tree(struct fs_context *fc)
+{
+ return get_tree_nodev(fc, basicmfs_fill_super);
+}
+
+enum basicmfs_param {
+ Opt_numpages,
+};
+
+const struct fs_parameter_spec basicmfs_fs_parameters[] = {
+ fsparam_u64("numpages", Opt_numpages),
+ {},
+};
+
+static int basicmfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ struct fs_parse_result result;
+ u64 *num_pages = (u64 *)fc->fs_private;
+ int opt;
+
+ opt = fs_parse(fc, basicmfs_fs_parameters, param, &result);
+ if (opt < 0) {
+ /*
+ * We might like to report bad mount options here;
+ * but traditionally ramfs has ignored all mount options,
+ * and as it is used as a !CONFIG_SHMEM simple substitute
+ * for tmpfs, better continue to ignore other mount options.
+ */
+ if (opt == -ENOPARAM)
+ opt = 0;
+ return opt;
+ }
+
+ switch (opt) {
+ case Opt_numpages:
+ *num_pages = result.uint_64;
+ break;
+ };
+
+ return 0;
+}
+
+static void basicmfs_free_fc(struct fs_context *fc)
+{
+ kfree(fc->fs_private);
+}
+
+static const struct fs_context_operations basicmfs_context_ops = {
+ .free = basicmfs_free_fc,
+ .parse_param = basicmfs_parse_param,
+ .get_tree = basicmfs_get_tree,
+};
+
+static int basicmfs_init_fs_context(struct fs_context *fc)
+{
+ fc->ops = &basicmfs_context_ops;
+
+ fc->fs_private = kzalloc(sizeof(u64), GFP_KERNEL);
+ /* Set a default number of pages to use */
+ *(u64 *)fc->fs_private = 128 * 1024;
+ return 0;
+}
+
+static void basicmfs_kill_sb(struct super_block *sb)
+{
+ struct basicmfs_sb_info *sbi = BMFS_SB(sb);
+ struct page *page, *tmp;
+
+ spin_lock(&sbi->lock);
+
+ /*
+ * Return the pages we took to the kernel.
+ * All the pages should be in the free list at this point
+ */
+ list_for_each_entry_safe(page, tmp, &sbi->free_list, lru) {
+ list_del(&page->lru);
+ put_page(page);
+ }
+
+ spin_unlock(&sbi->lock);
+
+ kfree(sbi);
+
+ kill_litter_super(sb);
+}
+
+static struct file_system_type basicmfs_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "BasicMFS",
+ .init_fs_context = basicmfs_init_fs_context,
+ .parameters = basicmfs_fs_parameters,
+ .kill_sb = basicmfs_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
+};
+
+static int __init init_basicmfs(void)
+{
+ printk(KERN_INFO "Starting BasicMFS");
+ register_filesystem(&basicmfs_fs_type);
+
+ return 0;
+}
+module_init(init_basicmfs);
+
+static void cleanup_basicmfs(void)
+{
+ printk(KERN_INFO "Removing BasicMFS");
+ unregister_filesystem(&basicmfs_fs_type);
+}
+module_exit(cleanup_basicmfs);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Bijan Tabatabai");
diff --git a/BasicMFS/basic.h b/BasicMFS/basic.h
new file mode 100644
index 000000000000..8e727201aca3
--- /dev/null
+++ b/BasicMFS/basic.h
@@ -0,0 +1,29 @@
+#ifndef BASIC_MMFS_H
+#define BASIC_MMFS_H
+
+#include <linux/types.h>
+#include <linux/fs.h>
+#include <linux/maple_tree.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+
+struct basicmfs_sb_info {
+ spinlock_t lock;
+ struct list_head free_list;
+ struct list_head active_list;
+ u64 num_pages;
+ u64 max_pages;
+ u64 free_pages;
+};
+
+struct basicmfs_inode_info {
+ // Maple tree mapping the page offset to the folio mapped to that offset
+ // Used to hold preallocated pages that haven't been mapped yet
+ struct maple_tree falloc_mt;
+ // The first virtual address this file is associated with.
+ u64 file_va_start;
+ // The file offset to folio mapping from the file
+ struct address_space *mapping;
+};
+
+#endif //BASIC_MMFS_H
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH 4/4] Add base implementation of an MFS
2024-11-22 20:38 ` [RFC PATCH 4/4] Add base implementation of an MFS Bijan Tabatabai
@ 2024-12-02 15:56 ` Jeff Johnson
0 siblings, 0 replies; 9+ messages in thread
From: Jeff Johnson @ 2024-12-02 15:56 UTC (permalink / raw)
To: Bijan Tabatabai, linux-fsdevel, linux-mm, btabatabai
Cc: akpm, viro, brauner, mingo
On 11/22/24 12:38, Bijan Tabatabai wrote:
> Mount by running
> sudo mount -t BasicMFS BasicMFS -o numpages=<pages> <mntdir>
>
> Where <pages> is the max number of 4KB pages it can use, and <mntdir> is
> the directory to mount the filesystem to.
>
> This patch is meant to serve as a reference for the reviewers and is not
> intended to be upstreamed.
>
> Signed-off-by: Bijan Tabatabai <btabatabai@wisc.edu>
...
> +static int __init init_basicmfs(void)
> +{
> + printk(KERN_INFO "Starting BasicMFS");
> + register_filesystem(&basicmfs_fs_type);
> +
> + return 0;
> +}
> +module_init(init_basicmfs);
> +
> +static void cleanup_basicmfs(void)
> +{
> + printk(KERN_INFO "Removing BasicMFS");
> + unregister_filesystem(&basicmfs_fs_type);
> +}
> +module_exit(cleanup_basicmfs);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Bijan Tabatabai");
Based on the other feedback it looks like this won't be accepted, but
for completeness I have a specific commit check which flagged this patch.
Since commit 1fffe7a34c89 ("script: modpost: emit a warning when the
description is missing"), a module without a MODULE_DESCRIPTION() will
result in a warning when built with make W=1. Recently, multiple
developers have been eradicating these warnings treewide, and very few
(if any) are left, so please don't introduce a new one :)
Please add the missing MODULE_DESCRIPTION()
/jeff
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Add support for File Based Memory Management
2024-11-22 20:38 [RFC PATCH 0/4] Add support for File Based Memory Management Bijan Tabatabai
` (3 preceding siblings ...)
2024-11-22 20:38 ` [RFC PATCH 4/4] Add base implementation of an MFS Bijan Tabatabai
@ 2024-11-23 12:23 ` Lorenzo Stoakes
2024-11-24 16:53 ` Bijan Tabatabai
2024-11-28 10:22 ` David Hildenbrand
4 siblings, 2 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2024-11-23 12:23 UTC (permalink / raw)
To: Bijan Tabatabai
Cc: linux-fsdevel, linux-mm, btabatabai, akpm, viro, brauner, mingo,
Liam Howlett, Vlastimil Babka, Jann Horn
+ VMA guys, it's important to run scripts/get_maintainers.pl on your
changes so the right people are pinged :)
On Fri, Nov 22, 2024 at 02:38:26PM -0600, Bijan Tabatabai wrote:
> This patch set implements file based memory management (FBMM) [1], a
> research project from the University of Wisconsin-Madison where a process's
> memory can be transparently managed by memory managers which are written as
> filesystems. When using FBMM, instead of using the traditional anonymous
> memory path, a process's memory is managed by mapping files from a memory
> management filesystem (MFS) into its address space. The MFS implements the
> memory management related callback functions provided by the VFS to
> implement the desired memory management functionality. After presenting
> this work at a conference, a handful of people asked if we were going to
> upstream the work, so we decided to see if the Linux community would be
> interested in this functionality as well.
>
While it's a cool project, I don't think it's upstreamable in its current
form - it essentially bypasses core mm functionality and 'does mm'
somewhere else (which strikes me, in effect, as the entire purpose of the
series).
mm is a subsystem that is in constant flux with many assumptions that one
might make about it being changed, which make it wholly unsuited to having
its functionality exported like this.
So in in effect it, by its nature, has to export internals somewhere else,
and that somewhere else now assumes things about mm that might change at
any point, additionally bypassing a great deal of highly sensitive and
purposeful logic.
This series also adds a lot of if (fbmm) { ... } changes to core logic
which is really not how we want to do things. hugetlbfs does this kind of
thing, but it is more or less universally seen as a _bad thing_ and
something we are trying to refactor.
So any upstreamable form of this would need to a. be part of mm, b. use
existing extensible mechanisms or create them, and c. not have _core_ mm
tasks or activities be performed 'elsewhere'.
Sadly I think the latter part may make a refactoring in this direction
infeasible, as it seems to me this is sort of the point of this.
This also means it's not acceptable to export highly sensitive mm internals
as you do in patch 3/4. Certainly in 1/4, as a co-maintainer of the mmap
logic, I can't accept the changes you suggest to brk() and mmap(), sorry.
There are huge subtleties in much of mm, including very very sensitive lock
mechanisms, and keeping such things within mm means we can have confidence
they work, and that fixes resolve issues.
I hope this isn't too discouraging, the fact you got this functioning is
amazing and as an out-of-tree research and experimentation project it looks
really cool, but for me, I don't think this is for upstream.
Thanks, Lorenzo
> This work is inspired by the increase in heterogeneity in memory hardware,
> such as from Optane and CXL. This heterogeneity is leading to a lot of
> research involving extending Linux's memory management subsystem. However,
> the monolithic design of the memory management subsystem makes it difficult
> to extend, and this difficulty grows as the complexity of the subsystem
> increases. Others in the research community have identified this problem as
> well [2,3]. We believe the kernel would benefit from some sort of extension
> interface to more easily prototype and implement memory management
> behaviors for a world with more diverse memory hierarchies.
>
> Filesystems are a natural extension mechanism for memory management because
> it already exists and memory mapping files into processes works. Also,
> precedent exists for writing memory managers as filesystems in the kernel
> with HugeTLBFS.
>
> While FBMM is easiest used for research and prototyping, I have also
> received feedback from people who work in industry that it would be useful
> for them as well. One person I talked to mentioned that they have made
> several changes to the memory management system in their branch that are
> not upstreamed, and it would be convinient to modularize those changes to
> avoid the headaches of rebasing when upgrading the kernel version.
>
> To use FBMM, one would perform the following steps:
> 1) Mount the MFS(s) they want to use
> 2) Enable FBMM by writting 1 to /sys/kernel/mm/fbmm/state
> 3) Set the MFS an application should allocate its memory from by writting
> the desired MFS's mount directory to /proc/<pid>/fbmm_mnt_dir, where <pid>
> is the PID of the target process.
>
> To have a process use an MFS for the entirety of the execution, one could
> use a wrapper program that writes /proc/self/fbmm_mount_dir then calls exec
> for the target process. We have created such a wrapper, which can be found
> at [4]. ld could also be extended to do this, using an environment variable
> similar to LD_PRELOAD.
>
> The first patch in this series adds the core of FBMM, allowing a user to
> set the MFS an application should allocate its anonymous memory from,
> transparently to the application.
>
> The second patch adds helper functions for common MM functionality that may
> be useful to MFS implementors for supporting swapping and handling
> fork/copy on write. Because fork is complicated, this patch adds a callback
> function to the super_operations struct to allow an MFS to decide its fork
> behavior, e.g. allow it to decide to do a deep copy of memory on fork
> instead of copy on write, and adds logic to the dup_mmap function to handle
> FBMM files.
>
> The third patch exports some kernel functions that are needed to implement
> an MFS to allow for MFSs to be written as kernel modules.
>
> The fourth and final patch in this series provides a sample implementation
> of a simple MFS, and is not actually intended to be upstreamed.
>
> [1] https://www.usenix.org/conference/atc24/presentation/tabatabai
> [2] https://www.usenix.org/conference/atc24/presentation/jalalian
> [3] https://www.usenix.org/conference/atc24/presentation/cao
> [4] https://github.com/multifacet/fbmm-workspace/blob/main/bmks/fbmm_wrapper.c
>
> Bijan Tabatabai (4):
> mm: Add support for File Based Memory Management
> fbmm: Add helper functions for FBMM MM Filesystems
> mm: Export functions for writing MM Filesystems
> Add base implementation of an MFS
>
> BasicMFS/Kconfig | 3 +
> BasicMFS/Makefile | 8 +
> BasicMFS/basic.c | 717 ++++++++++++++++++++++++++++++++
> BasicMFS/basic.h | 29 ++
> arch/x86/include/asm/tlbflush.h | 2 -
> arch/x86/mm/tlb.c | 1 +
> fs/Kconfig | 7 +
> fs/Makefile | 1 +
> fs/exec.c | 2 +
> fs/file_based_mm.c | 663 +++++++++++++++++++++++++++++
> fs/proc/base.c | 4 +
> include/linux/file_based_mm.h | 99 +++++
> include/linux/fs.h | 1 +
> include/linux/mm.h | 10 +
> include/linux/sched.h | 4 +
> kernel/exit.c | 3 +
> kernel/fork.c | 57 ++-
> mm/Makefile | 1 +
> mm/fbmm_helpers.c | 372 +++++++++++++++++
> mm/filemap.c | 2 +
> mm/gup.c | 1 +
> mm/internal.h | 13 +
> mm/memory.c | 3 +
> mm/mmap.c | 44 +-
> mm/pgtable-generic.c | 1 +
> mm/rmap.c | 2 +
> mm/vmscan.c | 14 +-
> 27 files changed, 2040 insertions(+), 24 deletions(-)
> create mode 100644 BasicMFS/Kconfig
> create mode 100644 BasicMFS/Makefile
> create mode 100644 BasicMFS/basic.c
> create mode 100644 BasicMFS/basic.h
> create mode 100644 fs/file_based_mm.c
> create mode 100644 include/linux/file_based_mm.h
> create mode 100644 mm/fbmm_helpers.c
>
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH 0/4] Add support for File Based Memory Management
2024-11-23 12:23 ` [RFC PATCH 0/4] Add support for File Based Memory Management Lorenzo Stoakes
@ 2024-11-24 16:53 ` Bijan Tabatabai
2024-11-28 10:22 ` David Hildenbrand
1 sibling, 0 replies; 9+ messages in thread
From: Bijan Tabatabai @ 2024-11-24 16:53 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-fsdevel, linux-mm, BIJAN TABATABAI, akpm, viro, brauner,
mingo, Liam Howlett, Vlastimil Babka, Jann Horn
On Sat, Nov 23, 2024 at 6:23 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> + VMA guys, it's important to run scripts/get_maintainers.pl on your
> changes so the right people are pinged :)
Sorry about that. I'll be more mindful of this next time I send a patch.
> While it's a cool project, I don't think it's upstreamable in its current
> form - it essentially bypasses core mm functionality and 'does mm'
> somewhere else (which strikes me, in effect, as the entire purpose of the
> series).
Understandable.
Thank you for spending the time to review the patches and giving a
thorough reply!
Bijan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Add support for File Based Memory Management
2024-11-23 12:23 ` [RFC PATCH 0/4] Add support for File Based Memory Management Lorenzo Stoakes
2024-11-24 16:53 ` Bijan Tabatabai
@ 2024-11-28 10:22 ` David Hildenbrand
1 sibling, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2024-11-28 10:22 UTC (permalink / raw)
To: Lorenzo Stoakes, Bijan Tabatabai
Cc: linux-fsdevel, linux-mm, btabatabai, akpm, viro, brauner, mingo,
Liam Howlett, Vlastimil Babka, Jann Horn
On 23.11.24 13:23, Lorenzo Stoakes wrote:
> + VMA guys, it's important to run scripts/get_maintainers.pl on your
> changes so the right people are pinged :)
>
> On Fri, Nov 22, 2024 at 02:38:26PM -0600, Bijan Tabatabai wrote:
>> This patch set implements file based memory management (FBMM) [1], a
>> research project from the University of Wisconsin-Madison where a process's
>> memory can be transparently managed by memory managers which are written as
>> filesystems. When using FBMM, instead of using the traditional anonymous
>> memory path, a process's memory is managed by mapping files from a memory
>> management filesystem (MFS) into its address space. The MFS implements the
>> memory management related callback functions provided by the VFS to
>> implement the desired memory management functionality. After presenting
>> this work at a conference, a handful of people asked if we were going to
>> upstream the work, so we decided to see if the Linux community would be
>> interested in this functionality as well.
>>
>
> While it's a cool project, I don't think it's upstreamable in its current
> form - it essentially bypasses core mm functionality and 'does mm'
> somewhere else (which strikes me, in effect, as the entire purpose of the
> series).
>
> mm is a subsystem that is in constant flux with many assumptions that one
> might make about it being changed, which make it wholly unsuited to having
> its functionality exported like this.
>
> So in in effect it, by its nature, has to export internals somewhere else,
> and that somewhere else now assumes things about mm that might change at
> any point, additionally bypassing a great deal of highly sensitive and
> purposeful logic.
>
> This series also adds a lot of if (fbmm) { ... } changes to core logic
> which is really not how we want to do things. hugetlbfs does this kind of
> thing, but it is more or less universally seen as a _bad thing_ and
> something we are trying to refactor.
>
> So any upstreamable form of this would need to a. be part of mm, b. use
> existing extensible mechanisms or create them, and c. not have _core_ mm
> tasks or activities be performed 'elsewhere'.
>
> Sadly I think the latter part may make a refactoring in this direction
> infeasible, as it seems to me this is sort of the point of this.
>
> This also means it's not acceptable to export highly sensitive mm internals
> as you do in patch 3/4. Certainly in 1/4, as a co-maintainer of the mmap
> logic, I can't accept the changes you suggest to brk() and mmap(), sorry.
>
> There are huge subtleties in much of mm, including very very sensitive lock
> mechanisms, and keeping such things within mm means we can have confidence
> they work, and that fixes resolve issues.
>
> I hope this isn't too discouraging, the fact you got this functioning is
> amazing and as an out-of-tree research and experimentation project it looks
> really cool, but for me, I don't think this is for upstream.
I agreed with this sentiment. It looks like something a research OS
might want to consider as it's way of dealing with anonymous memory in
general, but nothing on squeezes into an existing MM implementation.
I'm also not 100% sure on statements like "Providing this transparency"
-- what about fork() and COW? What about memory statistics?
"Transparency" is a strong word :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread