From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F6AEC61D9C for ; Wed, 22 Nov 2023 21:12:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A487B6B0637; Wed, 22 Nov 2023 16:12:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9D0006B0638; Wed, 22 Nov 2023 16:12:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7FCEB6B0639; Wed, 22 Nov 2023 16:12:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6A4386B0637 for ; Wed, 22 Nov 2023 16:12:29 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 4BE37A1668 for ; Wed, 22 Nov 2023 21:12:29 +0000 (UTC) X-FDA: 81486838818.05.DE9B14E Received: from mail-pf1-f193.google.com (mail-pf1-f193.google.com [209.85.210.193]) by imf02.hostedemail.com (Postfix) with ESMTP id 5ECA180029 for ; Wed, 22 Nov 2023 21:12:27 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jCoEkyOt; spf=pass (imf02.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.210.193 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700687547; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=efc4LuxM+YClZeP0rF1BlV0TPdiTLe9TrsONK1dWGk4=; b=JSWBnObUN+STb495i1313vVgxKaLWHUcuMvuUsh9xnbMWlLr9tzjo8/1ZaMMm+/9dVjd3O SGutcbmmFKgzjt02ohbxqjUrG7ojNdHsu+TathbzD8NLbBOe5o2+hg+OQyWFuXtkrN11a/ I221Wmd1056SRAET4kE7KkXrK0ljRz8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700687547; a=rsa-sha256; cv=none; b=EdnQqC08YOhsX2ygec6r2DKR+frReaTC0xCV26MFrYa4jnRZ56z3twuaIx7jZ/lePk4lxT pdBP68A5BHQ+u7YYkjcMEVkzlCUjAtiY5yG1B+Xy/ud/+u1Krji/R4WgsD94Uatpbuy1ga UFs8lyGd2TktDhrT4k7lOhQXQnb+DuU= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jCoEkyOt; spf=pass (imf02.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.210.193 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pf1-f193.google.com with SMTP id d2e1a72fcca58-6bd0e1b1890so229374b3a.3 for ; Wed, 22 Nov 2023 13:12:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700687546; x=1701292346; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=efc4LuxM+YClZeP0rF1BlV0TPdiTLe9TrsONK1dWGk4=; b=jCoEkyOttinIe0IikBZWGNwZqHYX2r4ZM030ZzSLQnH1ezgOhcIkGqiyykwDwBa1jr 70YKgq4tio/47AkmWBodA6yI+MqmKHlei3kBum8AFJQupzut2xOEkgv0PNuvPIp0PO3A ZoTM29NVOyEzxRivMYKqCd0mzcMBm59R4p9JotDXakD4U49v+ctO7PbuvBZY1MdsDKQ3 T+AGc0VJqRW2phWto3K6MNwVD4tD00nFQJdnEbE7Z5GScI2MdTcgczNdVCp2Tv0cCuKx C4AsHXShG0rZEynm0Mouc8H9lKnvYBeA/IBAtAGf0YKgPRSMzgK3zNFDGYppmbj9YG7o Z7IQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700687546; x=1701292346; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=efc4LuxM+YClZeP0rF1BlV0TPdiTLe9TrsONK1dWGk4=; b=wKaUle5JO+m/WhB6AKy/xmQ1vaFUbTk1tfhM3+dYs5uRCudXllpK1ROlmXMGMfBXsD ZgNIxt9xrOsbyKwEk/IdKtmt2D8shv3ZJlEmFsiW2TYQ+p7lF3u4ekh/XuiWxPCyiNa0 4jSZv43ScPlurO1lnAFqoiJl7kbPMLilOIzn6aLoJHvTx0PmHPvzoO7dfAj/+RUhEbJ2 8Sp7SDvhLNXZ4qFCXKyNZA/4F0+Qd3YkzuhL5mdEZRhlHjPB0Yw6GGqzd2HA8uobY48j DtMEGwJokr603NRML2VeA2ffqW6HKd+fvdPEBuOuG2Q2kogg69OzoQbHH/8SWdpKvTuq 9wZA== X-Gm-Message-State: AOJu0YzacRqj8zvg/RUMJWA2KF6PN5LrpIJh6+MRnPtnExokqR7ktXAy ke0C+GG0a600RpmZvTvlVPJzHSxN54xe X-Google-Smtp-Source: AGHT+IFyLKi/UX8cG8JtNK31D46LnIz+FKATE0AdMLHzg0L0sgRXSVvqGSyPkTSkfLO98TY33r1+qg== X-Received: by 2002:a05:6a00:800d:b0:6c4:dc5b:5b2b with SMTP id eg13-20020a056a00800d00b006c4dc5b5b2bmr3826752pfb.20.1700687546091; Wed, 22 Nov 2023 13:12:26 -0800 (PST) Received: from fedora.mshome.net ([75.167.214.230]) by smtp.gmail.com with ESMTPSA id j18-20020a635512000000b005bdbce6818esm132136pgb.30.2023.11.22.13.12.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Nov 2023 13:12:25 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, Gregory Price Subject: [RFC PATCH 07/11] mm/mempolicy: add task mempolicy syscall variants Date: Wed, 22 Nov 2023 16:11:56 -0500 Message-Id: <20231122211200.31620-8-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231122211200.31620-1-gregory.price@memverge.com> References: <20231122211200.31620-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: 89q9hsxyih7hwnf5sqgzyt4jo5zggfm3 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 5ECA180029 X-Rspam-User: X-HE-Tag: 1700687547-293092 X-HE-Meta: U2FsdGVkX19D3jSYYFnS+8euFeGfjLexSzPz52ePbu57aTXYBU3vhHQKzcA7JvcjAToiRHkogCaVcBDTRiITXBSHMIkeyM5NOSvjnWrdKdObQLdkpf4xokPbllIxvpq2s2n76l5ivIsGLVhF5zpWkbOd3bLUc3xB6aTGSpnKSaM1oQ9ReqMMo3U0mTN2GbvztBSH6vQeP77LmrAI/l7KsgnMplDAFdjdTt/42Rxwr73dhcA8A0FQfBwTmzBaPlNQ2PokkodvHuMZ3kVwSrwDn+9rnEGMfrzeDJTtL4JjVRn2FW7NwlL7iKP3QCFtD1MDwoQZxnSbTyJowxm4Fcxz9O/whHdqyiT17XO47Sr2Nix1/wjqGEziSNgMepZT3jj22clXdieqzYI8GDvJZbKqMp+8we2tqzmYS9xTFkbeHT8soPY5tabdxzBYpQLXb/0t/bHyW0Brxr1Ulfkw9qCWjrLLOf1CCXczB31z/nTpLKHhuPgI6zGDQ17d8J7nkdjRJ1K+0wbLUMbcsg3VvVp9h2K9R3oQX3R58/0MFl/zlxVuPdw3gsSpGAdUdHdcVodXze/vL2IQwKKiljip4djFcdPZ8dd9/Mj6b/65S9S/4M0oyhvD351UStpCJplVcpXa3pvOI32iWOBM5iMGgOsaqETdGPDqb1noeAiPULNaFcY20/iXVzUctEEs+0NlnoY39l/UzLoNqdcNnCOrGER4acd59JmIh60KQ/bUI46Kxgheo4TE+v6U1fTqtjUx1mshacSVqjHVPsDr3jk3VBD5U55Y/KdnujtAFQELB+RDE6VDKTmEBBKmbm8Bl0G/wopEtWF3+0rsqIDW9rY4lT8IN2VBkaabkpcu7syaQpRZUmi/tX9b5DT41mkRRvs6TCPeonC9xEs+ikfKC7IUEV8YuONYml5d5oWP4/Lx5zu0kuQeAtQtRS1NVvWydS/HNMU0uVL9hsFFclenGj4C5de nDuqaGpL 68nDB9RhbO7ZArMNOJxvjZFXHlgMAuO7qZxPWLR/s0H/oyMEDhlGF2WL+x/1dnqkCnitxBPBngB/FZ6JyNwiz81qWCObEmV2yczQ/4ngMxurZU186hD2nPe5O/IowLnFTmAjubfmKuD2b4YeIgFXonIIiIL6MS+EQUKsR86BpDHM0MsnhQY/exPkNAQ5xG70sc+tJfRzyGqqbQg8k+y07cWPm6ZfaNrOTcmtO17avR9tQhxEoknoeZ4s6zOyvij5ZyoAxAKauJmp4B+CFGzu1bxte+klmNoM+4BbwLvFux/vJM/iCmNKzin6fRXPBLZcMqPhvzUL0EtCyroljzxplyvhUJiLOCwH1uOcR+AjRGiT/vmqWhRVTkkVflm1TcFWFuSuJoDwLUYjq2zvNI6KmR8F00OjPNFOaHc7uZff72XVRYUjtOQmtwZY+HlPGJ7eK8ZvMfDt+EKn0df+WIyLBtvH4gxMODMF5QwVz X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Add system calls to allow one task to view or change another task's mempolicy settings. The task mempolicy has traditionally been a feature that could only be changed by the task itself. This creates issues with task migrations between cgroups where cpusets may differ. Attempts were made to allow policy nodemasks to be shifted via a flag (MPOL_F_RELATIVE_NODES), but this is not foolproof. Additionally, as new policies emerge (like weighted interleave), it may be necessary to allow not just the policy to be changed, but individual attributes of the policy (such as a node weight) in response to other system events - such as memory hotplug. If pid is 0, this behaves the same as the original mempolicy syscalls, otherwise this interface requires CAP_SYS_NICE. Syscalls in this patch: sys_set_task_mempolicy sys_get_task_mempolicy sys_set_task_mempolicy_home_node sys_task_mbind Signed-off-by: Gregory Price --- arch/x86/entry/syscalls/syscall_32.tbl | 4 + arch/x86/entry/syscalls/syscall_64.tbl | 4 + include/linux/syscalls.h | 14 +++ include/uapi/asm-generic/unistd.h | 10 ++- include/uapi/linux/mempolicy.h | 10 +++ mm/mempolicy.c | 119 +++++++++++++++++++++++++ 6 files changed, 160 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index c8fac5205803..358bd91d7461 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -461,3 +461,7 @@ 454 i386 futex_wake sys_futex_wake 455 i386 futex_wait sys_futex_wait 456 i386 futex_requeue sys_futex_requeue +457 i386 set_task_mempolicy sys_set_task_mempolicy +458 i386 get_task_mempolicy sys_get_task_mempolicy +459 i386 set_task_mempolicy_home_node sys_set_task_mempolicy_home_node +460 i386 task_mbind sys_task_mbind diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 8cb8bf68721c..c83b0c5c1ff9 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -378,6 +378,10 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_task_mempolicy sys_set_task_mempolicy +458 common get_task_mempolicy sys_get_task_mempolicy +459 common set_task_mempolicy_home_node sys_set_task_mempolicy_home_node +460 common task_mbind sys_task_mbind # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index fd9d12de7e92..fd1a8863b5c1 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -816,12 +816,21 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len, const unsigned long __user *nmask, unsigned long maxnode, unsigned flags); +asmlinkage long sys_task_mbind(const struct mbind_args __user *uargs, + size_t usize); asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long __user *nmask, unsigned long maxnode, unsigned long addr, unsigned long flags); asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask, unsigned long maxnode); +asmlinkage long sys_get_task_mempolicy(pid_t pid, int __user *policy, + unsigned long __user *nmask, + unsigned long maxnode, + unsigned long addr, unsigned long flags); +asmlinkage long sys_set_task_mempolicy(pid_t pid, int mode, + const unsigned long __user *nmask, + unsigned long maxnode); asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode, const unsigned long __user *from, const unsigned long __user *to); @@ -945,6 +954,11 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags); +asmlinkage long sys_set_task_mempolicy_home_node(pid_t pid, + unsigned long start, + unsigned long len, + unsigned long home_node, + unsigned long flags); asmlinkage long sys_cachestat(unsigned int fd, struct cachestat_range __user *cstat_range, struct cachestat __user *cstat, unsigned int flags); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 756b013fb832..f179715f1d59 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -828,9 +828,17 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake) __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue) +#define __NR_set_task_mempolicy 457 +__SYSCALL(__NR_set_task_mempolicy, sys_set_task_mempolicy) +#define __NR_get_task_mempolicy 458 +__SYSCALL(__NR_get_task_mempolicy, sys_get_task_mempolicy) +#define __NR_set_task_mempolicy_home_node 459 +__SYSCALL(__NR_set_task_mempolicy_home_node, sys_set_task_mempolicy_home_node) +#define __NR_task_mbind 460 +__SYSCALL(__NR_task_mbind, sys_task_mbind) #undef __NR_syscalls -#define __NR_syscalls 457 +#define __NR_syscalls 461 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..c29cfb25db29 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -26,6 +26,16 @@ enum { MPOL_MAX, /* always last member of enum */ }; +struct mbind_args { + pid_t pid; + unsigned long start; + unsigned long len; + unsigned long mode; + unsigned long *nmask; + unsigned long maxnode; + unsigned int flags; +}; + /* Flags for set_mempolicy */ #define MPOL_F_STATIC_NODES (1 << 15) #define MPOL_F_RELATIVE_NODES (1 << 14) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3d2171ac4098..fb295ade8ad7 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1654,6 +1654,32 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le return __set_mempolicy_home_node(current, start, len, home_node, flags); } +SYSCALL_DEFINE5(set_task_mempolicy_home_node, pid_t, pid, unsigned long, start, + unsigned long, len, unsigned long, home_node, + unsigned long, flags) +{ + struct task_struct *task; + int err; + + if (pid && !capable(CAP_SYS_NICE)) + return -EPERM; + + rcu_read_lock(); + task = pid ? find_task_by_vpid(pid) : current; + if (!task) { + rcu_read_unlock(); + err = -ESRCH; + goto out; + } + get_task_struct(task); + rcu_read_unlock(); + + err = __set_mempolicy_home_node(task, start, len, home_node, flags); + put_task_struct(task); +out: + return err; +} + SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len, unsigned long, mode, const unsigned long __user *, nmask, unsigned long, maxnode, unsigned int, flags) @@ -1661,6 +1687,48 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len, return kernel_mbind(current, start, len, mode, nmask, maxnode, flags); } +static long kernel_task_mbind(const struct mbind_args __user *uargs, + size_t usize) +{ + struct mbind_args kargs; + struct task_struct *task; + int err; + + if (usize < sizeof(kargs)) + return -EINVAL; + + err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return err; + + + if (kargs.pid && !capable(CAP_SYS_NICE)) + return -EPERM; + + rcu_read_lock(); + task = kargs.pid ? find_task_by_vpid(kargs.pid) : current; + if (!task) { + rcu_read_unlock(); + err = -ESRCH; + goto out; + } + get_task_struct(task); + rcu_read_unlock(); + + err = kernel_mbind(task, kargs.start, kargs.len, kargs.mode, + kargs.nmask, kargs.maxnode, kargs.flags); + + put_task_struct(task); +out: + return err; +} + +SYSCALL_DEFINE2(task_mbind, const struct mbind_args __user *, args, + size_t, size) +{ + return kernel_task_mbind(args, size); +} + /* Set the process memory policy */ static long kernel_set_mempolicy(struct task_struct *task, int mode, const unsigned long __user *nmask, @@ -1688,6 +1756,31 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask, return kernel_set_mempolicy(current, mode, nmask, maxnode); } +SYSCALL_DEFINE4(set_task_mempolicy, pid_t, pid, int, mode, + const unsigned long __user *, nmask, unsigned long, maxnode) +{ + struct task_struct *task; + int err; + + if (pid && !capable(CAP_SYS_NICE)) + return -EPERM; + + rcu_read_lock(); + task = pid ? find_task_by_vpid(pid) : current; + if (!task) { + rcu_read_unlock(); + err = -ESRCH; + goto out; + } + get_task_struct(task); + rcu_read_unlock(); + + err = kernel_set_mempolicy(task, mode, nmask, maxnode); + put_task_struct(task); +out: + return err; +} + static int kernel_migrate_pages(pid_t pid, unsigned long maxnode, const unsigned long __user *old_nodes, const unsigned long __user *new_nodes) @@ -1821,6 +1914,32 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy, flags); } +SYSCALL_DEFINE6(get_task_mempolicy, pid_t, pid, int __user *, policy, + unsigned long __user *, nmask, unsigned long, maxnode, + unsigned long, addr, unsigned long, flags) +{ + struct task_struct *task; + int err; + + if (pid && !capable(CAP_SYS_NICE)) + return -EPERM; + + rcu_read_lock(); + task = pid ? find_task_by_vpid(pid) : current; + if (!task) { + rcu_read_unlock(); + err = -ESRCH; + goto out; + } + get_task_struct(task); + rcu_read_unlock(); + + err = kernel_get_mempolicy(task, policy, nmask, maxnode, addr, flags); + put_task_struct(task); +out: + return err; +} + bool vma_migratable(struct vm_area_struct *vma) { if (vma->vm_flags & (VM_IO | VM_PFNMAP)) -- 2.39.1