From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 562F6C07CA9 for ; Tue, 28 Nov 2023 12:50:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA4256B029A; Tue, 28 Nov 2023 07:50:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A8F576B02B2; Tue, 28 Nov 2023 07:50:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 848386B0296; Tue, 28 Nov 2023 07:50:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 639F96B0296 for ; Tue, 28 Nov 2023 07:50:54 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 420F91601C2 for ; Tue, 28 Nov 2023 12:50:54 +0000 (UTC) X-FDA: 81507347628.03.B889D97 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf21.hostedemail.com (Postfix) with ESMTP id D21FE1C001C for ; Tue, 28 Nov 2023 12:50:50 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf21.hostedemail.com: domain of weixi.zhu@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=weixi.zhu@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701175852; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=omtV+js5KoEWxzNsDTgYiPgMmTreNxSHA/JO1Y+goH4=; b=z9/JcAqyF0YQTrCAKTLNbLoQgwMpEZlVRhpqnlp8KSetYQ/4Mf1x4WnmwMB76b/w+PM/Jp WDtsqkVyjfMr3UnQdW45NNOwsL7KP4qBAjgYDIwDuhAsqrAx29aleRaDNisXSeCG+DBIuJ Dii4dmMaRKbbCnPD4YAEUdjj8LuzP5s= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf21.hostedemail.com: domain of weixi.zhu@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=weixi.zhu@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701175852; a=rsa-sha256; cv=none; b=rw37cvyJ2bHOLQHNTfjaPDyPjrgFpXpOMfsoJAfgwXb/Kgsf6NXvFRExp/o9vBWPl+eNjl rc8tTEphARdnCJM6IQVFSav33//0IrZqcNegM+fjiDOetIdofKemFUgGHTAR5NbZ7NbRiv Ue9eL48ee88r+bMTCrzhHXTBhcA7Vh0= Received: from kwepemm000018.china.huawei.com (unknown [172.30.72.57]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Sfj3X4tZmzvRGs; Tue, 28 Nov 2023 20:50:16 +0800 (CST) Received: from DESKTOP-RAUQ1L5.china.huawei.com (10.174.179.172) by kwepemm000018.china.huawei.com (7.193.23.4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Tue, 28 Nov 2023 20:50:45 +0800 From: Weixi Zhu To: , , CC: , , , , , , , , , , , , , , , , , , , , , , , , , Weixi Zhu Subject: [RFC PATCH 4/6] mm/gmem: add new syscall hmadvise() to issue memory hints for heterogeneous NUMA nodes Date: Tue, 28 Nov 2023 20:50:23 +0800 Message-ID: <20231128125025.4449-5-weixi.zhu@huawei.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231128125025.4449-1-weixi.zhu@huawei.com> References: <20231128125025.4449-1-weixi.zhu@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.174.179.172] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To kwepemm000018.china.huawei.com (7.193.23.4) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: D21FE1C001C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: et15uzrpa6ugujffwyut1cuhf3zxx43q X-HE-Tag: 1701175850-597207 X-HE-Meta: U2FsdGVkX193VvOEtmfIecw4J4gHNTsw30L/FFJi0DY3dMv6l9YbkhoVfuYAF37pzp+L2Lb8D2s3HcY42Xbn9/TfqgL5O3IIyjdMPnG6bSIIFDhQyIAmcAckWvaAW6gWYdALVMAcL/Y/USqj3wOP2DxpUQdIR4jC/dJtEI1XFXJ5K2YN3tpm7IGjZL2zoLjswR1kkv7d3mLfN+ZdNap3aI4fMhxxMPIe6lTb2TfOviQSRhUuXMF8CrpI2m0hL4h+dm24CvOtGu47G0Gw0sIskKqNjrxyJZLaPXsvVECNUJwivqJUOo6YNn4CCXmdgKig/RU1FCCACQyF/x0jF9tTVw2JDtN11WIXVK2ayOk4xXeJG+vmFuI80t+na+/+q1X9/RFhDlEczvP0nSjpfr5SLarxht4jOg2AeJX9wqv/FD8NeUm9YHh/DdO3va6+ahLv4zPHwvrZvvU5qcUgoSYuAkPAtAnMUZDysOsFTQXwXqWE2dIuf9O+8Fz33QObh0BFgSFC2lTe6nsuhudZYcXdZDPJQTprfCXpnlIl7Y7g7JvUvMALT9VULx+35OseFRM2zvTL+E+0L53DVPhszclZNKaXmDZMigIk0RdVO0bpc3PL6RaBBv3VV68OQ/ABzykn7cbqg8gsldHD35g9z72L2BXrgRW9utJ8YxDYWkQD+n3rUgdt19k87cramVypBFsYEbWqb+ry9EJ4rgUFW8j4t0kGJFP8O6X/MmkK7GZCg8Oj22o1WP8v1ihOdaAxiEpsms4iM469BN+EOxj3VPWDKUcNOeU7y7L1n1jU8/cKSwOyb6+KNYo6fCSeKYsZvOG9cJRkFeZi64CFJ4ayjQL//tySSJUxv+N1UuYj77YCcQiPmueNE4qQO1qtEfTQdq2nUiYhWcX1xHRmokPTj/hTAUVetSqMW2C04me6buHo+e8fRXbO3j+mxV9YaNzeCjQ1DnUQmUm0Gg9dPHGO5vn aCBzXPNW uGHrCdqmvlW7Jzk71Negog+S6mXT/msKBm/0H0zqqhO0U5gIa5hvM4jttqtHgzH4KaETUqBeQBZvm31Qd+XEeQMa+DvRekUmclJkrjmqXOAQj1Oj3/2MiRbhidWISsUHposki4TGe0Ccm4UeYL6MfRRkpxtg05P7ThXFpev++RJyMbzodh8NiElgGkE7z5csdjmDYCnfzuq7h9uU3i+hFbmM10A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This patch adds a new syscall, hmadvise(), to issue memory hints for heterogeneous NUMA nodes. The new syscall effectively extends madvise() with one additional argument that indicates the NUMA id of a heterogeneous device, which is not necessarily accessible by the CPU. The implemented memory hint is MADV_PREFETCH, which guarantees that the physical data of the given VMA [VA, VA+size) is migrated to a designated NUMA id, so subsequent accesses from the corresponding device can obtain local memory access speed. This prefetch hint is internally parallized with multiple workqueue threads, allowing the page table management to be overlapped. In a test with Huawei's Ascend NPU card, the MADV_PREFETCH is able to saturate the host-device bandwidth if the given VMA size is larger than 16MB. Signed-off-by: Weixi Zhu --- arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + include/linux/gmem.h | 9 + include/uapi/asm-generic/mman-common.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- kernel/sys_ni.c | 2 + mm/gmem.c | 222 ++++++++++++++++++++++++ tools/include/uapi/asm-generic/unistd.h | 5 +- 8 files changed, 247 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 531effca5f1f..298313d2e0af 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 457 +#define __NR_compat_syscalls 458 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 9f7c1bf99526..0d44383b98be 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -919,6 +919,8 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake) __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue) +#define __NR_hmadvise 457 +__SYSCALL(__NR_hmadvise, sys_hmadvise) /* * Please add new compat syscalls above this comment and update diff --git a/include/linux/gmem.h b/include/linux/gmem.h index f424225daa03..97186f29638d 100644 --- a/include/linux/gmem.h +++ b/include/linux/gmem.h @@ -22,6 +22,11 @@ static inline bool gmem_is_enabled(void) return static_branch_likely(&gmem_status); } +static inline bool vma_is_peer_shared(struct vm_area_struct *vma) +{ + return false; +} + struct gm_dev { int id; @@ -280,6 +285,10 @@ int gm_as_attach(struct gm_as *as, struct gm_dev *dev, enum gm_mmu_mode mode, bool activate, struct gm_context **out_ctx); #else static inline bool gmem_is_enabled(void) { return false; } +static inline bool vma_is_peer_shared(struct vm_area_struct *vma) +{ + return false; +} static inline void hnuma_init(void) {} static inline void __init vm_object_init(void) { diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..49b22a497c5d 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +/* for hmadvise */ +#define MADV_PREFETCH 26 /* prefetch pages for hNUMA node */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 756b013fb832..a0773d4f7fa5 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -829,8 +829,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue) +#define __NR_hmadvise 453 +__SYSCALL(__NR_hmadvise, sys_hmadvise) + #undef __NR_syscalls -#define __NR_syscalls 457 +#define __NR_syscalls 458 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index e1a6e3c675c0..73bc1b35b8c6 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -374,3 +374,5 @@ COND_SYSCALL(setuid16); /* restartable sequence */ COND_SYSCALL(rseq); + +COND_SYSCALL(hmadvise); diff --git a/mm/gmem.c b/mm/gmem.c index b95b6b42ed6d..4eb522026a0d 100644 --- a/mm/gmem.c +++ b/mm/gmem.c @@ -9,6 +9,8 @@ #include #include #include +#include +#include DEFINE_STATIC_KEY_FALSE(gmem_status); EXPORT_SYMBOL_GPL(gmem_status); @@ -484,3 +486,223 @@ int gm_as_attach(struct gm_as *as, struct gm_dev *dev, enum gm_mmu_mode mode, return GM_RET_SUCCESS; } EXPORT_SYMBOL_GPL(gm_as_attach); + +struct prefetch_data { + struct mm_struct *mm; + struct gm_dev *dev; + unsigned long addr; + size_t size; + struct work_struct work; + int *res; +}; + +static void prefetch_work_cb(struct work_struct *work) +{ + struct prefetch_data *d = + container_of(work, struct prefetch_data, work); + unsigned long addr = d->addr, end = d->addr + d->size; + int page_size = HPAGE_SIZE; + int ret; + + do { + /* + * Pass a hint to tell gm_dev_fault() to invoke peer_map anyways + * and implicitly mark the mapped physical page as recently-used. + */ + ret = gm_dev_fault(d->mm, addr, d->dev, GM_FAULT_HINT_MARK_HOT); + if (ret == GM_RET_PAGE_EXIST) { + pr_info("%s: device has done page fault, ignore prefetch\n", __func__); + } else if (ret != GM_RET_SUCCESS) { + *d->res = -EFAULT; + pr_err("%s: call dev fault error %d\n", __func__, ret); + } + } while (addr += page_size, addr != end); + + kfree(d); +} + +static int hmadvise_do_prefetch(struct gm_dev *dev, unsigned long addr, size_t size) +{ + unsigned long start, end, per_size; + int page_size = HPAGE_SIZE; + struct prefetch_data *data; + struct vm_area_struct *vma; + int res = GM_RET_SUCCESS; + + end = round_up(addr + size, page_size); + start = round_down(addr, page_size); + size = end - start; + + mmap_read_lock(current->mm); + vma = find_vma(current->mm, start); + if (!vma || start < vma->vm_start || end > vma->vm_end) { + mmap_read_unlock(current->mm); + return GM_RET_FAILURE_UNKNOWN; + } + mmap_read_unlock(current->mm); + + per_size = (size / GM_WORK_CONCURRENCY) & ~(page_size - 1); + + while (start < end) { + data = kzalloc(sizeof(struct prefetch_data), GFP_KERNEL); + if (!data) { + flush_workqueue(prefetch_wq); + return GM_RET_NOMEM; + } + + INIT_WORK(&data->work, prefetch_work_cb); + data->mm = current->mm; + data->dev = dev; + data->addr = start; + data->res = &res; + if (per_size == 0) + data->size = size; + else + data->size = (end - start < 2 * per_size) ? (end - start) : per_size; + queue_work(prefetch_wq, &data->work); + start += data->size; + } + + flush_workqueue(prefetch_wq); + return res; +} + +static int gm_unmap_page_range(struct vm_area_struct *vma, unsigned long start, + unsigned long end, int page_size) +{ + struct gm_fault_t gmf = { + .mm = current->mm, + .size = page_size, + .copy = false, + }; + struct gm_mapping *gm_mapping; + struct vm_object *obj; + int ret; + + obj = current->mm->vm_obj; + if (!obj) { + pr_err("gmem: peer-shared vma should have vm_object\n"); + return -EINVAL; + } + + for (; start < end; start += page_size) { + xa_lock(obj->logical_page_table); + gm_mapping = vm_object_lookup(obj, start); + if (!gm_mapping) { + xa_unlock(obj->logical_page_table); + continue; + } + xa_unlock(obj->logical_page_table); + mutex_lock(&gm_mapping->lock); + if (gm_mapping_nomap(gm_mapping)) { + mutex_unlock(&gm_mapping->lock); + continue; + } else if (gm_mapping_cpu(gm_mapping)) { + zap_page_range_single(vma, start, page_size, NULL); + } else { + gmf.va = start; + gmf.dev = gm_mapping->dev; + ret = gm_mapping->dev->mmu->peer_unmap(&gmf); + if (ret) { + pr_err("gmem: peer_unmap failed. ret %d\n", + ret); + mutex_unlock(&gm_mapping->lock); + continue; + } + } + gm_mapping_flags_set(gm_mapping, GM_PAGE_NOMAP); + mutex_unlock(&gm_mapping->lock); + } + + return 0; +} + +static int hmadvise_do_eagerfree(unsigned long addr, size_t size) +{ + unsigned long start, end, i_start, i_end; + int page_size = HPAGE_SIZE; + struct vm_area_struct *vma; + int ret = GM_RET_SUCCESS; + unsigned long old_start; + + if (check_add_overflow(addr, size, &end)) + return -EINVAL; + + old_start = addr; + + end = round_down(addr + size, page_size); + start = round_up(addr, page_size); + if (start >= end) + return ret; + + mmap_read_lock(current->mm); + do { + vma = find_vma_intersection(current->mm, start, end); + if (!vma) { + pr_info("gmem: there is no valid vma\n"); + break; + } + + if (!vma_is_peer_shared(vma)) { + pr_debug("gmem: not peer-shared vma, skip dontneed\n"); + start = vma->vm_end; + continue; + } + + i_start = start > vma->vm_start ? start : vma->vm_start; + i_end = end < vma->vm_end ? end : vma->vm_end; + ret = gm_unmap_page_range(vma, i_start, i_end, page_size); + if (ret) + break; + + start = vma->vm_end; + } while (start < end); + + mmap_read_unlock(current->mm); + return ret; +} + +static bool check_hmadvise_behavior(int behavior) +{ + return behavior == MADV_DONTNEED; +} + +SYSCALL_DEFINE4(hmadvise, int, hnid, unsigned long, start, size_t, len_in, int, behavior) +{ + int error = -EINVAL; + struct hnode *node; + + if (hnid == -1) { + if (check_hmadvise_behavior(behavior)) { + goto no_hnid; + } else { + pr_err("hmadvise: behavior %d need hnid or is invalid\n", + behavior); + return error; + } + } + + if (hnid < 0) + return error; + + if (!is_hnode(hnid) || !is_hnode_allowed(hnid)) + return error; + + node = get_hnode(hnid); + if (!node) { + pr_err("hmadvise: hnode id %d is invalid\n", hnid); + return error; + } + +no_hnid: + switch (behavior) { + case MADV_PREFETCH: + return hmadvise_do_prefetch(node->dev, start, len_in); + case MADV_DONTNEED: + return hmadvise_do_eagerfree(start, len_in); + default: + pr_err("hmadvise: unsupported behavior %d\n", behavior); + } + + return error; +} diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h index 76d946445391..6d28d7a4096c 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h @@ -823,8 +823,11 @@ __SYSCALL(__NR_cachestat, sys_cachestat) #define __NR_fchmodat2 452 __SYSCALL(__NR_fchmodat2, sys_fchmodat2) +#define __NR_hmadvise 453 +__SYSCALL(__NR_hmadvise, sys_hmadvise) + #undef __NR_syscalls -#define __NR_syscalls 453 +#define __NR_syscalls 454 /* * 32 bit systems traditionally used different -- 2.25.1