From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE085C433FE for ; Mon, 17 Oct 2022 07:32:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D5F8C6B0072; Mon, 17 Oct 2022 03:32:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D0F616B0074; Mon, 17 Oct 2022 03:32:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD6F48E0001; Mon, 17 Oct 2022 03:32:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A97C56B0072 for ; Mon, 17 Oct 2022 03:32:18 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 64BEE1409AF for ; Mon, 17 Oct 2022 07:32:18 +0000 (UTC) X-FDA: 80029623156.10.621FE23 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) by imf10.hostedemail.com (Postfix) with ESMTP id A1D18C0025 for ; Mon, 17 Oct 2022 07:32:16 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VSK0-Ok_1665991930; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VSK0-Ok_1665991930) by smtp.aliyun-inc.com; Mon, 17 Oct 2022 15:32:11 +0800 From: Baolin Wang To: akpm@linux-foundation.org Cc: arnd@arndb.de, baolin.wang@linux.alibaba.com, jingshan@linux.alibaba.com, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH] mm: Introduce new MADV_NOMOVABLE behavior Date: Mon, 17 Oct 2022 15:32:01 +0800 Message-Id: X-Mailer: git-send-email 1.8.3.1 ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665991937; a=rsa-sha256; cv=none; b=mBcds+b1hHslSXxHIPjVISwc6kO9LmALjr2JiFCWHP8ac+Rp/nkn/Jl7svL9EFRh90DE35 iBQxsU2l5M27zESFd1IgVMucgEfsi8u2+GenncuPfweq73lqG5vU4HwlTPhBme8TvFYLIJ TRMg0oYvyHSEbP0bM5xXq2yC9SAFQ9Y= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665991937; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:references; bh=6y1duUw7DFyUWoEOtVAjYvl9xr2LsgSMrXRnGUD9pCo=; b=erYyg4BkjQzCBsVExsDUMCIcUYxoG5G2PyyiP76Lh7SFSqvdAzZXYQeVr1cB6svt8OIcG0 1PlRwA1LzTpdpKzDTRBFiMtDyRBL+7BSRObQ8mlIpE9LhvE3dv1IW2dapnL1skeHCWathv MV02pW/3R8Mkje4QFiK2UePCnzqd/yM= X-Stat-Signature: 96wyjatdb65yanbktrxdoyn56sy4a78s X-Rspamd-Queue-Id: A1D18C0025 Authentication-Results: imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1665991936-101429 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When creating a virtual machine, we will use memfd_create() to get a file descriptor which can be used to create share memory mappings using the mmap function, meanwhile the mmap() will set the MAP_POPULATE flag to allocate physical pages for the virtual machine. When allocating physical pages for the guest, the host can fallback to allocate some CMA pages for the guest when over half of the zone's free memory is in the CMA area. In guest os, when the application wants to do some data transaction with DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and create IOMMU mappings for the DMA pages. However, when calling VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be failed to longterm-pin sometimes. After some invetigation, we found the pages used to do DMA mapping can contain some CMA pages, and these CMA pages will cause a possible failure of the longterm-pin, due to failed to migrate the CMA pages. The reason of migration failure may be temporary reference count or memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA ioctl returns error, which makes the application failed to start. To fix this issue, this patch introduces a new madvise behavior, named as MADV_NOMOVABLE, to avoid allocating CMA pages and movable pages if the users want to do longterm-pin, which can remove the possible failure of movable or CMA pages migration. Signed-off-by: Baolin Wang --- include/linux/mm.h | 6 ++++++ include/uapi/asm-generic/mman-common.h | 2 ++ mm/madvise.c | 6 ++++++ mm/memory.c | 6 ++++++ 4 files changed, 20 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index c63dfc804f1e..c9b2ab6e96fc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -307,6 +307,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */ #define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */ #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ +#define VM_NOMOVABLE 0x100000000 /* Avoid movable pages */ #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS #define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ @@ -661,6 +662,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma) return vma->vm_flags & VM_ACCESS_FLAGS; } +static inline bool vma_no_movable(struct vm_area_struct *vma) +{ + return vma->vm_flags & VM_NOMOVABLE; +} + static inline struct vm_area_struct *vma_find(struct vma_iterator *vmi, unsigned long max) { diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..d6e64eda28b6 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,8 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_NOMOVABLE 26 /* Avoid movable pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index 2baa93ca2310..fc59d4f1f123 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1045,6 +1045,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, case MADV_DONTDUMP: new_flags |= VM_DONTDUMP; break; + case MADV_NOMOVABLE: + new_flags |= VM_NOMOVABLE; + break; case MADV_DODUMP: if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) return -EINVAL; @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior) case MADV_PAGEOUT: case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: + case MADV_NOMOVABLE: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: @@ -1360,6 +1364,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * triggering read faults if required * MADV_POPULATE_WRITE - populate (prefault) page tables writable by * triggering write faults if required + * MADV_NOMOVABLE - avoid movable pages allocation in the page fault path + * due to longterm-pin required. * * return values: * zero - success diff --git a/mm/memory.c b/mm/memory.c index f88c351aecd4..5b75be6ba659 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5189,6 +5189,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; + unsigned long pf_flags; __set_current_state(TASK_RUNNING); @@ -5203,6 +5204,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, flags & FAULT_FLAG_REMOTE)) return VM_FAULT_SIGSEGV; + if (vma_no_movable(vma)) + pf_flags = memalloc_pin_save(); /* * Enable the memcg OOM handling for faults triggered in user * space. Kernel faults are handled more gracefully. @@ -5231,6 +5234,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, mem_cgroup_oom_synchronize(false); } + if (vma_no_movable(vma)) + memalloc_pin_restore(pf_flags); + mm_account_fault(regs, address, flags, ret); return ret; -- 2.27.0