From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97C56C4332F for ; Wed, 9 Nov 2022 23:54:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 13A1D8E0001; Wed, 9 Nov 2022 18:54:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E9DA6B0073; Wed, 9 Nov 2022 18:54:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ECCD88E0001; Wed, 9 Nov 2022 18:54:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id DE92F6B0072 for ; Wed, 9 Nov 2022 18:54:31 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id AAF7AC1484 for ; Wed, 9 Nov 2022 23:54:31 +0000 (UTC) X-FDA: 80115560742.14.C37C0C8 Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) by imf29.hostedemail.com (Postfix) with ESMTP id 247C7120002 for ; Wed, 9 Nov 2022 23:54:29 +0000 (UTC) Received: by mail-pj1-f43.google.com with SMTP id c15-20020a17090a1d0f00b0021365864446so202363pjd.4 for ; Wed, 09 Nov 2022 15:54:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=svbokUsTGvaKrPTYe+iV+83WUpl4LfGM9newrGHW2AY=; b=NVJifqN9d/MXVsGOlKMwco45aTS6zns93fkFsZShX9ZakBfRXR6BgRtsfuWoo/nYjq Yh1+8777EkMrFbzOSF0trrWXepsvxK9UHYJ6z3eqXQvtTV2kJ5s3iBLSyTkeiVioDz/e XoJyGUaCRJbVvTIGht8jZSwfI/Zc6k+aFn9XEV+ZQMb0uxK0qeb5UxHU6aFYQdqDl+vS n0rkWkRLu9XENITtJL4BlxBWeVQKebZXYnDcdBCwnTHMQS3hlOBDENGsK1aNmgwIednE tm518RU7jeTzZmOAwl8dS/n8W/8RLsSBOlxBYJxGpDpu0FyIc7vEm9CNUV1qOmHG5xGo u1tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=svbokUsTGvaKrPTYe+iV+83WUpl4LfGM9newrGHW2AY=; b=Q6RujfziykhaOsOpvBzq0LfR5C10l0XlNaww3CLQE94jLrzKzuo61bK9dpvGGb9qIp WYO9Zwfd2SiGzb8ennQv8zKj45NVq7TfcoTXZK6i1seH715Np/8nH4xiGRbsKsgFQHEb 2enV6V13LkJxxAk/uC4fE49SYCI3zsiXMamo8jp5bI6yhvuvS4ZhJWzTgHR6JIx2flVl QVMQdmW6Lo1nWDGXpBc0MCAFGBbT2dwWUJ5g5j4fNotX0EExa/u+rKEXjZLPdggtRKcN Q8meF5vRWkVRwQnsTIS2J1Hn6GxuioQM/jnJUjJ0gZzs/ZVLCxbWa6hbsZE6/M1yxUci sbvg== X-Gm-Message-State: ACrzQf2qLdROiWmKIsqM9WUYf8UNVYUHuMXCfXX7fhsOHPshk4C5idnn pFIVVoQ3RI0Np9AIhssq+3U= X-Google-Smtp-Source: AMsMyM7D3r025gafHPKVPQGDKYgSA7UYST+0IdiwK3iAeEhbEXKL7jeVlwMhxRs4ZWAd6LrKzl22ug== X-Received: by 2002:a17:90a:c398:b0:214:6fd:90df with SMTP id h24-20020a17090ac39800b0021406fd90dfmr1226415pjt.35.1668038068786; Wed, 09 Nov 2022 15:54:28 -0800 (PST) Received: from gmail.com ([2601:600:8500:5f14:d627:c51e:516e:a105]) by smtp.gmail.com with ESMTPSA id ij22-20020a170902ab5600b0017f36638010sm9576409plb.276.2022.11.09.15.54.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Nov 2022 15:54:27 -0800 (PST) Date: Wed, 9 Nov 2022 15:54:25 -0800 From: Andrei Vagin To: Muhammad Usama Anjum Cc: =?utf-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Danylo Mocherniuk , Alexander Viro , Andrew Morton , Suren Baghdasaryan , Greg KH , Christian Brauner , Peter Xu , Yang Shi , Vlastimil Babka , Zach O'Keefe , "Matthew Wilcox (Oracle)" , "Gustavo A. R. Silva" , Dan Williams , kernel@collabora.com, Gabriel Krisman Bertazi , David Hildenbrand , Peter Enderborg , "open list : KERNEL SELFTEST FRAMEWORK" , Shuah Khan , open list , "open list : PROC FILESYSTEM" , "open list : MEMORY MANAGEMENT" , Paul Gofman Subject: Re: [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Message-ID: References: <20221109102303.851281-1-usama.anjum@collabora.com> <20221109102303.851281-3-usama.anjum@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline In-Reply-To: <20221109102303.851281-3-usama.anjum@collabora.com> ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668038070; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=svbokUsTGvaKrPTYe+iV+83WUpl4LfGM9newrGHW2AY=; b=qKzv2YtA3wFbK1xGyn9fEIiaZWeH/90OaWUAhanIaA5L+E+1nrjfW00brc+TDqGL9zO+Vf 5M/knT2oErQsf0GKbW8Ik/KcXCmVHgpj2NfdKJW3J7VQa9z+RKWXcsMpw9E9/xwrQyvO9f vSjM5x4VjZT+qzsye89/1nS3owiER+o= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=NVJifqN9; spf=pass (imf29.hostedemail.com: domain of avagin@gmail.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=avagin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668038070; a=rsa-sha256; cv=none; b=uYy5JUXuQpD4qLcI97ZErSFekdKkgg00eRQ75ZqWMy2fXB+0zqs90ka8yNoeofUBOOs6PF SqPsvRfAnmGRMoHVGCxJikpFuBox6F/k81m76+mND+da5j2K4FtlB1vNx3o8poxd4w56jS vBqQhBCRokKg7xex+JSRyJ75YM3v6F4= X-Rspamd-Queue-Id: 247C7120002 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=NVJifqN9; spf=pass (imf29.hostedemail.com: domain of avagin@gmail.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=avagin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspamd-Server: rspam10 X-Rspam-User: X-Stat-Signature: gzncx9j4bcqejait4g3c66uiw8hy1sqp X-HE-Tag: 1668038069-534086 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Muhammad, Here are a few inline comments. On Wed, Nov 09, 2022 at 03:23:02PM +0500, Muhammad Usama Anjum wrote: > This IOCTL, PAGEMAP_SCAN can be used to get and/or clear the info about > page table entries. The following operations are supported in this ioctl: > - Get the information if the pages are soft-dirty, file mapped, present > or swapped. > - Clear the soft-dirty PTE bit of the pages. > - Get and clear the soft-dirty PTE bit of the pages. > > Only the soft-dirty bit can be read and cleared atomically. struct > pagemap_sd_args is used as the argument of the IOCTL. In this struct: > - The range is specified through start and len. > - The output buffer and size is specified as vec and vec_len. > - The optional maximum requested pages are specified in the max_pages. > - The flags can be specified in the flags field. The PAGEMAP_SD_CLEAR > and PAGEMAP_SD_NO_REUSED_REGIONS are supported. > - The masks are specified in rmask, amask, emask and return_mask. > > This IOCTL can be extended to get information about more PTE bits. > > This is based on a patch from Gabriel Krisman Bertazi. > > Signed-off-by: Muhammad Usama Anjum > --- > Changes in v6: > - Rename variables and update comments > - Make IOCTL independent of soft_dirty config > - Change masks and bitmap type to _u64 > - Improve code quality > > Changes in v5: > - Remove tlb flushing even for clear operation > > Changes in v4: > - Update the interface and implementation > > Changes in v3: > - Tighten the user-kernel interface by using explicit types and add more > error checking > > Changes in v2: > - Convert the interface from syscall to ioctl > - Remove pidfd support as it doesn't make sense in ioctl > --- > fs/proc/task_mmu.c | 328 ++++++++++++++++++++++++++++++++++ > include/uapi/linux/fs.h | 56 ++++++ > tools/include/uapi/linux/fs.h | 56 ++++++ > 3 files changed, 440 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 8235c536ac70..8d6a84ec5ef7 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -19,6 +19,9 @@ > #include > #include > #include > +#include > +#include > +#include > > #include > #include > @@ -1775,11 +1778,336 @@ static int pagemap_release(struct inode *inode, struct file *file) > return 0; > } > > +#define PAGEMAP_OP_MASK (PAGE_IS_SOFTDIRTY | PAGE_IS_FILE | \ > + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > +#define PAGEMAP_NONSD_OP_MASK (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > +#define PAGEMAP_SD_FLAGS (PAGEMAP_SOFTDIRTY_CLEAR | PAGEMAP_NO_REUSED_REGIONS) > +#define IS_CLEAR_OP(a) (a->flags & PAGEMAP_SOFTDIRTY_CLEAR) > +#define IS_GET_OP(a) (a->vec) > +#define IS_SD_OP(a) (a->flags & PAGEMAP_SD_FLAGS) > + > +struct pagemap_scan_private { > + struct page_region *vec; > + unsigned long vec_len; > + unsigned long vec_index; > + unsigned int max_pages; > + unsigned int found_pages; > + unsigned int flags; > + unsigned long required_mask; > + unsigned long anyof_mask; > + unsigned long excluded_mask; > + unsigned long return_mask; > +}; > + > +static int pagemap_scan_pmd_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + > + if (IS_GET_OP(p) && p->max_pages && (p->found_pages == p->max_pages)) > + return -1; > + > + if (vma->vm_flags & VM_PFNMAP) > + return 1; > + > + return 0; > +} > + > +static int add_to_out(bool sd, bool file, bool pres, bool swap, struct pagemap_scan_private *p, > + unsigned long addr, unsigned int len) > +{ > + unsigned long bitmap, cur = sd | file << 1 | pres << 2 | swap << 3; Should we define contants for each of these bits? > + bool cpy = true; > + > + if (p->required_mask) > + cpy = ((p->required_mask & cur) == p->required_mask); > + if (cpy && p->anyof_mask) > + cpy = (p->anyof_mask & cur); > + if (cpy && p->excluded_mask) > + cpy = !(p->excluded_mask & cur); > + > + bitmap = cur & p->return_mask; > + > + if (cpy && bitmap) { > + if ((p->vec_index) && (p->vec[p->vec_index - 1].bitmap == bitmap) && > + (p->vec[p->vec_index - 1].start + p->vec[p->vec_index - 1].len * PAGE_SIZE == > + addr)) { I think it is better to define a variable for p->vec_index - 1. nit: len can be in bytes rather than pages. > + p->vec[p->vec_index - 1].len += len; > + p->found_pages += len; > + } else if (p->vec_index < p->vec_len) { > + p->vec[p->vec_index].start = addr; > + p->vec[p->vec_index].len = len; > + p->found_pages += len; > + p->vec[p->vec_index].bitmap = bitmap; > + p->vec_index++; > + } else { > + return -ENOMEM; > + } > + } > + > + return 0; > +} > + > +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long addr, > + unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned int len; > + spinlock_t *ptl; > + int ret = 0; > + pte_t *pte; > + bool dirty_vma = (p->flags & PAGEMAP_NO_REUSED_REGIONS) ? > + (false) : (vma->vm_flags & VM_SOFTDIRTY); > + > + if ((walk->vma->vm_end < addr) || (p->max_pages && p->found_pages == p->max_pages)) > + return 0; > + > + end = min(end, walk->vma->vm_end); > + > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (ptl) { > + if (dirty_vma || check_soft_dirty_pmd(vma, addr, pmd, false)) { > + /* > + * Break huge page into small pages if operation needs to be performed is > + * on a portion of the huge page or the return buffer cannot store complete > + * data. > + */ > + if ((IS_CLEAR_OP(p) && (end - addr < HPAGE_SIZE))) { > + spin_unlock(ptl); > + split_huge_pmd(vma, pmd, addr); > + goto process_smaller_pages; > + } > + > + if (IS_GET_OP(p)) { > + len = (end - addr)/PAGE_SIZE; > + if (p->max_pages && p->found_pages + len > p->max_pages) > + len = p->max_pages - p->found_pages; > + > + ret = add_to_out(dirty_vma || > + check_soft_dirty_pmd(vma, addr, pmd, false), can we reuse a return code of the previous call of check_soft_dirty_pmd? > + vma->vm_file, pmd_present(*pmd), is_swap_pmd(*pmd), > + p, addr, len); > + } > + if (!ret && IS_CLEAR_OP(p)) > + check_soft_dirty_pmd(vma, addr, pmd, true); should we return a error in this case? We need to be sure that: * we stop waking page tables after this point. * return this error to the user-space if we are not able to add anything in the vector. > + } > + spin_unlock(ptl); > + return 0; > + } > + > +process_smaller_pages: > + if (pmd_trans_unstable(pmd)) > + return 0; > + > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + for (; addr < end && !ret && (!p->max_pages || (p->found_pages < p->max_pages)) > + ; pte++, addr += PAGE_SIZE) { > + if (IS_GET_OP(p)) > + ret = add_to_out(dirty_vma || check_soft_dirty(vma, addr, pte, false), > + vma->vm_file, pte_present(*pte), > + is_swap_pte(*pte), p, addr, 1); > + if (!ret && IS_CLEAR_OP(p)) > + check_soft_dirty(vma, addr, pte, true); > + } > + pte_unmap_unlock(pte - 1, ptl); > + cond_resched(); > + > + return 0; > +} > + > +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, > + struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned int len; > + bool sd; > + > + if (vma) { > + /* Individual pages haven't been allocated and written */ > + sd = (p->flags & PAGEMAP_NO_REUSED_REGIONS) ? (false) : > + (vma->vm_flags & VM_SOFTDIRTY); > + > + len = (end - addr)/PAGE_SIZE; > + if (p->max_pages && p->found_pages + len > p->max_pages) > + len = p->max_pages - p->found_pages; > + > + add_to_out(sd, vma->vm_file, false, false, p, addr, len); > + } > + > + return 0; > +} > + > +#ifdef CONFIG_MEM_SOFT_DIRTY > +static int pagemap_scan_pre_vma(unsigned long start, unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned long end_cut = end; > + int ret; > + > + if (!(p->flags & PAGEMAP_NO_REUSED_REGIONS) && IS_CLEAR_OP(p) && > + (vma->vm_flags & VM_SOFTDIRTY)) { > + if (vma->vm_start < start) { > + ret = split_vma(vma->vm_mm, vma, start, 1); > + if (ret) > + return ret; > + } > + /* Calculate end_cut because of max_pages */ > + if (IS_GET_OP(p) && p->max_pages) > + end_cut = min(start + (p->max_pages - p->found_pages) * PAGE_SIZE, end); > + > + if (vma->vm_end > end_cut) { > + ret = split_vma(vma->vm_mm, vma, end_cut, 0); > + if (ret) > + return ret; > + } > + } > + > + return 0; > +} > + > +static void pagemap_scan_post_vma(struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + > + if (!(p->flags & PAGEMAP_NO_REUSED_REGIONS) && IS_CLEAR_OP(p) && > + (vma->vm_flags & VM_SOFTDIRTY)) { > + vma->vm_flags &= ~VM_SOFTDIRTY; > + vma_set_page_prot(vma); > + } > +} > +#endif /* CONFIG_MEM_SOFT_DIRTY */ > + > +static const struct mm_walk_ops pagemap_scan_ops = { > + .test_walk = pagemap_scan_pmd_test_walk, > + .pmd_entry = pagemap_scan_pmd_entry, > + .pte_hole = pagemap_scan_pte_hole, > + > +#ifdef CONFIG_MEM_SOFT_DIRTY > + /* Only for clearing SD bit over VMAs */ > + .pre_vma = pagemap_scan_pre_vma, > + .post_vma = pagemap_scan_post_vma, > +#endif /* CONFIG_MEM_SOFT_DIRTY */ > +}; > + > +static long do_pagemap_sd_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) > +{ > + struct mmu_notifier_range range; > + unsigned long __user start, end; > + struct pagemap_scan_private p; > + int ret; > + > + start = (unsigned long)untagged_addr(arg->start); > + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) > + return -EINVAL; > + > + if (IS_GET_OP(arg) && > + ((arg->vec_len == 0) || (!access_ok((struct page_region *)arg->vec, arg->vec_len)))) > + return -ENOMEM; > + > +#ifndef CONFIG_MEM_SOFT_DIRTY > + if (IS_SD_OP(arg) || (arg->required_mask & PAGE_IS_SOFTDIRTY) || > + (arg->anyof_mask & PAGE_IS_SOFTDIRTY)) > + return -EINVAL; > +#endif > + > + if ((arg->flags & ~PAGEMAP_SD_FLAGS) || (arg->required_mask & ~PAGEMAP_OP_MASK) || > + (arg->anyof_mask & ~PAGEMAP_OP_MASK) || (arg->excluded_mask & ~PAGEMAP_OP_MASK) || > + (arg->return_mask & ~PAGEMAP_OP_MASK)) > + return -EINVAL; > + > + if ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || !arg->return_mask) > + return -EINVAL; > + > + if (IS_SD_OP(arg) && ((arg->required_mask & PAGEMAP_NONSD_OP_MASK) || > + (arg->anyof_mask & PAGEMAP_NONSD_OP_MASK))) > + return -EINVAL; > + > + end = start + arg->len; > + p.max_pages = arg->max_pages; > + p.found_pages = 0; > + p.flags = arg->flags; > + p.required_mask = arg->required_mask; > + p.anyof_mask = arg->anyof_mask; > + p.excluded_mask = arg->excluded_mask; > + p.return_mask = arg->return_mask; > + p.vec_index = 0; > + p.vec_len = arg->vec_len; > + > + if (IS_GET_OP(arg)) { > + p.vec = vzalloc(arg->vec_len * sizeof(struct page_region)); I think we need to set a reasonable limit for vec_len to avoid large allocations on the kernel. We can consider to use kmalloc or kvmalloc here. Thanks, Andrei