From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7CEEBC04A6A for ; Sun, 13 Aug 2023 08:14:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 979486B0075; Sun, 13 Aug 2023 04:14:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 929F46B0078; Sun, 13 Aug 2023 04:14:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7C9388D0001; Sun, 13 Aug 2023 04:14:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 682336B0075 for ; Sun, 13 Aug 2023 04:14:43 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 3509C1A0344 for ; Sun, 13 Aug 2023 08:14:43 +0000 (UTC) X-FDA: 81118370046.20.9E80202 Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by imf16.hostedemail.com (Postfix) with ESMTP id 4AB6918000E for ; Sun, 13 Aug 2023 08:14:41 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=FCjVfVG7; spf=pass (imf16.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com; dmarc=pass (policy=quarantine) header.from=collabora.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691914481; a=rsa-sha256; cv=none; b=OuUyefIPuvGC/JCX14GpVLov3TZJjiplZk9vqhBbgTzBdLUxAkyVYQXdtAVSy0m32eJaKa MEfca9vlQPhrGO9bq8ENa1Yq10cgG1H8CyoGC1QQYw4vIGikkjX2XZZmhf1UBfAigBYo1n bb2AaeSyNTFtfIW6iYiw2PoBwzxVcwg= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=FCjVfVG7; spf=pass (imf16.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com; dmarc=pass (policy=quarantine) header.from=collabora.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691914481; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QLk/aCu9lzClJ+8qjSQeHdIOeAstxiSUP99qytu2Uf4=; b=nWN89O5T8oT21MrIdbm/we6eAtgAmdUuvuyYZXmFOyb/Cs2NeCKjyyqwylabzeqWIE0lC5 0LYDkcbCVsLTijyAhi7bL1VRqwVbjgFzTF5portAV2FdJH30GTD4gp7OvqusVA8TOWmTQ2 7LfarsnN8OzvcsZU7bCD5kHVaWeKBgo= Received: from [192.168.100.7] (unknown [39.34.188.71]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: usama.anjum) by madras.collabora.co.uk (Postfix) with ESMTPSA id 1E3DF6607122; Sun, 13 Aug 2023 09:14:32 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1691914479; bh=BChGkJRkjZWgzyQFOCbWXa50jj06kVV10+dAvRN9voo=; h=Date:Cc:Subject:To:References:From:In-Reply-To:From; b=FCjVfVG7uidfPv4zBprM+S9bPUpZPT4QO+5p6F/iyUYonAtFmweBtVQnsE41tO6AT iYGrXnHeRy8PaUESj4Yy8QykXZvZR5wwXxRvBf5l4d1tFCLOqnhyUcuIGWcoBKKv8p J9hnsOO/1lt0xJoVb0NXw7zvGwf9NLV0NyKYw5BaWW4DGnEamKQKhkUocsYlUTpAZF dJ3zBOMRho7KbqY2qS65pmtOOIGaMBcwtdrkF/YGKNkG01ys1nml0itErhrF3/Sl59 vlNDWVHqWxQXa3Rq7trUFSTtL2YBUF47GKSDBACjCQoopInUfZf85qfIVSVS8x7yw8 +Ec9weZFL4DSw== Message-ID: Date: Sun, 13 Aug 2023 13:14:28 +0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.1 Cc: Muhammad Usama Anjum , Peter Xu , David Hildenbrand , Andrew Morton , =?UTF-8?B?TWljaGHFgiBNaXJvc8WC?= =?UTF-8?Q?aw?= , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Mike Rapoport , Nadav Amit , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Subject: Re: [PATCH v29 2/6] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs To: =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= References: <20230811180842.3141781-1-usama.anjum@collabora.com> <20230811180842.3141781-3-usama.anjum@collabora.com> Content-Language: en-US From: Muhammad Usama Anjum In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 4AB6918000E X-Stat-Signature: 1e16bqdtpsyeri3hxfgsmxfpcbm373c3 X-Rspam-User: X-HE-Tag: 1691914481-581436 X-HE-Meta: U2FsdGVkX1/JJGrfzEha9duhLtXzMAJXaON9U3HBzcgYuZa3GCj7tOYzl/r+ODpOyLujDQlvkTj3gNh5t7xx9yX2oyBXuRfyWX06AzR3Q6WFUbH9hy2folKXIyqgTvLD1z1lBTl5cWx7VTGcDFZ9AQ1YPbFJW3VYXtnPh9hcjyDEJpwpeHGlT/tzvcZqajwEifsNmMAX2ljKEwKvtAgdwIG6TDAaqKw20OkbqaKh8D9raRgio2Ya9o8eHXYgYLjndevvgVx0Oz2BH7S7935pI1z6CTo6EW3VVMlZNubyE7AWGSYfaycd9l7LrX0azki+qlXZ3+T/kmaPFvofEqlNb3v2xhkW73qP7rZrYysBqG4c5oSvMr6b82/WXfU/q3eughIqrmL3tNthFKLNifct1RcJBYnBdntudin2UkYyjWpI4CANl4aLwyX+ypmxWNrMsOGWTBk8osVPipel9LRXRmrp2ABHjHbiPOYsoLC7epGq8NtgNGm0cnYoXi+Izla/FHhdKpMkf/W1Ml6XPVgEqyszTXHB6w0OmT02sDz1r26Nhn9LIwzu0X+q/zo/8qpkuI7i8aEBgL4dFwybtjH3d3zs7UTpGqzasuJeUnLNTtDsGB9uBheO++obZtMu3P8iypGUb0kSsc6qearR8Z0479GwlAkh6ybuoWgD58pvgl5UCGrOYqpipwi1IywNAfuy/iZs9ucvkj/hXG3gNWGbjxs9nmytPz6yV90+FoGn2VvZhlzuBa/s1/m+Uz68WrvPJAAUGPxsNiLJv/aqIs4HbRkgzenw5VcrKtg7RYHUhM5v8T/gf+YvfLCwCHCx6hHkmMMbQYCBa6zWvsVRtwLLUqpACqdi8AfoTqohfY9yHQySTd05Jw4HzJ2jnn+9kDiQ0XnAIWrntyPX5tCzOjkEw5WZ7xRPFlOUGTrtwnaSMT1yyZZ5KKWfCbND3WuXEir+EuBZqa9Xta6B5Lcm0CB XyvilwgC hY8jzfeBUZarQy95xWcI3bzx7opAYLU+brTv9Uojj4p4VdBy+gWDMbejU7Gjbxm+SB4xk5VcYuBEIcxeQlTh5XO6bAv0aC5cW/ghg8GR+FiQZt9PZspQgIJcPGjFMSdVmbQnvcd0g+1bfJDkuiz1CtPrxYm1op3KWyyHfziYjOoWSK4DULznP6y5aUYb9sudmlcwFaCiEDGqupuPJ9HAn/E7tx2KqtAmk63Xk0KCuxzE6WHas7/6zuQ7som2sV6ybgv3f X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 8/12/23 8:49 PM, Michał Mirosław wrote: > On Fri, Aug 11, 2023 at 11:08:38PM +0500, Muhammad Usama Anjum wrote: >> The PAGEMAP_SCAN IOCTL on the pagemap file can be used to get or optionally >> clear the info about page table entries. The following operations are supported >> in this IOCTL: >> - Scan the address range and get the memory ranges matching the provided criteria. >> This is performed by default when the output buffer is specified. > > Nit: This is actually performed always, but you can disable the output part > by passing {NULL, 0} for the buffer. I'll update it to: "This is performed when the output buffer is specified." > > [...] >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -19,6 +19,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> >> #include >> #include >> @@ -1749,11 +1751,682 @@ static int pagemap_release(struct inode *inode, struct file *file) >> return 0; >> } >> >> +#define PM_SCAN_CATEGORIES (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN | \ >> + PAGE_IS_FILE | PAGE_IS_PRESENT | \ >> + PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \ >> + PAGE_IS_HUGE) >> +#define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC) >> + >> +struct pagemap_scan_private { >> + struct pm_scan_arg arg; >> + unsigned long masks_of_interest, cur_vma_category; >> + struct page_region *vec_buf; >> + unsigned long vec_buf_len, vec_buf_index, found_pages, walk_end_addr; >> + struct page_region __user *vec_out; >> +}; > [...] >> +static unsigned long pagemap_thp_category(pmd_t pmd) >> +{ >> + unsigned long categories = PAGE_IS_HUGE; >> + >> + /* >> + * THPs don't support file-backed memory. So PAGE_IS_FILE >> + * hasn't been checked here. > > "hasn't been" -> "is not" > (same for HugeTLB comment) I'll update. > >> +static bool pagemap_scan_push_range(unsigned long categories, >> + struct pagemap_scan_private *p, >> + unsigned long addr, unsigned long end) >> +{ >> + struct page_region *cur_buf = &p->vec_buf[p->vec_buf_index]; >> + >> + /* >> + * When there is no output buffer provided at all, the sentinel values >> + * won't match here. There is no other way for `cur_buf->end` to be >> + * non-zero other than it being non-empty. >> + */ >> + if (addr == cur_buf->end && categories == cur_buf->categories) { >> + cur_buf->end = end; >> + return true; >> + } >> + >> + if (cur_buf->end) { >> + if (p->vec_buf_index >= p->vec_buf_len - 1) >> + return false; >> + >> + cur_buf = &p->vec_buf[++p->vec_buf_index]; >> + } >> + >> + cur_buf->start = addr; >> + cur_buf->end = end; >> + cur_buf->categories = categories; >> + >> + return true; >> +} >> + >> +static void pagemap_scan_backout_range(struct pagemap_scan_private *p, >> + unsigned long addr, unsigned long end) >> +{ >> + struct page_region *cur_buf = &p->vec_buf[p->vec_buf_index]; >> + >> + if (cur_buf->start != addr) { >> + cur_buf->end = addr; >> + } else { >> + cur_buf->start = cur_buf->end = 0; >> + if (p->vec_buf_index > 0) >> + p->vec_buf_index--; > > There is no need to move to the previous index, as if the walk ends at > this moment, the flush_buffer() code will ignore the empty last range. Yeah, I'll update. > >> + } >> + >> + p->found_pages -= (end - addr) / PAGE_SIZE; >> +} >> + >> +static int pagemap_scan_output(unsigned long categories, >> + struct pagemap_scan_private *p, >> + unsigned long addr, unsigned long *end) >> +{ >> + unsigned long n_pages, total_pages; >> + int ret = 0; >> + >> + if (!p->vec_buf) >> + return 0; >> + >> + categories &= p->arg.return_mask; >> + >> + n_pages = (*end - addr) / PAGE_SIZE; >> + if (check_add_overflow(p->found_pages, n_pages, &total_pages) || >> + total_pages > p->arg.max_pages) { >> + size_t n_too_much = total_pages - p->arg.max_pages; >> + *end -= n_too_much * PAGE_SIZE; >> + n_pages -= n_too_much; >> + ret = -ENOSPC; >> + } >> + >> + if (!pagemap_scan_push_range(categories, p, addr, *end)) { >> + *end = addr; >> + n_pages = 0; >> + ret = -ENOSPC; >> + } >> + >> + p->found_pages += n_pages; >> + if (ret) >> + p->walk_end_addr = *end; >> + >> + return ret; >> +} > [...] >> +static int pagemap_scan_init_bounce_buffer(struct pagemap_scan_private *p) >> +{ >> + if (!p->arg.vec_len) >> + return 0; > > The removal of `cur_buf` lost the case of empty non-NULL output buffer > passed in args. That was requesting the walk to stop at first matching > page (with the address returned in `walk_end`). The push_range() call > is still checking that, but since neither the buffer nor the sentinel > values are set, the case is not possible to invoke. Yeah, this is why I've removed all that logic here. The vec_len is set to 0 and vec_buf to NULL. This handles all the cases. > >> + /* >> + * Allocate a smaller buffer to get output from inside the page >> + * walk functions and walk the range in PAGEMAP_WALK_SIZE chunks. >> + */ > > I think this is no longer true? We can now allocate arbitrary number of > entries, but should probably have at least 512 to cover one PMD of pages. > So it would be better to have a constant that holds the number of > entries in the bounce buffer. I'll remove the comment. PAGEMAP_WALK_SIZE >> PAGE_SHIFT is a constant already, just a fancy one. Altough if we can increase 512 to bigger number, it'll be better in terms of performance. I'm not sure how much we can increase it. > >> + p->vec_buf_len = min_t(size_t, PAGEMAP_WALK_SIZE >> PAGE_SHIFT, >> + p->arg.vec_len); >> + p->vec_buf = kmalloc_array(p->vec_buf_len, sizeof(*p->vec_buf), >> + GFP_KERNEL); >> + if (!p->vec_buf) >> + return -ENOMEM; >> + >> + p->vec_buf[0].end = 0; > > p->vec_buf->start = p->vec_buf->end = 0; Sure. > >> + p->vec_out = (struct page_region __user *)p->arg.vec; >> + >> + return 0; >> +} >> + >> +static int pagemap_scan_flush_buffer(struct pagemap_scan_private *p) >> +{ >> + const struct page_region *buf = p->vec_buf; >> + int n = (int)p->vec_buf_index; > > Why do you need an `int` here (requiring a cast)? Just looked at it, n and return code of pagemap_scan_flush_buffer() should be long. Changed. > >> + if (p->arg.vec_len == 0) >> + return 0; > > This should be actually `if (!buf)` as this notes that we don't have any > buffer allocated (due to no output requested). I'll update. !buf seems more reasonable. > >> + if (buf[n].end && buf[n].end != buf[n].start) >> + n++; > > Testing `buf[n].end` is redundant, as the range is nonempty if > `end != start`. Sure. > >> + if (!n) >> + return 0; >> + >> + if (copy_to_user(p->vec_out, buf, n * sizeof(*buf))) >> + return -EFAULT; >> + >> + p->arg.vec_len -= n; >> + p->vec_out += n; >> + >> + p->vec_buf_index = 0; >> + p->vec_buf_len = min_t(size_t, p->vec_buf_len, p->arg.vec_len); >> + p->vec_buf[0].end = 0; > > buf->start = buf->end = 0; Sure. > >> + return n; >> +} >> + >> +static long do_pagemap_scan(struct mm_struct *mm, unsigned long uarg) >> +{ >> + struct mmu_notifier_range range; >> + struct pagemap_scan_private p = {0}; >> + unsigned long walk_start; >> + size_t n_ranges_out = 0; >> + int ret; >> + >> + ret = pagemap_scan_get_args(&p.arg, uarg); >> + if (ret) >> + return ret; >> + >> + p.masks_of_interest = p.arg.category_inverted | p.arg.category_mask | >> + p.arg.category_anyof_mask | p.arg.return_mask; > > `category_inverted` can be left out, because if a set bit it is not also in one > of the masks, then its value is going to be ignored. Okay. > > [...] >> + for (walk_start = p.arg.start; walk_start < p.arg.end; >> + walk_start = p.arg.walk_end) { >> + int n_out; >> + >> + if (fatal_signal_pending(current)) { >> + ret = -EINTR; >> + break; >> + } >> + >> + ret = mmap_read_lock_killable(mm); >> + if (ret) >> + break; >> + ret = walk_page_range(mm, walk_start, p.arg.end, >> + &pagemap_scan_ops, &p); >> + mmap_read_unlock(mm); >> + >> + n_out = pagemap_scan_flush_buffer(&p); >> + if (n_out < 0) >> + ret = n_out; >> + else >> + n_ranges_out += n_out; >> + >> + p.walk_end_addr = p.walk_end_addr ? p.walk_end_addr : p.arg.end; > > Why is `p.walk_end_addr` needed? It is not used in the loop code. Shoudn't > it be `p.arg.walk_end` as used in the `for` loop continuation statement? It isn't needed for the loop. But we need to note down the ending address of walk. We can switch to using p.arg.walk_end for better logical reason. I'll update code. > >> + if (ret != -ENOSPC || p.arg.vec_len == 0 || >> + p.found_pages == p.arg.max_pages) >> + break; > > Nit: I think you could split this into two or three separate `if (x) > break;` for easier reading. The `vec_len` and `found_pages` are > buffer-full tests, so could go along, but `ret != ENOSPC` is checking an > error condition aborting the scan before it ends. Can be done. > >> + } >> + >> + /* ENOSPC signifies early stop (buffer full) from the walk. */ >> + if (!ret || ret == -ENOSPC) >> + ret = n_ranges_out; >> + >> + p.arg.walk_end = p.walk_end_addr ? p.walk_end_addr : walk_start; >> + if (pagemap_scan_writeback_args(&p.arg, uarg)) >> + ret = -EFAULT; > [...] >> --- a/include/uapi/linux/fs.h >> +++ b/include/uapi/linux/fs.h > [...] >> +/* >> + * struct pm_scan_arg - Pagemap ioctl argument >> + * @size: Size of the structure >> + * @flags: Flags for the IOCTL >> + * @start: Starting address of the region >> + * @end: Ending address of the region >> + * @walk_end Address where the scan stopped (written by kernel). >> + * walk_end == end (tag removed) informs that the scan completed on entire range. > > I'm not sure `tag removed` is enough to know what tag was removed. > Maybe something like "with address tags cleared" would fit? Okay. > > Best Regards > Michał Mirosław -- BR, Muhammad Usama Anjum