From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44202EB64D9 for ; Thu, 15 Jun 2023 15:12:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8AF4D6B0074; Thu, 15 Jun 2023 11:12:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 85F158E0003; Thu, 15 Jun 2023 11:12:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 74D6C8E0002; Thu, 15 Jun 2023 11:12:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 676246B0074 for ; Thu, 15 Jun 2023 11:12:15 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0DC6A40948 for ; Thu, 15 Jun 2023 15:12:15 +0000 (UTC) X-FDA: 80905323030.23.9C03026 Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by imf09.hostedemail.com (Postfix) with ESMTP id B5F1414002E for ; Thu, 15 Jun 2023 15:11:56 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=cvv4yiSD; dmarc=pass (policy=quarantine) header.from=collabora.com; spf=pass (imf09.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686841917; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T9NZ7xgu3m48edmJD5B4kCODSvBtFQEyUkZiXPSnSrk=; b=W8CqIM2tuBbuvOXQSwoJHd57us72xBCpvoVtaEFTDKnoTYgXV0c8RbXtP03baQ8cshwKMU A+J2JHOOnRfllbpbnFXknnyyCjjv/3a489Z5lnF9JiuseUZMSpNpnrBqapKIB6aPlI+6aY Co1RW1DA7wJImP1FCca0Qs7F/nRrENE= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=cvv4yiSD; dmarc=pass (policy=quarantine) header.from=collabora.com; spf=pass (imf09.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686841917; a=rsa-sha256; cv=none; b=TvBolHi/QBKmvLA0G04F1rD22fH/uiZvaxjikrMyz1LymEWf7YHNacHoMV+Mud6VRZQWEf YZrWjq/ezzWEj0+g2C4oKy5BPCG6Bqh5+Dh1u3JB5qF2GZzJf9ZT2Ac+PIG2giWCRq74dl 3GJyYSTPmdIGfC5sImdoN9O+AyPRbB0= Received: from [192.168.10.55] (unknown [119.155.33.163]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: usama.anjum) by madras.collabora.co.uk (Postfix) with ESMTPSA id 3C6876606F67; Thu, 15 Jun 2023 16:11:47 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1686841914; bh=3MdXnp0rnSSfhyFgYUFW/54WeVSBpiq5xrBJ+WJvLTA=; h=Date:Cc:Subject:To:References:From:In-Reply-To:From; b=cvv4yiSDjEqgmaVk30lc/A1N18hlWQXP3t4XFBlUoaAfMgs9xX8+1fJ5yWk2NNoZj AJ5m9Lp/q9ZnW30G5WveI2rouU5wWSBuXVAsJKd4oImv66qsD9JWkacs4Vtd2rrkZJ cc1/RWaVKeHBPd9N62Ee7eClRJfHdeNtDZx6GTCQLp5cwipiuxWKv/lrPHoE99BxVI dOAQjfANwl/WlLvqh572WAxqxz4Ih3vvdqLhLl78MhDSDLAzyi00oQ2L+BsoxafukV 42aI37Al5R1KTGWwek9J/wPw+dDcRmAewfO6178tNZDU9KlbEKJ22FKmqWZBvzHFTc JP+tS0tnOv6Bw== Message-ID: <39bc8212-9ee8-dbc1-d468-f6be438b683b@collabora.com> Date: Thu, 15 Jun 2023 20:11:42 +0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Cc: Muhammad Usama Anjum , Peter Xu , David Hildenbrand , Andrew Morton , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Mike Rapoport , Nadav Amit , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Subject: Re: [PATCH v18 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Content-Language: en-US To: =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= References: <20230613102905.2808371-1-usama.anjum@collabora.com> <20230613102905.2808371-3-usama.anjum@collabora.com> <0db01d90-09d6-08a4-bbb8-70670d3baa94@collabora.com> <34203acf-7270-7ade-a60e-ae0f729dcf70@collabora.com> <96b7cc00-d213-ad7d-1b48-b27f75b04d22@collabora.com> From: Muhammad Usama Anjum In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: t79spk9tfbpbta3s1b4augjh1srijio7 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: B5F1414002E X-HE-Tag: 1686841916-198192 X-HE-Meta: U2FsdGVkX1+A0kYOFkF7fAAkMszoRG6EsyQiJNbR+SMusmU7OeBjgFrts+R9DijcyhX67wDCQi5kuQY6zYbKgd18amU360OxDGDZH/+zcZYoP8V6TmwrU7P/j4iifKkVXxpENJprgMEVhj/jpLXW0KDyNsEtVyFYX8n7QYFkjUf8pl/vWo/ZNRd2vMUg83JF4nKN9f+PcePEKMXXMxNQXsGZzvkLoUGdYAILuHF9ldqqkxLTo5HVYW3EJu7D9EDgZTj9BXfr7gb3bQG9kFrlMuvie6/YlB9dT0hWeSZFExRBpkPqnXCzJtiZNF0JMoM/2HLvqtugXAc2jl1nzEydKKe5vF0Ddqxf4Cbw42AB/VMvY9WxGTJcg9MPZa5B/uDvuwEJZgs3wMg9Z/76Uv8du7sxqg56z8o3+tSJkwqob6UCBCDKsk8J/LIDuIC/WvCD9oQkkieH3S/R/hwusNHAHN5dlfjMNivtNADKrXkeOFzl/WCU9+hcK2LPInWcsL0xKzmA8PrpyVgFiT8sbVCPagXOuuV9oV98zMYK/0ZYpP7+u2g1b/yLWkqUd/UOmjhvXUYd4IxStuEVBzkP9d2351nMhqI/g+/WP6hDdR07HPO98E2wXIIf7ao9v8MwDF2nSqYBdEVUguA633TDEZOQdtd0CF3R/bCK7K1+eVr0dkmDAd/u8finzaFRJXzfWIj4p0jQNtrcpMdJ1JyPo8ECAT67MxRZxF1/jzmx/hvO2/nufMoDYn3aiEKSlrwJKHyeDoV0eSvMIW8Qgsu7TFKAotsvzpWG5G/OPgrleHAD7TPaRi+D8KAmBQSJAO5DyUYPn9RPpswkSwwL21S5Y2Q46ChTOfrFJLbHqwkYHYVdS1eDyHjOwHU5+4f88Nc3V+TARC/cHAAWLN8OnAHLzghPilf0jnbt2jN7OIthsgudrBix3eQ5o0kvm+IffGLh1Tq5ES7dIWhVYsmYqLSmA6z 9C/AlWzF waZGUtz8duRkDdpWYc/8jnStrXM9x+Ed4yvFC/K4nmJuwJ6TcGHaNMIkWmLwj+YufZrOZrlREv5PaLXmxWLkL1+HrsKjkHgSDeRZWWdA3S+5gMKnSosdNsO/dptpvwdGvh7zyRjAsVtf42tVueCqdLOYrUTFS+Lyqk9KXo1taQtwxCSl0RVH+eNc+pEIqV8LTRdxIHs4PjGHQubgyK15ZsYuufPPihaLgoaAf/8sShKvMobPk4zlBSOUr23neulCzl5u97/wuugIOPFc5kmsTfwSru/VAntppTFhNCQEp2s8+bmhl1noTibfA9N1gD/t0imupOzv3gaSqKMrUEpt7qRdU+FTP9xSh6H2puM79SiKnda0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 6/15/23 7:52 PM, Michał Mirosław wrote: > On Thu, 15 Jun 2023 at 15:58, Muhammad Usama Anjum > wrote: >> I'll send next revision now. >> On 6/14/23 11:00 PM, Michał Mirosław wrote: >>> (A quick reply to answer open questions in case they help the next version.) >>> >>> On Wed, 14 Jun 2023 at 19:10, Muhammad Usama Anjum >>> wrote: >>>> On 6/14/23 8:14 PM, Michał Mirosław wrote: >>>>> On Wed, 14 Jun 2023 at 15:46, Muhammad Usama Anjum >>>>> wrote: >>>>>> >>>>>> On 6/14/23 3:36 AM, Michał Mirosław wrote: >>>>>>> On Tue, 13 Jun 2023 at 12:29, Muhammad Usama Anjum >>>>>>> wrote: >>> [...] >>>>>>>> + if (cur_buf->bitmap == bitmap && >>>>>>>> + cur_buf->start + cur_buf->len * PAGE_SIZE == addr) { >>>>>>>> + cur_buf->len += n_pages; >>>>>>>> + p->found_pages += n_pages; >>>>>>>> + } else { >>>>>>>> + if (cur_buf->len && p->vec_buf_index >= p->vec_buf_len) >>>>>>>> + return -ENOMEM; >>>>>>> >>>>>>> Shouldn't this be -ENOSPC? -ENOMEM usually signifies that the kernel >>>>>>> ran out of memory when allocating, not that there is no space in a >>>>>>> user-provided buffer. >>>>>> There are 3 kinds of return values here: >>>>>> * PM_SCAN_FOUND_MAX_PAGES (1) ---> max_pages have been found. Abort the >>>>>> page walk from next entry >>>>>> * 0 ---> continue the page walk >>>>>> * -ENOMEM --> Abort the page walk from current entry, user buffer is full >>>>>> which is not error, but only a stop signal. This -ENOMEM is just >>>>>> differentiater from (1). This -ENOMEM is for internal use and isn't >>>>>> returned to user. >>>>> >>>>> But why ENOSPC is not good here? I was used before, I think. >>>> -ENOSPC is being returned in form of true error from >>>> pagemap_scan_hugetlb_entry(). So I'd to remove -ENOSPC from here as it >>>> wasn't true error here, it was only a way to abort the walk immediately. >>>> I'm liking the following erturn code from here now: >>>> >>>> #define PM_SCAN_BUFFER_FULL (-256) >>> >>> I guess this will be reworked anyway, but I'd prefer this didn't need >>> custom errors etc. If we agree to decoupling the selection and GET >>> output, it could be: >>> >>> bool is_interesting_page(p, flags); // this one does the >>> required/anyof/excluded match >>> size_t output_range(p, start, len, flags); // this one fills the >>> output vector and returns how many pages were fit >>> >>> In this setup, `is_interesting_page() && (n_out = output_range()) < >>> n_pages` means this is the final range, no more will fit. And if >>> `n_out == 0` then no pages fit and no WP is needed (no other special >>> cases). >> Right now, pagemap_scan_output() performs the work of both of these two >> functions. The part can be broken into is_interesting_pages() and we can >> leave the remaining part as it is. >> >> Saying that n_out < n_pages tells us the buffer is full covers one case. >> But there is case of maximum pages have been found and walk needs to be >> aborted. > > This case is exactly what `n_out < n_pages` will cover (if scan_output > uses max_pages properly to limit n_out). > Isn't it that when the buffer is full we want to abort the scan always > (with WP if `n_out > 0`)? Wouldn't it be duplication of condition if buffer is full inside pagemap_scan_output() and just outside it. Inside pagemap_scan_output() we check if we have space before putting data inside it. I'm using this same condition to indicate that buffer is full. > >>>>>>> For flags name: PM_REQUIRE_WRITE_ACCESS? >>>>>>> Or Is it intended to be checked only if doing WP (as the current name >>>>>>> suggests) and so it would be redundant as WP currently requires >>>>>>> `p->required_mask = PAGE_IS_WRITTEN`? >>>>>> This is intended to indicate that if userfaultfd is needed. If >>>>>> PAGE_IS_WRITTEN is mentioned in any of mask, we need to check if >>>>>> userfaultfd has been initialized for this memory. I'll rename to >>>>>> PM_SCAN_REQUIRE_UFFD. >>>>> >>>>> Why do we need that check? Wouldn't `is_written = false` work for vmas >>>>> not registered via uffd? >>>> UFFD_FEATURE_WP_ASYNC and UNPOPULATED needs to be set on the memory region >>>> for it to report correct written values on the memory region. Without UFFD >>>> WP ASYNC and UNPOUPULATED defined on the memory, we consider UFFD_WP state >>>> undefined. If user hasn't initialized memory with UFFD, he has no right to >>>> set is_written = false. >>> >>> How about calculating `is_written = is_uffd_registered() && >>> is_uffd_wp()`? This would enable a user to apply GET+WP for the whole >>> address space of a process regardless of whether all of it is >>> registered. >> I wouldn't want to check if uffd is registered again and again. This is why >> we are doing it only once every walk in pagemap_scan_test_walk(). > > There is no need to do the checks repeatedly. If I understand the code > correctly, uffd registration is per-vma, so it can be communicated > from test_walk to entry/hole callbacks via a field in > pagemap_scan_private. > >>>>> While here, I wonder if we really need to fail the call if there are >>>>> unknown bits in those masks set: if this bit set is expanded with >>>>> another category flags, a newer userspace run on older kernel would >>>>> get EINVAL even if the "treat unknown as 0" be what it requires. >>>>> There is no simple way in the API to discover what bits the kernel >>>>> supports. We could allow a no-op (no WP nor GET) call to help with >>>>> that and then rejecting unknown bits would make sense. >>>> I've not seen any examples of this. But I've seen examples of returning >>>> error if kernel doesn't support a feature. Each new feature comes with a >>>> kernel version, greater than this version support this feature. If user is >>>> trying to use advanced feature which isn't present in a kernel, we should >>>> return error and not proceed to confuse the user/kernel. In fact if we look >>>> at userfaultfd_api(), we return error immediately if feature has some bit >>>> set which kernel doesn't support. >>> >>> I think we should have a way of detecting the supported flags if we >>> don't want a forward compatibility policy for flags here. Maybe it >>> would be enough to allow all the no-op combinations for this purpose? >> Again I don't think UFFD is doing anything like this. > > If it's cheap and easy to provide a user with a way to detect the > supported features - why not do it? I'm sorry. But it would bring up something new and iterations will be needed to just play around. I like the UFFD way. > > Best Regards > Michał Mirosław -- BR, Muhammad Usama Anjum