From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87A0EC74A5B for ; Fri, 17 Mar 2023 14:15:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DE8716B0075; Fri, 17 Mar 2023 10:15:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D99426B0078; Fri, 17 Mar 2023 10:15:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BEB4A6B007B; Fri, 17 Mar 2023 10:15:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id AAD386B0075 for ; Fri, 17 Mar 2023 10:15:55 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 760751A1966 for ; Fri, 17 Mar 2023 14:15:55 +0000 (UTC) X-FDA: 80578589070.05.17E55CF Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) by imf01.hostedemail.com (Postfix) with ESMTP id 623F24001F for ; Fri, 17 Mar 2023 14:15:53 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=W4zjRDSR; spf=pass (imf01.hostedemail.com: domain of emmir@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=emmir@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679062553; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5tBpag1M9/QK3J44nio5y/z0shlcd59R/V5AkvISOo0=; b=q3R5sAz/XyBIEFcDtCaH4Vxfo3Q+4Vc+MYBkrmPARwCK7UfRK0DdgGA02es/jgPPxRU289 ZxAZdTXCZJ8n+GQuNLsNZHSFsnIu0GvLSCaZtLPva9myvr7ioECJ6WFn+pDwvVOD3SxWBD 1eYXSXU2U+aZ5AqfNSym+P3E6MkCDTM= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=W4zjRDSR; spf=pass (imf01.hostedemail.com: domain of emmir@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=emmir@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679062553; a=rsa-sha256; cv=none; b=HxpEtNsghWbDKlny2AcfK+wjFEnycQAK3ixPDmYvri5pDDkS2S+WdPzolSJj5MDWJ+KvPQ kyOiDURNY2WgMXAiMF8T43L0srPOmbte4C6cOYuaKlpEZxWLZSPnSLZzGis+sVmacv+5Qe I9zhBh4aI8OU6zzUi3e6uUrAi9ewKDg= Received: by mail-ed1-f48.google.com with SMTP id ek18so20938325edb.6 for ; Fri, 17 Mar 2023 07:15:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1679062552; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5tBpag1M9/QK3J44nio5y/z0shlcd59R/V5AkvISOo0=; b=W4zjRDSR1VwMi34d268bsm/oghzr1RPH1PKphPT5eretqaJvJLC5v2Oppto80nguAm CXmyOlupS0vT2fYXp6aHhdMZQBWNoOUK+qj9g8VspYuwvYytNlmF63RfTenZRCv0Ai+1 CnYRAF63jCEiSXqiOUouDtYZ7TKtEEF4zUEF4fRa9AecrI9Rq5NcwC+M/HuZV0qoOrXc lXsDcg/r0c/TxjriLZ0dku0sirbVxAlCKPVLRzDSl18++YjunB/KcM/mKGzv+v6kpThi wsC1C6QYWmSs2IOKBiPgpd5I4Mw94bbhddwbLAvet35CcKxC3TfBP4xSxq6p1J/cQTof AA7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679062552; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5tBpag1M9/QK3J44nio5y/z0shlcd59R/V5AkvISOo0=; b=XRk5oYqA4Gc8myrIz/dndDStM802A4oxkyeUvoacgNHzZUal2VSZDpnGO8xxREAg0k bBS5j0FDjEp5yJV6Yfm8NXY8jLoewZ5M5B9qON1W46tHY88kMjEUxHZmiapAckXIwmgJ /wbZVX3PyLKhgK4rLVqImupKcLJvr6tPHHpXGNcKESrSUJhdZ8drGmWIBXjw/azSNf50 35/l8/pyzp5PsFHDS70FReqP2oeWXEcDoVm7szPS3vz78WYndeF0tB/ky8bBezSnw5ib YnDx7P4y6UzzP3/TgdHn5bJJ5gDlqbRv8Wwv2sZvaTIc53Jq1c0R/yaH0gui+lHqQSUk sAsQ== X-Gm-Message-State: AO0yUKXeBzYnaNdxoHo9m3kd4DbJVWT1+NuQbOE3+VDe9b7LEMNjmJmv /wnlQTKwwZ3vdj1XCLFrfk28IZ4c/6NXiGqwWow8ug== X-Google-Smtp-Source: AK7set+DTtXSR9Avk0Jn6IT86sdVUC8+M9HffKHHDzCIBOtRhHD0v7nzpuzVOguMSBd6PF9nH5JdX241VByBNqeyCvk= X-Received: by 2002:a17:907:8dce:b0:931:99b5:67a4 with SMTP id tg14-20020a1709078dce00b0093199b567a4mr1593357ejc.15.1679062551783; Fri, 17 Mar 2023 07:15:51 -0700 (PDT) MIME-Version: 1.0 References: <20230309135718.1490461-1-usama.anjum@collabora.com> <20230309135718.1490461-5-usama.anjum@collabora.com> <3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com> <44d3f94e-fb9f-49df-7460-75dcee61802f@collabora.com> In-Reply-To: <44d3f94e-fb9f-49df-7460-75dcee61802f@collabora.com> From: =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= Date: Fri, 17 Mar 2023 15:15:40 +0100 Message-ID: Subject: Re: [PATCH v11 4/7] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs To: Muhammad Usama Anjum Cc: Peter Xu , David Hildenbrand , Andrew Morton , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Mike Rapoport , Nadav Amit , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 623F24001F X-Stat-Signature: 9y84do19jceeh4jtwaw9hr6hu64asui5 X-Rspam-User: X-HE-Tag: 1679062553-146865 X-HE-Meta: U2FsdGVkX1+ggJqpR+7jmuI1pnxp7X/eS/buxlo8heyz7YLOZfQaw6MEL7I7W5e+RSUeytsWI2nL0uR8xZ4MJxLv03K1aFLmEHtKAhKnewwVWZgyL9wuTwuRJJjkJ09t5ucAsGhm6cuNTbhVPahkSKdZh+le6CVrOmOf2GOd8TymN7LPln6L1CrH2JvkO0m1xgI1jmOxltvIK2lIbnlUkswqEmw+iPJBsnVYoMSOwOKfUxzHAbfV6p0TsS8di7zfPmIDrEci16mnCu/dtx0u6O18ZleDSimJFUSp+wuc7qnjTKTJKNjmKjpMGUsd64ZlKmPALNld1/GAdj0N8xWfgJn3PUWFJ2wIbpWDnRw8gBBJv6Ssq3az1oRftMVQyiljytFdoVQ1CqzLUdZygyK+0Py25mn4K8CgL1LuvklvHLIyLktlOGZU/gr8ys03NEDS+/M8rUPol9xeFeJiGE6FbyBLMGBDz0hSifu6AoiM5+JObupsnfv9S3RWNTEkvD7IS4GQVtm6Ra1ScF141GFXYWRwkvhcw08nbp34PLgAL7EkdmNRxWSo0e7jWlHJQgII3zK6FahkQgyup121vOEmE5BCauWChqdCrdPV/tzKIow5Pd5V8xoqdJYBuqKA/NBUHnn7r14K1JI4ljGCL0pWX6eSNbiXXfd5EBUAhMMXbcZcXoa0ptDJFpx/DyAV19uBh2OeDu93FGeqpVG2omC3k9+WBWu+WZotr3U/EWLfG81sNNg4Bkx8gXcKDm/WPLEDVJtdJEB2w/Ez9HEHRu1/GSS+MNI/+UrRfpCCmtYCYqx2xdh+VwIBBvN73LkPlGa0nmlQXCCEb7QlMIzgXGav8TKxsfXfArUsVxPegmcOplDviPRsjplefrWWcXT61pvZBfIoWYXWhxUkFVC/0xpZIPkO/wYSWxrcp5leGXhLfvfvPfj/04iTDEWIDdQrW5Pfcyk8/8LesyJ22b4cqu0 oa6grMKE ckJLAKS4ltpWIVS4tesXeb9DRkjvbDx40g9X61DtNTuYwbHssEcwulj8y1PBAmUAoJ9RGimNdlrnkSUvqPpOr8660NTWuVHBU8Q1NWlui+4yt/4rQOGH/pODHnIfGF0QdrA/7iKMQ3V+zfcnSC/qwGaP4FRYM/THwKdMeNVeR4fYPoliiP9XH+MsF5LzHMTrm5lXdX8mzR1m9cp5/CJb4hiXmU2I7g96SSZk3vjGfaIeum0L3146G/gYxPaGTYMUAM2D9qKnPwKd395AYKOVJCCITi6ytfUdhY++6IWRNEtWgRcLt/UDfIdeSIPpez3sanIn7pZRWTiRCbco3k6aey7SfPfEhmJLrJqlP6Ir3zT+XHrzEbL0PXjmnBumFNwG7843HfJx+6ryxiyrA8zFg5iD8gr6laHAni+G8NYfkbC1yooY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 17 Mar 2023 at 13:44, Muhammad Usama Anjum wrote: > On 3/17/23 2:28=E2=80=AFAM, Micha=C5=82 Miros=C5=82aw wrote: > > On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum > > wrote: > >> On 3/13/23 9:02=E2=80=AFPM, Micha=C5=82 Miros=C5=82aw wrote: > >>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum > >>> wrote: > > [...] > >>>> --- a/fs/proc/task_mmu.c > >>>> +++ b/fs/proc/task_mmu.c > > [...] > >>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool = swap, > > [...] > >>> The `cur->len` test seems redundant: is it possible to have > >>> `cur->start =3D=3D addr` in that case (I guess it would have to get > >>> `n_pages =3D=3D 0` in an earlier invocation)? > >> No, both wouldn't work. cur->len =3D=3D 0 means that it has only garba= ge. It is > >> essential to check the validity from cur->len before performing other > >> checks. Also cur->start can never be equal to addr as we are walking o= ver > >> page addressing in serial manner. We want to see here if the current > >> address matches the previous data by finding the ending address of las= t > >> stored data (cur->start + cur->len * PAGE_SIZE). > > > > If cur->len =3D=3D 0, then it doesn't matter if it gets merged or not -= it > > can be filtered out during the flush (see below). > > [...] > >>>> + } else if ((!p->vec_index) || > >>>> + ((p->vec_index + 1) < p->vec_len)) { > >>> > >>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? = Or better: > >>> > >>> if (vec_index >=3D p->vec_len) > >>> return -ENOSPC; > >> > >> No, it'll not work. Lets leave it as it is. :) > >> > >> It has gotten somewhat complex, but I don't have any other way to make= it > >> simpler which works. First note the following points: > >> 1) We walk over 512 page or 1 thp at a time to not over allocate memor= y in > >> kernel (p->vec). > >> 2) We also want to merge the consecutive pages with the same flags int= o one > >> struct page_region. p->vec of current walk may merge with next walk. S= o we > >> cannot write to user memory until we find the results of the next walk= . > >> > >> So most recent data is put into p->cur. When non-intersecting or merge= able > >> data is found, we move p->cur to p->vec[p->index] inside the page walk= . > >> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After= all > >> the walks are over. We move the p->cur to arg->vec. It completes the d= ata > >> transfer to user buffer. > > [...] > >> I'm so sorry that it has gotten this much complex. It was way simpler = when > >> we were walking over all the memory in one go. But then we needed an > >> unbounded memory from the kernel which we don't want. > > [...] > > > > I've gone through and hopefully understood the code. I'm not sure this > > needs to be so complicated: when traversing a single PMD you can > > always copy p->cur to p->vec[p->vec_index++] because you can have at > > most pages_per_PMD non-merges (in the worst case the last page always > > is left in p->cur and whole p->vec is used). After each PMD p->vec > > needs a flush if p->vec_index > 0, skipping the dummy entry at front > > (len =3D=3D 0; if present). (This is mostly how it is implemented now, = but > > I propose to remove the "overflow" check and do the starting guard > > removal only every PMD.) > Sorry, unable to understand where to remove the guard? Instead of checking for it in pagemap_scan_output() for each page you can skip it in do_pagemap_cmd() when doing the flush. > > BTW#2, I think the ENOSPC return in pagemap_scan_output() should > > happen later - only if the pages would match and that caused the count > > to exceed the limit. For THP n_pages should be truncated to the limit > > (and ENOSPC returned right away) only after the pages were verified to > > match. > We have 2 counters here: > * the p->max_pages optionally can be set to find out only N pages of > interest. So p->found_pages is counting this. We need to return early if > the page limit is complete. > * the p->vec_index keeps track of output buffer array size I think I get how the limits are supposed to work, but I also think the implementation is not optimal. An example (assuming max_pages =3D 1 and vec_len =3D 1): - a matching page has been found - a second - non-matching - is tried but results in immediate -ENOSPC. -> In this case I'd expect the early return to happen just after the first page is found so that non A similar problem occurs for hugepage: when the limit is hit (we found >=3D max_pages, n_pages is possibly truncated), but the scan continues until next page / PMD. [...] > >>>> + if (!arg->required_mask && !arg->anyof_mask && > >>>> + !arg->excluded_mask) > >>>> + return false; > >>> > >>> Is there an assumption in the code that those checks are needed? I'd > >>> expect that no selection criteria makes a valid page set? > >> In my view, selection criterion must be specified for the ioctl to wor= k. If > >> there is no criterio, user should go and read pagemap file directly. S= o the > >> assumption is that at least one selection criterion must be specified. > > > > Yes. I'm not sure we need to prevent multiple ways of doing the same > > thing. But doesn't pagemap reading lack the range aggregation feature? > Yeah, correct. But note that we are supporting only selective 4 flags in > this ioctl, not all pagemap flags. So it is useful for only those users w= ho > depend only on these 4 flags. Out pagemap_ioctl interface is not so much > generic that we can cater anyone. Its interface is specific and we are > adding only those cases which are of our interest. So if someone wants > range aggregation from pagemap_ioctl, he'll need to add that flag in the > IOCTL first. When IOCTL support is added, he can specify the selection > criterion etc. The available flag set is not a problem. An example usecase: dumping the memory state for debugging: ioctl(return_mask=3DALL) returns a conveniently compact vector of ranges of pages that are actually used by the process (not only having reserved the virtual space). This is actually something that helps dumping processes with using tools like AddressSanitizer that create huge sparse mappings. Best Regards Micha=C5=82 Miros=C5=82aw