From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64293C64EC7 for ; Tue, 28 Feb 2023 15:55:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B087F6B0072; Tue, 28 Feb 2023 10:55:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AB8816B0073; Tue, 28 Feb 2023 10:55:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9329C6B0074; Tue, 28 Feb 2023 10:55:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7EC456B0072 for ; Tue, 28 Feb 2023 10:55:48 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 318A11A02A0 for ; Tue, 28 Feb 2023 15:55:48 +0000 (UTC) X-FDA: 80517151176.10.2FAEA2B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 1D08E40015 for ; Tue, 28 Feb 2023 15:55:45 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=c2smdjL2; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677599746; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v6M2ykteJBFwPdE4Fc5PB7vibFCXhoU/gll1/xu6zk0=; b=zYP0MKefk7LQrApYRSUOrNOMnsIFoY4foBbRE2xCiofP48Hud2Ok27cSUxJLt2diOkzgY4 7Fz9fDtxx3B6681D58nPsTw9haPn5s1klDfkzTY958RrkplTbJOCDYgcoJFfXRKxxJz/3n FhES8pWSNRCP0LSM0tC+bhLRF5POol0= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=c2smdjL2; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677599746; a=rsa-sha256; cv=none; b=Enk3P9B7eyk8M9xvpys/pq+2vCB40Flq45UUllRtH0d67YxRr7DWK3ottpDGPBrNFd30Bz aZxf3DK+TYieDo+KY1DEzFgCmowKPIbJtR4WJo1v222HI96UEhk1zYw6q58H+OQ437NTy6 FNzjuvnvYYZ/V/MDEWWc6jR4/N7Qmrg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1677599745; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=v6M2ykteJBFwPdE4Fc5PB7vibFCXhoU/gll1/xu6zk0=; b=c2smdjL2+tzaw+6hUS/wk60sFKSDz0zzqmDWZ7APSzLq+28VcK7UJ2qLDGry+k5iptivB1 9wnPgPW/KHtvLvu+8RvdBlLREVXi1HMrccta5OoJq5cvwmgUQGhfzcOU+8ryDEtsNnK3e3 pF0qit0175zHVrSPFz/HgCgC0oU9NU4= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-642-Ena6AiUMOqaT95Lt9CVi8A-1; Tue, 28 Feb 2023 10:55:44 -0500 X-MC-Unique: Ena6AiUMOqaT95Lt9CVi8A-1 Received: by mail-qt1-f198.google.com with SMTP id k13-20020ac8074d000000b003bfd04a3cbcso3640279qth.16 for ; Tue, 28 Feb 2023 07:55:44 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677599744; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=v6M2ykteJBFwPdE4Fc5PB7vibFCXhoU/gll1/xu6zk0=; b=jQ0q7T1Q5jmODH5HDB+ljXtguflKY6rF+BssRJEt6FyrmrGRwaCwNyQpeer994J7C/ A1KA6O2njHdY7sm5I27uzT6zPVbDsCIsdyYS0/53tn3Ph/QyqeJ9Xy21lXJZzgAg63Iy ccTnudPfc3jeLmQZMOMsfFT80rkU1+a/gWd18lNV8xCQU7MC3N3ps4ejppdh9FRO0V3w QtEoK/+uQy+yaswaI1lUTvP6uAG/QMO1ZtxsYgiUK2RIVnoEQGB1QIPgp30L2oGfiNvR odRTvKAZS5HbN5w4BMvjqcF988hwnntJWUJ2rezVixV+oeuOFyQitIl01KWAnhF2ZAgF eTjg== X-Gm-Message-State: AO0yUKURZ8mf4hyxa7gfWDaZZ3DyUWIm5ZhVwaJvObmJDMUduRjapaQ/ SO0fXMuuNHzaVXEB3JsPOsP1Gfzv9BI7fvMvogCFmOPCyuio0Q6f+lPArDHuQMmHFEGX2cUeakS fSQJ6iVbfzYU= X-Received: by 2002:ac8:580f:0:b0:3bf:daa8:cacc with SMTP id g15-20020ac8580f000000b003bfdaa8caccmr6872042qtg.3.1677599743733; Tue, 28 Feb 2023 07:55:43 -0800 (PST) X-Google-Smtp-Source: AK7set/XPVjIGEfnKjmfSsCn1612xPeb/JaaDAmV6dwmY5PblZQ38Qj9ElM25ukoj3dLGb0rICpZcw== X-Received: by 2002:ac8:580f:0:b0:3bf:daa8:cacc with SMTP id g15-20020ac8580f000000b003bfdaa8caccmr6871879qtg.3.1677599742772; Tue, 28 Feb 2023 07:55:42 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id x21-20020a376315000000b007419eb86df0sm7080654qkb.127.2023.02.28.07.55.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Feb 2023 07:55:42 -0800 (PST) Date: Tue, 28 Feb 2023 10:55:40 -0500 From: Peter Xu To: Nadav Amit Cc: Muhammad Usama Anjum , Mike Rapoport , =?utf-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Andrew Morton , Alexander Viro , Cyrill Gorcunov , Paul Gofman , Danylo Mocherniuk , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , kernel list , linux-fsdevel , linux-mm , linux-kselftest , Greg KH , "kernel@collabora.com" , David Hildenbrand , Andrei Vagin Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Message-ID: References: <20230202112915.867409-1-usama.anjum@collabora.com> <20230202112915.867409-4-usama.anjum@collabora.com> <2fe790e5-89e0-d660-79cb-15160dffd907@collabora.com> <751CCD6C-BFD1-42BD-A651-AE8E9568568C@vmware.com> <5D5DEEED-55EB-457B-9EB7-C6D5B326FE99@vmware.com> MIME-Version: 1.0 In-Reply-To: <5D5DEEED-55EB-457B-9EB7-C6D5B326FE99@vmware.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 1D08E40015 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: q1rkxw5h7sy8ub6ru51wytssb59nr9fe X-HE-Tag: 1677599745-569796 X-HE-Meta: U2FsdGVkX1/cnBZyJx1DXqK8yCMBJ7iRRpzG0UzrSeO/hYDopiSsAum5Cx553q0wtJrmnM39KlbzU3V+2McYu/2puFm+t8E6G7j+SEZITbTmColEHaCnfF/2U0FmXU7gqHUckblv6934LpHK6smJL/gETfRMeKvtrqhFgz/TuQ9zli3iTvrXUnuRkhEpJoCPvxsdAbkbdXT7qw7Q1w+I72u7ZRdLFpniZTzboqMAQNyLQrRxX+RqKuKX03yDBHRuPzQMy4z+cbfjmW8VMmZ9y6YSfSO/O3TGtJSP/QH5HkBZwNSu7M04zM8V/MfzJrjRxONMLLqDLHrmKndEgXDcuJ5/OiT+AtDz1Zsb1M8pFzETbASc8Q/CX2XFvbO+DljZlsNkalc+wSxSbB0paTaqjOP8+z/yE6jIB90jQoFhjOLTpG++aIXaHgnIcHkV5sAF8hnT4GcPAA4oEssFwmVlPV2ZHTU+/MB5iHcAeRfaOMSGpXNPJ3v6YwJils9eIn9LGP1ZRTzDPxdKkjSW+QG6GP2QmhcsswZ+ipJ0ehRc7CBjSF4fQw9giYvu70D8r8uRU6IKtogKZq6apRao/FDb1R694xQnnSRcuRPUPO9VODm/Ac+745j0/QQSMtgREWgaRExwPCjNtuxhxr45HKnJ8/ZX2sz8hGRZ2YrMlL/Le1e2obwVXFBCFGp8Nhqs9ncdCWiZ+oUuBKIKWEVGJpTxBUGZs7P9+K+s9EVPTa2pJkkM63DWrwCZQZStd6lYRudatz91K7gFMkLY0t4PtJ1Gsq61KjGzZpLoNC2zoEcRbM2dK8YSrJKnoSyzjpMR075m7mSTaq1YjgKdUN+DCNUPrqqvk0ISMxa/lIbw8LHkdW0Mhcx9R8AU99+I7NSWo6Mh92JWB0Z6L/DGAAZ8gQUPdZO/aIacXfrinWmzqQy8nugqp4b4iwUIkzeeIqwcfKgoMuMrkxTfui2CMAD1X4g 1cWNd9yp Lpj0WuCS+70abqtW1zJa5Fd5AOBmxostqqn3K5jITlVgwzVpPMgmPCyMvg0HrsRG4HXHXDfohYCQ/Z/VPLNihIrucqpODZ8bZSnAEvTD0B1wDuiU+WNXPiUtsBfcZp6AIcRvGX4pDSozY75AKM/blowf4x+eCsT2U9Y6Sj+PKsO8jMzzC5pTdib51dh3GNGr2M+aw86VylmWXK6b2oJjFqVcfyNYVLJt1GjFiHhro+D1jTgVJHNNBmgt7mu7dnWECJrXWtSiHtqj1hbQS5F+FI/W7ysKyNbyof7PE4ZmuHVKtuVrkT7V+NhZua6BTseAW15gSg2xia3mjWLlHJEV2szsEOQ5ZODnL8d4s9ID/gqMK6Sy0698QqLnRBe6r8dvRHMsn1J9x1MGpcM/ivuVeEjB8QjlpP2kQs6Jbg6nfvFM7een8Djy4+vAasuuhOH21aYZfj51408Mun3gXh9D3QJcMz6pFabmzNgI5K6vGDkZ7bCT7GTMvEYl8fMd4u1Q+vvPCxHUY1VvJqeYk2fJijcFa3ak6P+Q5tPFC76xzrw3f6PDPWBi1f8AUrxK57vqxc5rd9AqSC8qGs5N2BLNdMBBPvRN+7YdxuPLB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote: > > > > On Feb 27, 2023, at 1:18 PM, Peter Xu wrote: > > > > !! External Email > > > > On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote: > >> From my experience with UFFD, proper ordering of events is crucial, although it > >> is not always done well. Therefore, we should aim for improvement, not > >> regression. I believe that utilizing the pagemap-based mechanism for WP'ing > >> might be a step in the wrong direction. I think that it would have been better > >> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and > >> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the > >> file descriptor unless the log is full. > > > > Yes this is an interesting question to think about.. > > > > Keeping the data in the pgtable has one good thing that it doesn't need any > > complexity on maintaining the log, and no possibility of "log full". > > I understand your concern, but I think that eventually it might be simpler > to maintain, since the logic of how to process the log is moved to userspace. > > At the same time, handling inputs from pagemap and uffd handlers and sync’ing > them would not be too easy for userspace. I do not expect a common uffd-wp async user to provide a fault handler at all. In my imagination it's in most cases used standalone from other uffd modes; it means all the faults will still be handled by the kernel. Here we only leverage the accuracy of userfaultfd comparing to soft-dirty, so not really real "user"-faults. > > But yes, allocation on the heap for userfaultfd_wait_queue-like entries would > be needed, and there are some issues of ordering the events (I think all #PF > and other events should be ordered regardless) and how not to traverse all > async-userfaultfd_wait_queue’s (except those that block if the log is full) > when a wakeup is needed. Will there be an ordering requirement for an async mode? Considering it should be async to whatever else, I would think it's not a problem, but maybe I missed something. > > > > > If there's possible "log full" then the next question is whether we should > > let the worker wait the monitor if the monitor is not fast enough to > > collect those data. It adds some slight dependency on the two threads, I > > think it can make the tracking harder or impossible in latency sensitive > > workloads. > > Again, I understand your concern. But this model that I propose is not new. > It is used with PML (page-modification logging) and KVM, and IIRC there is > a similar interface between KVM and QEMU to provide this information. There > are endless other examples for similar producer-consumer mechanisms that > might lead to stall in extreme cases. Yes, I'm not against thinking of using similar structures here. It's just that it's definitely more complicated on the interface, at least we need yet one more interface to setup the rings and define its interfaces. Note that although Muhammud is defining another new interface here too for pagemap, I don't think it's strictly needed for uffd-wp async mode. One can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap interface already. So what Muhammud is proposing here are two things to me: (1) uffd-wp async, plus (2) a new pagemap interface (which will closely work with (1) only if we need atomicity on get-dirty and reprotect). Defining new interface for uffd-wp async mode will be something extra, so IMHO besides the heap allocation on the rings, we need to also justify whether that is needed. That's why I think it's fine to go with what Muhammud proposed, because it's a minimum changeset at least for userfault to support an async mode, and anything else can be done on top if necessary. Going a bit back to the "lead to stall in extreme cases" above, just also want to mention that the VM use case is slightly different - dirty tracking is only heavily used during migration afaict, and it's a short period. Not a lot of people will complain performance degrades during that period because that's just rare. And, even without the ring the perf is really bad during migration anyway... Especially when huge pages are used to back the guest RAM. Here it's slightly different to me: it's about tracking dirty pages during any possible workload, and it can be monitored periodically and frequently. So IMHO stricter than a VM use case where migration is the only period to use it. > > > > > The other thing is we can also make the log "never gonna full" by making it > > a bitmap covering any registered ranges, but I don't either know whether > > it'll be worth it for the effort. > > I do not see a benefit of half-log half-scan. It tries to take the > data-structure of one format and combine it with another. What I'm saying here is not half-log / half-scan, but use a single bitmap to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it avoids any above "stall" issue. > > Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the > threads of previous versions and I did not see userspace components that > use the API to say something smart. Actually similar here. :) So I'm probably not the best one to describe what is the best to look as API. What I know is I think the new pagemap interface is welcomed by CRIU developers, so it may be something good with/without userfaultfd getting involved already. I see this as "let's add one more bit for uffd-wp" in the new interface only. Quotting some link I got from Muhammud before with CRIU usage: https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com > Personally, I do not find the current API proposal to be very consistent > and simple, and it seems to me that it lets pagemap do > userfaultfd-related tasks, which might be considered inappropriate and > non-intuitive. Yes, I agree. I just don't know what's the best way to avoid this. The issue here IIUC is Muhammud needs one operation to do what Windows does with getWriteWatch() API. It means we need to mix up GET and PROTECT in a single shot. If we want to use pagemap as GET, then no choice to PROTECT also here to me. I think it'll be the same to soft-dirty if it's used, it means we'll extend soft-dirty modifications from clear_refs to pagemap too which I also don't think it's as clean. > > If I derailed the discussion, I apologize. Not at all. I just wished you joined earlier! -- Peter Xu