From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19ECAC433EF for ; Tue, 14 Jun 2022 19:25:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 70B9B6B0072; Tue, 14 Jun 2022 15:25:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6BAE46B0073; Tue, 14 Jun 2022 15:25:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 583856B0074; Tue, 14 Jun 2022 15:25:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4A2E16B0072 for ; Tue, 14 Jun 2022 15:25:45 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id C21AB60FC3 for ; Tue, 14 Jun 2022 19:25:44 +0000 (UTC) X-FDA: 79577821008.01.4E0DF89 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf31.hostedemail.com (Postfix) with ESMTP id 3F866200D5 for ; Tue, 14 Jun 2022 19:25:40 +0000 (UTC) Received: by mail-pl1-f169.google.com with SMTP id n18so8558246plg.5 for ; Tue, 14 Jun 2022 12:25:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Er0chcokXU71aR1chtD2dxRcexL2ZFk9N5jcnkRc7qg=; b=elzI4i5BrDEs1qtcHw2dng0CGPmsY8OIEhR9mT+XHy3yFVhSjaQ0+M/Jz/QZ915iyd zK3zuxDhI12sxXK6LJ29YtZhgBO1rfQV4FLBSL85BNP1XscXcAKrUjwSNoEg5/w2NwUd 7InjE9PT4IFxxun1xvZDV8qtMIG5M3aNUN5LJihZMiiMieWDWqHgsMc6vViSbnzVd8x1 k/OiUi6yMFlGqHgHp6xQrxf6qJQA8zRAVp1mJ2j07W56CkNAWQWgK8kPyFMuBjj6+zUL i0Jm1PGJprsTMCVbxRZP++ueaFCyOHj5QJTAAhA0MO9MPx57qjIy/+56j9NiH7zzJyur AnRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Er0chcokXU71aR1chtD2dxRcexL2ZFk9N5jcnkRc7qg=; b=w3sqHxTfPBbbhx1WLspggML2UA0FkVFFAKgVrp6oKZ+TMJU3bhHcmf+7H10DJei5+w lF0RBtgX3mt0Y9Qmw6Y+PoWxNzQJ6KnSH4hVsTg2tYzxbKMvUcCBb4elF73EXKZAppVg 2Hb1ii7T1tEm7BQfWzoGrk/B4H3MrDsKCxjgeGHBLRt6/ddTzfLcv1sFnQinYTszvLpl oJJ740aRWfwgNKYqv6cy/6dSp+5y7eRlfHk5t8eFIR/7Fxa++aqmcUzZyJj4sHljb+rJ UTweL88wO5xJlvnIXwCnJ2YK59ELWxVhTDccDO0kjp5oR7TppgcySMrkSGQP4l3ctDH0 QRfA== X-Gm-Message-State: AJIora+g+bPjynk6ynsBqdTUntdhtPSPm/eJUYJXf17rPhR+0EoFFG/M vd2kceJMLpalOmGhIwmCG1o= X-Google-Smtp-Source: AGRyM1tvZI+1E3up/Usjrz72sVQ5+hn2PH5ukg2qnUBjdzxHSCiAZdo3qKUStA1QV4srrVW241AbOg== X-Received: by 2002:a17:902:dac7:b0:166:3dfe:f4b8 with SMTP id q7-20020a170902dac700b001663dfef4b8mr5715027plx.55.1655234739017; Tue, 14 Jun 2022 12:25:39 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id je3-20020a170903264300b001641b2d61d4sm7567114plb.30.2022.06.14.12.25.37 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 Jun 2022 12:25:38 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\)) Subject: Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG From: Nadav Amit In-Reply-To: Date: Tue, 14 Jun 2022 12:25:36 -0700 Cc: David Hildenbrand , Peter Xu , Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen Content-Transfer-Encoding: quoted-printable Message-Id: <06230F13-F08C-474E-A06B-62A89AE856D2@gmail.com> References: <20220613204043.98432-1-namit@vmware.com> <3eea2e6e-1646-546a-d9ef-d30052c00c7d@redhat.com> To: Mike Rapoport X-Mailer: Apple Mail (2.3696.100.31) ARC-Authentication-Results: i=1; imf31.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=elzI4i5B; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf31.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1655234740; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Er0chcokXU71aR1chtD2dxRcexL2ZFk9N5jcnkRc7qg=; b=m/2aOppsHq35GH2boVfHS5l77T0Ez1yClTWl5Ssb9k1Bsi9oUeLQI24uunKJCS6OHlYPzU 3jz5woRRxrKeH+/7tnM8kyrD0wQaxrox1Eet9vrYyAuP0tQcZABjePeGt4/L2aEDJlLruS Nk/CdBpJzCW0btk5cR14nnVhKMp0+eQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655234740; a=rsa-sha256; cv=none; b=K4dQeES/AJKrDMOW4hdPqDS5OUiqU7Ex9ZayIcI3fFUt42z7sIYeDD+vNYuYJowLW8R8cJ inacin5KWcQ7LjAONFoGVU+DG+Q7oHVm/m0wQ+5lUXofp2wx+7GgZf3fjEt8R1W5t13VxV 8GF8boVP/fIAILdllf1sPJOzrhJsjyA= X-Stat-Signature: 61jh3doso7mweebbgk4sa79jukqzqjkw X-Rspamd-Queue-Id: 3F866200D5 X-Rspam-User: Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=elzI4i5B; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf31.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com X-Rspamd-Server: rspam10 X-HE-Tag: 1655234740-999288 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Jun 14, 2022, at 11:56 AM, Mike Rapoport wrote: > On Tue, Jun 14, 2022 at 09:18:43AM -0700, Nadav Amit wrote: >> On Jun 14, 2022, at 8:22 AM, David Hildenbrand = wrote: >>=20 >>> On 13.06.22 22:40, Nadav Amit wrote: >>>> From: Nadav Amit >>>>=20 >>>> As we know, using a PTE on x86 with cleared access-bit (aka = young-bit) >>>> takes ~600 cycles more than when the access-bit is set. At the same >>>> time, setting the access-bit for memory that is not used (e.g., >>>> prefetched) can introduce greater overheads, as the prefetched = memory is >>>> reclaimed later than it should be. >>>>=20 >>>> Userfaultfd currently does not set the access-bit (excluding the >>>> huge-pages case). Arguably, it is best to let the uffd monitor = control >>>> whether the access-bit should be set or not. The expected use is = for the >>>> monitor to request userfaultfd to set the access-bit when the copy >>>> operation is done to resolve a page-fault, and not to set the = young-bit >>>> when the memory is prefetched. >>>=20 >>> Thinking out loud about existing users: postcopy live migration in = QEMU >>> has two usage for placement of pages >>>=20 >>> a) Resolving a fault. E.g., a VCPU might be waiting for resolution = to >>> make progress. >>> b) Background migration to converge without faults on all relevant >>> pages. >>>=20 >>> I guess in a) we'd want UFFDIO_COPY_MODE_YOUNG in b) we don't want = it. >>>=20 >>>=20 >>> I wonder, however, instead of calling this "young", which implies = what >>> the OS should or shouldn't do, to define this as a hint that the = placed >>> page is very likely to be accessed next. >>>=20 >>> I'm bad at naming, UFFDIO_COPY_MODE_ACCESS_LIKELY would express what = I >>> have in mind. >>=20 >> How about UFFDIO_COPY_MODE_WILLNEED_READ ? >>=20 >>>> Introduce UFFDIO_COPY_MODE_YOUNG to enable userspace to request the >>>> young bit to be set. For UFFDIO_CONTINUE and UFFDIO_ZEROPAGE set = the bit >>>> unconditionally since the former is only used to resolve = page-faults and >>>> the latter would not benefit from not setting the access-bit. >>>>=20 >>>> Cc: Mike Kravetz >>>> Cc: Hugh Dickins >>>> Cc: Andrew Morton >>>> Cc: Axel Rasmussen >>>> Cc: Peter Xu >>>> Cc: David Hildenbrand >>>> Cc: Mike Rapoport >>>> Signed-off-by: Nadav Amit >>>>=20 >>>> --- >>>>=20 >>>> There are 2 possible enhancements: >>>>=20 >>>> 1. Use the flag to decide on whether to mark the PTE as dirty (for >>>> writable PTEs). I guess that setting the dirty-bit is as expensive = as >>>> setting the access-bit, and setting it introduces similar = tradeoffs, >>>> as mentioned above. >>>>=20 >>>> 2. Introduce a similar mode for write-protect and use this = information >>>> for setting both the young and dirty bits. Makes one wonder whether >>>> mprotect() should also set the bit in certain cases... >>>=20 >>> I wonder if UFFDIO_COPY_MODE_READ_ACCESS_LIKELY vs. >>> UFFDIO_COPY_WRITE_ACCESS_LIKELY could evenmake sense. I feel like it = could. >>>=20 >>> For example, QEMU knows if a page fault it's resolving was due to a = read >>> or a write fault and could use that information accordingly. Of = course, >>> we don't completely know if we currently have a read fault, if we = could >>> get a write fault immediately after. >>>=20 >>> Especially in the context of UFFDIO_ZEROPAGE, >>> UFFDIO_ZEROPAGE_WRITE_ACCESS_LIKELY could ... not place the zeropage = but >>> instead populate an actual page and mark it accessed+dirty. I even = have >>> a use case for that ;) >>>=20 >>>=20 >>> The kernel could decide how to treat these hints -- for example, if = it >>> doesn't want user space to mess with access/dirty bits, it could = just >>> mostly ignore the hints. >>=20 >> I can do that. I think users can do the zero page-copy themselves = today, but >> whatever you prefer. >>=20 >> But, I cannot take it anymore: the list of arguments for uffd stuff = is >> crazy. I would like to collect all the possible arguments that are = used for >> uffd operation into some =E2=80=9Cstruct uffd_op=E2=80=9D. >=20 > Squashing boolean parameters into int flags will also reduce the = insane > amount of parameters. No strong feelings though. >=20 >> Any objection? Thanks. I also noticed a couple of embarrassing bugs that I made. Will = send v1 with fixes.=