From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F5BBC43334 for ; Wed, 13 Jul 2022 01:09:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 14BFB9400F1; Tue, 12 Jul 2022 21:09:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0FBD09400E5; Tue, 12 Jul 2022 21:09:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F2D4C9400F1; Tue, 12 Jul 2022 21:09:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E48869400E5 for ; Tue, 12 Jul 2022 21:09:38 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id B4B8E120222 for ; Wed, 13 Jul 2022 01:09:38 +0000 (UTC) X-FDA: 79680294036.04.FF1DAF9 Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) by imf24.hostedemail.com (Postfix) with ESMTP id 57A01180088 for ; Wed, 13 Jul 2022 01:09:38 +0000 (UTC) Received: by mail-pg1-f178.google.com with SMTP id q82so9110108pgq.6 for ; Tue, 12 Jul 2022 18:09:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Vxz0znQBgCQOHYTII94clQ9N2Tdg7lZi/cp2djV5d5U=; b=OpcPrN4Naa/BkzMPZaL5RlkZWDua7qpqB5jGH5k7LP1AwWC+dPKoctNB8QKeKcoVoU m9umJ+oYEg+FxD15CB3h89NnJlmo+GC0xqr9dHwEPALiyvZRJyAasgLpU7W+tbL1WN/Y NUqRF8kmkDzLTIDYWWSwJsu42gii141zCa/ROQZ8AYr2BWPq73HkxYg7eeQ+9OyFnqne isjMgFAP3I87M3Ctfy/f9CbAuJ6E3arx3fQ/P/Z7yph/HLGH09fyjlUiPYPSpiMPCu37 nl1JziMvtZF2cJxg5gZ46dqQDCWSYD3g/XFrb8DmuRYU+fbmXuuTRuDHwDD29XXQqMjH u07A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Vxz0znQBgCQOHYTII94clQ9N2Tdg7lZi/cp2djV5d5U=; b=YzpxBrGU77yKpNOrKSs4TWsAuGeZJoWU126ZLTWEXnyBpq4YqjpJlEmMtt+gI3C7zr h9R8JB4HZY/X42sXBmqsKB/2JUZzpvuFdhs2etGMDrKErBpVdYzb7xj80C76PufvimGD 4/kI9wq38i/LoXLj8CALTDuXGLG99ZyjxmnTuinLqPho8CWfx36MkUt7f4xHK9bNUnHK ItyMZaZiDs0SdyGJ2aF6vQuaL9/xxlojPcKxSwvvH4lSSyKUh6gVYFpL8+xN2Fc1U5z1 CJbjbBpt82QyiyEFm7g4hZzI0CI1mzvl/WFONwzWM20LhndXV3JHSPWuNSAVQt/MMPkT EOdA== X-Gm-Message-State: AJIora8wx+TPIIBG2TVe55h37hP3AqaiDUY1DO3vHe9QypCkq5fwqJL7 HwaAprz18pIlwWsH8D0IfFg= X-Google-Smtp-Source: AGRyM1vjlBmxc4HkZgCCmhdmAP1saE/O2nGiIDrQVilxxQLPCX1lb5NqYsbGlfNyovvYRDdKERk+ew== X-Received: by 2002:a63:5a4f:0:b0:412:6081:6cee with SMTP id k15-20020a635a4f000000b0041260816ceemr850453pgm.246.1657674577006; Tue, 12 Jul 2022 18:09:37 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id z10-20020a63e54a000000b0041975999455sm1540028pgj.75.2022.07.12.18.09.35 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 12 Jul 2022 18:09:36 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\)) Subject: Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations From: Nadav Amit In-Reply-To: Date: Tue, 12 Jul 2022 18:09:35 -0700 Cc: Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , David Hildenbrand , Mike Rapoport Content-Transfer-Encoding: quoted-printable Message-Id: References: <20220622185038.71740-1-namit@vmware.com> <20220622185038.71740-3-namit@vmware.com> <5D85870C-CBDF-45F7-A3A5-5F889521BE41@vmware.com> To: Peter Xu X-Mailer: Apple Mail (2.3696.100.31) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657674578; a=rsa-sha256; cv=none; b=eqYm6QozQmqDayMLKdaLb9bzHGbIcP1QHkevJSmK1umBEUm6AS7cvAB8iyfa+Obbyc5XNC SA1aIZw+YkhTRXJXxEswKIpqlNIEHvv8oy8oJ7wnY4axGuqmRProp9U0a1KUdJuV3nBr7q 4m68tuyf/3wkbhyawkC7HVkQIcdVO+Q= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=OpcPrN4N; spf=pass (imf24.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657674578; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Vxz0znQBgCQOHYTII94clQ9N2Tdg7lZi/cp2djV5d5U=; b=6jWeO53cPPZ0L9orqmkR1M982HUWB+mew/K2N799Pcrh1c1IuAPKmtfSJtnoAXm4OZLAW8 SFo3Hgp84ZReaGkQj3XfSOkFLUWn7XwoeApwmOoyZdK0xRzT5BzMtUKR0Bz7wT+fvU7g1Z A6ULjdX3rEtS0zCrsxRzwzD4anHhcsI= X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 57A01180088 X-Rspam-User: Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=OpcPrN4N; spf=pass (imf24.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Stat-Signature: q1ymusodkidb65j34r5bhkk7u5gykgen X-HE-Tag: 1657674578-102631 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Jul 12, 2022, at 7:56 AM, Peter Xu wrote: > Hi, Nadav, >=20 > On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote: >> On Jun 22, 2022, at 11:50 AM, Nadav Amit = wrote: >>=20 >>> From: Nadav Amit >>>=20 >>> Using a PTE on x86 with cleared access-bit (aka young-bit) >>> takes ~600 cycles more than when the access bit is set. At the same >>> time, setting the access-bit for memory that is not used (e.g., >>> prefetched) can introduce greater overheads, as the prefetched = memory is >>> reclaimed later than it should be. >>>=20 >>> Userfaultfd currently does not set the access-bit (excluding the >>> huge-pages case). Arguably, it is best to let the user control = whether >>> the access bit should be set or not. The expected use is to request >>> userfaultfd to set the access-bit when the copy/wp operation is done = to >>> resolve a page-fault, and not to set the access-bit when the memory = is >>> prefetched. >>>=20 >>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request = the >>> young bit to be set. >>=20 >> I reply to my own email, but this mostly addresses the concerns that = Peter >> has raised. >>=20 >> So I ran the test below on my Haswell (x86), which showed two things: >>=20 >> 1. Accessing an address using a clean PTE or old PTE takes ~500 = cycles >> more than with dirty+young (depending on the access, of course: dirty >> does not matter for read, dirty+young both matter for write). >>=20 >> 2. I made a mistake in my implementation. PTEs are - at least on x86 = - >> created as young with mk_pte(). So the logic should be similar to >> do_set_pte(): >>=20 >> if (prefault && arch_wants_old_prefaulted_pte()) >> entry =3D pte_mkold(entry); >> else >> entry =3D pte_sw_mkyoung(entry); >>=20 >> Based on these results, I will send another version for both young = and >> dirty. Let me know if these results are not convincing. >=20 > Thanks for trying to verify this idea, but I'm not fully sure this is = what > my concern was on WRITE_LIKELY. >=20 > AFAICT the test below was trying to measure the overhead of hardware > setting either access or dirty or both bits when they're not set for > read/write. Indeed. >=20 > What I wanted as a justification is whether WRITE_LIKELY would be = helpful > in any real world scenario at all. AFAIK the only way to prove it so = far > is to measure any tlb flush difference (probably only on x86, since = that > tlb code is only compiled on x86) that may trigger with W=3D0,D=3D1 = but may not > trigger with W=3D0,D=3D0 (where W stands for "write bit", and D stands = for > "dirty bit"). >=20 > It's not about the slowness when D is cleared. >=20 > The core thing is (sorry to rephrase, but just hope we're on the same = page) > we'll set D bit always for all uffd pages so far. Even if we want to > change that behavior so we skip setting D bit for RO pages (we'll need = to > convert the dirty bit into PageDirty though), we'll still always set D = bit > for writable pages. So we always set D bit as long as possible and = we'll > never suffer from hardware overhead on setting D bit for uffd pages. Thanks as usual for your clarifications. As you see, I also do my best = to be on the same page with, even if from time to time I fail. I had some = recent communication challenges on lkml, so I hope that you understand that everything I say is said with full respect, and if I use double-quotes = while arguing with you, it is in good spirit, and I really appreciate your feedback. Ok. So there is a lot to digest in what you just said, and I politely disagree with some of the assertions that you made. You focus on = discussing the issue of whether we set the dirty bit for RO pages, which in my = opinion is so intuitively wrong. But, I think that discussing this issue really digress us from the benefits of not setting the D-bit when it is = unnecessary for RW pages, which is the main question. But before we get to it, I want to argue with some of the =E2=80=9Cfacts=E2= =80=9D that you present: 1. "D bit always for all uffd pages=E2=80=9D - This is true for almost = all UFFD operations, but not true to write-unprotect. The moment we use David=E2=80= =99s MM_CP_TRY_CHANGE_WRITABLE in UFFD, the PTE would be writable but not = dirty. [Yes, the patches that I sent do not deal with that: as I noted before, = I want to send it as part of v2.] Arguably, one of the places that setting = the D-bit matters the most if when you change PTE from RO->RW. And anyhow, = as you can see the API is inconsistent. 2. "we'll still always set D bit for writable pages=E2=80=9D - To = continue my previous bullet - why? Why would we? Besides uffd-wrunprotect path, why would we always set it for MCOPY_CONTINUE? (more to follow on this one) 3. "measure any tlb flush difference =E2=80=A6 that may trigger with = W=3D0,D=3D1 but may not trigger with W=3D0,D=3D0=E2=80=9D - This is really a boring case, = which even if was underoptimized could have been resolved by its own. This is certainly = not the motivation for the write hints. So please allow me to go back to the reasons for why you want = write-hints, and RO entries would be mostly left out and implicitly included in the other arguments - which are mainly about RW entries: 1. You can avoid TLB flushes when write-protecting clean writable PTEs = (on x86). Such PTEs can be created when userspace monitor prefaults or speculatively write-unprotects memory that might be needed. When a = monitor removes the mapping, using MADV_DONTNEED, it would not need to flush anything. 2. Hopefully you agree that write-hints are needed if during uffd-write-unprotect we actually write-unprotect the PTE (not just = clearing the logical flag). So you do need write-hint for zero-page (to get a clear-page) and for write-unprotect, so why not to be consistent and = provide it for all operations? 3. For UFFDIO_CONTINUE. Admittedly, I am not very familiar with UFFDIO_CONTINUE, but from the discussion (and skimming the code) I understood that you can be used for prefetching. In such case, why would = you assume that the page is dirty? 4. It allows you to treat softdirty properly. If softdirty is on, presumably, if you did not get a WRITE_HINT, you would keep it writeprotected. 5. Consistency with UFFDIO_ZERO that needs a write-hint to clear a page. Now I will =E2=80=9Ckill myself=E2=80=9D over support of write-hints. = Not everything that I mentioned here is in v1 that I sent. But, if you decide you do not want write-hints, I would still need to = leave the UFFDIO_ZERO write-hints (that clear the page) and to find some = solution for UFFDIO_WRITEPROTECT (that should set the dirty-bit for better = performance of WP page-fault handling).=20 If you decide you do want it, I would run some tests to check that = indeed the access-time of UFFDIO_WRITEPROTECT reflects the fact no TLB flush = was needed. > The other worry of having WRITE_HINT is, after we have it we probably = need > to _not_ apply dirty bit when WRITE_HINT is not set (which is actually = a > very light ABI change since we used to always set it), then I'll start = to > worry the hardware setting D bit overhead you just measured because = we'll > have that overhead when user didn't specify WRITE_HINT with the old = code. Good point, which I have already got to. So, because I was mistaken on = the ACCESS_HINT understanding (young-bit is set by default), this would mean that the hints should only be regarded when the =E2=80=9Chints=E2=80=9D = feature is enabled for backward compatibility. I got this code ready for v2. >=20 > So again, I'm totally fine if you want to start with ACCESS_HINT only, = but > I still don't see why we should need WRITE_HINT too.. I really hate how the =E2=80=9Cfeatures=E2=80=9D evolve that I think it = is better to decide now. I think that in the lack of agreement, the best thing to do is to put = all of the write-hints now in the API, and only regard them for UFFDIO_ZERO = (which would clear the page) and UFFDIO_WRITE(un)PROTECT (that would set the = old bit). Again, thanks for the feedback, and hopefully (at least) I understood = you this time.