From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E75E6C3ABC5 for ; Mon, 12 May 2025 03:06:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A3C426B00A7; Sun, 11 May 2025 23:06:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9EBE96B00A8; Sun, 11 May 2025 23:06:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8DC256B00A9; Sun, 11 May 2025 23:06:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 70CCB6B00A7 for ; Sun, 11 May 2025 23:06:21 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 2353DBA45B for ; Mon, 12 May 2025 03:06:21 +0000 (UTC) X-FDA: 83432767362.18.BB06083 Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) by imf18.hostedemail.com (Postfix) with ESMTP id 012BF1C000D for ; Mon, 12 May 2025 03:06:18 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=kylehuey.com header.s=google header.b=fAEzzWbf; spf=pass (imf18.hostedemail.com: domain of me@kylehuey.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=me@kylehuey.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747019179; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cqeSdNx0CHeLh30AuyndpNWXZmP3dukBvUlG2nm95dQ=; b=IR+8L0P/JFVFk6dMZin+wIiKLq9+wshjERbdwMjGk4Z0BPVm2fwmcK67yniXH9uFqyL4TX wChzbgAa5Vm05sgeyS6z+Ffjrt2UChFn4MZIf4r3NXlPYHzbVN+Q5mMRogDsMficAyVru/ 4woZFjGq0b+z5ZmF13r4YG0/txfNYEU= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=kylehuey.com header.s=google header.b=fAEzzWbf; spf=pass (imf18.hostedemail.com: domain of me@kylehuey.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=me@kylehuey.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747019179; a=rsa-sha256; cv=none; b=A5ejNkjFKxSXGYNYOZUjOYptznCp2v739wPr+dYMuwZXX5054Kt/A6RvKpiArO0KzIhBQN diqg36DZRDnExzv1shZ0Jdvn2GAcOHsfHl+/WnCqGK+o/Cg/VoiqjZ2ac4DLqwvJ1z1YpT 8K8g6uELqosQiXseYh0+PbQcHkvot8w= Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-54fc9e3564cso3316142e87.2 for ; Sun, 11 May 2025 20:06:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kylehuey.com; s=google; t=1747019177; x=1747623977; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=cqeSdNx0CHeLh30AuyndpNWXZmP3dukBvUlG2nm95dQ=; b=fAEzzWbfOBWr8WrwClWZs3ssLL3bXS6KffVgcOcGnmtjn2Y9hqb4k6f4Bispe/5vLU B9SubS+3R2my5c0Zlb4nRhChb//boxlaYtT27ej9Et2mjEo1d13Av83X1IY9e4iQ67NS TTX1Gwz1/7nBLSAtJVuPjxnhPfmvFxiKTWTTIgNqS9VE6GYingqkD8Ns8W4RFjzBEW5J eHXSWutxBfLSao9jhiif/FibK2OoCQb4FaerYDrZdULWRTrwP5UyLjkzVQpLCBm4D0Ty Nhte8HudvK6jsAGod5hhaj8dBtsAnvGlPzkjjy+p3q9G6W7uesrvFPzgzGMoTN3Lwk03 mdVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747019177; x=1747623977; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cqeSdNx0CHeLh30AuyndpNWXZmP3dukBvUlG2nm95dQ=; b=aSdUQxd1ZknfsTHwhnAx6Bb5+pcqjoBx96OUu6mae4ZAaASGpp6ZlaI0PRD9OUnhjK whAqAI0/4Wa41k2pCpHvPKQ9+voF7gGWvKobicfprxEZsKlvKw2nN2LG+/hLnt/ItlQW r+3upHIDM6SpTbbr/yFy0y83Cx9vbsiGcO2Nh6AkiB2VzCRp+JvuDvOb8/fdxJrPG4GV O0MgkwCKGpKkV6iT9hjQx3MqH4FtUP5957lTi70DAfYbnUVoFmEpYUF35isfaoyGffGN d87cEDfrgecYal1ARANnSHgr9BeD//5n2+7enbc+bkX9/qCCyIZIDGV3J+DbePrbKDHa WucA== X-Forwarded-Encrypted: i=1; AJvYcCVPBylmWsSR6lGH0L7Sv2a2YJcL43q3+RQ+aINzKrt3uzE38JcLm66KGOn3ZkEh7uqV6NB+tTq6cw==@kvack.org X-Gm-Message-State: AOJu0YzXgcwxfu8KDO75SElpX3ih2g43+gZkBMkP2SAVXrhoT+7V3yFp 0Z0cpfXVSg12XGWwOz7jhJAPehjTxM8aTHf6RHh2wtzHd0Pxe6ZvDxY0A6D4lHmkySnB89xaEpI dj3Ge1P9xi2a2CmUr8I0gUP5d+v+T/Ntc4e6O X-Gm-Gg: ASbGncvz/1oDMYXAO33I5uY4gGZ9obfTfqGp2NVNzMPnMZfDlkmJYZ7hbpXy+nys9Nb NJg/8DvaxnrAECIF8KsLU5gbNEFf/R6v9jgwThvBnXH2dv2TLQAwBuEM8/feTILfGm4R4Xvvmdl Q7GubVlMLZy00++N+quj85+fIKJZUvirn6 X-Google-Smtp-Source: AGHT+IHRuAOe3WacXM0IL8CAYTt6PfU2RLwPShCiL1Y+DeIOHLql5HJQxQ/kGP1m2stDHfZvxmHLdqNG7rYizSDiS/Y= X-Received: by 2002:a05:6512:640b:b0:549:8b24:989d with SMTP id 2adb3069b0e04-54fc6764426mr3461026e87.0.1747019176738; Sun, 11 May 2025 20:06:16 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kyle Huey Date: Sun, 11 May 2025 20:06:03 -0700 X-Gm-Features: AX0GCFsgzrcWFFa5U_-FaMNzElVCr-BE21Eo2Gh4nlBIxOfo_bLAZyxmi6GNLVU Message-ID: Subject: Re: Suppress pte soft-dirty bit with UFFDIO_COPY? To: Peter Xu Cc: Andrew Morton , open list , linux-mm@kvack.org, criu@lists.linux.dev, "Robert O'Callahan" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Stat-Signature: ds15eqcieaco8binwkjrjaxobupunety X-Rspamd-Queue-Id: 012BF1C000D X-Rspam-User: X-HE-Tag: 1747019178-409597 X-HE-Meta: U2FsdGVkX19nX1Gv7B35HWmJzOaSs3ez/tWVb9V7xWEwOy19LAUs7S4mYv28MgG3t/Kyzzy1lehSuA8uzjtHYji/opPrhUnFobrzupJ2LuHxhni0/bzb03x5Buwh1khpLYcHUsTkQQx66Ehd+Qjan5FTc+uZWJ6xwL8dMx6BrRMpAQ8ILmYSznwBwqP6XlYJwhnz1zUiar/3cE/8DaMonq34OJb5P1EKWY8ADe6Eut6bR+4XvvEvF17beRZFBYuIsRvqvLBCajdwKq6i/S6Qz+hHT2Uquavgpz4GWRi68ZzPX1MH8SaA5Q9aV0WPUoPe40audGdTVVK7SdHVXHlAq9Pe9n8Y6xUpX+x4rD/jKc2bsZmli2tjEs2pxv5r4+mN5DWa/np/yt/epGDU/6wvONkn3bk/OCC67X2TSc4tz6VCNUAy3MoanVAOiZIBT3CvOpojUak+hD9bb+xJaF6Ajb8kfSzlmEtLV3/iJzoshJI/wKwyJaWPbDfzLprbY4rFo3p2WsMRj8Ob8gT3O9okW6FPbmYwMZJw6t0TG4qXLqVeyn3r9ogpvhJ4h9goAFVo8DjF67eWri0aroLQUqYd/a69FMU1JfEXSmJkfX7zQWbj7q39KBGOaAiMCM4k+ZIUz8CZOD8DoIyse8V4e4BN4pcw7mAGsBVith6Z5uYbUfmiCZLj5n9d17FZaxjAe7GXJCn4j9xzbDFnx3SmVflsVswNy5/Z4JoveQHmNeqbJzhnMXOkPkQS/A91iOR5uhCwnn9QGRiEeqph+CC5q0WjwqwT3ybhlkTexyVUnmIqUS/ePBpoHq2ixYEiaYh3q8+XOGp2ykvLzICS2E0Ih/5OJCbcyM4ARlTwPRcYxkeHOV9G+/wSHb465wwqG0TP2Le7/xz+E1+kMboyZK2C3loZT5dMr9PcUMwhZDILCrwx5HHAMyVv8C5YmFIBNVOO7BXi0iPRMpk5Q+mYI6S8IT4 pQC8EyTm RYOxQcoSeHzv1fOTWtzP6R/bxChGDNKCBxQwmSCQkaSsNyJERh8Y1MZ7rqa00hKbTOR9GTlsxCmkJ6babXmMBVyia4gjYsCOjSv27jGWDLs3n86PFC1Y/bMkE3gIhzB3Y3/EBxG0lkOIfx4c4Wk7g/IJsDryr+Pfc/1uhzcscTwsyM4szlKOjS0CzFUWLrVJFOa9vEqIxqJwJn2zGQcV1t7+2Ay+WypJBpV1ArCmMeyZlKxUZlfAut46cB4UiB/CHYUpI6gpSFP7uxehjY53ACPO4k1DQNy8+pUBsnyneJ8AIMQTFCONt2qAfrM3NeCrjAuOiO9mFbQtleVWchQy/mqTBaXG4Img675DCNgw+/IITo4nk1XE9fFVVnA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 5, 2025 at 3:15=E2=80=AFPM Kyle Huey wrote: > > On Mon, May 5, 2025 at 1:05=E2=80=AFPM Peter Xu wrote= : > > > > Hi, Kyle, > > > > On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote: > > > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not ad= d > > > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any > > > thoughts/objections? > > > > > > The kernel has a "soft-dirty" bit on ptes which tracks if they've bee= n > > > written to since the last time /proc/pid/clear_refs was used to clear > > > the soft-dirty bit. CRIU uses this to track which pages have been > > > modified since a previous checkpoint and reduce the size of the > > > checkpoints taken. I would like to use this in my debugger[0] to trac= k > > > which pages a program function dirties when that function is invoked > > > from the debugger. > > > > > > However, the runtime environment for this function is rather unusual. > > > In my debugger, the process being debugged doesn't actually exist > > > while it's being debugged. Instead, we have a database of all program > > > state (including registers and memory values) from when the process > > > was executed. It's in some sense a giant core dump that spans multipl= e > > > points in time. To execute a program function from the debugger we > > > rematerialize the program state at the desired point in time from our > > > database. > > > > > > For performance reasons, we fill in the memory lazily[1] via > > > userfaultfd. This makes it difficult to use the soft-dirty bit to > > > track the writes the function triggers, because UFFDIO_COPY (and > > > friends) mark every page they touch as soft-dirty. Because we have th= e > > > canonical source of truth for the pages we materialize via UFFDIO_COP= Y > > > we're only interested in what happens after the userfaultfd operation= . > > > > > > Clearing the soft-dirty bit is complicated by two things: > > > 1. There's no way to clear the soft-dirty bit on a single pte, so > > > instead we have to clear the soft-dirty bits for the entire process. > > > That requires us to process all the soft-dirty bits on every other pt= e > > > immediately to avoid data loss. > > > 2. We need to clear the soft-dirty bits after the userfaultfd > > > operation, but in order to avoid racing with the task that triggered > > > the page fault we have to do a non-waking copy, then clear the bits, > > > and then separately wake up the task. > > > > > > To work around all of this, we currently have a 4 step process: > > > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty. > > > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE. > > > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the = process. > > > 4. Do a UFFDIO_WAKE. > > > > > > The overhead of all of this (particularly step 1) is a millisecond or > > > two *per page* that we lazily materialize, and while that's not > > > crippling for our purposes, it is rather undesirable. What I would > > > like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty > > > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear > > > all the soft-dirty bits once after setting up all the mmaps in the > > > process the relevant ptes would then "just do the right thing" from > > > our perspective. > > > > > > But I do want to get some feedback on this before I spend time writin= g > > > any code. Is there a reason not to do this? Or an alternate way to > > > achieve the same goal? > > > > Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _CO= PY? > > > > If sync fault is a perf concern for frequent writes, just to mention at > > least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC)= , > > which is almost exactly soft dirty bits to me, though it solves a few > > issues it has on e.g. false positives over vma merging and swapping, or > > like you said missing of finer granule reset mechanisms. > > > > Maybe you also want to have a look at the pagemap ioctl introduced some > > time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp i= n > > soft-dirty-like way): > > > > https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst > > > Thanks. This is all very helpful and I think I can construct what I > need out of these building blocks. > > - Kyle That works like a charm, thanks. The only problem I ran into is that the man page for userfaultfd(2) claims there's a handshake pattern where you can call UFFDIO_API twice, once with 0 to enumerate all supported features, and then again with the feature mask you want to initialize the API. In reality the API only permits a single UFFDIO_API call because of the internal UFFD_FEATURE_INITIALIZED flag, so doing this handshake requires creating a sacrificial fd. If the man page is not just totally wrong then this may have been an unintentional regression from 22e5fe2a2a279. - Kyle > > > If this is generally sensible, then a couple questions: > > > 1. Do I need a UFFD_FEATURE flag for this, or is it enough for a > > > program to be able to detect the existence of a > > > UFFDIO_COPY_MODE_DONTSOFTDIRTY by whether the ioctl accepts the flag > > > or returns EINVAL? I would tend to think the latter. > > > > The latter requires all the setups needed, and an useless ioctl to prob= e. > > Not a huge issue, but since userfaultfd is extensible, a feature flag m= ight > > be better as long as a new feature is well defined. > > > > > 2. Should I add this mode for the other UFFDIO variants (ZEROPAGE, > > > MOVE, etc) at the same time even if I don't have any use for them? > > > > Probably not. I don't see a need to implement something just to make t= he > > API look good.. If any chunk of code in the Linux kernel has no plan t= o be > > used, we should probably not adding them since the start.. > > > > Thanks, > > > > -- > > Peter Xu > >