From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E2E5C3ABB9 for ; Mon, 5 May 2025 22:15:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 335126B0095; Mon, 5 May 2025 18:15:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E6CA6B0098; Mon, 5 May 2025 18:15:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1ADDA6B0099; Mon, 5 May 2025 18:15:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id ED5D56B0095 for ; Mon, 5 May 2025 18:15:16 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EF51CBF916 for ; Mon, 5 May 2025 22:15:17 +0000 (UTC) X-FDA: 83410261074.08.A904475 Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) by imf29.hostedemail.com (Postfix) with ESMTP id B72CD120002 for ; Mon, 5 May 2025 22:15:15 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kylehuey.com header.s=google header.b=eubrH1aK; dmarc=none; spf=pass (imf29.hostedemail.com: domain of me@kylehuey.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=me@kylehuey.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746483316; a=rsa-sha256; cv=none; b=VfEPhDGqFYxfHXzMRjA+TlS4f/bXxadWDihMMO9u29VWQfy4jXClMOU9wOg+Hl2BvymoUv KaO2lYMOmdHoA5tv5JQqQ4KsBN/UnKDp3btBoRftD6IfjfAEwxuIi7cLLVaC5iS3agz2oV +pfPiHoIgQ28kcem2zRBQepPTt7LrCQ= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746483316; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TQ5eBrFfOmcah0yJB/jEfTuCEypZVE8gaZUWNHeUUC8=; b=e63tEJaFK/pJB5EA9ffHy3Uj5Fu64/UGLvsIjeJoL2AWKQD3ek2A8Z0vPV8D3VLrjO9aOU DoIivZcfxyjEQNF7WGGXo17KcV/zHk2wAu0X2p4fRpQLfSngdwmEpYZI3+2ePOm7hj76GG Qz8nH9xveL0FRKGEjUDgdr9zNfxyvzY= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=kylehuey.com header.s=google header.b=eubrH1aK; dmarc=none; spf=pass (imf29.hostedemail.com: domain of me@kylehuey.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=me@kylehuey.com Received: by mail-lj1-f182.google.com with SMTP id 38308e7fff4ca-30bfc8faef9so42283311fa.1 for ; Mon, 05 May 2025 15:15:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kylehuey.com; s=google; t=1746483314; x=1747088114; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TQ5eBrFfOmcah0yJB/jEfTuCEypZVE8gaZUWNHeUUC8=; b=eubrH1aK9N1YzVmT7bK2U2pCVfg23xBjAo9xJPa5NG7DMydfv9sILWSIDKTUnon2xv gDMfu58S2/xplrRGYBugodlttu6CTB4CXV/lByQSN8PRgjMvsCX0knwcFsg11BPhvODl sEKD5/11DvDc0h6IvHTKEzmzaxUqeQK+/aO1aNrx766rbgoStSKxpy1M4MVrQO5gtWih 5ClcqCNlM5C/vM+Nll7psnXI+8iuWgC+fssBiM+gljpIclVjZsBYvqSuXTBQGT4tsm44 xic73nPeKFzGvKFKM1DF20ZxVwr10oOsLeLyLWCN1zWbI71zkRFAP/Y1RiNJG2c4Yguw 3a1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746483314; x=1747088114; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TQ5eBrFfOmcah0yJB/jEfTuCEypZVE8gaZUWNHeUUC8=; b=t66Kb8zPAj4yy0um+fq2Zey44RQUJe27hACEfULYadjR5IgtQht+2HBv0lLUy7OwFn UUueLBSFD7ss/1abH8x4tCGdPQB/8/AITLSRy2plisgYgKOXIKuwT8HjhpEhlLIgkNPq cB+cSgW9Q3j7RteRsd+lIyr6+89SfJYxKZcxeKNbHqn4vIg67WllfMjosbnog3Ymks/i OruPAMFjaOoNpBDlJZcK735O9bJv+T7OinZU4ayElyiw7MSJovqvPCdEU74q/0CemLrR HY2nbxjvUlb9fUEzQyVmbqvRdqHHAOJJ7HI3Kx8uBju+wC0zoEIkrgR/WGlOV2Nqkdc8 r+sg== X-Forwarded-Encrypted: i=1; AJvYcCUuvqWdig5/DqoIQ39CeGDofNjhhCrKQs7MwQs1dP6eiMuyzH7ClTLMK2dghjxPbk93c9e0ioJ3LA==@kvack.org X-Gm-Message-State: AOJu0YwnSGzXLsAkC7M3RxZPmNu/WjjI2TEsfojsgYoYPOAmuuuqTmNz J4QL5mbdCqgeXGbuJPz8uTICGST+N94h2jABbKA8ROLQwkZVcUTeAHmqAqA/tmJ+GDiG4W0g0Fo Vqcpw8TlgfBRoYXLbm/LvnDrcSGyPdOcL429J X-Gm-Gg: ASbGnctDU/d1PHIGjoJuhRKPlca8wW+tsZc5ckaXlOxaPgRvlomcJngCZCOcZIR7cY7 HBg5eTWAAYh83OVZZ68D/Z062Z4NCPwjPRRoUBI0j/bNCG4dsT6fxO1S8RT1B4QXOdy+p4DKyC9 G/yKZYyj3KWiceE/tlp3p+HQ== X-Google-Smtp-Source: AGHT+IFgzVRcS9mdsTk87FEhrDFQ5jFKeoi6nutbirWAHEtPPAbw81TGSNj3TVoWMr7WdMW7l5qoV87uzdF62e4sk08= X-Received: by 2002:a2e:be08:0:b0:31a:6644:25c with SMTP id 38308e7fff4ca-3264f00f29amr3080161fa.12.1746483313188; Mon, 05 May 2025 15:15:13 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kyle Huey Date: Mon, 5 May 2025 15:15:00 -0700 X-Gm-Features: ATxdqUEzTn2NHO1OHMFM-6gWIqCZ6B2vpcStPrXwsgZOLdP2CnJ9U-2q1q9HPeQ Message-ID: Subject: Re: Suppress pte soft-dirty bit with UFFDIO_COPY? To: Peter Xu Cc: Andrew Morton , open list , linux-mm@kvack.org, criu@lists.linux.dev, "Robert O'Callahan" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: B72CD120002 X-Stat-Signature: gdb9noeo9yhrbwy7ppzumydnpekhhk34 X-Rspam-User: X-HE-Tag: 1746483315-584921 X-HE-Meta: U2FsdGVkX18MGrA0RHJ94036nir1ZcOlFXi+NyCBNc+nyuombuptl1UqOEL/qS+D9fd5gwedZR6E0Srjt1WQj8o3D/QCu50VLtja5XRs/heYyojRCVoaZ5XnFKjtthRJ58kJ9niWdQjyPlPFJ47jTyDV0Hp8q5J1qoUO0/GuRkqyM2VLK9DSMj8MNKNp8ny19ZtPyp9ortTFAAyMRbF4iGgIKezzipLCxJs/JEN1S+zRX0i+M5aq6jHicctIs2s9yfYDbois2HiqKU+JtAJV0/pqmoNL7Z/UCt1VyoqK+GNybas17Tbwtzbc49qBceDNMceqM8mc9ifbb56luvU8RB510fkh/kwG9WVNrT51Z6qUSxe5WRTCbQ/FOoK8R0hKa0GqTEqUKCqAOqqXNlLrNuOtiDeHuFARKmQcl0da4ZwbJRCX9YVYyJ3RYupxxmC8AjQgAUbArPWjiMqiHY1akhNGsUW7OqrqidND3+EodgeBzUQKDlDGtepFrdmcDa7w5/g/hGAvFrRn8aQBl9KxqDwAUTWUCvvPSi0hURnpXDOMY7vi3HbGlIYraTv3AKiQ5fHZjQYKExkRF33WqJb3Js+tFyuq7iWqYk/UPqDyK3t5MhBjNoC797lKNznFQfsXteY+PGqdYTjmwPc7HvV4lVJFXKu4OgVxM+814rmSasUfuoFMaaya4XMrtF5lBjLPAVqmor28+gGFrG2m9d1zayguky6b6/PTyfLbg55rsT+nUeiYP00qaMpOr9pqLSINW7lksbSVRDmm6fC7QRjVKBRuWl60oed3q684QLDZZtyKCg3ZpCbbX6uZmb1pVYp9xKREJNXhr3K72GAP5afO1XJ9njFkITk5ckZKJuPCCnxUXPx8WUo2dBuQPOQuJ7nQeO1Fvbp8DuMG/ThNF7XMdiuFiKDvfescDKafLCOVLWD+iOCA0qUgajUmlmUeFhtTaiM2Zbt0hHHYOx3KnGg cq+QbMMw F/U54erkGHDHvL1qLZM799klMYxzjn3uWBOiSR/n0phpTyd32ceuVOcy8Ey+q/upy5GZaLgspRey+92yrQFgt+ZJFMezAdV8UAZQO5dDDMsOdCyNv4S0COsVnMboCevb97MS5uBZh+wWGtR/vRqjGHOtt5fKowvbd+ffj8VMNYRIl7V5f5Rg9f++jtoYc9Y7d7VEKP02IIGAuYRrglo/k1ORVkPuAsNGc1h0zB0d7dq+b/ujBgFs8n6wKrns/ZNV/2NHY9IOOVdKrAqBoiBytLgKbHDmqN6xvoKKK2+bxZaSwtH5Ys0NlTGM6yQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 5, 2025 at 1:05=E2=80=AFPM Peter Xu wrote: > > Hi, Kyle, > > On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote: > > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add > > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any > > thoughts/objections? > > > > The kernel has a "soft-dirty" bit on ptes which tracks if they've been > > written to since the last time /proc/pid/clear_refs was used to clear > > the soft-dirty bit. CRIU uses this to track which pages have been > > modified since a previous checkpoint and reduce the size of the > > checkpoints taken. I would like to use this in my debugger[0] to track > > which pages a program function dirties when that function is invoked > > from the debugger. > > > > However, the runtime environment for this function is rather unusual. > > In my debugger, the process being debugged doesn't actually exist > > while it's being debugged. Instead, we have a database of all program > > state (including registers and memory values) from when the process > > was executed. It's in some sense a giant core dump that spans multiple > > points in time. To execute a program function from the debugger we > > rematerialize the program state at the desired point in time from our > > database. > > > > For performance reasons, we fill in the memory lazily[1] via > > userfaultfd. This makes it difficult to use the soft-dirty bit to > > track the writes the function triggers, because UFFDIO_COPY (and > > friends) mark every page they touch as soft-dirty. Because we have the > > canonical source of truth for the pages we materialize via UFFDIO_COPY > > we're only interested in what happens after the userfaultfd operation. > > > > Clearing the soft-dirty bit is complicated by two things: > > 1. There's no way to clear the soft-dirty bit on a single pte, so > > instead we have to clear the soft-dirty bits for the entire process. > > That requires us to process all the soft-dirty bits on every other pte > > immediately to avoid data loss. > > 2. We need to clear the soft-dirty bits after the userfaultfd > > operation, but in order to avoid racing with the task that triggered > > the page fault we have to do a non-waking copy, then clear the bits, > > and then separately wake up the task. > > > > To work around all of this, we currently have a 4 step process: > > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty. > > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE. > > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the pr= ocess. > > 4. Do a UFFDIO_WAKE. > > > > The overhead of all of this (particularly step 1) is a millisecond or > > two *per page* that we lazily materialize, and while that's not > > crippling for our purposes, it is rather undesirable. What I would > > like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty > > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear > > all the soft-dirty bits once after setting up all the mmaps in the > > process the relevant ptes would then "just do the right thing" from > > our perspective. > > > > But I do want to get some feedback on this before I spend time writing > > any code. Is there a reason not to do this? Or an alternate way to > > achieve the same goal? > > Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _COPY= ? > > If sync fault is a perf concern for frequent writes, just to mention at > least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC), > which is almost exactly soft dirty bits to me, though it solves a few > issues it has on e.g. false positives over vma merging and swapping, or > like you said missing of finer granule reset mechanisms. > > Maybe you also want to have a look at the pagemap ioctl introduced some > time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp in > soft-dirty-like way): > > https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst Thanks. This is all very helpful and I think I can construct what I need out of these building blocks. - Kyle > > If this is generally sensible, then a couple questions: > > 1. Do I need a UFFD_FEATURE flag for this, or is it enough for a > > program to be able to detect the existence of a > > UFFDIO_COPY_MODE_DONTSOFTDIRTY by whether the ioctl accepts the flag > > or returns EINVAL? I would tend to think the latter. > > The latter requires all the setups needed, and an useless ioctl to probe. > Not a huge issue, but since userfaultfd is extensible, a feature flag mig= ht > be better as long as a new feature is well defined. > > > 2. Should I add this mode for the other UFFDIO variants (ZEROPAGE, > > MOVE, etc) at the same time even if I don't have any use for them? > > Probably not. I don't see a need to implement something just to make the > API look good.. If any chunk of code in the Linux kernel has no plan to = be > used, we should probably not adding them since the start.. > > Thanks, > > -- > Peter Xu >