From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D234C3ABCB for ; Mon, 12 May 2025 17:16:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 505AE6B0195; Mon, 12 May 2025 13:16:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 48DA86B0196; Mon, 12 May 2025 13:16:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 308086B0197; Mon, 12 May 2025 13:16:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0E7656B0195 for ; Mon, 12 May 2025 13:16:31 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id AD2C6120559 for ; Mon, 12 May 2025 17:16:32 +0000 (UTC) X-FDA: 83434909824.13.81875EF Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com [209.85.167.52]) by imf30.hostedemail.com (Postfix) with ESMTP id 91C3B8000B for ; Mon, 12 May 2025 17:16:30 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=kylehuey.com header.s=google header.b=AIPh9024; spf=pass (imf30.hostedemail.com: domain of me@kylehuey.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=me@kylehuey.com; dmarc=none ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=kylehuey.com header.s=google header.b=AIPh9024; spf=pass (imf30.hostedemail.com: domain of me@kylehuey.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=me@kylehuey.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747070190; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UGww0DZuhWCqnq9CydvYXETjwB0X/QKTp4W4lyBf4lE=; b=C2NgUADJWCjZZATAMTrydr6fnjKv+n2nuKxSv9aWRvuE5tdUnpRjEphJlYPOMCTabQWsAG 2pZ+gJngaJSeFJDg4EN2PB91CIIrQW5zv+raDgFXnAKHNZVgKcFf1wOsPww2V83NeEC00R v1KC6iMmab0XcW7LgI4OdpQgQgAlN+A= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747070190; a=rsa-sha256; cv=none; b=QbdjE193L+7Lrmizs6AG2yjCnep858vCETlL9nzwukk1QWVj+URlENew1VkS/Dxg87XtWe 6A5QMeZsNKYEgbez4guQrrpbOkM6nFe41R3gE+D2dVs6kJFQulmmSOOeTluB8uk2C/yZKE lsUn0DeTWSPzG6bW0xfXbly7sbD2EsY= Received: by mail-lf1-f52.google.com with SMTP id 2adb3069b0e04-5499614d3d2so6041230e87.3 for ; Mon, 12 May 2025 10:16:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kylehuey.com; s=google; t=1747070189; x=1747674989; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UGww0DZuhWCqnq9CydvYXETjwB0X/QKTp4W4lyBf4lE=; b=AIPh9024yub/Ft+qH9EZhVgmsSBI1FdGyis35Y7dGzZTmoZpkDvO578dgUDACPwEgm kMsh+UkZVYIqtc9Ii8iWJwifszbNnrh2cFCKiwJR9Q//TqKjyHRXKRBkZOxYuol07oOo MesP9kloIjChxrb7JKhCzREzIL3i3aOm6rU3GhCLifSPjQ+6ddIGsrjgXFLpO86FFrRv dJMzkJ3jR5MMnbKl/flWuLxdsTRGqicQ/C+xYEDNykjAO48n8KR1l0E+Q+90ZJkr+Cuy i96gpOCbcY3prxv7LmRU4r/Ndd4cZVrertnnzZSTj18ZehvgyrDvwvrG8RkxgtWxishR B0sQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747070189; x=1747674989; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UGww0DZuhWCqnq9CydvYXETjwB0X/QKTp4W4lyBf4lE=; b=Jo2ucL1Tg/cdjE5MTerVYZBopWa6VNO2nhlLVnFWRhBC9GVVgbGfc1CBAY2A1xUt12 C1zVZuwemblYLWYaKbiXmjqVhhSEVk0MYLe+xXrzpjnDAot6ZkW322d68NWcOH2T5i3S fNYvrFVUKozbzlJbizI/Ux0Zu3Uw0+eMVvkSZoyo8uUZm4Y15C5pThtOQlVgUG0cI10D G7CDDFjY3yrH5Kg8VrSXM3pQdMHR8FkK1JEu48ABuLx9De3ous0M3rfAIJeATd2g2z3S xV4bwavsoEp+IonZCvhc3mZFuQy0ncyGvis//rgEwVCEnMGdDsO1T5UQk4iee/JVydm9 NDsQ== X-Forwarded-Encrypted: i=1; AJvYcCUyDdZXnMSxSNHGMUk5MmzSJ7D29z5CZRku+YQtmHTzJohi8NGPTmKPVeJ1kxB1L7kmVuZoN2CzRA==@kvack.org X-Gm-Message-State: AOJu0YwJVrwsyPYz4p3upnTktM60GvWd6kLQFr1ecpEo8fTvS1j5kK41 NMEQkD2LGinBA4D35qgAIFVlFMCpOyXYNfkssACrueQoV6IaG9CGDnfGaXkaebX0ouCMkIf9tRu XYU9589z9TCdDaxX0cO3pF4LaJFvtGo3n85yv X-Gm-Gg: ASbGncvROFtKc/drNi66cEYhd9xLD2vv7VcS5noivdaqlxVnI9+Al0uaUKbozekISg7 lT71nro31dF0wnlnov9mwXuOqVtcQtwMzX3nUmAnELizksjzZ3qBPpHoysu4//TH4YPNXj6qRQ/ i7WNi/j1ww0wgaaGT+bawXoFpoHko6iqww X-Google-Smtp-Source: AGHT+IHlJVQG5UMOoL4tD4FLu5PuxI6RtZgjMGaPXzas/B4OigGM6d1JSuYVJbI0UsuUYZozS0Vz44SVFFp5Y5gzvHc= X-Received: by 2002:a05:6512:6506:b0:54d:662b:c8a9 with SMTP id 2adb3069b0e04-54fc67cab88mr3672457e87.25.1747070188407; Mon, 12 May 2025 10:16:28 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kyle Huey Date: Mon, 12 May 2025 10:16:12 -0700 X-Gm-Features: AX0GCFsXpkeL1OAfWn1bHHWW3jMRJSh-9wyV2HLz-ft10oY0FXN7UplOE0NGavs Message-ID: Subject: Re: Suppress pte soft-dirty bit with UFFDIO_COPY? To: Peter Xu Cc: Andrew Morton , open list , linux-mm@kvack.org, criu@lists.linux.dev, "Robert O'Callahan" , Axel Rasmussen , Mike Rapoport , Andrea Arcangeli Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 91C3B8000B X-Stat-Signature: 19g6ko417he9d3efzmnu8sixxqx1ca7d X-Rspam-User: X-HE-Tag: 1747070190-117735 X-HE-Meta: U2FsdGVkX1/dDTXgT7AB2UNUPN+oguJVkBHOWcu+dgUo2275uQdG/vvkOyS9yeX1oWP/nkBnsbDm21B3xEHOi9mkTBckQxqFL7F2YllzFkJlMLn+7pBR2TVdK069H5c5A/CdEDJVhpmVTclvaiFF5UKImYp+Ox0qzBNDJKuKtWaGqQhU2ZPebUHN0x8+J3+gAwynJzKlageX4nbOvvPVJmJgqPEMlV3H9I/8GdC3dwnFfcFQxQFJKsKs0qpEIYkwW2Cibl2+IwK2C0Gj57IUpMGoSr7Ohfmki5s85lNrAWegTf19sIt1TqKmX2dpwJT3VnoLMXjbH4btN76ib1vCOWx3gIqQiSLpCyUS4KWPVLD1sr+YPQJ8XsNIqFxAXzi78T20bnTXnbANbQOCxYlPL4/Adfbw7yTQQ5JX6dVOFiXqH8qix1sAKRtqtX8nOiHQotgmgBGF5clS/8r3emLJVonsKK8dcSQ4N6gB43G0WxcxQZs+Ni78FzVEzjDLjIITaV8opcbSQapoeG0vTxrdf+2t+T/bc9LPjGUGDf+6EOJOLBsGcBGe+6VdCP1oMmmZ3Z0Y2Q+kKg8wN/uyHqpT3IxwLbMAVr+cnslIOsYQwkZgeekVrptOgRCxlqP7UauTlf2YiQ5uwFYMYvguoTJnh9Fi6JHUzqjiqJID3fdNWGbfYTt2jyai4yoby6X01X+9hbMLSVc566/S2b8ENQqkgL2mXenmZWm1oHV7MIocXKj622Kl3HfWXSZjgzyILiV9IPFKX/gZP1RHUk8iyuCWHN5e0P/qZpf0dAto+1iBP/IUdVxTAKWw7RTUoDFoPmG0ViSr+x3wfVWSAldVHfj+kOEW5EvfSulEMTmveTojYPtZ55MombX4uODf+XIpEoaldklPx0kW4nRb0fSFHpeT3YMewm0hflZwenUkF/+2mJ0MYxTV9bAzg2RjHI31Qjw8saY+DQNdtbBvJmWv2ep lVVcvR9t AyvDJsuAGr69smoVLo6ncP1k2NtxFvVCqV4AhvNEonZbznlZrF5b2VxYgPIcC4ssvo24A+abMPmNa0hcUz5ZIP1ZYac7wXjWGAlYLz8ZAEchbhQzyF8S9SscVNweQO/NA1apq/hkt8qIHLDA7Dnx/2RKL2PTdJS+6WXMrSTRIFMGNPvbQdHVkxN1NLWCj1JDuYXNaa5lIglJV7I8HcH1G+vSSj/VymxmCv7ZwnSDrZLlYJNwIHoqd/XK0ndXVlVN3mr62pIQRyiyWiOXuX3tZNuPptCcsyJ+y5aGC/drdYlEEz9z5FsD8kMNCo2m25ivQmEOtEnLauB8Rvd09hig1IO8crtB3dk1cka22GxQjzl0ld61yA8T8ac77IE9+pw5Tw80EgT9RT+Zu0j8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 12, 2025 at 8:54=E2=80=AFAM Peter Xu wrote: > > On Sun, May 11, 2025 at 08:06:03PM -0700, Kyle Huey wrote: > > On Mon, May 5, 2025 at 3:15=E2=80=AFPM Kyle Huey wrot= e: > > > > > > On Mon, May 5, 2025 at 1:05=E2=80=AFPM Peter Xu w= rote: > > > > > > > > Hi, Kyle, > > > > > > > > On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote: > > > > > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does no= t add > > > > > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any > > > > > thoughts/objections? > > > > > > > > > > The kernel has a "soft-dirty" bit on ptes which tracks if they've= been > > > > > written to since the last time /proc/pid/clear_refs was used to c= lear > > > > > the soft-dirty bit. CRIU uses this to track which pages have been > > > > > modified since a previous checkpoint and reduce the size of the > > > > > checkpoints taken. I would like to use this in my debugger[0] to = track > > > > > which pages a program function dirties when that function is invo= ked > > > > > from the debugger. > > > > > > > > > > However, the runtime environment for this function is rather unus= ual. > > > > > In my debugger, the process being debugged doesn't actually exist > > > > > while it's being debugged. Instead, we have a database of all pro= gram > > > > > state (including registers and memory values) from when the proce= ss > > > > > was executed. It's in some sense a giant core dump that spans mul= tiple > > > > > points in time. To execute a program function from the debugger w= e > > > > > rematerialize the program state at the desired point in time from= our > > > > > database. > > > > > > > > > > For performance reasons, we fill in the memory lazily[1] via > > > > > userfaultfd. This makes it difficult to use the soft-dirty bit to > > > > > track the writes the function triggers, because UFFDIO_COPY (and > > > > > friends) mark every page they touch as soft-dirty. Because we hav= e the > > > > > canonical source of truth for the pages we materialize via UFFDIO= _COPY > > > > > we're only interested in what happens after the userfaultfd opera= tion. > > > > > > > > > > Clearing the soft-dirty bit is complicated by two things: > > > > > 1. There's no way to clear the soft-dirty bit on a single pte, so > > > > > instead we have to clear the soft-dirty bits for the entire proce= ss. > > > > > That requires us to process all the soft-dirty bits on every othe= r pte > > > > > immediately to avoid data loss. > > > > > 2. We need to clear the soft-dirty bits after the userfaultfd > > > > > operation, but in order to avoid racing with the task that trigge= red > > > > > the page fault we have to do a non-waking copy, then clear the bi= ts, > > > > > and then separately wake up the task. > > > > > > > > > > To work around all of this, we currently have a 4 step process: > > > > > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty. > > > > > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE. > > > > > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across = the process. > > > > > 4. Do a UFFDIO_WAKE. > > > > > > > > > > The overhead of all of this (particularly step 1) is a millisecon= d or > > > > > two *per page* that we lazily materialize, and while that's not > > > > > crippling for our purposes, it is rather undesirable. What I woul= d > > > > > like to have instead is a UFFDIO_COPY mode that leaves the soft-d= irty > > > > > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we cl= ear > > > > > all the soft-dirty bits once after setting up all the mmaps in th= e > > > > > process the relevant ptes would then "just do the right thing" fr= om > > > > > our perspective. > > > > > > > > > > But I do want to get some feedback on this before I spend time wr= iting > > > > > any code. Is there a reason not to do this? Or an alternate way t= o > > > > > achieve the same goal? > > > > > > > > Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for= _COPY? > > > > > > > > If sync fault is a perf concern for frequent writes, just to mentio= n at > > > > least latest Linux also supports async tracking (UFFD_FEATURE_WP_AS= YNC), > > > > which is almost exactly soft dirty bits to me, though it solves a f= ew > > > > issues it has on e.g. false positives over vma merging and swapping= , or > > > > like you said missing of finer granule reset mechanisms. > > > > > > > > Maybe you also want to have a look at the pagemap ioctl introduced = some > > > > time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-= wp in > > > > soft-dirty-like way): > > > > > > > > https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst > > > > > > > > > Thanks. This is all very helpful and I think I can construct what I > > > need out of these building blocks. > > > > > > - Kyle > > > > That works like a charm, thanks. > > > > The only problem I ran into is that the man page for userfaultfd(2) > > claims there's a handshake pattern where you can call UFFDIO_API > > twice, once with 0 to enumerate all supported features, and then again > > with the feature mask you want to initialize the API. In reality the > > API only permits a single UFFDIO_API call because of the internal > > UFFD_FEATURE_INITIALIZED flag, so doing this handshake requires > > creating a sacrificial fd. > > This is true, almost all apps I'm aware that are using userfaultfd needs > that. It's indeed confusing. > > > > > If the man page is not just totally wrong then this may have been an > > unintentional regression from 22e5fe2a2a279. > > IMHO 22e5fe2a2a279 was correct, and it fixed a possible race due to > ctx->state before. The new cmpxchg() plus the INITIALIZED flag should avo= id > the race. > > In this case it should be the man page that was wrong since this commit o= f > man page, afaict: > > commit a252b3345f5b0a4ecafa7d4fb1ac73cb4fd4877f (HEAD) > Author: Axel Rasmussen > Date: Tue Oct 3 12:45:43 2023 -0700 > > ioctl_userfaultfd.2: Describe two-step feature handshake > > I'll see if Axel / Mike / Andrea has any comment, otherwise I'll propose = a > patch to fix the man-pages and state the fact (that we need a sacrificial > fd). > > Maybe I should really add the UFFDIO_FEATURES ioctl to allow fetching the > feature flags from kernel separately, considering how much trouble we've > hit with this whole thing.. Personally I don't think it's a real issue to have to create a sacrificial fd once at process initialization to see what features are available. I wouldn't have even said anything if the man page hadn't explicitly told me there was another way. - Kyle > Thanks, > > -- > Peter Xu >