From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 014E2C3ABB6 for ; Mon, 5 May 2025 20:05:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2F5C26B0099; Mon, 5 May 2025 16:05:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2A3EF6B009A; Mon, 5 May 2025 16:05:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 145136B009B; Mon, 5 May 2025 16:05:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E999B6B0099 for ; Mon, 5 May 2025 16:05:40 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 72BD71208B4 for ; Mon, 5 May 2025 20:05:42 +0000 (UTC) X-FDA: 83409934524.17.6F10F50 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf01.hostedemail.com (Postfix) with ESMTP id 33F8A40008 for ; Mon, 5 May 2025 20:05:40 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="dE51bAv/"; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf01.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746475540; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YFUdBwtJFHu0IdSQXDQrNEDkNHrYdeA6mBR+yrTNtIk=; b=DngaIjax7iVEjCll4EIr4HZbFshcuWvE3ulSPKECPvEefjwYDh44WRJO+Kg1hSJsdGRAU2 8dHPRNonSBde5iV58CV70uE1X+6xFeBFESZmv3x9FGOoNe1GyFxbso2sOhzQ/Eny97p3Iu vXnYBkx7U24NVD/y+6biNqdc0YqCQl8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="dE51bAv/"; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf01.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746475540; a=rsa-sha256; cv=none; b=pXBn6xeLTtaF3mn7lJVypNjlhVKoAkD91tcodtXmhmsXTBKNfh9q1+pF1oQw9Q2l/m6kxZ 0QImzU+hMUoOBoF2+tuy6/dWvdQEnFN+Ph9BEyeazouOWGG3dI35WraG38QIskQJ5Jqo3W vVXqBD/vCqBzj9sCX+AChCEqdQK7Xb0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1746475539; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=YFUdBwtJFHu0IdSQXDQrNEDkNHrYdeA6mBR+yrTNtIk=; b=dE51bAv/GeBET+oCT/yeGiQNRJPPIjd1uVINqYG1t6RV4WVAyx4ZDu0FTQY7yAALUNQ4FD fkWTQrpN0sWPjTKXtDTT9s9KwFnr7AZ7w+gCsSYfq6n4Xq0DpOcnBvInQXfOQWxPRQee7o D5a3I6dtaDLeOcn0ehtYWlWSJDEwKp0= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-456-E07ry1MoNKSmSbfCewDrXA-1; Mon, 05 May 2025 16:05:37 -0400 X-MC-Unique: E07ry1MoNKSmSbfCewDrXA-1 X-Mimecast-MFC-AGG-ID: E07ry1MoNKSmSbfCewDrXA_1746475537 Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-6e91ee078aaso91435876d6.3 for ; Mon, 05 May 2025 13:05:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746475536; x=1747080336; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=YFUdBwtJFHu0IdSQXDQrNEDkNHrYdeA6mBR+yrTNtIk=; b=oPpaqdbSv430p1wGxpSHDlXhGeB5kWDu4Ffxt+8KpqYzZBg5oucemWXyJdLZmGr5o4 GKR7zlQvkY7efZ4L6PJqBq6g2W2ue57jefLGsu3BPWgKNIgQgtLe0ofIQG5oASAtcWul LP0lVnbS8tmrdRZov9yFov+m5uAfumfYUiWlqRcDITwKHrG+Sd1CoR3x0qv0RTpKEk6A 1Mn+l9lso0dPT34s1PkstdicF5ou+d+0LHh4d6+S2/VbpcbfY2U2WQn55s0/iGJsPdeD Vl4op+h84qhFETUpR5fzFwODDfM/0PVFpRBqvjl6OQs4HTEIzEynK4y+tbaAyJt65hdQ lOhA== X-Forwarded-Encrypted: i=1; AJvYcCVpf0+FPoxg62UEpZH5KD6fB8dPNDIaDNOPE3Qp5ffGCd9A/FA3uTJKg4t8SKOmYfu3PkiYS8isSw==@kvack.org X-Gm-Message-State: AOJu0YzRgSA8PDmHXnC5QagR0lQF6J0xOk+tXJulv0WK/zLGmOVJeQDw WVu8qTyqUXkXGqToXMsteAbaWQuQvMe4Al2sJ3N9UD1eMCS4R8uXL2L6azaOW1oSXMcaP4cE1/3 GVqez9AdCdnokgIrDfo0qPfNRkmXe73ShIVFclV3yCKzDiFaqpiC8jGuj X-Gm-Gg: ASbGncvJyLUYMhLdUJKfBaVabCdcqUlHeLN3N+3NwS3YnSVdhJZpECpvmoqlXJcAETx KguC4l3u9xi2y8NRaz9wgiaCVeA4F3zmi9tPhQqyZqNUbxcdageOfSpOe5ZJj4+hkyVbkbQDwF9 0ru2koVSy8KOHG/vbS94FmU13+VQLMKfK+U1wvxl0wL0rpomUYqC9CBd7+LcNgKjiEvd8f41sfH uyS0k1eIuHnq9IXDOy68QDQsNxEKh0GXeIZJAdx60Qu8b4rY0b+h4n2wzLvymJAdYDQ8SAPVdjZ XU8= X-Received: by 2002:a05:6214:411a:b0:6f4:c602:806e with SMTP id 6a1803df08f44-6f528c51f14mr123959856d6.13.1746475536596; Mon, 05 May 2025 13:05:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGAX8vD/Mhme2rAizgX3udqaLznlk0blCrJd5nt93pvyI5CBbUh+Ubh3gass9cVF6sOpMENQw== X-Received: by 2002:a05:6214:411a:b0:6f4:c602:806e with SMTP id 6a1803df08f44-6f528c51f14mr123959606d6.13.1746475536265; Mon, 05 May 2025 13:05:36 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6f50f4826ddsm59766116d6.101.2025.05.05.13.05.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 May 2025 13:05:35 -0700 (PDT) Date: Mon, 5 May 2025 16:05:32 -0400 From: Peter Xu To: Kyle Huey Cc: Andrew Morton , open list , linux-mm@kvack.org, criu@lists.linux.dev, Robert O'Callahan Subject: Re: Suppress pte soft-dirty bit with UFFDIO_COPY? Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 1vsfD9T57zJgZDbRoSroE8nLyU7wUWUn9j_5XKvHzMk_1746475537 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Stat-Signature: d4apos8xxpcs65fub1wb9numuhibrxm5 X-Rspamd-Queue-Id: 33F8A40008 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1746475540-148368 X-HE-Meta: U2FsdGVkX19NaeMeXQ614EGAslQzy8beh9du59XyMXHw/8jhWd1ni1pFhm3W5PQGTBa44gy0G/Gyqw2fD3neZoHPz2aAPw19bzlwACIjgXUJO/74JJuhIUYRDh9NcD13XnnfE1Zz6EKuckLeFsRVO3S1XNzSAFE/JftkiG6VJhktp8zwoozC+Po0UNyXFsms3K01Cs9IdmeKzU3wxLMknT/hnoHEhbAkXkHut9z5dpZGU8A4p1tG5zq33JyvDK7F1EzZ/5X+W8XWwABbIHvUzOpcYR4jJebC//8T8dgTjPd5b2d4NpyP5+rNjft45+UiFud/jLHh1wEYn5DB5xIYbA0QcrusO/0h/kUaPmmyc97UALB0nJiNhGFU/OgdJw/dAf7QL+S1SRS+ar0WPqwJ/GdLqW8QfGSw0olfDpXJpgk/wBRUcjY+OpnJ4Gf/uH9crKGJWzhiVYcZRBXcPkA+KcUlgmQnvqq8lbeO22rqeDOO1jZ7oq2f5z4UKd+wgeHsfvzsA9pAVxUpj+EfBa5QKd0eOl416JEWx9gi64mQfFw9Ji1V5+kEA8lzDnT3mz5Ow2gLrdQ4xb12hbrt1Wy91YOGc030mBFNYPxpmyk4U0eS4J2mjXRSJnDaF3UjpidPvXXay8IiLmHhIZwHn7Tjfu+gT4Ta2/hXWC5ctScoeuV1iMOPT9c9Av3P51N2mxsaOKGrEsGWsb6CCrMtOiWLk3TIsrwNF+WyX/gw5MQWlGMW2kMlScDr6iZWojG1PwvhlqQV21r0yVBJ1tocgf2S2uGbk61tgw81UY7lPm1m5ZnOmSuyQM9SojSgMKYaG4JGAIQz+W2aTVHP5zLiIfn9haU//JXwK14xTen93f6G/lOK/rp0Ma4VV3jXJX7iRLr+Ud7+B/GUbTGRIqMRwD0z9iwaB+TiDQuL8zKIYSNsE334tujo2RmthIsdku2IHYzTNcEwAM332VD6KgWqPHW SI9f96H0 2YLG27K/3q7uCVzlqp3Aaq0YLBEbiVz4bGMi7HMXUSqjsym2ad4blBTmMODLQLJ3R+mBfz9imV4Ztad04t/94TsAhS8ewq5ZOqVbNcmSY4EaNu79vSTEfAY7G+EutwuvzQURAbxFIEyNLEe9yaxskzRDSg6dg7jQ/cWPcK36VbfKWnb/RwSS+mSQaC299tVG1y8rf4ayDQlck/KKvpOzfdKaSJa6LmIcGemykIh4duTm28RP3Ytz039PKUmntcSrokL7/9xdyI1HC8i1QxuKQcJUDvlx/WOE+ZIWbzZ05A+MCKX7ln1YyU1a00LMwh/QVh3HYprjtYncDh7JSSvvz/KyptDMyUCn+C+jWN5sW36Z6Shh1aX5A3h4Ea0M2XiWv7IMJFxu11R8XbPyuV7kl3BkKxjY0LuhIAZh53i2DA9Y8wdJhuQJpNV2DhvXeIczBexu04l+p1vS/4ok= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, Kyle, On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote: > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any > thoughts/objections? > > The kernel has a "soft-dirty" bit on ptes which tracks if they've been > written to since the last time /proc/pid/clear_refs was used to clear > the soft-dirty bit. CRIU uses this to track which pages have been > modified since a previous checkpoint and reduce the size of the > checkpoints taken. I would like to use this in my debugger[0] to track > which pages a program function dirties when that function is invoked > from the debugger. > > However, the runtime environment for this function is rather unusual. > In my debugger, the process being debugged doesn't actually exist > while it's being debugged. Instead, we have a database of all program > state (including registers and memory values) from when the process > was executed. It's in some sense a giant core dump that spans multiple > points in time. To execute a program function from the debugger we > rematerialize the program state at the desired point in time from our > database. > > For performance reasons, we fill in the memory lazily[1] via > userfaultfd. This makes it difficult to use the soft-dirty bit to > track the writes the function triggers, because UFFDIO_COPY (and > friends) mark every page they touch as soft-dirty. Because we have the > canonical source of truth for the pages we materialize via UFFDIO_COPY > we're only interested in what happens after the userfaultfd operation. > > Clearing the soft-dirty bit is complicated by two things: > 1. There's no way to clear the soft-dirty bit on a single pte, so > instead we have to clear the soft-dirty bits for the entire process. > That requires us to process all the soft-dirty bits on every other pte > immediately to avoid data loss. > 2. We need to clear the soft-dirty bits after the userfaultfd > operation, but in order to avoid racing with the task that triggered > the page fault we have to do a non-waking copy, then clear the bits, > and then separately wake up the task. > > To work around all of this, we currently have a 4 step process: > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty. > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE. > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the process. > 4. Do a UFFDIO_WAKE. > > The overhead of all of this (particularly step 1) is a millisecond or > two *per page* that we lazily materialize, and while that's not > crippling for our purposes, it is rather undesirable. What I would > like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear > all the soft-dirty bits once after setting up all the mmaps in the > process the relevant ptes would then "just do the right thing" from > our perspective. > > But I do want to get some feedback on this before I spend time writing > any code. Is there a reason not to do this? Or an alternate way to > achieve the same goal? Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _COPY? If sync fault is a perf concern for frequent writes, just to mention at least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC), which is almost exactly soft dirty bits to me, though it solves a few issues it has on e.g. false positives over vma merging and swapping, or like you said missing of finer granule reset mechanisms. Maybe you also want to have a look at the pagemap ioctl introduced some time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp in soft-dirty-like way): https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst > > If this is generally sensible, then a couple questions: > 1. Do I need a UFFD_FEATURE flag for this, or is it enough for a > program to be able to detect the existence of a > UFFDIO_COPY_MODE_DONTSOFTDIRTY by whether the ioctl accepts the flag > or returns EINVAL? I would tend to think the latter. The latter requires all the setups needed, and an useless ioctl to probe. Not a huge issue, but since userfaultfd is extensible, a feature flag might be better as long as a new feature is well defined. > 2. Should I add this mode for the other UFFDIO variants (ZEROPAGE, > MOVE, etc) at the same time even if I don't have any use for them? Probably not. I don't see a need to implement something just to make the API look good.. If any chunk of code in the Linux kernel has no plan to be used, we should probably not adding them since the start.. Thanks, -- Peter Xu