From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97173C3ABC3 for ; Mon, 12 May 2025 15:54:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C09056B016F; Mon, 12 May 2025 11:54:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BB98A6B0170; Mon, 12 May 2025 11:54:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5A806B0171; Mon, 12 May 2025 11:54:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 878E66B016F for ; Mon, 12 May 2025 11:54:25 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4F048801F3 for ; Mon, 12 May 2025 15:54:26 +0000 (UTC) X-FDA: 83434702932.07.1756D0F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf12.hostedemail.com (Postfix) with ESMTP id 081F14000E for ; Mon, 12 May 2025 15:54:23 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=K4pdjbqf; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf12.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747065264; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5PiiNv5uyVSUS0u0nlImNDdFDgHHu9GbJcbLZJOQT9M=; b=TOzESxeXtIzd2fzqaC7c+a63V/IYk/yfHo8AhDtF5GKtAEdtfbVFSjEsxE4h8PPFqie7Fi zAcK+R+X7844rDeTOQaMOzZXVJ5pWiCvYz2DeXbXpMrFLlhGatlD9yF3AS60t7zB2rriE1 bJN340DGqfqykZeZlJL+6RYfB5DLHKg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747065264; a=rsa-sha256; cv=none; b=12/WsjS7G0xmyulZxGR1PM38YPs+KCagV3fcF2iHxuSp0zqNbdjd3RoV+IOHLoPyXpEvsh R91Ts6tqQS4sN1XySVOLddLpi8o1cjRZrwmiv6kQEKnjHDGnRxferw6sP5RnyYeZ1KEOTD EBpxEsubnS04+oqlUyapJrRUUt2kOkw= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=K4pdjbqf; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf12.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1747065263; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5PiiNv5uyVSUS0u0nlImNDdFDgHHu9GbJcbLZJOQT9M=; b=K4pdjbqfINpRLxflEBbJWgb3GcZ8It8bVmoKKifPQMneeWmE5Iq6w8o4ygLHASxEk9fHpV N+SlhZ3irZo+zv35hF8XyBnrOMlVLhTzTAYp6xQs9PlnCcSltFtS+yG8r8WkezgbHAAtRc 5sAIw7He0ocpJhdQSi/IrSe78wqwDwc= Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-423-G7stWN-9P-mPQ8wSXYjG4w-1; Mon, 12 May 2025 11:54:21 -0400 X-MC-Unique: G7stWN-9P-mPQ8wSXYjG4w-1 X-Mimecast-MFC-AGG-ID: G7stWN-9P-mPQ8wSXYjG4w_1747065261 Received: by mail-qk1-f199.google.com with SMTP id af79cd13be357-7c5f876bfe0so814032585a.3 for ; Mon, 12 May 2025 08:54:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747065261; x=1747670061; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=5PiiNv5uyVSUS0u0nlImNDdFDgHHu9GbJcbLZJOQT9M=; b=wqWoBAHeE0hLzn24k7Hs+TFUV/5VJ44gJ0p91knwNbJMPIKWcd54gs2fOXES/MJSS7 JUSCZNgm2N5ezzVI61Rs4ZxQZRRdI2VR1C0nWdjDFILj8pX0g8hmY4/VgKealaWFUZqL wDNjepe7w1WNJoGAOqNpFcQePh3JfLMeOyD56j/vps6RMVdMz73K4AeYKMfoO+iUfsDu rfM7hdDFgg1vUn2usEJvZaFnJOHcaqjOvGZNAK4unJfcr658NVEkc4XBX9SyL+Mxf/IJ /U/dJtVnk5VacNfAFin3sey7l+W3bRN13IOhHQ0Nc7hQO3nrbEGLiRuNukQJyMwsKHLU xhig== X-Forwarded-Encrypted: i=1; AJvYcCUrmqS/XD9LUq2qLUJtCMzILp1sUBv9vsBl+UHvPaNUGxrL4MRTnoFCdv/9QfQP8Z+BDEFuK5mLTg==@kvack.org X-Gm-Message-State: AOJu0YwlEwsJ6CjnKOLyJZ29U+MRuoCXjJPR76KT+kTuhvzu6pG8+IlM v8P5JpMCTJZ1HreQS/MaCRRNz4+JejeXJhP/sGQglEqqkie3b2NZrHtt/1DzTwRiPbqs0DIQ4Ic 79wgV81PC/sucVEgI/53RaKw3GuthT8QneDq8ZFQJGHU+4w+i X-Gm-Gg: ASbGncsicOAB7PVh6L2ClNehKvrEDBixL2Lzm/g1bdcXZJsZExvs9ZwBqNuOEmtIrvz WUZKTXu+H97DdwpDnxYMVzzJKrC/t6aTgdz9iFFH4K3maXKxppjleeFYFzTP7iEvbv7+qUozFfJ HbZcuaneY8ecUP1u5STTU44k+xfZ57aZbnLfRHDMx2ZGqsZyFZqKRrzVAJvlwLyvrJmxd+c1hru ozxJ5YyImvITU9hxnSoN/YEuCWB943AwLIs0AdHUSLabFH5QAEQJH+ziktsbWtifoDTDsd7GSb3 Q5E= X-Received: by 2002:a05:6214:cc6:b0:6f2:a457:19a with SMTP id 6a1803df08f44-6f6e47f35ddmr223570596d6.25.1747065261007; Mon, 12 May 2025 08:54:21 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGt5QJ730e9vS/CHAeWSroZKVwd/nUpoKNZ+4h9isdxMULceGefTs46ICee1XcqyXPlx9/AxA== X-Received: by 2002:a05:6214:cc6:b0:6f2:a457:19a with SMTP id 6a1803df08f44-6f6e47f35ddmr223570356d6.25.1747065260602; Mon, 12 May 2025 08:54:20 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6f6e3a472b3sm54258846d6.77.2025.05.12.08.54.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 May 2025 08:54:20 -0700 (PDT) Date: Mon, 12 May 2025 11:54:17 -0400 From: Peter Xu To: Kyle Huey Cc: Andrew Morton , open list , linux-mm@kvack.org, criu@lists.linux.dev, Robert O'Callahan , Axel Rasmussen , Mike Rapoport , Andrea Arcangeli Subject: Re: Suppress pte soft-dirty bit with UFFDIO_COPY? Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: mGZttr3eB5L_3k91CjRKQ6ltYAdX2hRKsi52QH8sfd8_1747065261 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Stat-Signature: soswjjtb3ahypsh1pfkzr8j49suhxf6r X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 081F14000E X-HE-Tag: 1747065263-8786 X-HE-Meta: U2FsdGVkX18f8UOMuyDffbayr6IE51S4ENs2Vj1WTD/o7/2ZfoAcZ5VosZYJg5I1fVBTsOKAKeViePibVJ4LJpIoXIOs9ZHLG8Ozt6Ob3Q6oUb8qiDtbd98D4fbOf0F+ETAfLPLgulM5raUvcPxO2nhBUKIyDmOLEa/xDwNHRnGuuULCsrQZQJeQX/A7VtUHqtm8zZE7DTcitqHcbOz2WZQxNV6zwBoeTOmgRwcGlePTsvN6USP2GrH1rRmtcpTlSBRCv8WTK4vCDsCdSFtnzveHxU/yTmCOq9Wv0ZO+HyTRKVGzM1DClF8LEQWecjsVo/dl2Vx10Z0/ZXpaGR/brNtdoQNp+8qEHyqs7j+f2cMAk3rEV7Umg9zNtF6oWVTc8r7QtBwyXiFDzOD/ufs1eH+9Z7ioS16x6BFYNauFM8Eum2mf1B9jVve7gd3Q/tEnz1Ruj/1c7xdFw2FAtuLpFrsVU9ShD6KVpFx0ea1kFS6H6Tt148q3JaL8WiMNPzxJXohXi3fScBmAvadZJy2B6j5vufgMag4ZirJ+QkDtYbPai2Q55YjZL9uN7TvSlERN/b72aeAqVyWn3z0PVCDsP8SoizCj8xIGhfT7PvcWZh8AerOy7D73rI/8oHxfp2pNSrPuJlnk5YQDrJ+7MwHGcJGbNVs9jZ++PT5EOuoAJHHL/xsWTuL0yfd0TC7/Y8EVWAnhe0AZFuPurdV8LlB+zXSRTQiyfwOyvFBq0ESORYFt4BHP+lx5SaBlUDqdO8FFbDjCNR3XQvVV4S66d57J69Psaw2c1/kXj7R/GaRpUW2NewoWAez/u589ZAhu5pSQEF7jXzSWD1WSPVv1ofRylPo6z16tpkZxtXmfBGR56bVOlZ/hqavIO31IpgiszYGdnelkZVsik0lc9Iv9MMIl9eg1KB3Mb22TMefm0eTKurPMGCJCTvgR9yOvQelw+2l1spoyjtPAxgSk0+PWlWB e85inFJ7 CGwq6qbKvbL95/y5LwRDMrvapaq+pNkn7AeBbna1JyT9Qr9A5IQm9fTPMx5yUK4LTfsRe/xYxhsK6dLq1AGYb/92ps+ewWXnh6p8saaB2CkrZyW4jVmoEDYwEWHKga45CpRzW9lvBBe7ZCZrnh6/RnsG2U80fW8QtQc0yb8rc6GTUhJFTO2PD/SMDEu8Z8/yioLf81nT3DVeAVVkSOI5tswSqCwNrQSU2kPmMo8+i+nVHtuhVWB67Auf5e0E+vDF71W8+46dqrG7KpoKXYTmxFKKX+6ukvXUm51E8sxBqEwExmcsqlCKInJTyyz7K+d4cGoFEEqc06+cGKtukWs7H3kCAAUqIr86xRX5vsBQY5x2JWk9HdaZ1VopjuF7APBOaOwEV0g6a/3IiRKJjmpUBPLq6cGWKtlaOGDLkSPA6eCZenbxjnqr7IZEVysER6auTPb9N0OdSlM7lNy7nGVvsAjnH/9xP8L2YvMgn8bDrB+IQfmLfv9nMdBs7VBjQj1cSu47NLGyXpCnv1B2vIGdj55NsEOo3KYr5jPddalYmg5pUWtfUlg8XD7ngDxSdhHn6SLwROUcpD7CdHCeiMDjD9kTaaX6jfplvZ61zNuL2hIyqaxzI5rZnMJSfyi5BONnVj8mC8BR3tAZgmoPqhM0BrafcXqksxLRSb4+nx6qkoU9I4criFN8HLpraPL/r2X97M3oHn7GI1IELsqU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, May 11, 2025 at 08:06:03PM -0700, Kyle Huey wrote: > On Mon, May 5, 2025 at 3:15 PM Kyle Huey wrote: > > > > On Mon, May 5, 2025 at 1:05 PM Peter Xu wrote: > > > > > > Hi, Kyle, > > > > > > On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote: > > > > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add > > > > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any > > > > thoughts/objections? > > > > > > > > The kernel has a "soft-dirty" bit on ptes which tracks if they've been > > > > written to since the last time /proc/pid/clear_refs was used to clear > > > > the soft-dirty bit. CRIU uses this to track which pages have been > > > > modified since a previous checkpoint and reduce the size of the > > > > checkpoints taken. I would like to use this in my debugger[0] to track > > > > which pages a program function dirties when that function is invoked > > > > from the debugger. > > > > > > > > However, the runtime environment for this function is rather unusual. > > > > In my debugger, the process being debugged doesn't actually exist > > > > while it's being debugged. Instead, we have a database of all program > > > > state (including registers and memory values) from when the process > > > > was executed. It's in some sense a giant core dump that spans multiple > > > > points in time. To execute a program function from the debugger we > > > > rematerialize the program state at the desired point in time from our > > > > database. > > > > > > > > For performance reasons, we fill in the memory lazily[1] via > > > > userfaultfd. This makes it difficult to use the soft-dirty bit to > > > > track the writes the function triggers, because UFFDIO_COPY (and > > > > friends) mark every page they touch as soft-dirty. Because we have the > > > > canonical source of truth for the pages we materialize via UFFDIO_COPY > > > > we're only interested in what happens after the userfaultfd operation. > > > > > > > > Clearing the soft-dirty bit is complicated by two things: > > > > 1. There's no way to clear the soft-dirty bit on a single pte, so > > > > instead we have to clear the soft-dirty bits for the entire process. > > > > That requires us to process all the soft-dirty bits on every other pte > > > > immediately to avoid data loss. > > > > 2. We need to clear the soft-dirty bits after the userfaultfd > > > > operation, but in order to avoid racing with the task that triggered > > > > the page fault we have to do a non-waking copy, then clear the bits, > > > > and then separately wake up the task. > > > > > > > > To work around all of this, we currently have a 4 step process: > > > > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty. > > > > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE. > > > > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the process. > > > > 4. Do a UFFDIO_WAKE. > > > > > > > > The overhead of all of this (particularly step 1) is a millisecond or > > > > two *per page* that we lazily materialize, and while that's not > > > > crippling for our purposes, it is rather undesirable. What I would > > > > like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty > > > > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear > > > > all the soft-dirty bits once after setting up all the mmaps in the > > > > process the relevant ptes would then "just do the right thing" from > > > > our perspective. > > > > > > > > But I do want to get some feedback on this before I spend time writing > > > > any code. Is there a reason not to do this? Or an alternate way to > > > > achieve the same goal? > > > > > > Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _COPY? > > > > > > If sync fault is a perf concern for frequent writes, just to mention at > > > least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC), > > > which is almost exactly soft dirty bits to me, though it solves a few > > > issues it has on e.g. false positives over vma merging and swapping, or > > > like you said missing of finer granule reset mechanisms. > > > > > > Maybe you also want to have a look at the pagemap ioctl introduced some > > > time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp in > > > soft-dirty-like way): > > > > > > https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst > > > > > > Thanks. This is all very helpful and I think I can construct what I > > need out of these building blocks. > > > > - Kyle > > That works like a charm, thanks. > > The only problem I ran into is that the man page for userfaultfd(2) > claims there's a handshake pattern where you can call UFFDIO_API > twice, once with 0 to enumerate all supported features, and then again > with the feature mask you want to initialize the API. In reality the > API only permits a single UFFDIO_API call because of the internal > UFFD_FEATURE_INITIALIZED flag, so doing this handshake requires > creating a sacrificial fd. This is true, almost all apps I'm aware that are using userfaultfd needs that. It's indeed confusing. > > If the man page is not just totally wrong then this may have been an > unintentional regression from 22e5fe2a2a279. IMHO 22e5fe2a2a279 was correct, and it fixed a possible race due to ctx->state before. The new cmpxchg() plus the INITIALIZED flag should avoid the race. In this case it should be the man page that was wrong since this commit of man page, afaict: commit a252b3345f5b0a4ecafa7d4fb1ac73cb4fd4877f (HEAD) Author: Axel Rasmussen Date: Tue Oct 3 12:45:43 2023 -0700 ioctl_userfaultfd.2: Describe two-step feature handshake I'll see if Axel / Mike / Andrea has any comment, otherwise I'll propose a patch to fix the man-pages and state the fact (that we need a sacrificial fd). Maybe I should really add the UFFDIO_FEATURES ioctl to allow fetching the feature flags from kernel separately, considering how much trouble we've hit with this whole thing.. Thanks, -- Peter Xu