From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F523D1812A for ; Mon, 14 Oct 2024 15:57:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 185BF6B007B; Mon, 14 Oct 2024 11:57:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 136DB6B0083; Mon, 14 Oct 2024 11:57:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F18B96B0088; Mon, 14 Oct 2024 11:57:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id CF0CC6B007B for ; Mon, 14 Oct 2024 11:57:32 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5D739120ED8 for ; Mon, 14 Oct 2024 15:57:25 +0000 (UTC) X-FDA: 82672662660.02.CDDA2E5 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf23.hostedemail.com (Postfix) with ESMTP id 9F827140015 for ; Mon, 14 Oct 2024 15:57:26 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=IkSlR13x; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of jannh@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728921419; a=rsa-sha256; cv=none; b=p3bEIvHVcGUYC+GP9sIUhu+ahFSK1DF1/ks2P0CoId8hRTM1un9eq78maBrJzQv+kKCmWp zNOIljXe2cB8mjveD7RhH21ywmwP/g4P11olpxeJJuvRT+0ySdJ8fhDE3E5KPjzzV7hD3I ZM+vFyEnT7MxFbKb/LY2cSOoCU1WHsA= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=IkSlR13x; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of jannh@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728921419; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m6UcN7D6a3Y3HeZfL95jmfp9Wnz9QqZMWgnGuOo2iKI=; b=YSVgvlnjEC+zNHRkAGeHbj7WaBfXscC6ToDnRkMASox1bCvIXIX3HwndL5Cm/AYwKdqyZg 94vBiaef5kUEaX7j+48kSSJNhQHmBEdSgP/PKeaLgb67cwEtvbjFRQtRu97fLDojLZ6xVg ax0E3/+iyKHxQDUW4iWNTc1qi8d36uI= Received: by mail-ed1-f53.google.com with SMTP id 4fb4d7f45d1cf-5c932b47552so20541a12.0 for ; Mon, 14 Oct 2024 08:57:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728921449; x=1729526249; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=m6UcN7D6a3Y3HeZfL95jmfp9Wnz9QqZMWgnGuOo2iKI=; b=IkSlR13x8RlVtwSKQhD4jMOKyIyk/E3ChM6IG4dHGtQc5upFfwiV0s6ImCs74x4/FV F8euZLah/rJkf40x+PtG9eoQwPjjrIzpXdHk87kTAMutK/hlefIDVgvVSFYVvxi2duNl fuEFLaPiCOVsJuE3Zt8/TTBPRm3Ni1elCCwJ0vvY5kS28ZZkEjIpSbjVF9vzzcDFJ8i+ w37XH9szYTy3V/pKsfM+hptWqTLH2I+DCx+BFFXYhPEOxKzMFeViRtwjisFs19wElrbZ Lhl7sfKyQxbEnU/ufd5KWcmSxoEr3KzBg6z0lEhZBaKXx04xXXqD4RuDM+Pc+5jdlR3d sAjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728921449; x=1729526249; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=m6UcN7D6a3Y3HeZfL95jmfp9Wnz9QqZMWgnGuOo2iKI=; b=vt6yp8DJhsr4eb3bYjgQ83+Wr7mrcUBNASGlUI7E+wQVH/FSjXdRg2z3jp3KbSXLho 0nr5t+SSbPqeVTTsW96GFAY3+2J80Ka2xPppan5CR7OgvlvlC5mjY2vfzzpwUWMdKims chi/IQIVw5xS53Oe0782xIDEWTMfjqbcKQAB+SFhTJjqt0ZCdMPGuU0b2qH/tfb2h+Wm bXaQENo2S5l3IULPFrqyO79MpAmwGwOgy0XEDtdtj4yNUcoBGKMhAltYhUKfP6P0Te9o 6Bue6KQGNlLFsYoT2WQr35p0WAZH8l9vO71G0OepWcC7F+eJNSdwp0LgthbBhuoVqG3s LqmQ== X-Forwarded-Encrypted: i=1; AJvYcCXKSmAW9Q+J1ShJk525YQEX3t7H3rAD1Uvf3c+1rLxJGfVZS6nJf4+3mh5FEgcI3WDjSnW/e4B3nA==@kvack.org X-Gm-Message-State: AOJu0Yyx5x8xsTPQDClktqB0bxsE7KD+JqHVPiJ6lLGyDexPlUiVREgY +GvP/PTbtW4vuZTS7jL65GRG0aDhNvNVCL8O9ustK6jdnuDKXdVY/n+MTCFTZk62ofMA9bYc2Yh YxlucVZrMuIjuwkvYuLpxutWfuTCmH96oFea7 X-Google-Smtp-Source: AGHT+IEGtZ74IajYTANuZOCN6og9uIMTu7SYAYG5PbPVPdswaALt2lAU33R1qlhZZ1AD7R94hwvEgAiDaN9IhA3SQdQ= X-Received: by 2002:a05:6402:3586:b0:5c5:c44d:484e with SMTP id 4fb4d7f45d1cf-5c95c5c820dmr391530a12.1.1728921448200; Mon, 14 Oct 2024 08:57:28 -0700 (PDT) MIME-Version: 1.0 References: <868739d2-0869-462f-ac86-1a8d1dccb0a4@lucifer.local> In-Reply-To: <868739d2-0869-462f-ac86-1a8d1dccb0a4@lucifer.local> From: Jann Horn Date: Mon, 14 Oct 2024 17:56:50 +0200 Message-ID: Subject: Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism To: Lorenzo Stoakes Cc: Andrew Morton , Suren Baghdasaryan , "Liam R . Howlett" , Matthew Wilcox , Vlastimil Babka , "Paul E . McKenney" , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E . J . Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Arnd Bergmann , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, Shuah Khan , Christian Brauner , linux-kselftest@vger.kernel.org, Sidhartha Kumar , Vlastimil Babka Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: i8gkd17bq34w8ukth48h5upnmqtxfkq5 X-Rspamd-Queue-Id: 9F827140015 X-Rspamd-Server: rspam02 X-HE-Tag: 1728921446-682706 X-HE-Meta: U2FsdGVkX1+3kZ47Mq0kaR6EcnIxS4teL+V3zqIlNK+PsBYmanxgEXcWGrxFmjij3B/uhGPJUptXoF6onPe8Bcbn0wYZIoFFu0Ix4/Du9Cs9RCLZtk11norK3vNJaTvJgnCdkRzlw1QN97KL7vsilHPXEWYLnmdymSRInRvu/xJd6iuyAk/2sorMnwRmXwzHSHvjgW0YrY15UXXsJaAMBaNXZ8mP4HOWl38yhVPzeN2OexrJvooj+mctLm6lk1J89ceiUQXcZTTNeayEYs6L+0oUhEaLIKNRmQU4HiKG1xAD1JlTr3a8XsRQGvkUCQ9HFrVuEgrnKbOcfHFaHlU5Oy6a6X4/4Xmrk31lx7coUeU/JFmstloBJL9Af57YjT2QSBjLOKn1DH0sCtvjX/lB3JsqLLbHGWK4v+DmA8Aua8ludohoJddGlqX5uMGrzxVcjKcn/fgq00ZpUphfXYxiwBhf6/fi8zssds+EHGRrhO23ftNJHSxynT1agKtyJFsSFwTBuaUZn6W+pYQh05ArCfQ1GL3ekVp4DHjh/sQVIWSXqvTZBBpAxwZt5NHNbRLhMR56i/4lu/q2r4yZu73gIE51zxR+X0nI4aFBAmnijqlf8TCR7Nu6+aVjU52jIcW2yyqjuZ3EMlFiYGe2THtUPMlWolKuNWiez8AgVU8RWho9FYt9SLs+kmJmmfddqXUgfFEsnLNV7oxjfLv1mg64bWJW/dB6GtXkd2aK1P92h5Exe4Z+6D5a29qp+zOBYgW+3c1Jo82Of0TxPeJYvanURv9PzT/sHF3zLPuKtBHG/M6uSYPY6nff8Hj+FMiddJSKU4AZz2pLlVg4+B+o15xjZfHZswdBwc3W9DwQ7tolOsbafTUKYjTs/PtGlYPXYnHfPlOF8XVYnXMaqNo5wM2g0BMGgVuktq5PQGIozT1UlOTZVrd4rIaUVXXSZQyM8lpYVEt6nCXmvlR77k2dt1d mIE7iVNv iLgdRxNEsEQVtK40jBVNJJLci1Q1AS0h7uOHfhkr+ad3WPiafSq7nSW6C2SfIp7DZ0cfKjnLZa1fIayQG6IIefG26BkXlSFDqs8ySRWbawklQCLyXTedB002a3nTsRaNlsIWGdTHLzrsAPSE+2SH86xXRxDP6/MX+mBBWj8BELTnVL4byG8euLfG4xFeGL9vKFcgJ3cF/Txlm2rFdPnlbk/RHA7w50Cn2IwsF X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 14, 2024 at 1:09=E2=80=AFPM Lorenzo Stoakes wrote: > On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote: > > On Fri, Sep 27, 2024 at 2:51=E2=80=AFPM Lorenzo Stoakes wrote: > > > return 0; > > > default: > > > /* be safe, default to 1. list exceptions explicitly = */ > > [...] > > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > > + struct vm_area_struct **prev, > > > + unsigned long start, unsigned long e= nd) > > > +{ > > > + long err; > > > + bool retried =3D false; > > > + > > > + *prev =3D vma; > > > + if (!is_valid_guard_vma(vma, /* allow_locked =3D */false)) > > > + return -EINVAL; > > > + > > > + /* > > > + * Optimistically try to install the guard poison pages first= . If any > > > + * non-guard pages are encountered, give up and zap the range= before > > > + * trying again. > > > + */ > > > + while (true) { > > > + unsigned long num_installed =3D 0; > > > + > > > + /* Returns < 0 on error, =3D=3D 0 if success, > 0 if = zap needed. */ > > > + err =3D walk_page_range_mm(vma->vm_mm, start, end, > > > + &guard_poison_walk_ops, > > > + &num_installed); > > > + /* > > > + * If we install poison markers, then the range is no= longer > > > + * empty from a page table perspective and therefore = it's > > > + * appropriate to have an anon_vma. > > > + * > > > + * This ensures that on fork, we copy page tables cor= rectly. > > > + */ > > > + if (err >=3D 0 && num_installed > 0) { > > > + int err_anon =3D anon_vma_prepare(vma); > > > > I'd move this up, to before we create poison PTEs. There's no harm in > > attaching an anon_vma to the VMA even if the rest of the operation > > fails; and I think it would be weird to have error paths that don't > > attach an anon_vma even though they . > > I think you didn't finish this sentence :) Oops... > I disagree, we might have absolutely no need to do it, and I'd rather onl= y > do so _if_ we have to. But there's no downside to erroring out after having installed an anon_vma, right? > It feels like the logical spot to do it and, while the cases where it > wouldn't happen are ones where pages are already poisoned (the > vma->anon_vma =3D=3D NULL test will fail so basically a no-op) or error o= n page > walk. My understanding is that some of the MM code basically assumes that a VMA without an anon_vma and without userfault-WP can't contain any state that needs to be preserved; or something along those lines. As you pointed out, fork() is one such case (which maybe doesn't matter so much because it can't race with this operation). khugepaged also relies on this assumption in retract_page_tables(), though that function is not used on anonymous VMAs. If MADVISE_GUARD is extended to cover file VMAs in the future, then I think we could race with retract_page_tables() in a functionally relevant way even when MADVISE_GUARD succeeds: If khugepaged preempts us between the page walk and installing the anon_vma, retract_page_tables() could observe that we don't have an anon_vma yet and throw away a page table in which we just installed guard PTEs. Though I guess really that's not the main reason why I'm saying this; my main reason is that almost any other path that has to ensure an anon_vma is present does that part first (usually because the ordering matters and this way around is more or less the only possible ordering). So even if there are some specific reasons why you can do the ordering the other way around here, it kinda stands out to me as being weird... > > > + if (err_anon) > > > + err =3D err_anon; > > > + } > > > + > > > + if (err <=3D 0) > > > + return err; > > > + > > > + if (!retried) > > > + /* > > > + * OK some of the range have non-guard pages = mapped, zap > > > + * them. This leaves existing guard pages in = place. > > > + */ > > > + zap_page_range_single(vma, start, end - start= , NULL); > > > + else > > > + /* > > > + * If we reach here, then there is a racing f= ault that > > > + * has populated the PTE after we zapped. Giv= e up and > > > + * let the user know to try again. > > > + */ > > > + return -EAGAIN; > > > > Hmm, yeah, it would be nice if we could avoid telling userspace to > > loop on -EAGAIN but I guess we don't have any particularly good > > options here? Well, we could bail out with -EINTR if a (fatal?) signal > > is pending and otherwise keep looping... if we'd tell userspace "try > > again on -EAGAIN", we might as well do that in the kernel... > > The problem is you could conceivably go on for quite some time, while > holding and contending a HIGHLY contended lock (mm->mmap_lock) so I'd > really rather let userspace take care of it. Hmm... so if the retry was handled in-kernel, you'd basically ideally have the retry happen all the way up in do_madvise(), where the mmap lock can be dropped and re-taken? > You could avoid this by having the walker be a _replace_ operation, that = is > - if we encounter an existing mapping, replace in-place with a poison > marker rather than install marker/zap. > > However doing that would involve either completely abstracting such logic > from scratch (a significant task in itself) to avoid duplication which be > hugely off-topic for the patch set or worse, duplicating a whole bunch of > page walking logic once again. Mmh, yeah, you'd have to extract the locked part of zap_pte_range() and add your own copy of all the stuff that happens higher up for setting up TLB flushes and such... I see how that would be a massive pain and error-prone. > By being optimistic and simply having the user having to handle looping > which seems reasonable (again, it's weird if you're installing poison > markers and another thread could be racing you) we avoid all of that. I guess one case in which that could happen legitimately is if you race a MADV_POISON on the area 0x1ff000-0x200100 (first page is populated, second page is not, pmd entry corresponding to 0x200000 is clear) with a page fault at 0x200200? So you could have a scenario like: 1. MADV_POISON starts walk_page_range() 2. MADV_POISON sees non-zero, non-poison PTE at 0x1ff000, stops the walk 3. MADV_POISON does zap_page_range_single() 4. pagefault at 0x200200 happens and populates with a hugepage 5. MADV_POISON enters walk_page_range() 6. MADV_POISON splits the THP 7. MADV_POISON sees a populated PTE > > > + update_mmu_cache(walk->vma, addr, pte); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static const struct mm_walk_ops guard_unpoison_walk_ops =3D { > > > + .pte_entry =3D guard_unpoison_pte_entry, > > > + .walk_lock =3D PGWALK_RDLOCK, > > > +}; > > > > It is a _little_ weird that unpoisoning creates page tables when they > > don't already exist, which will also prevent creating THP entries on > > fault in such areas afterwards... but I guess it doesn't really matter > > given that poisoning has that effect, too, and you probably usually > > won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned > > before... so I guess this is not an actionable comment. > > It doesn't? There's no .install_pte so if an entries are non-present we > ignore. Ah, right, of course. Nevermind. > HOWEVER, we do split THP. I don't think there's any way around it unless = we > extended the page walker to handle this more gracefully (pmd level being > able to hint that we shouldn't do that or something), but that's really o= ut > of scope here. I think the `walk->action =3D=3D ACTION_CONTINUE` check in walk_pmd_range() would let you do that, see wp_clean_pmd_entry() for an example. But yeah I guess that might just be unnecessary complexity. > The idea is that a caller can lazily call MADV_GUARD_UNPOISON on a range > knowing things stay as they were, I guess we can add to the manpage a not= e > that this will split THP? Yeah, might make sense...