From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6ED7D0EE38 for ; Fri, 11 Oct 2024 20:55:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0C76E6B00A1; Fri, 11 Oct 2024 16:55:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 076C96B00A2; Fri, 11 Oct 2024 16:55:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E32F56B00A5; Fri, 11 Oct 2024 16:55:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C49CE6B00A1 for ; Fri, 11 Oct 2024 16:55:58 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id F283B815C9 for ; Fri, 11 Oct 2024 20:55:53 +0000 (UTC) X-FDA: 82662528312.15.74FB2F9 Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) by imf07.hostedemail.com (Postfix) with ESMTP id 0BAEB40004 for ; Fri, 11 Oct 2024 20:55:51 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SPb2PWhT; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of surenb@google.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728680043; a=rsa-sha256; cv=none; b=0i2Ky/0l8f6wxkF3q30hR4qx2m3GvtS4LbNG9ft/XKgJaW1nCJCzwUDY4CMsocZHzWKRqO DdlIIJSKP+/G6fOMzpu5ODHauoRNlmQPTCVskojr/TXtQGG5d+L6Wq6uiwAies7GeePxlS UngQrX2HEUqtwhU/ECJhnBD8Fx9//U4= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SPb2PWhT; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of surenb@google.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728680043; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KBFPzh5NlZh+7Zd1K+MhC4XZXtWT2loiufG6vbfnB2M=; b=NE4yHBJUOXJHxeORuvWD1EAAqgElkFvHjQUzcB1yB3fiPvIr/KRX+cJBtWAQqfZGIdLzu9 LaGylaedb3Zsot6B31c78O9O68ZELr7WpaSqGoFgMRRYqqMbzlLPud7hxHFGxRCk5bDOkD 1zCjnVv4Iqd7jsMCPvH9Ol6RLHfE/og= Received: by mail-qt1-f179.google.com with SMTP id d75a77b69052e-4603d3e0547so60421cf.0 for ; Fri, 11 Oct 2024 13:55:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728680155; x=1729284955; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KBFPzh5NlZh+7Zd1K+MhC4XZXtWT2loiufG6vbfnB2M=; b=SPb2PWhT82U+ugbtrlYM7BzvqphosNz64tj23k5cHrqhpz1PSEs4DEHYCl4ausRgxu KxckwsENjTeLHcoQATbRooEpugnE6wKlb9YYMxFmWwnMXI/jOnwWpEg75HjHYJo+/Rgj 5Zzc6S7k/n45z9hsYPc7iPJrJ9/h9awEYt3CqmBSVhHhX1eA9j2L1bZuJcr0W7S8N7fk Gg0TEQoV3ePj1WVlZaGTyf2m9DjYiKiPCKF7iBO/7saBOzP+lKSiIe7xd4UMHQZ+mcA1 lHBNy/HaJf5dNIJkgpf5uxHZdtjhCfDFYZ2mysfLHsocTrHo5+sTOQGBZhif5iXuLmFH IEbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728680155; x=1729284955; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KBFPzh5NlZh+7Zd1K+MhC4XZXtWT2loiufG6vbfnB2M=; b=HIQHPdmiIHqY3aGFotikz3HRrDXYi0v3apOz6jZwRv38xINEf+ieI0RoTs9iJETNDQ jR4ZBvzwDgTSxCgpDdL+iGQ/IMkwE/armGJSh2oVI7pQV6NRIEPF36UsFg8c0qrvXpVx OJBOwT2gc52gMgdTT5+P77aOXRDe4o0kqWNjmzkgg1scZiHmu/Ok7IIgRQxK9JJHLsD6 HsULPPYcbGyc9+rm8KnnIBNnTNhIoMD+NgoSF6EvgM47bHsqJXmowOaeN69Kt2xon6NK x9jZOFmPrxkRauNmUoBCEvIfmeh+edEduqhXhqtm5SRRWG4Dq7arrpioucHkClqEYT02 SjYg== X-Forwarded-Encrypted: i=1; AJvYcCXOkFFeJEjQAMO7OyM7C9+lhkktmnPGtMyzTLJ6N0oqJbNFCFFaXEIUY6XfTb9mwjzGCxODumMiOw==@kvack.org X-Gm-Message-State: AOJu0YxAXjQ4Uds9AGGdnzQGt3jH+pLlYK6y2A5iuiNHEKLPOcKrP6nL chQMyMCiU4D8v3AxMy5B3yqR3/mz3dQpfT6PSRDbDYWT8ffXhDUQflPcHwF4IXo/4zoCd+Ikpft UHQX+DemC+FnCNOBbPfsM+WraeE8Mi/fVmZva X-Google-Smtp-Source: AGHT+IG6IScrTSH9iJhbpLI7RLZRrQurOaPZvTBCGWMkP7mWaid454o5VBVq59wNWCmpMXfwr84ZSmZjgA3xj9fhNCc= X-Received: by 2002:a05:622a:a28c:b0:45c:9d26:8f6e with SMTP id d75a77b69052e-46059c77ccdmr87721cf.21.1728680155167; Fri, 11 Oct 2024 13:55:55 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Suren Baghdasaryan Date: Fri, 11 Oct 2024 13:55:42 -0700 Message-ID: Subject: Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism To: Jann Horn Cc: Lorenzo Stoakes , Andrew Morton , "Liam R . Howlett" , Matthew Wilcox , Vlastimil Babka , "Paul E . McKenney" , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E . J . Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Arnd Bergmann , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, Shuah Khan , Christian Brauner , linux-kselftest@vger.kernel.org, Sidhartha Kumar , Vlastimil Babka Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 0BAEB40004 X-Stat-Signature: z8wchj6kahkbuutcmhz7rgwwhnjzwpko X-Rspam-User: X-HE-Tag: 1728680151-618321 X-HE-Meta: U2FsdGVkX19nTSn4GGKltWU/1+KWNSCFxHYQIaNlsVfjFOYohUTK71D04yhfkUqWpV5h247V0HZzWQ+CjCJMljNDPXo3+oGnvk7hYscVSBJTJ3GcjvX65LSyjBlcOQptjgkNKEmvhUl3k8AanLpSvsuGNi3To5BHlxnNEii7oo8T6/l0Co2qIuuss9LJtkE1JqnqIefHU6s0PUp/zOU9WDaOhLVeEKUmWpBWACzHqz568WyuBl431UIHfxUgkFZ5j2j7rG08WmQyUvUO0Vqe1LOdgH58jcdJlVxnzEg2OMr3f7hWBklqqcJdmmNwi+IXCij0fRG8NYZcD0Eavk//JETEXs7GSTZ5+tfv0S3VM636yEXhXugwr3l8OmNU792gQBXioOBz9baHDggASTiBD8Nqpk/Te0VukNr7wCgSYMXGqgvbvnx9oby1VIUterX1TDPCTNsp9OPUkyPxZcdwuNJ+RS7y381m0Vl0cFjsTfa9y0qY+KVyWpNR+p3qfyxBpNIVzy2B/Mgb96vvYllHfMW70+xL3aY6Bl6QIKG8cyfjYdSg7/nbqY7K1pTd7VQNJpx0lozD7PZsosSG4ouaI/L8DZmFra2g/DHVUb7lr6IOSBoTJz6t+HyPI03AzbROxOsB+ZZdbuXKg0FUN8T1CXTJbSQ2ZYSencdXnFyP5hwLGu6ZL2OeLnL2xd9MBYxjC1PxtmAuCdl5VTUddAeEuGtKpMDRFSqPDLNKbMM7FWPwtNWeSDNYhmEUKWX50HozUupn3ipyryTXzyhKFhxbJOZopcqMCoufLZK3/5CZtNpP7k8w17E+PrmZGTjb28+4Nqc6cuPZi/W3w4zPjvN7g2nBqDh5yqu5CGGp4TbLZojS+C/RJQiz2TXbr9e/xGRMFuxQzxu9g13sQlduUXrGIO9z/DzuoMKnpDczqle3v6dvohPcM1nIzy15heCxPPRgnjYT564odEaj4G2h+TC gXCdxq+H KvoM6GEOgyDIfvVx7wK9ebAoHJZe3HjL9TMXaBwFElEBeVnYwBFr1MhBZ3VBMuNtIvdylYO7kLKwFZEJATW2AkdO+3rQ4nqIukD97LuvNSPjo/vJ2Ktg0GshNV5p7X5SsNzl3L+ttoLp3KAPFZLKPYGMUxe39b7eyNZdEEkQxgZbQ3pPwrqWa0aEtm8n7NgeVl64oxopkWhW2F5BUOq6Bu4ZNWF5Q6sH6kWNH4Qrn6sqW2LZLeK6LnB3yaQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Oct 11, 2024 at 11:12=E2=80=AFAM Jann Horn wrote= : > > On Fri, Sep 27, 2024 at 2:51=E2=80=AFPM Lorenzo Stoakes > wrote: > > Implement a new lightweight guard page feature, that is regions of user= land > > virtual memory that, when accessed, cause a fatal signal to arise. > [...] > > --- > > arch/alpha/include/uapi/asm/mman.h | 3 + > > arch/mips/include/uapi/asm/mman.h | 3 + > > arch/parisc/include/uapi/asm/mman.h | 3 + > > arch/xtensa/include/uapi/asm/mman.h | 3 + > > include/uapi/asm-generic/mman-common.h | 3 + > > I kinda wonder if we could start moving the parts of those headers > that are the same for all architectures to include/uapi/linux/mman.h > instead... but that's maybe out of scope for this series. > > [...] > > diff --git a/mm/madvise.c b/mm/madvise.c > > index e871a72a6c32..7216e10723ae 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_POPULATE_READ: > > case MADV_POPULATE_WRITE: > > case MADV_COLLAPSE: > > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock.= */ > > What does poisoning need a write lock for? anon_vma_prepare() doesn't > need it (it only needs mmap_lock held for reading), > zap_page_range_single() doesn't need it, and pagewalk also doesn't > need it as long as the range being walked is covered by a VMA, which > it is... > > I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment > saying "We might need to install an anon_vma" - is that referring to > an older version of the patch where the anon_vma_prepare() call was > inside the pagewalk callback or something like that? Either way, > anon_vma_prepare() doesn't need write locks (it can't, it has to work > from the page fault handling path). I was wondering about that too and I can't find any reason for write-locking the mm for this operation. PGWALK_WRLOCK should also be changed to PGWALK_RDLOCK as we are not modifying the VMA. BTW, I'm testing your patchset on Android and so far it is stable! > > > return 0; > > default: > > /* be safe, default to 1. list exceptions explicitly */ > [...] > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, unsigned long end= ) > > +{ > > + long err; > > + bool retried =3D false; > > + > > + *prev =3D vma; > > + if (!is_valid_guard_vma(vma, /* allow_locked =3D */false)) > > + return -EINVAL; > > + > > + /* > > + * Optimistically try to install the guard poison pages first. = If any > > + * non-guard pages are encountered, give up and zap the range b= efore > > + * trying again. > > + */ > > + while (true) { > > + unsigned long num_installed =3D 0; > > + > > + /* Returns < 0 on error, =3D=3D 0 if success, > 0 if za= p needed. */ > > + err =3D walk_page_range_mm(vma->vm_mm, start, end, > > + &guard_poison_walk_ops, > > + &num_installed); > > + /* > > + * If we install poison markers, then the range is no l= onger > > + * empty from a page table perspective and therefore it= 's > > + * appropriate to have an anon_vma. > > + * > > + * This ensures that on fork, we copy page tables corre= ctly. > > + */ > > + if (err >=3D 0 && num_installed > 0) { > > + int err_anon =3D anon_vma_prepare(vma); > > I'd move this up, to before we create poison PTEs. There's no harm in > attaching an anon_vma to the VMA even if the rest of the operation > fails; and I think it would be weird to have error paths that don't > attach an anon_vma even though they . > > > + if (err_anon) > > + err =3D err_anon; > > + } > > + > > + if (err <=3D 0) > > + return err; > > + > > + if (!retried) > > + /* > > + * OK some of the range have non-guard pages ma= pped, zap > > + * them. This leaves existing guard pages in pl= ace. > > + */ > > + zap_page_range_single(vma, start, end - start, = NULL); > > + else > > + /* > > + * If we reach here, then there is a racing fau= lt that > > + * has populated the PTE after we zapped. Give = up and > > + * let the user know to try again. > > + */ > > + return -EAGAIN; > > Hmm, yeah, it would be nice if we could avoid telling userspace to > loop on -EAGAIN but I guess we don't have any particularly good > options here? Well, we could bail out with -EINTR if a (fatal?) signal > is pending and otherwise keep looping... if we'd tell userspace "try > again on -EAGAIN", we might as well do that in the kernel... > > (Personally I would put curly braces around these branches because > they occupy multiple lines, though the coding style doesn't explicitly > say that, so I guess maybe it's a matter of personal preference... > adding curly braces here would match what is done, for example, in > relocate_vma_down().) > > > + retried =3D true; > > + } > > +} > > + > > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > > + unsigned long next, struct mm_walk = *walk) > > +{ > > + pte_t ptent =3D ptep_get(pte); > > + > > + if (is_guard_pte_marker(ptent)) { > > + /* Simply clear the PTE marker. */ > > + pte_clear_not_present_full(walk->mm, addr, pte, true); > > I think that last parameter probably should be "false"? The sparc code > calls it "fullmm", which is a term the MM code uses when talking about > operations that remove all mappings in the entire mm_struct because > the process has died, which allows using some faster special-case > version of TLB shootdown or something along those lines. > > > + update_mmu_cache(walk->vma, addr, pte); > > + } > > + > > + return 0; > > +} > > + > > +static const struct mm_walk_ops guard_unpoison_walk_ops =3D { > > + .pte_entry =3D guard_unpoison_pte_entry, > > + .walk_lock =3D PGWALK_RDLOCK, > > +}; > > It is a _little_ weird that unpoisoning creates page tables when they > don't already exist, which will also prevent creating THP entries on > fault in such areas afterwards... but I guess it doesn't really matter > given that poisoning has that effect, too, and you probably usually > won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned > before... so I guess this is not an actionable comment.