From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CF17D2E9C7 for ; Tue, 12 Nov 2024 12:59:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 897EE8D0002; Tue, 12 Nov 2024 07:59:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 847498D0001; Tue, 12 Nov 2024 07:59:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6E80C8D0002; Tue, 12 Nov 2024 07:59:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 488018D0001 for ; Tue, 12 Nov 2024 07:59:00 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E02C9120576 for ; Tue, 12 Nov 2024 12:58:59 +0000 (UTC) X-FDA: 82777446444.25.42A0A61 Received: from mail-wm1-f51.google.com (mail-wm1-f51.google.com [209.85.128.51]) by imf09.hostedemail.com (Postfix) with ESMTP id 09F17140005 for ; Tue, 12 Nov 2024 12:58:28 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=cygmSTkG; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of jannh@google.com designates 209.85.128.51 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731416275; a=rsa-sha256; cv=none; b=JLZlqsizCYHcq5CD9HX7yuZ21gNkOx97+aLG5R1eWlYUIDckTiQCxSPvoeo9r4/E0qOa6k gSM0+5Cj+dj0jhK4To0l+r8miIzyMHhrVb386uCsqgDeev2clmV+BznvNBvTUOD8rtOhMX oXb1crTr4+RlDkrIk4KDPd5LiA885WQ= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=cygmSTkG; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of jannh@google.com designates 209.85.128.51 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731416275; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yfAeLaRFWdN/IgPZYBOM4W/Hl1zrpIvx0XX4d1JTnqQ=; b=iptDMpmfY7dynrTVTLZHfEdvW28pH0JLk4DVahF/F3daoLVf1uopdxkVDQ5YYU+KDvb2uX Jl7VDTp3keFmRTuvgDIHNb2Nax5eqOtrfqqZDDSC6SgqCKAhcTMg53gkghE56zKMusdaPX cIKDbFib/hKAnqy42/5+FSbzU61yWG0= Received: by mail-wm1-f51.google.com with SMTP id 5b1f17b1804b1-43150ea2db6so84705e9.0 for ; Tue, 12 Nov 2024 04:58:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731416336; x=1732021136; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yfAeLaRFWdN/IgPZYBOM4W/Hl1zrpIvx0XX4d1JTnqQ=; b=cygmSTkGO09ExZpAFQ61G64GikdZps9LPQfu9B4J4+dwqVVV84b2yWlJSvNfHzR8MW 3rc4c0n0zgoM9ZBk+7UK+FEzXKUfO0W9OvhYnIH1jkGWry4zC/WpITRvet72ZFPDyiC3 evbHyNkUyuk5UsDj3HcOyr9pbZw7XyIlkfh1Cmuz8k7+t0cKNnbbJSVOpIK2uC1sbtL2 Pj2pL7D57PlPGUPnHQ3i7hjcyftUYZ5VHkqKVVfyG3dq8XBpQoUQ80805ttPTi6aqhaR BAm14OvR3Ir2ZSUM8eE5BRjU+uGa9N59/kEoKqcDIy8vgNmrqLpScfRRpg7zNPyX4thL +fQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731416336; x=1732021136; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yfAeLaRFWdN/IgPZYBOM4W/Hl1zrpIvx0XX4d1JTnqQ=; b=WX88I513KiIpALERRK0EQ9uCcCCazrqnrFMwUnQTrTCL2swLaSbxqsLhKnoFIvbERv alK7ohiGAtlgF8pRZE5Bi7+7L+9Ud5G5IUKqmq/2nRI4uqwuLJr1poTAoGYy/lSdxOzv wt19cSONAq1OrnrxWxYi5DcZ8gIKZSa6YltftLT9Hwvt72j427wpOoQLxDDok30ZrBNO 56WRUuwj/exNE1QgLC5OrYmItthgd+1faIPvtSE82eXedLaPaJ13C3Pp3djFAA2zB0T2 gXKCp0lKGiAKmpI5Uf3SwDPr2M0rtQcXlj2oJkkigKreV70kYpd+B38ki8EOIV9oHEy/ wCNg== X-Forwarded-Encrypted: i=1; AJvYcCWqNMY1Z73AgyIsxkbywD7RgL4PEhBxAIaG9g8TaxrhreazUhiLuUiBua4S/QZcsCVqxdKcp2YASA==@kvack.org X-Gm-Message-State: AOJu0YzHgY0fe0KvckOxo69hdkdE+xoHeyvs16gHSZhYVWIVd900giUk CeRWaJ0jHMu5IozcTF4qQfRC5m0i/bdTxCh6AiFLwzWtKgo8wWU5H5YH2cwDCk18uDWuD2dEHoV /8G3kgDkX01EMYycrvambnOon05EBsQK/W9H4 X-Gm-Gg: ASbGncvz75WUBAJE15KlZSl4oHMuTzByCgfQK+ZrXH00Fid70tdfta7shq+cjiy2kum 8khbv1x63kiBdHKfeoG0X8OEa10ndF9ZbKYhtvLqpdXQyi5+15j9Rt7y7Dw== X-Google-Smtp-Source: AGHT+IHLz6K9tbthJlaw1F8JcdjVsquGr069haUcEPP6n2eZiBDEp86lACNry8D0/sD7HeRefdA8WFvsihfIk72p6Tk= X-Received: by 2002:a05:600c:4d0d:b0:42b:a8fc:3937 with SMTP id 5b1f17b1804b1-432cc969fdamr1541985e9.4.1731416335983; Tue, 12 Nov 2024 04:58:55 -0800 (PST) MIME-Version: 1.0 References: <20241010-vma-v6-0-d89039b6f573@google.com> <20241010-vma-v6-1-d89039b6f573@google.com> In-Reply-To: From: Jann Horn Date: Tue, 12 Nov 2024 13:58:20 +0100 Message-ID: Subject: Re: [PATCH v6 1/2] rust: mm: add abstractions for mm_struct and vm_area_struct To: Alice Ryhl Cc: Miguel Ojeda , Matthew Wilcox , Lorenzo Stoakes , Vlastimil Babka , John Hubbard , "Liam R. Howlett" , Andrew Morton , Greg Kroah-Hartman , Arnd Bergmann , Alex Gaynor , Boqun Feng , Gary Guo , =?UTF-8?Q?Bj=C3=B6rn_Roy_Baron?= , Benno Lossin , linux-kernel@vger.kernel.org, linux-mm@kvack.org, rust-for-linux@vger.kernel.org, Andreas Hindborg , Wedson Almeida Filho Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 09F17140005 X-Rspamd-Server: rspam01 X-Stat-Signature: 88oud1qkya44cn8imxmncafs93reg5nu X-HE-Tag: 1731416308-866187 X-HE-Meta: U2FsdGVkX1/oir5nEgXbOocsDjVRcvYJlPUda8nAIE2XNWSzpeLdL+JJsIVTprtLABz0Shk3FKPr3D3URxjQ+kWiSuEJ2B2dMuLmICDx18bsVKj4JT2+oHbslhMfUTWbag7wuEl1TDs99zprMSLMIovAUX5Wg5RJAWyOR3fQE5Gv9wdcQgJe2vhV0NZK1rC3pya5E0PyJ6IxE27HUgcchpNtZi2UczX8BLjQL9CJCGCOfSIHTgAG0VxTUulnx+WoCbofH2hbwoh6+VSiq8n9R2sC2l61AaUoF9vuOsLcaF8AdQNO6sIvE9d50xsXSM1Z1Tmji1aQKBLxDAHdDV4rNcf+H5VDH7m4TOAyHW8BQ9X/GvxKcvIUl4LeRM7lG5ExWkP2F4nJbUFct3E/gQ4sRfI0EW0DcSU1a3oj6JtLHKUlZRafKaz0tk7Y9s6+xbNrzoMk+3Gx8YBrUz+viSCE5e/0asNeEiNzYIXk75e601WCB4QRVx9A/DAjHHQ3PApre8BIxbWKZoqsRzlm7mQqPr5KkK/JYB4g+9dE42awRXLA8a5zZe4B/m8kylDBm09Z/TPaT3IzfwVszw8S1kpHjo/6FnPvhqdX/OEQlLMtLqc0eUxc9xkZ1Pc92AVpAOr0Tlj/gRh5chdMHx053JCK7LnC0emuUUSEIGrMPo/Oh3p+d4/kvyqwXtNMbX9n/9E3QIujWJd7hCf+Zfsr+c8sT17AnkLGcvyefFbX25krxs+lxa7pdyd2aD+8lUwnFMn3wDuvFNocmxbjVEMWa+2zl7Q1o2V+BEDtiq6lIhubMsPFt6+1ZDiNEor2cC8YkgMzzV0BRz6uLO+UYfd0kWOzYin+KYtZ9vCM3XuLLEHZdwKbzmtBHwztz0fzfLzM/olIpZNx4keDczsRMBxlanJHvzgbqTJqA6KcEqfiU0Wml/34lcmJY6Cve5X/13uCvZfclSYpoI0kRxffZlpr88m nWPSBp9N LBkR4MMRw3hE5XaNhLYp/yT42849K0AnHDctXZvvmgTTuYoajm35j7g9U8yg5EmedH9JfxDtUCYWeXwaI9l8VSZZuDmjB7GYchA4egDp7qfuze5acNdXOmGpcnuUn14/H+F9UIkK9JGqB0ohafCHi9jq0685ZxQ4tZ1yFl0QsJtd2ro8ClP/Z8NViFTlvzVRi1xZ4vHGkaJkPtAHBrRqVvjRSPYpp6Rrokc6odqXy6gREqkkcum1DX5xytdIsV9KaD0otoHZcxKz7seuPcyFnXUjt46vhQBwjYvi43JN7dCEhB9fjXCPQ/n4VxNg9EDSnq7CTA2lWBm1axBydRCt1KVnTarCq0lz//0OlfuHdpRvpUJdoHxHCB/J+97+9umS84NqB37p/EtjzuKdDJ01UYOtSFRpaAbzr8Tmz X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Nov 12, 2024 at 1:12=E2=80=AFPM Alice Ryhl w= rote: > On Mon, Nov 11, 2024 at 5:51=E2=80=AFPM Jann Horn wrot= e: > > > > On Mon, Nov 11, 2024 at 12:41=E2=80=AFPM Alice Ryhl wrote: > > > Thanks a lot for your review! > > > > > > Note that there is a v7 already: > > > https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@google.com= / > > > > Yeah, oops, somehow I only realized I was looking at an older version > > of the series after sending my mail... > > > > > On Fri, Nov 8, 2024 at 8:02=E2=80=AFPM Jann Horn w= rote: > > > > On Thu, Oct 10, 2024 at 2:56=E2=80=AFPM Alice Ryhl wrote: > > > >> These abstractions allow you to manipulate vmas. Rust Binder will = uses > > > >> these in a few different ways. > > > >> > > > >> In the mmap implementation, a VmAreaNew will be provided to the mm= ap > > > >> call which allows it to modify the vma in ways that are only okay = during > > > >> initial setup. This is the case where the most methods are availab= le. > > > >> > > > >> However, Rust Binder needs to insert and remove pages from the vma= as > > > >> time passes. When incoming messages arrive, pages may need to be > > > >> inserted if space is missing, and in this case that is done by usi= ng a > > > >> stashed ARef and calling mmget_not_zero followed by mmap_write= _lock > > > >> followed by vma_lookup followed by vm_insert_page. In this case, s= ince > > > >> mmap_write_lock is used, the VmAreaMut type will be in use. > > > > > > > > FYI, the way the C binder implementation uses vma_lookup() and > > > > vm_insert_page() is not very idiomatic. The standard way of > > > > implementing binder_alloc_free_page() would be to use something lik= e > > > > unmap_mapping_range() instead of using > > > > vma_lookup()+zap_page_range_single(); though to set that up, you'd > > > > have to create one inode per binder file, maybe with something like > > > > the DRM subsystem's drm_fs_inode_new(). And instead of having > > > > binder_install_single_page(), the standard way would be to let > > > > binder_vm_fault() install PTEs lazily on fault. That way you'd neve= r > > > > have to take mmap locks or grab MM references yourself. > > > > > > Would that actually work? All writes happen from kernel space, so it'= s > > > not clear to me that we can trigger the fault handler from there. We > > > can't use copy_to_user because the write happens from a different > > > process than the one that owns the vma. > > > > > > That said, one alternative is to let Binder store an array of `struct > > > page` and just write directly to those pages when writing from kernel > > > space. Then binder_vm_fault() can look up the relevant page from the > > > array when userspace attempts to read the data. > > > > Right, that's what I was thinking of - keep allocating pages at the > > same point in time when binder currently does it, only defer mapping > > them into userspace. > > > > > Though we could also > > > just vm_insert_page() right before returning the address to userspace= , > > > since we know that userspace will read it right away after the syscal= l > > > returns. > > > > I think that is basically what binder does now? > > Right now binder calls vm_insert_page() right after calling > alloc_page(), which means that vm_insert_page() happens in the process > sending the message. But we could delay the call so that it happens in > the process that receives the message instead (which is also the > process that owns the mapping). Ah, I see. I don't think that would make much of a difference in terms of the complexity of MM locking, except that you'd save an mmget_not_zero()... > > > > Let me know if you think it would be helpful to see a prototype of > > > > that in C - I think I can cobble that together, but doing it nicely > > > > will require some work to convert at least some of the binder_alloc > > > > locking to mutexes, and some more work to switch the ->f_mapping of > > > > the binder file or something like that. (I guess modeling that in R= ust > > > > might be a bit of a pain though...) > > > > > > It would be useful to hear about what the advantages of > > > unmap_mapping_range() are. I don't need a full prototype, I should be > > > able to understand given a rough description of what the required > > > changes are. > > > > The nice part is that basically, if you have a pointer to the file or > > the inode, you can just do something like the following to zap a PTE: > > > > unmap_mapping_range(file->f_mapping, index, 1, 1); > > > > You don't have to take any locks yourself to make that work, you don't > > even have to hold a reference to the mm_struct or anything like that, > > and you don't need any of that logic you currently have in the C > > binder driver to look up the VMA by address and revalidate it is still > > the VMA you expect. The MM subsystem will automatically iterate > > through all VMAs that overlap the specified range of the file's > > address_space, and it will zap PTEs in the specified range (and it > > even works fine if the VMAs have been moved or split or exist in > > multiple processes or stuff like that). It's a similar story on the > > allocation path. The only locks you need to explicitly take are > > whatever locks the driver internally requires. > > > > Going through the fault handler and/or the address_space for > > installing/removing PTEs, instead of using vma_lookup(), is also safer > > because it implicitly ensures you can only operate on your own > > driver's VMAs. From a glance at the Rust binder RFC > > (https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@go= ogle.com/), > > it looks like use_page_slow() just looks up the VMA at the expected > > address and calls insert_page() on it. I don't immediately see > > anything in the Rust Binder RFC that would prevent calling > > insert_page() on a newly created VMA of a different type that has > > appeared at the same address - which could cause the page to > > inadvertently become writable by userspace (if the new VMA is > > writable), could cause refcounted pages to be installed in VM_PFNMAP > > regions that are supposed to only contain non-refcounted entries, > > could probably cause type confusion when trying to install 4K pages in > > hugetlb regions that can't contain 4K pages, and so on. > > Right ... I guess I'm missing an equivalent to binder_vma_close to > ensure that once the vma is closed, we don't try to insert pages. Yeah, I think that would work. (I think there currently is no way to shrink a VMA without first splitting it, so you should see ->open() and ->close() invocations when that happens.) > I gave a suggestion to enforce that vm_insert_page is only called on > MIXEDMAP vmas in my previous email. I guess that would prevent the > type confusion you mention, but it still seems dangerous ... you risk > that some other driver is storing special data in the private data of > pages in the new vma, and if you insert a random page there, there > could maybe be type confusion on the private data in the `struct > page`? Hmm, yeah, maybe... > > Though I just realized, you're only doing this in the shrinker > > callback where you're not supposed to sleep, but unmap_mapping_range() > > requires sleeping locks. So I guess directly using > > unmap_mapping_range() wouldn't work so well. I guess one way to > > address that could be to add a trylock version of > > unmap_mapping_range(). > > > > Another more fancy solution to that would be to stop using shrinkers > > entirely, and instead make binder pages show up as clean file pages on > > the kernel's page allocation LRU lists, and let the kernel take care > > of removing page mappings - basically similar to how reclaim works for > > normal file pages. I'm a bit fuzzy on how this area of the kernel > > works exactly; one option might be to do something similar to > > aio_private_file(), creating a new anonymous inode - but unlike in > > AIO, you'd then install that anonymous inode's address_space as the > > i_mapping of the existing file in binder_open(), rather than creating > > a new file. Then you could pin the inode's pages into a page pointer > > array in the kernel (sort of like in aio_setup_ring(), except you'd > > only do it on demand in binder_install_buffer_pages()), and then have > > a "release_folio" operation in your address_space_operations that > > drops your page reference if the page is currently unused. At that > > point, the driver wouldn't really be participating in creating or > > removing userspace PTEs at all, the kernel would mostly manage it for > > you, except that you'd have to tell the kernel when it is okay to > > reclaim pages, or something like that. > > Whether it's okay to reclaim a given page can flip-flop very quickly > under some workflows. Changing that setting has to be a pretty fast > operation. I think one fast way to do that would be to track this internally in binder (as is the case now), and then have address_space_operations callbacks that the kernel invokes when it wants to know whether a page can be discarded or not. > > (I think in the Linux kernel, you might theoretically be able to cause > > memory safety issues by deadlocking in particular ways - if you are > > running on a CONFIG_PREEMPT kernel without panic_on_warn set, and > > you're in a non-preemptible context because you're holding a spinlock > > or such, and then you sleep because you try to wait on a mutex or > > kmalloc() with GFP_KERNEL, the scheduler will print a "scheduling > > while atomic" error message and then try to recover from that > > situation by resetting the preempt_count and scheduling anyway. I > > think that theoretically breaks the RCU read-side critical sections > > normally implied by spinlocks once you return to the broken context, > > though IIRC this situation will get fixed up at the next syscall > > return or something along those lines. I haven't tried this though.) > > Sleeping in atomic context is a known area of concern. We're working > on it. For now, just assume that it's one of the allowed bad things. > Eventually we would like to handle it properly with this tool: > https://www.memorysafety.org/blog/gary-guo-klint-rust-tools/ Ah, thanks for the pointer. > > > >> + /// Maps a single page at the given address within the virtua= l memory area. > > > >> + /// > > > >> + /// This operation does not take ownership of the page. > > > >> + #[inline] > > > >> + pub fn vm_insert_page(&self, address: usize, page: &Page) -> = Result { > > > >> + // SAFETY: By the type invariants, the caller holds the m= map write lock, so this access is > > > >> + // not a data race. The page is guaranteed to be valid an= d of order 0. The range of > > > >> + // `address` is already checked by `vm_insert_page`. > > > >> + to_result(unsafe { bindings::vm_insert_page(self.as_ptr()= , address as _, page.as_ptr()) }) > > > > > > > > vm_insert_page() has a kinda weird contract because there are two > > > > contexts from which you can call it cleanly: > > > > > > > > 1. You can call it on a VmAreaRef (just like zap_page_range_single(= ), > > > > you only need to hold an mmap read lock or VMA read lock, no write > > > > lock is required); however, you must ensure that the VMA is already > > > > marked VM_MIXEDMAP. This is the API contract under which you'd call > > > > this from a fault handler. > > > > > > > > 2. You can call it on a VmAreaNew (and it will take care of setting > > > > VM_MIXEDMAP for you). > > > > > > > > I think nothing would immediately crash if you called vm_insert_pag= e() > > > > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap > > > > lock in write mode; but that would permit weird scenarios where you > > > > could, for example, have a userfaultfd context associated with a VM= A > > > > which becomes VM_MIXEDMAP, while normally you can't attach userfaul= tfd > > > > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads = to > > > > anything bad, but it would be a weird state that probably shouldn't= be > > > > permitted. > > > > > > > > There are also safety requirements for the page being installed, bu= t I > > > > guess the checks that vm_insert_page() already does via > > > > validate_page_before_insert() might be enough to make this safe... > > > > > > One way to handle this is to make an VmAreaRef::check_mixedmap that > > > returns a VmAreaMixedMapRef after checking the flag. That type can th= en > > > have a vm_insert_page method. > > > > Sounds reasonable. > > > > > As for VmAreaNew, I'm not sure we should have it there. If we're not > > > careful, it would be a way to set VM_MIXEDMAP on something that alrea= dy > > > has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXED= MAP > > > directly and then go through the method on VmAreaRef. > > > > Makes sense. > > > > I guess one tricky part is that it might be bad if you could > > Seems like this sentence is incomplete? Whoops, guess I got distracted while writing this... I guess it could be bad if you could install page mappings before changing the VMA flags in a way that makes the already-installed page mappings invalid. But as long as you don't change the VM_READ/VM_WRITE/VM_EXEC flags, and you never clear VM_MIXEDMAP/VM_PFNMAP, I think that probably can't happen, so that should be fine...