From: Daniel Almeida <daniel.almeida@collabora.com>
To: Alice Ryhl <aliceryhl@google.com>
Cc: "Miguel Ojeda" <ojeda@kernel.org>,
"Will Deacon" <will@kernel.org>,
"Boris Brezillon" <boris.brezillon@collabora.com>,
"Robin Murphy" <robin.murphy@arm.com>,
"Jason Gunthorpe" <jgg@ziepe.ca>,
"Boqun Feng" <boqun.feng@gmail.com>,
"Gary Guo" <gary@garyguo.net>,
"Björn Roy Baron" <bjorn3_gh@protonmail.com>,
"Benno Lossin" <lossin@kernel.org>,
"Andreas Hindborg" <a.hindborg@kernel.org>,
"Trevor Gross" <tmgross@umich.edu>,
"Danilo Krummrich" <dakr@kernel.org>,
"Joerg Roedel" <joro@8bytes.org>,
"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
"Asahi Lina" <lina+kernel@asahilina.net>,
linux-kernel@vger.kernel.org, rust-for-linux@vger.kernel.org,
iommu@lists.linux.dev, linux-mm@kvack.org
Subject: Re: [PATCH v4] io: add io_pgtable abstraction
Date: Fri, 19 Dec 2025 08:04:17 -0300 [thread overview]
Message-ID: <63063977-BA16-4F00-AFBA-8DD6409902E1@collabora.com> (raw)
In-Reply-To: <20251219-io-pgtable-v4-1-68aaa7a40380@google.com>
Hi Alice,
> On 19 Dec 2025, at 07:50, Alice Ryhl <aliceryhl@google.com> wrote:
>
> From: Asahi Lina <lina+kernel@asahilina.net>
>
> This will be used by the Tyr driver to create and modify the page table
> of each address space on the GPU. Each time a mapping gets created or
> removed by userspace, Tyr will call into GPUVM, which will figure out
> which calls to map_pages and unmap_pages are required to map the data in
> question in the page table so that the GPU may access those pages when
> using that address space.
>
> The Rust type wraps the struct using a raw pointer rather than the usual
> Opaque+ARef approach because Opaque+ARef requires the target type to be
> refcounted.
>
> Signed-off-by: Asahi Lina <lina+kernel@asahilina.net>
> Acked-by: Boris Brezillon <boris.brezillon@collabora.com>
> Co-developed-by: Alice Ryhl <aliceryhl@google.com>
> Signed-off-by: Alice Ryhl <aliceryhl@google.com>
> ---
> Changes in v4:
> - Rename prot::PRIV to prot::PRIVILEGED
> - Adjust map_pages to return the length even on error.
> - Explain return value in docs of map_pages and unmap_pages.
> - Explain in map_pages that the caller must explicitly flush the TLB
> before accessing the resulting mapping.
> - Add a safety requirement that access to a given range is required to
> be exclusive.
> - Reword comment on NOOP_FLUSH_OPS.
> - Rebase on v6.19-rc1 and pick up tags.
> - Link to v3: https://lore.kernel.org/r/20251112-io-pgtable-v3-1-b00c2e6b951a@google.com
>
> Changes in v3:
> - Almost entirely rewritten from scratch.
> - Link to v2: https://lore.kernel.org/all/20250623-io_pgtable-v2-1-fd72daac75f1@collabora.com/
> ---
> rust/bindings/bindings_helper.h | 3 +-
> rust/kernel/io.rs | 1 +
> rust/kernel/io/pgtable.rs | 278 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 281 insertions(+), 1 deletion(-)
>
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index a067038b4b422b4256f4a2b75fe644d47e6e82c8..1b05a5e4cfb4780fdc27813d708a8f1a6a2d9913 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -56,9 +56,10 @@
> #include <linux/fdtable.h>
> #include <linux/file.h>
> #include <linux/firmware.h>
> -#include <linux/interrupt.h>
> #include <linux/fs.h>
> #include <linux/i2c.h>
> +#include <linux/interrupt.h>
> +#include <linux/io-pgtable.h>
> #include <linux/ioport.h>
> #include <linux/jiffies.h>
> #include <linux/jump_label.h>
> diff --git a/rust/kernel/io.rs b/rust/kernel/io.rs
> index 98e8b84e68d11ef74b2026d8c3d847a127f4672d..88253158448cbf493ca200a87ef9ba958255e761 100644
> --- a/rust/kernel/io.rs
> +++ b/rust/kernel/io.rs
> @@ -10,6 +10,7 @@
> };
>
> pub mod mem;
> +pub mod pgtable;
> pub mod poll;
> pub mod resource;
>
> diff --git a/rust/kernel/io/pgtable.rs b/rust/kernel/io/pgtable.rs
> new file mode 100644
> index 0000000000000000000000000000000000000000..11096acfa41d45125e866876e41459a347e9afe6
> --- /dev/null
> +++ b/rust/kernel/io/pgtable.rs
> @@ -0,0 +1,278 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! IOMMU page table management.
> +//!
> +//! C header: [`include/io-pgtable.h`](srctree/include/io-pgtable.h)
> +
> +use core::{
> + marker::PhantomData,
> + ptr::NonNull, //
> +};
> +
> +use crate::{
> + alloc,
> + bindings,
> + device::{Bound, Device},
> + devres::Devres,
> + error::to_result,
> + io::PhysAddr,
> + prelude::*, //
> +};
> +
> +use bindings::io_pgtable_fmt;
> +
> +/// Protection flags used with IOMMU mappings.
> +pub mod prot {
> + /// Read access.
> + pub const READ: u32 = bindings::IOMMU_READ;
> + /// Write access.
> + pub const WRITE: u32 = bindings::IOMMU_WRITE;
> + /// Request cache coherency.
> + pub const CACHE: u32 = bindings::IOMMU_CACHE;
> + /// Request no-execute permission.
> + pub const NOEXEC: u32 = bindings::IOMMU_NOEXEC;
> + /// MMIO peripheral mapping.
> + pub const MMIO: u32 = bindings::IOMMU_MMIO;
> + /// Privileged mapping.
> + pub const PRIVILEGED: u32 = bindings::IOMMU_PRIV;
> +}
> +
> +/// Represents a requested `io_pgtable` configuration.
> +pub struct Config {
> + /// Quirk bitmask (type-specific).
> + pub quirks: usize,
> + /// Valid page sizes, as a bitmask of powers of two.
> + pub pgsize_bitmap: usize,
> + /// Input address space size in bits.
> + pub ias: u32,
> + /// Output address space size in bits.
> + pub oas: u32,
> + /// IOMMU uses coherent accesses for page table walks.
> + pub coherent_walk: bool,
> +}
> +
> +/// An io page table using a specific format.
> +///
> +/// # Invariants
> +///
> +/// The pointer references a valid io page table.
> +pub struct IoPageTable<F: IoPageTableFmt> {
> + ptr: NonNull<bindings::io_pgtable_ops>,
> + _marker: PhantomData<F>,
> +}
> +
> +// SAFETY: `struct io_pgtable_ops` is not restricted to a single thread.
> +unsafe impl<F: IoPageTableFmt> Send for IoPageTable<F> {}
> +// SAFETY: `struct io_pgtable_ops` may be accessed concurrently.
> +unsafe impl<F: IoPageTableFmt> Sync for IoPageTable<F> {}
> +
> +/// The format used by this page table.
> +pub trait IoPageTableFmt: 'static {
> + /// The value representing this format.
> + const FORMAT: io_pgtable_fmt;
> +}
> +
> +impl<F: IoPageTableFmt> IoPageTable<F> {
I don’t see a reason to keep struct Foo and impl Foo separate.
IMHO, these should always be together, as the first thing one wants
to read after a type declaration is its implementation.
> + /// Create a new `IoPageTable` as a device resource.
> + #[inline]
> + pub fn new(
> + dev: &Device<Bound>,
> + config: Config,
> + ) -> impl PinInit<Devres<IoPageTable<F>>, Error> + '_ {
> + // SAFETY: Devres ensures that the value is dropped during device unbind.
> + Devres::new(dev, unsafe { Self::new_raw(dev, config) })
> + }
> +
> + /// Create a new `IoPageTable`.
> + ///
> + /// # Safety
> + ///
> + /// If successful, then the returned value must be dropped before the device is unbound.
> + #[inline]
> + pub unsafe fn new_raw(dev: &Device<Bound>, config: Config) -> Result<IoPageTable<F>> {
> + let mut raw_cfg = bindings::io_pgtable_cfg {
> + quirks: config.quirks,
> + pgsize_bitmap: config.pgsize_bitmap,
> + ias: config.ias,
> + oas: config.oas,
> + coherent_walk: config.coherent_walk,
> + tlb: &raw const NOOP_FLUSH_OPS,
> + iommu_dev: dev.as_raw(),
> + // SAFETY: All zeroes is a valid value for `struct io_pgtable_cfg`.
> + ..unsafe { core::mem::zeroed() }
> + };
> +
> + // SAFETY:
> + // * The raw_cfg pointer is valid for the duration of this call.
> + // * The provided `FLUSH_OPS` contains valid function pointers that accept a null pointer
> + // as cookie.
> + // * The caller ensures that the io pgtable does not outlive the device.
We should probably tailor the sentence above for Devres?
> + let ops = unsafe {
> + bindings::alloc_io_pgtable_ops(F::FORMAT, &mut raw_cfg, core::ptr::null_mut())
> + };
I’d add a blank here.
> + // INVARIANT: We successfully created a valid page table.
> + Ok(IoPageTable {
> + ptr: NonNull::new(ops).ok_or(ENOMEM)?,
> + _marker: PhantomData,
> + })
> + }
> +
> + /// Obtain a raw pointer to the underlying `struct io_pgtable_ops`.
> + #[inline]
> + pub fn raw_ops(&self) -> *mut bindings::io_pgtable_ops {
> + self.ptr.as_ptr()
> + }
> +
> + /// Obtain a raw pointer to the underlying `struct io_pgtable`.
> + #[inline]
> + pub fn raw_pgtable(&self) -> *mut bindings::io_pgtable {
> + // SAFETY: The io_pgtable_ops of an io-pgtable is always the ops field of a io_pgtable.
> + unsafe { kernel::container_of!(self.raw_ops(), bindings::io_pgtable, ops) }
> + }
> +
> + /// Obtain a raw pointer to the underlying `struct io_pgtable_cfg`.
> + #[inline]
> + pub fn raw_cfg(&self) -> *mut bindings::io_pgtable_cfg {
> + // SAFETY: The `raw_pgtable()` method returns a valid pointer.
> + unsafe { &raw mut (*self.raw_pgtable()).cfg }
> + }
> +
> + /// Map a physically contiguous range of pages of the same size.
> + ///
> + /// Even if successful, this operation may not map the entire range. In that case, only a
> + /// prefix of the range is mapped, and the returned integer indicates its length in bytes. In
> + /// this case, the caller will usually call `map_pages` again for the remaining range.
> + ///
> + /// The returned [`Result`] indicates whether an error was encountered while mapping pages.
> + /// Note that this may return a non-zero length even if an error was encountered. The caller
> + /// will usually [unmap the relevant pages](Self::unmap_pages) on error.
> + ///
> + /// The caller must flush the TLB before using the pgtable to access the newly created mapping.
> + ///
> + /// # Safety
> + ///
> + /// * No other io-pgtable operation may access the range `iova .. iova+pgsize*pgcount` while
> + /// this `map_pages` operation executes.
> + /// * This page table must not contain any mapping that overlaps with the mapping created by
> + /// this call.
> + /// * If this page table is live, then the caller must ensure that it's okay to access the
> + /// physical address being mapped for the duration in which it is mapped.
> + #[inline]
> + #[must_use]
> + pub unsafe fn map_pages(
> + &self,
> + iova: usize,
> + paddr: PhysAddr,
> + pgsize: usize,
> + pgcount: usize,
> + prot: u32,
> + flags: alloc::Flags,
> + ) -> (usize, Result) {
> + let mut mapped: usize = 0;
> +
> + // SAFETY: The `map_pages` function in `io_pgtable_ops` is never null.
> + let map_pages = unsafe { (*self.raw_ops()).map_pages.unwrap_unchecked() };
> +
> + // SAFETY: The safety requirements of this method are sufficient to call `map_pages`.
> + let ret = to_result(unsafe {
> + (map_pages)(
> + self.raw_ops(),
> + iova,
> + paddr,
> + pgsize,
> + pgcount,
> + prot as i32,
> + flags.as_raw(),
> + &mut mapped,
> + )
> + });
> +
> + (mapped, ret)
> + }
> +
> + /// Unmap a range of virtually contiguous pages of the same size.
> + ///
> + /// This may not unmap the entire range, and returns the length of the unmapped prefix in
> + /// bytes.
> + ///
> + /// # Safety
> + ///
> + /// * No other io-pgtable operation may access the range `iova .. iova+pgsize*pgcount` while
> + /// this `unmap_pages` operation executes.
> + /// * This page table must contain one or more consecutive mappings starting at `iova` whose
> + /// total size is `pgcount * pgsize`.
> + #[inline]
> + #[must_use]
> + pub unsafe fn unmap_pages(&self, iova: usize, pgsize: usize, pgcount: usize) -> usize {
> + // SAFETY: The `unmap_pages` function in `io_pgtable_ops` is never null.
> + let unmap_pages = unsafe { (*self.raw_ops()).unmap_pages.unwrap_unchecked() };
> +
> + // SAFETY: The safety requirements of this method are sufficient to call `unmap_pages`.
> + unsafe { (unmap_pages)(self.raw_ops(), iova, pgsize, pgcount, core::ptr::null_mut()) }
> + }
> +}
> +
> +// For now, we do not provide the ability to flush the TLB via the built-in callback mechanism.
> +// Instead, the `map_pages` function requires the caller to explicitly flush the TLB before the
> +// pgtable is used to access the newly created range.
> +//
> +// This is done because the initial user of this abstraction may perform many calls to `map_pages`
> +// in a single batched operation, and wishes to only flush the TLB once after performing the entire
> +// batch of mappings. These callbacks would flush too often for that use-case.
> +//
> +// Support for flushing the TLB in these callbacks may be added in the future.
> +static NOOP_FLUSH_OPS: bindings::iommu_flush_ops = bindings::iommu_flush_ops {
> + tlb_flush_all: Some(rust_tlb_flush_all_noop),
> + tlb_flush_walk: Some(rust_tlb_flush_walk_noop),
> + tlb_add_page: None,
> +};
> +
> +#[no_mangle]
> +extern "C" fn rust_tlb_flush_all_noop(_cookie: *mut core::ffi::c_void) {}
> +
> +#[no_mangle]
> +extern "C" fn rust_tlb_flush_walk_noop(
> + _iova: usize,
> + _size: usize,
> + _granule: usize,
> + _cookie: *mut core::ffi::c_void,
> +) {
> +}
> +
> +impl<F: IoPageTableFmt> Drop for IoPageTable<F> {
> + fn drop(&mut self) {
> + // SAFETY: The caller of `ttbr` promised that the page table is not live when this
> + // destructor runs.
Not sure I understand this sentence. Perhaps we should remove the word “ttbr” from here? ttbr is a register.
> + unsafe { bindings::free_io_pgtable_ops(self.raw_ops()) };
> + }
> +}
> +
> +/// The `ARM_64_LPAE_S1` page table format.
> +pub enum ARM64LPAES1 {}
> +
> +impl IoPageTableFmt for ARM64LPAES1 {
> + const FORMAT: io_pgtable_fmt = bindings::io_pgtable_fmt_ARM_64_LPAE_S1 as io_pgtable_fmt;
> +}
> +
> +impl IoPageTable<ARM64LPAES1> {
> + /// Access the `ttbr` field of the configuration.
> + ///
> + /// This is the physical address of the page table, which may be passed to the device that
> + /// needs to use it.
> + ///
> + /// # Safety
> + ///
> + /// The caller must ensure that the device stops using the page table before dropping it.
> + #[inline]
> + pub unsafe fn ttbr(&self) -> u64 {
> + // SAFETY: `arm_lpae_s1_cfg` is the right cfg type for `ARM64LPAES1`.
> + unsafe { (*self.raw_cfg()).__bindgen_anon_1.arm_lpae_s1_cfg.ttbr }
> + }
> +
> + /// Access the `mair` field of the configuration.
> + #[inline]
> + pub fn mair(&self) -> u64 {
> + // SAFETY: `arm_lpae_s1_cfg` is the right cfg type for `ARM64LPAES1`.
> + unsafe { (*self.raw_cfg()).__bindgen_anon_1.arm_lpae_s1_cfg.mair }
> + }
> +}
>
> ---
> base-commit: 3e7f562e20ee87a25e104ef4fce557d39d62fa85
> change-id: 20251111-io-pgtable-fe0822b4ebdd
>
> Best regards,
> --
> Alice Ryhl <aliceryhl@google.com>
>
Looks good to me. Please wait for Deborah Brouwer’s Tested-By tag if
there’s no further comments from others.
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
next prev parent reply other threads:[~2025-12-19 11:04 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-19 10:50 Alice Ryhl
2025-12-19 11:04 ` Daniel Almeida [this message]
2025-12-19 11:43 ` Alice Ryhl
2025-12-19 11:50 ` Daniel Almeida
2025-12-19 11:56 ` Alice Ryhl
2025-12-19 14:05 ` Jason Gunthorpe
2025-12-19 14:38 ` Alice Ryhl
2025-12-19 15:11 ` Boris Brezillon
2025-12-19 15:14 ` Jason Gunthorpe
2025-12-19 15:27 ` Boris Brezillon
2025-12-19 17:32 ` Jason Gunthorpe
2025-12-21 0:06 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=63063977-BA16-4F00-AFBA-8DD6409902E1@collabora.com \
--to=daniel.almeida@collabora.com \
--cc=Liam.Howlett@oracle.com \
--cc=a.hindborg@kernel.org \
--cc=aliceryhl@google.com \
--cc=bjorn3_gh@protonmail.com \
--cc=boqun.feng@gmail.com \
--cc=boris.brezillon@collabora.com \
--cc=dakr@kernel.org \
--cc=gary@garyguo.net \
--cc=iommu@lists.linux.dev \
--cc=jgg@ziepe.ca \
--cc=joro@8bytes.org \
--cc=lina+kernel@asahilina.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lossin@kernel.org \
--cc=ojeda@kernel.org \
--cc=robin.murphy@arm.com \
--cc=rust-for-linux@vger.kernel.org \
--cc=tmgross@umich.edu \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox