linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Bagas Sanjaya <bagasdotme@gmail.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Jonathan Corbet <corbet@lwn.net>, Jann Horn <jannh@google.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] docs/mm: expand vma doc to highlight pte freeing, non-vma traversal
Date: Fri, 6 Jun 2025 06:30:06 +0700	[thread overview]
Message-ID: <aEIofgUdu5z5USfD@archie.me> (raw)
In-Reply-To: <20250604180308.137116-1-lorenzo.stoakes@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 5110 bytes --]

On Wed, Jun 04, 2025 at 07:03:08PM +0100, Lorenzo Stoakes wrote:
> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> index e6756e78b476..be49e2a269e4 100644
> --- a/Documentation/mm/process_addrs.rst
> +++ b/Documentation/mm/process_addrs.rst
> @@ -303,7 +303,9 @@ There are four key operations typically performed on page tables:
>  1. **Traversing** page tables - Simply reading page tables in order to traverse
>     them. This only requires that the VMA is kept stable, so a lock which
>     establishes this suffices for traversal (there are also lockless variants
> -   which eliminate even this requirement, such as :c:func:`!gup_fast`).
> +   which eliminate even this requirement, such as :c:func:`!gup_fast`). There is
> +   also a special case of page table traversal for non-VMA regions which we
> +   consider separately below.
>  2. **Installing** page table mappings - Whether creating a new mapping or
>     modifying an existing one in such a way as to change its identity. This
>     requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
> @@ -335,15 +337,13 @@ ahead and perform these operations on page tables (though internally, kernel
>  operations that perform writes also acquire internal page table locks to
>  serialise - see the page table implementation detail section for more details).
> 
> +.. note:: We free empty PTE tables on zap under the RCU lock - this does not
> +          change the aforementioned locking requirements around zapping.
> +
>  When **installing** page table entries, the mmap or VMA lock must be held to
>  keep the VMA stable. We explore why this is in the page table locking details
>  section below.
> 
> -.. warning:: Page tables are normally only traversed in regions covered by VMAs.
> -             If you want to traverse page tables in areas that might not be
> -             covered by VMAs, heavier locking is required.
> -             See :c:func:`!walk_page_range_novma` for details.
> -
>  **Freeing** page tables is an entirely internal memory management operation and
>  has special requirements (see the page freeing section below for more details).
> 
> @@ -355,6 +355,44 @@ has special requirements (see the page freeing section below for more details).
>               from the reverse mappings, but no other VMAs can be permitted to be
>               accessible and span the specified range.
> 
> +Traversing non-VMA page tables
> +------------------------------
> +
> +We've focused above on traversal of page tables belonging to VMAs. It is also
> +possible to traverse page tables which are not represented by VMAs.
> +
> +Kernel page table mappings themselves are generally managed but whatever part of
> +the kernel established them and the aforementioned locking rules do not apply -
> +for instance vmalloc has its own set of locks which are utilised for
> +establishing and tearing down page its page tables.
> +
> +However, for convenience we provide the :c:func:`!walk_kernel_page_table_range`
> +function which is synchronised via the mmap lock on the :c:macro:`!init_mm`
> +kernel instantiation of the :c:struct:`!struct mm_struct` metadata object.
> +
> +If an operation requires exclusive access, a write lock is used, but if not, a
> +read lock suffices - we assert only that at least a read lock has been acquired.
> +
> +Since, aside from vmalloc and memory hot plug, kernel page tables are not torn
> +down all that often - this usually suffices, however any caller of this
> +functionality must ensure that any additionally required locks are acquired in
> +advance.
> +
> +We also permit a truly unusual case is the traversal of non-VMA ranges in
> +**userland** ranges, as provided for by :c:func:`!walk_page_range_debug`.
> +
> +This has only one user - the general page table dumping logic (implemented in
> +:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes
> +even if they are highly unusual (possibly architecture-specific) and are not
> +backed by a VMA.
> +
> +We must take great care in this case, as the :c:func:`!munmap` implementation
> +detaches VMAs under an mmap write lock before tearing down page tables under a
> +downgraded mmap read lock.
> +
> +This means such an operation could race with this, and thus an mmap **write**
> +lock is required.
> +
>  Lock ordering
>  -------------
> 
> @@ -461,6 +499,10 @@ Locking Implementation Details
>  Page table locking details
>  --------------------------
> 
> +.. note:: This section explores page table locking requirements for page tables
> +          encompassed by a VMA. See the above section on non-VMA page table
> +          traversal for details on how we handle that case.
> +
>  In addition to the locks described in the terminology section above, we have
>  additional locks dedicated to page tables:
> 

The wording looks good, thanks!

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

      reply	other threads:[~2025-06-05 23:30 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-04 18:03 Lorenzo Stoakes
2025-06-05 23:30 ` Bagas Sanjaya [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aEIofgUdu5z5USfD@archie.me \
    --to=bagasdotme@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=jannh@google.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox