linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jann Horn <jannh@google.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jonathan Corbet <corbet@lwn.net>,
	 "Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	 Alice Ryhl <aliceryhl@google.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	 Matthew Wilcox <willy@infradead.org>,
	Mike Rapoport <rppt@kernel.org>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org,
	 Suren Baghdasaryan <surenb@google.com>,
	Hillf Danton <hdanton@sina.com>,
	 Qi Zheng <zhengqi.arch@bytedance.com>,
	SeongJae Park <sj@kernel.org>,
	 Bagas Sanjaya <bagasdotme@gmail.com>
Subject: Re: [PATCH v2] docs/mm: add VMA locks documentation
Date: Wed, 13 Nov 2024 20:46:39 +0100	[thread overview]
Message-ID: <CAG48ez2=oP15_X_MqtDG22P6TZYpTr07-TZBk3Z_DvuwB6nJFQ@mail.gmail.com> (raw)
In-Reply-To: <20241108135708.48567-1-lorenzo.stoakes@oracle.com>

On Fri, Nov 8, 2024 at 2:57 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be used when it comes to interacting with
> mm_struct and vm_area_struct objects.
>
> This is especially pertinent as regards the efforts to find sensible
> abstractions for these fundamental objects in kernel rust code whose
> compiler strictly requires some means of expressing these rules (and
> through this expression, self-document these requirements as well as
> enforce them).
>
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
>
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
>
> The document is split between generally useful information for users of mm
> interfaces, and separately a section intended for mm kernel developers
> providing a discussion around internal implementation details.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Jann Horn <jannh@google.com>

except some typos and one inaccuracy:

> +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
> +  which locks at a process address space granularity which can be acquired via
> +  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
> +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
> +  as a read/write semaphore in practice. A VMA read lock is obtained via
> +  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
> +  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
> +  automatically when the mmap write lock is released). To take a VMA write lock
> +  you **must** have already acquired an :c:func:`!mmap_write_lock`.
> +* **rmap locks** - When trying to access VMAs through the reverse mapping via a
> +  :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
> +  (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via

missing dot between sentences?

> +These fields describes the size, start and end of the VMA, and as such cannot be
> +modified without first being hidden from the reverse mapping since these fields
> +are used to locate VMAs within the reverse mapping interval trees.

still a typo here, "these fields describes"

> +.. note:: In instances where the architecture supports fewer page tables than
> +         five the kernel cleverly 'folds' page table levels, that is stubbing
> +         out functions related to the skipped levels. This allows us to
> +         conceptually act is if there were always five levels, even if the

typo: s/is if/as if/

> +1. **Traversing** page tables - Simply reading page tables in order to traverse
> +   them. This only requires that the VMA is kept stable, so a lock which
> +   establishes this suffices for traversal (there are also lockless variants
> +   which eliminate even this requirement, such as :c:func:`!gup_fast`).
> +2. **Installing** page table mappings - Whether creating a new mapping or
> +   modifying an existing one. This requires that the VMA is kept stable via an
> +   mmap or VMA lock (explicitly not rmap locks).

Arguably clearing A/D bits through the rmap, and switching PTEs
between present entries and migration entries, counts as "modifying an
existing one", but the locking for that is like for zapping/unmapping
(both page_idle_clear_pte_refs and try_to_migrate go through the
rmap). So "modifying an existing one" either needs more caveats or
more likely should just be moved to point three.

> +3. **Zapping/unmapping** page table entries - This is what the kernel calls
> +   clearing page table mappings at the leaf level only, whilst leaving all page
> +   tables in place. This is a very common operation in the kernel performed on
> +   file truncation, the :c:macro:`!MADV_DONTNEED` operation via
> +   :c:func:`!madvise`, and others. This is performed by a number of functions
> +   including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`
> +   among others. The VMA need only be kept stable for this operation.
> +4. **Freeing** page tables - When finally the kernel removes page tables from a
> +   userland process (typically via :c:func:`!free_pgtables`) extreme care must
> +   be taken to ensure this is done safely, as this logic finally frees all page
> +   tables in the specified range, ignoring existing leaf entries (it assumes the
> +   caller has both zapped the range and prevented any further faults or
> +   modifications within it).
> +
> +**Traversing** and **zapping** ranges can be performed holding any one of the
> +locks described in the terminology section above - that is the mmap lock, the
> +VMA lock or either of the reverse mapping locks.
> +
> +That is - as long as you keep the relevant VMA **stable** - you are good to go
> +ahead and perform these operations on page tables (though internally, kernel
> +operations that perform writes also acquire internal page table locks to
> +serialise - see the page table implementation detail section for more details).
> +
> +When **installing** page table entries, the mmap or VMA lock mut be held to keep

typo: "must be held"

> +When performing a page table traversal and keeping the VMA stable, whether a
> +read must be performed once and only once or not depends on the architecture
> +(for instance x86-64 does not require any special precautions).

Nitpick: In theory that'd still be a data race with other threads
concurrently installing page tables, though in practice compilers
don't seem to do anything bad with stuff like that.

> +A typical pattern taken when traversing page table entries to install a new
> +mapping is to optimistically determine whether the page table entry in the table
> +above is empty, if so, only then acquiring the page table lock and checking
> +again to see if it was allocated underneath is.

s/ is// ?

> +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
> +for the PTE, carefully checking it is as expected, before acquiring the
> +PTE-specific lock, and then *again* checking that the PMD lock is as expected.

s/PMD lock is/PMD entry is/ ?

> +In these instances, it is required that **all** locks are taken, that is
> +the mmap lock, the VMA lock and the relevant rmap lock.

s/rmap lock/rmap locks/

> +VMA read locking is entirely optimistic - if the lock is contended or a competing
> +write has started, then we do not obtain a read lock.
> +
> +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which

"is obtained by lock_vma_under_rcu function" sounds weird, maybe
either "is obtained by lock_vma_under_rcu" or "is obtained by the
lock_vma_under_rcu function"?


  parent reply	other threads:[~2024-11-13 19:47 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-08 13:57 Lorenzo Stoakes
2024-11-09 16:01 ` Lorenzo Stoakes
2024-11-11  7:33 ` Mike Rapoport
2024-11-11  8:39 ` Qi Zheng
2024-11-12  5:39 ` Bagas Sanjaya
2024-11-12 15:15 ` Liam R. Howlett
2024-11-13 15:44   ` Lorenzo Stoakes
2024-11-13 17:43     ` Jann Horn
2024-11-14 20:45   ` Lorenzo Stoakes
2024-11-13 19:46 ` Jann Horn [this message]
2024-11-14 20:23   ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAG48ez2=oP15_X_MqtDG22P6TZYpTr07-TZBk3Z_DvuwB6nJFQ@mail.gmail.com' \
    --to=jannh@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aliceryhl@google.com \
    --cc=bagasdotme@gmail.com \
    --cc=boqun.feng@gmail.com \
    --cc=corbet@lwn.net \
    --cc=hdanton@sina.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=rppt@kernel.org \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox