From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F155DD5D689 for ; Thu, 7 Nov 2024 21:16:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 67F556B00A5; Thu, 7 Nov 2024 16:16:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 607F36B00A6; Thu, 7 Nov 2024 16:16:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 45B496B00A8; Thu, 7 Nov 2024 16:16:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1DB2D6B00A5 for ; Thu, 7 Nov 2024 16:16:13 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id C2F39160312 for ; Thu, 7 Nov 2024 21:16:12 +0000 (UTC) X-FDA: 82760555010.09.77CAEA1 Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) by imf06.hostedemail.com (Postfix) with ESMTP id 17355180014 for ; Thu, 7 Nov 2024 21:15:44 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ZNPAFF5u; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of jannh@google.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731014000; a=rsa-sha256; cv=none; b=imrmS8eulLi6uxbZI/i/w+G2M1cs9kJhu5IKypzs3kUi0tTWVnGMTCNHJhJBUPujw5woS8 v3KQCG3aS5Oah0xyxhYe8HJ/RI0B9Gd4wIN24bwwGX5FgdBCFBl/dmKLc8RZwv2QfD8oe/ DPUIXf9vEzlQxrfqOai9jW/NUgA/Shg= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ZNPAFF5u; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of jannh@google.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731014000; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nVaHQ1Y4UUsOIf67jORkC8pstG2aYdOgzFWIiY/VM64=; b=Fdq0heT43G0CIYjRxHbPvRMkp4DyvaymD2v3lY4y6lY2NWt5JS/YIXl36EoHKJtz0BLv28 YZcevIYArjZn1Dg6qGXeCncX+K7uBAk4zXzQvaubW4/mB5ouSoJXJ30Yj2l0ZVBTqHWiVu 4HchcZOGibHdpitODqNeu2cN0RTMpts= Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-539e19d8c41so7637e87.0 for ; Thu, 07 Nov 2024 13:16:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731014169; x=1731618969; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nVaHQ1Y4UUsOIf67jORkC8pstG2aYdOgzFWIiY/VM64=; b=ZNPAFF5uap7aRVDMV0/En4EfEg9RlHnlh2waL8QxwoA7A2KkS8CNhYlvg0jtWknSoW GzuWBYGD+6Uf4aGcOPUHaMeu4NjsgHcS0hdQFIbcozp++TA2tExpWflBmNg3mEBsxJIi //7teFiSU+SQv6wjJRhTYDr1xevk9QN5XnwL1m4TFFv2XjEOdrQVWI0i7RGeSMuojCJp 78V8cZgZFPGvBGnp6rrUgHXtRymJnGbdc0IFJtNw8EJ13dqUZjXmBVcgcdvu38wPwJTC YA6M14TPdstTkczwoO8EmDteE3aK1Sdcnpyp434MVpWOrHkYOA9pdbh/tQI7jS+J2vs+ SsHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731014169; x=1731618969; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nVaHQ1Y4UUsOIf67jORkC8pstG2aYdOgzFWIiY/VM64=; b=pkJbI+qEXsxB7zyjpT9KXaGFIz3loKQWQflsPkzvrtJ1jNM6ddYt74/+SyuDIac5MO PGBw+U97YH59y1CNEKh4XvWEIEVyaWxc4OVdb+BtRMCrU7v+Dqhy6/ybjUF/C0q/V+v4 YdmzxTp5KM7LOvi6Eq0Zcto65raxOrwdAHNo9gJQaPjraaFb2WcH6tN+x1siOncOW+lU shYlk98EgaPLSRNzbf5OD26+jYNmawh9dFxySgHQRNB1p8IcG1YKb20NRNcHwDmLWLlL hSrpqdwvyXzlUTKzmHbw1Gz3TayLyueYaMTr1OL/tmQwy2cstigAO97dSS1N0TzHQsVC 3jlg== X-Forwarded-Encrypted: i=1; AJvYcCXZQ4UolQR/PWJ0YJvKFyRHE5WDnfLuuEyG8wJ67kODh802/LSzrqnUtQ6dJdn8OgAbziQkRWR1yQ==@kvack.org X-Gm-Message-State: AOJu0YxjLY4vTWnU6ay4pkDPNOrXtVqXsSIoUFlCRFcWOhoAjdoFztjQ /Ysfx3wy50TazkRx609VrsTYMfbKECKI7sjmencbIXYywE8m+wAncmAmU658siLBvWypquH8/Qn UEkNzzn8NvKvpRUBY29pUJp+h/5CnK5g4vx73 X-Gm-Gg: ASbGnctYsD6ctMKjnjTabP0VwvRXJfCOO9XPVyhHUCarsHqWECvTodwh5rVSR8ViJy4 CnuT5q/zTPOioIDmcXFpDj2UigOAe75PtIULAQS6SAw4rUqcnzNKLa93D2fE= X-Google-Smtp-Source: AGHT+IHtKDeI1hnEgGrvuwbIzYUz/eBlzb/8OZmh/fz7+HhpOrQahbFa/zrTKTUJiAlEiR64GCxcroXwjLWBudfunFA= X-Received: by 2002:ac2:5b87:0:b0:52e:8a42:f152 with SMTP id 2adb3069b0e04-53d812536c6mr448458e87.5.1731014168183; Thu, 07 Nov 2024 13:16:08 -0800 (PST) MIME-Version: 1.0 References: <20241107190137.58000-1-lorenzo.stoakes@oracle.com> In-Reply-To: <20241107190137.58000-1-lorenzo.stoakes@oracle.com> From: Jann Horn Date: Thu, 7 Nov 2024 22:15:31 +0100 Message-ID: Subject: Re: [PATCH] docs/mm: add VMA locks documentation To: Lorenzo Stoakes Cc: Andrew Morton , Jonathan Corbet , "Liam R . Howlett" , Vlastimil Babka , Alice Ryhl , Boqun Feng , Matthew Wilcox , Mike Rapoport , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Suren Baghdasaryan , Hillf Danton , Qi Zheng , SeongJae Park Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 3ryftwnew4bmjwi7h7azkhwibq7bsq4u X-Rspamd-Queue-Id: 17355180014 X-Rspamd-Server: rspam08 X-Rspam-User: X-HE-Tag: 1731014144-963062 X-HE-Meta: U2FsdGVkX18qqh4G4aYdRhac0kQdGiHdpnTNRyfpzUR8toQXEb4rhQIXZSk+DpDu4HXiMpWuYsVcjXok0ryDUBdjOD0xI0lTBJXHFFFH2FhtsAlAD4PloJwLtHsooQV0XyQxtsQ2FrfIEsicjNAOE1npyoWf3PBzpPKxJZfHiHclMrLjow8w7AVGA2ryAlaXW3xnNSi7HM3a1x8yzbxKOLYdKa1ImmhR5MikXVtb/+REvUpjvyBrqg+DmKKB7+ctIrAHlWd7MsacrEYqS/utVEIdNiiwHGQ3VxVqzraX3Ea3ONgwEGTcJwzoZPmZCrPle+0Ts2fES1DZ1x1x825arH67rSaDc3wjjW/N9R+aTtpF21/YJVh5srz3CvQlL+tHqS4HRDfT+zzBT4nKpzuSjnd0LiIx9L4bv/wH9qo0jexeP7IXySLR8ZwpBib8kGhjbOQjYxFLknRhJt4d9E9ijpHpqx+crGZ7A4+i1onlQOf0C89l0PqTVM5ZcgghxQTtIfr9pvrjEB/VQzayzIsjszNiuTZ/qhDr0/nfgKFH+N4cYLau+9qM6iIe4N9j471KehJSgpIXttGvc+pLG9bM8gsiKRkpNqZ3f4aRak8XXdXLJEQi3z0FHUBG6h88r35Giw3T6wtidaa6NMVogyy66ChPXQM3BH5KQfDRvY7RJpZtYdUJDg99mKFHTRQOiBPR6yBSNGAIoSW03afx9sTWmRG2Q3ymmiBvtFnDsGxZSj0uYdJbJg9J3VjMQXuJnJplrj7SCKv0bDpr35Pndn1Mi0g2K+GxqzsJwWBhGIvKsjHHvBFf0cDrUyPSsLd0g3TkYTNik994cEMfW57NHUy3KHgCEtWu/twYjHvWmDcNmjzGCG14DjCEe+fY2q6b9trD+E4WbVHMrR+ptIbFLRdgYPmfP/etqB0Xv8eN1IqmsnAN56xg6OcQkeeM8AT4nEjwis2vmEdHwjLvNA1egSH CMz1LytM Hr/kvD9P8MRpWZj7YUD8TkDuOqzZyUoiAxwGHLYUaheB+mmWA0f+Pwig8YJa0tsOJkB42OAV2rL6Ofipmii1IlzL2Bv958qTn3+3bidjzJkVc4ppJfTYNXFHGdrI31ePswSwgd1dRhsFm8ozAfcUeCyYTgB3E5Jl0Wggg1ifvPZfg1yx5hsVwNGS3aMTag7sYz8O/1o2s0IRAkFRsn+LTWg6JG67jf019yqPGdDGepgmRNnKATYcleeTxgFMcvKolry4gcg2ULpfNVH2J5SbbnAEqPfms59gC3xxHFSHn9f5s/STXbmjsLMg0umIgsZCcYCj7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 7, 2024 at 8:02=E2=80=AFPM Lorenzo Stoakes wrote: > Locking around VMAs is complicated and confusing. While we have a number = of > disparate comments scattered around the place, we seem to be reaching a > level of complexity that justifies a serious effort at clearly documentin= g > how locks are expected to be used when it comes to interacting with > mm_struct and vm_area_struct objects. Thanks, I think this is looking pretty good now. > +VMA fields > +^^^^^^^^^^ > + > +We can subdivide :c:struct:`!struct vm_area_struct` fields by their purp= ose, which makes it > +easier to explore their locking characteristics: > + > +.. note:: We exclude VMA lock-specific fields here to avoid confusion, a= s these > + are in effect an internal implementation detail. > + > +.. table:: Virtual layout fields > + > + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > + Field Description Write = lock > + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D [...] > + :c:member:`!vm_pgoff` Describes the page offset into the file, rmap w= rite. > + the original page offset within the mmap w= rite, > + virtual address space (prior to any rmap w= rite. > + :c:func:`!mremap`), or PFN if a PFN map > + and the architecture does not support > + :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`. Is that a typo in the rightmost column? "rmap write. mmap write, rmap write." lists rmap twice > + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > + > +These fields describes the size, start and end of the VMA, and as such c= annot be s/describes/describe/ > +modified without first being hidden from the reverse mapping since these= fields > +are used to locate VMAs within the reverse mapping interval trees. [...] > +.. table:: Reverse mapping fields > + > + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + Field Description = Write lock > + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + :c:member:`!shared.rb` A red/black tree node used, if th= e mmap write, > + mapping is file-backed, to place = the VMA VMA write, > + in the = i_mmap write. > + :c:member:`!struct address_space-= >i_mmap` > + red/black interval tree. This list of write locks is correct regarding what locks you need to make changes to the VMA's tree membership. Technically at a lower level, the contents of vma->shared.rb are written while holding only the file rmap lock when the surrounding rbtree nodes change, but that's because the rbtree basically takes ownership of the node once it's been inserted into the tree. But I don't know how to concisely express that here, and it's kind of a nitpicky detail, so I think the current version looks good. Maybe you could add "For changing the VMA's association with the rbtree:" on top of the list of write locks for this one? > + :c:member:`!shared.rb_subtree_last` Metadata used for management of t= he > + interval tree if the VMA is file-= backed. mmap write, > + = VMA write, > + = i_mmap write. For this one, I think it would be more accurate to say that it is written under just the i_mmap write lock. Though maybe that looks kinda inconsistent with the previous one... > + :c:member:`!anon_vma_chain` List of links to forked/CoW=E2=80= =99d anon_vma mmap read, > + objects. = anon_vma write. Technically not just forked/CoW'd ones, but also the current one. Maybe something like "List of links to anon_vma objects (including inherited ones) that anonymous pages can be associated with"? > + :c:member:`!anon_vma` :c:type:`!anon_vma` object used b= y mmap_read, > + anonymous folios mapped exclusive= ly to page_table_lock. > + this VMA. move_vma() uses unlink_anon_vmas() to change ->anon_vma from non-NULL to NULL. There we hold: - mmap lock (exclusive, from sys_mremap) - VMA write lock (from move_vma) - anon_vma lock (from unlink_anon_vmas) So it's not true that we always hold the page_table_lock for this. Should this maybe have two separate parts, one for "for changing NULL -> non-NULL" and one for "changing non-NULL to NULL"? Where the NULL->non-NULL scenario uses the locks you listed and non-NULL->NULL relies on write-locking the VMA and the anon_vma? > + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +These fields are used to both place the VMA within the reverse mapping, = and for > +anonymous mappings, to be able to access both related :c:struct:`!struct= anon_vma` objects > +and the :c:struct:`!struct anon_vma` which folios mapped exclusively to = this VMA should typo: s/which folios/in which folios/ > +reside. > + > +Page tables > +----------- > + > +We won't speak exhaustively on the subject but broadly speaking, page ta= bles map > +virtual addresses to physical ones through a series of page tables, each= of > +which contain entries with physical addresses for the next page table le= vel > +(along with flags), and at the leaf level the physical addresses of the > +underlying physical data pages (with offsets into these pages provided b= y the > +virtual address itself). > + > +In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE= . Huge > +pages might eliminate one or two of these levels, but when this is the c= ase we > +typically refer to the leaf level as the PTE level regardless. (That last sentence doesn't match my headcanon but I also don't have any reference for what is common Linux kernel phrasing around this so this isn't really an actionable comment.) > +.. note:: In instances where the architecture supports fewer page tables= than > + five the kernel cleverly 'folds' page table levels, that is skips the= m within > + the logic, regardless we can act as if there were always five. > + > +There are three key operations typically performed on page tables: > + > +1. **Installing** page table mappings - whether creating a new mapping o= r > + modifying an existing one. > +2. **Zapping/unmapping** page tables - This is what the kernel calls cle= aring page bikeshedding, feel free to ignore: Maybe "Zapping/unmapping page table entries"? At least that's how I always read "zap_pte_range()" in my head - "zap page table entry range". Though I don't know if that's the canonical interpretation. > + table mappings at the leaf level only, whilst leaving all page tables= in > + place. This is a very common operation in the kernel performed on fil= e > + truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!mad= vise`, > + and others. This is performed by a number of functions including > + :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and re= verse > + mapping logic. [...] > +Locking rules > +^^^^^^^^^^^^^ > + > +We establish basic locking rules when interacting with page tables: > + > +* When changing a page table entry the page table lock for that page tab= le > + **must** be held. (except, as you described below, in free_pgtables() when changing page table entries pointing to lower-level page tables) > +* Reads from and writes to page table entries must be appropriately atom= ic. See > + the section on atomicity below. [...] > +Page table installation > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +When allocating a P4D, PUD or PMD and setting the relevant entry in the = above > +PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This= is > +acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and > +:c:func:`!__pmd_alloc` respectively. > + > +.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` an= d > + :c:func:`!pud_lockptr` in turn, however at the time of writing it ult= imately > + references the :c:member:`!mm->page_table_lock`. > + > +Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or= , if > +:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in th= e PMD typo: s/used/use/ > +physical page metadata in the form of a :c:struct:`!struct ptdesc`, acqu= ired by > +:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately > +:c:func:`!__pte_alloc`. > + > +Finally, modifying the contents of the PTE has special treatment, as thi= s is a nit: unclear what "this" refers to here - it looks like it refers to "the PTE", but "the PTE is a lock" wouldn't make grammatical sense > +lock that we must acquire whenever we want stable and exclusive access t= o > +entries pointing to data pages within a PTE, especially when we wish to = modify > +them. I don't think "entries pointing to data pages" need any more locking than other entries, like swap entries or migration markers? > +This is performed via :c:func:`!pte_offset_map_lock` which carefully che= cks to > +ensure that the PTE hasn't changed from under us, ultimately invoking > +:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity containe= d within > +the :c:struct:`!struct ptdesc` associated with the physical PTE page. Th= e lock > +must be released via :c:func:`!pte_unmap_unlock`. > + > +.. note:: There are some variants on this, such as > + :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stab= le but > + for brevity we do not explore this. See the comment for > + :c:func:`!__pte_offset_map_lock` for more details. > + > +When modifying data in ranges we typically only wish to allocate higher = page > +tables as necessary, using these locks to avoid races or overwriting any= thing, > +and set/clear data at the PTE level as required (for instance when page = faulting > +or zapping). [...] > +Page table moving > +^^^^^^^^^^^^^^^^^ > + > +Some functions manipulate page table levels above PMD (that is PUD, P4D = and PGD > +page tables). Most notable of these is :c:func:`!mremap`, which is capab= le of > +moving higher level page tables. > + > +In these instances, it is either required that **all** locks are taken, = that is > +the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap= lock > +and VMA locks are taken and some other measure is taken to avoid rmap ra= ces (see > +the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementat= ion for > +details of how this is handled in this instance). mremap() always takes the rmap locks when moving entire page tables, and AFAIK that is necessary to avoid races that lead to TLB flushes going to the wrong address. mremap() sometimes moves *leaf entries* without holding rmap locks, but never entire tables. move_pgt_entry() is confusingly written - need_rmap_locks is actually always true in the NORMAL_* cases that move non-leaf entries. > +You can observe that in the :c:func:`!mremap` implementation in the func= tions > +:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform = the rmap > +side of lock acquisition, invoked ultimately by :c:func:`!move_page_tabl= es`. > + > +VMA lock internals > +------------------ > + > +This kind of locking is entirely optimistic - if the lock is contended o= r a > +competing write has started, then we do not obtain a read lock. > + > +The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_rea= d_lock` > +to ensure that the VMA is acquired in an RCU critical section, then atte= mpts to Maybe s/is acquired in/is looked up in/, to make it clearer that you're talking about a VMA lookup? > +VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock= via > +:c:func:`!rcu_read_unlock`. > + > +VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semap= hore for > +their duration and the caller of :c:func:`!lock_vma_under_rcu` must rele= ase it > +via :c:func:`!vma_end_read`. > + > +VMA **write** locks are acquired via :c:func:`!vma_start_write` in insta= nces where a > +VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock i= s always > +acquired. An mmap write lock **must** be held for the duration of the VM= A write > +lock, releasing or downgrading the mmap write lock also releases the VMA= write > +lock so there is no :c:func:`!vma_end_write` function. > + > +Note that a semaphore write lock is not held across a VMA lock. Rather, = a > +sequence number is used for serialisation, and the write semaphore is on= ly > +acquired at the point of write lock to update this. > + > +This ensures the semantics we require - VMA write locks provide exclusiv= e write > +access to the VMA. > + > +The VMA lock mechanism is designed to be a lightweight means of avoiding= the use > +of the heavily contended mmap lock. It is implemented using a combinatio= n of a > +read/write semaphore and sequence numbers belonging to the containing > +:c:struct:`!struct mm_struct` and the VMA. > + > +Read locks are acquired via :c:func:`!vma_start_read`, which is an optim= istic > +operation, i.e. it tries to acquire a read lock but returns false if it = is > +unable to do so. At the end of the read operation, :c:func:`!vma_end_rea= d` is > +called to release the VMA read lock. This can be done under RCU alone. Please clarify what "This" refers to, and whether the part about RCU is explaining an implementation detail or the API contract. > + > +Writing requires the mmap to be write-locked and the VMA lock to be acqu= ired via > +:c:func:`!vma_start_write`, however the write lock is released by the te= rmination or > +downgrade of the mmap write lock so no :c:func:`!vma_end_write` is requi= red. > + > +All this is achieved by the use of per-mm and per-VMA sequence counts, w= hich are > +used in order to reduce complexity, especially for operations which writ= e-lock > +multiple VMAs at once. > + > +If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the V= MA > +sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locke= d. If > +they differ, then they are not. nit: "it is not"? > + > +Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`, > +:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, = the > +:c:member:`!mm->mm_lock_seq` sequence number is incremented via > +:c:func:`!mm_lock_seqcount_begin`. > + > +Each time the mmap write lock is released in :c:func:`!mmap_write_unlock= ` or > +:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked= which > +also increments :c:member:`!mm->mm_lock_seq` via > +:c:func:`!mm_lock_seqcount_end`. > + > +This way, we ensure regardless of the VMA's sequence number count, that = a write > +lock is not incorrectly indicated (since we increment the sequence count= er on > +acquiring the mmap write lock, which is required in order to obtain a VM= A write > +lock), and that when we release an mmap write lock, we efficiently relea= se > +**all** VMA write locks contained within the mmap at the same time. Incrementing on mmap_write_lock() is not necessary for VMA locks; that part is for future seqlock-style users of the MM sequence count that want to work without even taking the VMA lock, with the new mmap_lock_speculation_{begin,end} API. See commit db8f33d3b7698 and the thread https://lore.kernel.org/linux-mm/20241010205644.3831427-5-andrii= @kernel.org/ .