From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5973D1BDCC for ; Mon, 4 Nov 2024 17:02:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3A7736B0085; Mon, 4 Nov 2024 12:02:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3573D6B008A; Mon, 4 Nov 2024 12:02:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D1666B008C; Mon, 4 Nov 2024 12:02:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EEB5B6B0085 for ; Mon, 4 Nov 2024 12:02:02 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 7A4FA1C6089 for ; Mon, 4 Nov 2024 17:02:02 +0000 (UTC) X-FDA: 82749029328.15.1D506BC Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf29.hostedemail.com (Postfix) with ESMTP id 1119312000E for ; Mon, 4 Nov 2024 17:01:16 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=2Yb9uwK3; spf=pass (imf29.hostedemail.com: domain of surenb@google.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730739556; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9qjsI9ftlsbJhkwV12cYtnTjEQSLD/1hzV3fXCn877I=; b=syadyeaHArSgrRBbXGnEu5w5Y2ZH6lzUAfUscUzw037vfueJ7J+T85NjL1KoeZ+vg2A08C /oxEtWHPxXij6oMsMOkd4EzXnC+gFtbR6ibeYHBIQLrS+qeH7A9pBByP/V1ahZazYGVvx7 Mr5bCFhIHM6IJw3XJEGr8hXzfqCEKO0= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=2Yb9uwK3; spf=pass (imf29.hostedemail.com: domain of surenb@google.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730739556; a=rsa-sha256; cv=none; b=Fy4udRKMgPZ4fj1eoZ9Dv+aGjXtp/DjPiJuqSolpztvmTPHnjT9+jpYpA+id6q2kKlY2SL /r21WwfstI53fCD3WEgMT32uP/kf5BVvHGW+/mvvroFCizj71OPxTttYFNbFzoh4jJ0rSF DtcTcOIuXRktmj9C8nLqLW8m7yJq14I= Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-20c8ac50b79so309575ad.0 for ; Mon, 04 Nov 2024 09:02:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730739719; x=1731344519; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=9qjsI9ftlsbJhkwV12cYtnTjEQSLD/1hzV3fXCn877I=; b=2Yb9uwK3vtfiA5xp0pPstDhd0nmCRLeYmo76qVhydIvUOc19qXydS43PWHe43otH8e zNnD6+SwwPR2Ns0XTcu/FR3sPmtv0TFyShEsP888eRvNQvIf/DqTeV4aqgE/6pyDmF3f lFtf1OjelV8SXI0608Ak/i5WPFRbctyq0c6mMwJjElo/Y7z/5eEhdpd8Q1m60yDcEn3L JaW8MwUeppt35A+xSlx0xkTE/9Sv4ZQejJKuMYJFxevvdbQHEHYnzazyrMmWt2psM6hW 2NVnKSoMr4jTpgFg4EFVsOUXfYHsdIM4vsQPPgSnddHOsBe6pJtQdnbn1Gn9MZ1ZQaL5 yneA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730739719; x=1731344519; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9qjsI9ftlsbJhkwV12cYtnTjEQSLD/1hzV3fXCn877I=; b=WC3gSPjHU0GrDc8jMQMrDQWuM3wdtgMXrmwGchMN6j9xebeqwJ2rLG0Sc2EhQaSVfL Fu2wnkUOiRRTfIh+5cp6Rxpab1nzqmvhWcPxrJzApBSQ7tFvsLYSDvxSadixaO2LTA1H RFDmDGT36OdFlhYxqBRsCfQ8Ms60W8eDA0SVl6ogbZJsAXxrhvUWjVNkkxeRMFcTFhAq lEpJwFKBcFPijUEAYX15y2637OD9dR+zhmRN7E1cT+oVMBA9eU2TI0LQE/JZDlxAObjo /qYCcLM/V1mj2E9nQd/GI6KS3VgTGIfrr5BbfaSaPteoOmS4Fe3Nd3ij4ghyKYTkriC2 5ozA== X-Forwarded-Encrypted: i=1; AJvYcCXMg3Bi0vZxk60RmjhvVEPg6ONn+hVpsp34d4pGkop+NUlz8jz/vz5LjRSGZl6jNDrzSjF/eNkp5Q==@kvack.org X-Gm-Message-State: AOJu0YwZFbj7PdKLeoeL80useyYDdCwGb11VoryjufysmStSx0WFO/S1 cJj8Gp7UaOLHbUs3bLynh2d03lY8PKUJQsHL81Ra/d60nrqk4D37H/Z7cCHs5v238k79beFFNLm Jq8/Ic4+EoS3FvBij+orHheQEb0oQzg164EkM X-Gm-Gg: ASbGncuRaBTQ3p3AxQ6i2/t43mXklt0vF9NDAOiBc0/SiEOP2HaYZ7YZ9BY8pQE+RRE AeWmsTOtEMc5YMTeoFOBAO7bLyqXhjj4= X-Google-Smtp-Source: AGHT+IHFp5kd55de15bGzOG33UjHfRpTsgQbFIO24bUAubuiqL6WoiQpih9q7DBWzNBEmnLr14ZAORcaZbKIk8pY+b8= X-Received: by 2002:a17:902:ea01:b0:201:e2db:7be3 with SMTP id d9443c01a7336-21134d36b88mr4160945ad.21.1730739717591; Mon, 04 Nov 2024 09:01:57 -0800 (PST) MIME-Version: 1.0 References: <20241101185033.131880-1-lorenzo.stoakes@oracle.com> In-Reply-To: <20241101185033.131880-1-lorenzo.stoakes@oracle.com> From: Suren Baghdasaryan Date: Mon, 4 Nov 2024 09:01:46 -0800 Message-ID: Subject: Re: [RFC PATCH] docs/mm: add VMA locks documentation To: Lorenzo Stoakes Cc: Jonathan Corbet , Andrew Morton , "Liam R . Howlett" , Vlastimil Babka , Jann Horn , Alice Ryhl , Boqun Feng , Matthew Wilcox , Mike Rapoport , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 1119312000E X-Stat-Signature: iigp8simbzcmbyg9zrqxsfb8defbo3aq X-Rspam-User: X-HE-Tag: 1730739676-338564 X-HE-Meta: U2FsdGVkX1953PQc+eUbKOPdeQowUGPJJ0XwAfY7q5//sM+pdYGMaScfR1jiy/VTBhqwzWmenYmpmrBAtW0PvI5s1/z73Qs87j5KW2d2b4KBnUvu3GAvq88RR66cQKTGhSZSJZmwxEfKZzjm3dTKt5PzMRZDnF2t6pdBHguGzkoBKFrn3dXRDcC0odRTrZezRyDTqAwoALUnXbkEX6giClRA1GwU0Z0BTTxnAULeXKE5kB/YXl7+g/8z1/viIBYcCnJp8DoLv0BEFFoRhGN6Q39ixiAorqpw8i006h3F9tVod1zGllXpUMVGaQut/KntPvTuK+JdfiYFED1kDIO13x5mLc+87ZDOR0rSCQZgoH0eyOiHnXW7Ra0h/5FLUTzeZmr6DjhV8WsaL7ajeiF5kKVYha4S1SGpiUCkv9j6n1U/aJT0Pd3oIG4bELJ7iVyaO9wgfcoWfIMMd2N+t92CsGr9Mic5drPWNseY64t+oQl/sOKsW+F/9LnBBXmmrlsBVVhy8UHUiSjFsXI3PrUb4j9EW95m8uuhRIEnQbdp5qeVSD45ufU4n3UnpHnSDeRWgicIMXvCLg+7fo2Z2xImOeEttFgmPLNQTTwkBl+WHV7FdRAgiylQEPLP0uAGYFiZne0BUnzy2HazOADkr51adM68wFs3FtVxZ6dbRmPGiH9O405eXIVxsbF6bG4GqH3BhQ/yZ39BQOhayth9kTQlurABiutHVSTSv2tMP4jynlT2r2L7yz5zYyvndXMa7m3LPF3DQfhaLF63Jix68xs2WZqRp0Dol/IipSTxEmsu/1MzA/21Pm3lQVP2TBhP8wZ03QwwzQ7ABxgZPwZzwHDCkPmowe4j4WgJjB6zz+6dX5TImAWUaRmOvFTieZAGSnu2EGdPwiGPJ0aZJYdNdd0t5GIEQ7Adp2w8tGw5AwbCV6+tI9HL4mFh1paqosXy9OZPwnn0bs80YFSOamia8wt wQQ9RZhP m5r4Wvx5MxewCFGRw/EjsSPcKA83+dQ6hZYh4HKEegysN/WGRF40TS+25TpGBrUIJ6w0+g6mx19t4ijIEwSoqb7Ux3oA0mcqPfxoYgjgK91qCJFAF+MdIW2SpsVW3C7U2lMUqqfqztpBr/bgmNxBNSljZlUi3l+1WZjInorPJj5G6fzU32aDF2PZz0tSb7/GDpP72oP6JGD2AtrORyvrPUOFq9i9il/NXYl5AtjbBaBNBoO8Sz6Ce1VqYt3amBA/7sM3iQe3Tb/5V8YVKVZ5OJrkUWnhge4aLjg6lGnINjY7Lcz5ZnoKvzC+OjC35LSHvCaXG7l77WZsdDL6/A1RtWFOBGjoSD0PqMQYmMGOfjPV21Q6EQfNYsmZePyr9ZZEx+aSth+OhgGpZNxcb0aCycveFXA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Nov 1, 2024 at 11:51=E2=80=AFAM Lorenzo Stoakes wrote: > > Locking around VMAs is complicated and confusing. While we have a number = of > disparate comments scattered around the place, we seem to be reaching a > level of complexity that justifies a serious effort at clearly documentin= g > how locks are expected to be interacted with when it comes to interacting > with mm_struct and vm_area_struct objects. > > This is especially pertinent as regards efforts to find sensible > abstractions for these fundamental objects within the kernel rust > abstraction whose compiler strictly requires some means of expressing the= se > rules (and through this expression can help self-document these > requirements as well as enforce them which is an exciting concept). > > The document limits scope to mmap and VMA locks and those that are > immediately adjacent and relevant to them - so additionally covers page > table locking as this is so very closely tied to VMA operations (and reli= es > upon us handling these correctly). > > The document tries to cover some of the nastier and more confusing edge > cases and concerns especially around lock ordering and page table teardow= n. > > The document also provides some VMA lock internals, which are up to date > and inclusive of recent changes to recent sequence number changes. > > Signed-off-by: Lorenzo Stoakes Thanks for documenting this, Lorenzo! Just heads-up, I'm working on changing some of the implementation details (removing vma->detached, moving vm_lock into vm_area_struct, etc.). I should be able to post those changes sometime later this week if testing does not reveal any issues. > --- > > REVIEWERS NOTES: > You can speed up doc builds by running `make SPHINXDIRS=3Dmm htmldocs`= . I > also uploaded a copy of this to my website at > https://ljs.io/output/mm/vma_locks to make it easier to have a quick > read through. Thanks! > > > Documentation/mm/index.rst | 1 + > Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++ > 2 files changed, 528 insertions(+) > create mode 100644 Documentation/mm/vma_locks.rst > > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst > index 0be1c7503a01..da5f30acaca5 100644 > --- a/Documentation/mm/index.rst > +++ b/Documentation/mm/index.rst > @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose. > vmemmap_dedup > z3fold > zsmalloc > + vma_locks > diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.= rst > new file mode 100644 > index 000000000000..52b9d484376a > --- /dev/null > +++ b/Documentation/mm/vma_locks.rst > @@ -0,0 +1,527 @@ > +VMA Locking > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +Overview > +-------- > + > +Userland memory ranges are tracked by the kernel via Virtual Memory Area= s or > +'VMA's of type `struct vm_area_struct`. > + > +Each VMA describes a virtually contiguous memory range with identical > +attributes, each of which described by a `struct vm_area_struct` > +object. Userland access outside of VMAs is invalid except in the case wh= ere an > +adjacent stack VMA could be extended to contain the accessed address. > + > +All VMAs are contained within one and only one virtual address space, de= scribed > +by a `struct mm_struct` object which is referenced by all tasks (that is= , > +threads) which share the virtual address space. We refer to this as the = `mm`. > + > +Each mm object contains a maple tree data structure which describes all = VMAs > +within the virtual address space. > + > +The kernel is designed to be highly scalable against concurrent access t= o > +userland memory, so a complicated set of locks are required to ensure no= data > +races or memory corruption occurs. > + > +This document explores this locking in detail. > + > +.. note:: > + > + There are three different things that a user might want to achieve vi= a > + locks - the first of which is **stability**. That is - ensuring that = the VMA > + won't be freed or modified in any way from underneath us. > + > + All MM and VMA locks ensure stability. > + > + Secondly we have locks which allow **reads** but not writes (and whic= h might > + be held concurrent with other CPUs who also hold the read lock). > + > + Finally, we have locks which permit exclusive access to the VMA to al= low for > + **writes** to the VMA. > + > +MM and VMA locks > +---------------- > + > +There are two key classes of lock utilised when reading and manipulating= VMAs - > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_st= ruct` > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock = at the > +VMA level of granularity. > + > +.. note:: > + > + Generally speaking, a read/write semaphore is a class of lock which p= ermits > + concurrent readers. However a write lock can only be obtained once al= l > + readers have left the critical region (and pending readers made to wa= it). > + > + This renders read locks on a read/write semaphore concurrent with oth= er > + readers and write locks exclusive against all others holding the sema= phore. > + > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a= write > +mmap lock gives you exclusive write access to a VMA, and a read lock giv= es you > +concurrent read-only access. > + > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more > +complicated. In this instance, a write semaphore is no longer enough to = gain > +exclusive access to a VMA, a VMA write lock is also required. I think "exclusive access to a VMA" should be "exclusive access to mm" if you are talking about mmap_lock. I think it's worth adding here: 1. to take a VMA write-lock you need to be holding an mmap_lock; 2. write-unlocking mmap_lock drops all VMA write locks in that mm. I see that you touch on this in the below "Note" section but that's not an implementation detail but the designed behavior, so I think these should not be mere side-notes. > + > +The VMA lock is implemented via the use of both a read/write semaphore a= nd > +per-VMA and per-mm sequence numbers. We go into detail on this in the VM= A lock > +internals section below, so for the time being it is important only to n= ote that > +we can obtain either a VMA read or write lock. > + > +.. note:: > + > + VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu(= )` > + function, and **no** existing mmap or VMA lock must be held, This fun= ction "no existing mmap or VMA lock must be held" did you mean to say "no exclusive mmap or VMA locks must be held"? Because one can certainly hold a read-lock on them. > + either returns a read-locked VMA, or NULL if the lock could not be > + acquired. As the name suggests, the VMA will be acquired under RCU, t= hough > + once obtained, remains stable. > + > + This kind of locking is entirely optimistic - if the lock is contende= d or a > + competing write has started, then we do not obtain a read lock. > + > + The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to = ensure > + that the VMA is acquired in an RCU critical section, then attempts to= VMA > + lock it via `vma_start_read()`, before releasing the RCU lock via > + `rcu_read_unlock()`. > + > + VMA read locks hold the a read lock on the `vma->vm_lock` semaphore f= or their > + duration and the caller of `lock_vma_under_rcu()` must release it via > + `vma_end_read()`. > + > + VMA **write** locks are acquired via `vma_start_write()` in instances= where a > + VMA is about to be modified, unlike `vma_start_read()` the lock is al= ways > + acquired. An mmap write lock **must** be held for the duration of the= VMA > + write lock, releasing or downgrading the mmap write lock also release= s the > + VMA write lock so there is no `vma_end_write()` function. > + > + Note that a semaphore write lock is not held across a VMA lock. Rathe= r, a > + sequence number is used for serialisation, and the write semaphore is= only > + acquired at the point of write lock to update this (we explore this i= n detail > + in the VMA lock internals section below). > + > + This ensures the semantics we require - VMA write locks provide exclu= sive > + write access to the VMA. > + > +Examining all valid lock state and what each implies: > + > +.. list-table:: > + :header-rows: 1 > + > + * - mmap lock > + - VMA lock > + - Stable? > + - Can read safely? > + - Can write safely? > + * - \- > + - \- > + - N > + - N > + - N > + * - R > + - \- > + - Y > + - Y > + - N > + * - \- > + - R > + - Y > + - Y > + - N > + * - W > + - \- > + - Y > + - Y > + - N > + * - W > + - W > + - Y > + - Y > + - Y > + > +Note that there are some exceptions to this - the `anon_vma` field is pe= rmitted > +to be written to under mmap read lock and is instead serialised by the `= struct > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all > +lock-specific fields are permitted to be read under RCU alone (though s= tability cannot > +be expected in this instance). > + > +.. note:: > + The most notable place to use the VMA read lock is on page table faul= ts on > + the x86-64 architecture, which importantly means that without a VMA w= rite As Jann mentioned, CONFIG_PER_VMA_LOCK is supported on many more architectu= res. > + lock, page faults can race against you even if you hold an mmap write= lock. > + > +VMA Fields > +---------- > + > +We examine each field of the `struct vm_area_struct` type in detail in t= he table > +below. > + > +Reading of each field requires either an mmap read lock or a VMA read lo= ck to be > +held, except where 'unstable RCU read' is specified, in which case unsta= ble > +access to the field is permitted under RCU alone. > + > +The table specifies which write locks must be held to write to the field= . > + > +.. list-table:: > + :widths: 20 10 22 5 20 > + :header-rows: 1 > + > + * - Field > + - Config > + - Description > + - Unstable RCU read? > + - Write Lock > + * - vm_start > + - > + - Inclusive start virtual address of range VMA describes. > + - > + - mmap write, VMA write > + * - vm_end > + - > + - Exclusive end virtual address of range VMA describes. > + - > + - mmap write, VMA write > + * - vm_rcu > + - vma lock > + - RCU list head, in union with vma_start, vma_end. RCU implementati= on detail. > + - N/A > + - N/A > + * - vm_mm > + - > + - Containing mm_struct. > + - Y > + - (Static) > + * - vm_page_prot > + - > + - Architecture-specific page table protection bits determined from = VMA > + flags > + - > + - mmap write, VMA write > + * - vm_flags > + - > + - Read-only access to VMA flags describing attributes of VMA, in un= ion with > + private writable `__vm_flags`. > + - > + - N/A > + * - __vm_flags > + - > + - Private, writable access to VMA flags field, updated by vm_flags_= *() > + functions. > + - > + - mmap write, VMA write > + * - detached > + - vma lock > + - VMA lock implementation detail - indicates whether the VMA has be= en > + detached from the tree. > + - Y > + - mmap write, VMA write > + * - vm_lock_seq > + - vma lock > + - VMA lock implementation detail - A sequence number used to serial= ise the > + VMA lock, see the VMA lock section below. > + - Y > + - mmap write, VMA write It's a bit weird to state that VMA write-lock is required when talking about vm_lock_seq/vm_lock themselves being parts of that lock. I would simply say N/A for both of them since they should not be modified directly. > + * - vm_lock > + - vma lock > + - VMA lock implementation detail - A pointer to the VMA lock read/w= rite > + semaphore. > + - Y > + - None required > + * - shared.rb > + - > + - A red/black tree node used, if the mapping is file-backed, to pla= ce the > + VMA in the `struct address_space->i_mmap` red/black interval tree= . > + - > + - mmap write, VMA write, i_mmap write > + * - shared.rb_subtree_last > + - > + - Metadata used for management of the interval tree if the VMA is > + file-backed. > + - > + - mmap write, VMA write, i_mmap write > + * - anon_vma_chain > + - > + - List of links to forked/CoW'd `anon_vma` objects. > + - > + - mmap read or above, anon_vma write lock nit: I would spell it out for clarity: mmap read or write > + * - anon_vma > + - > + - `anon_vma` object used by anonymous folios mapped exclusively to = this VMA. > + - > + - mmap read or above, page_table_lock > + * - vm_ops > + - > + - If the VMA is file-backed, then either the driver or file-system = provides > + a `struct vm_operations_struct` object describing callbacks to be= invoked > + on specific VMA lifetime events. > + - > + - (Static) > + * - vm_pgoff > + - > + - Describes the page offset into the file, the original page offset= within > + the virtual address space (prior to any `mremap()`), or PFN if a = PFN map. > + - > + - mmap write, VMA write > + * - vm_file > + - > + - If the VMA is file-backed, points to a `struct file` object descr= ibing > + the underlying file, if anonymous then `NULL`. > + - > + - (Static) > + * - vm_private_data > + - > + - A `void *` field for driver-specific metadata. > + - > + - Driver-mandated. > + * - anon_name > + - anon name > + - A field for storing a `struct anon_vma_name` object providing a n= ame for > + anonymous mappings, or `NULL` if none is set or the VMA is file-b= acked. > + - > + - mmap write, VMA write > + * - swap_readahead_info > + - swap > + - Metadata used by the swap mechanism to perform readahead. > + - > + - mmap read > + * - vm_region > + - nommu > + - The containing region for the VMA for architectures which do not > + possess an MMU. > + - N/A > + - N/A > + * - vm_policy > + - numa > + - `mempolicy` object which describes NUMA behaviour of the VMA. > + - > + - mmap write, VMA write > + * - numab_state > + - numab > + - `vma_numab_state` object which describes the current state of NUM= A > + balancing in relation to this VMA. > + - > + - mmap write, VMA write > + * - vm_userfaultfd_ctx > + - > + - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, = either > + of zero size if userfaultfd is disabled, or containing a pointer = to an > + underlying `userfaultfd_ctx` object which describes userfaultfd m= etadata. > + - > + - mmap write, VMA write > + > +.. note:: > + > + In the config column 'vma lock' configuration means CONFIG_PER_VMA_LO= CK, > + 'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'no= mmu' > + means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab= ' means > + CONFIG_NUMA_BALANCING'. > + > + In the write lock column '(Static)' means that the field is set only = once > + upon initialisation of the VMA and not changed after this, the VMA wo= uld > + either have been under an mmap write and VMA write lock at the time o= r not > + yet inserted into any tree. > + > +Page table locks > +---------------- > + > +When allocating a P4D, PUD or PMD and setting the relevant entry in the = above > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respect= ively. > + > +.. note:: > + `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in = turn, > + however at the time of writing it ultimately references the > + `mm->page_table_lock`. > + > +Allocating a PTE will either use the `mm->page_table_lock` or, if > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD phys= ical > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc(= )` > +called from `pmd_lock()` and ultimately `__pte_alloc()`. > + > +Finally, modifying the contents of the PTE has special treatment, as thi= s is a > +lock that we must acquire whenever we want stable and exclusive access t= o > +entries pointing to data pages within a PTE, especially when we wish to = modify > +them. > + > +This is performed via `pte_offset_map_lock()` which carefully checks to = ensure > +that the PTE hasn't changed from under us, ultimately invoking `pte_lock= ptr()` > +to obtain a spin lock at PTE granularity contained within the `struct pt= desc` > +associated with the physical PTE page. The lock must be released via > +`pte_unmap_unlock()`. > + > +.. note:: > + There are some variants on this, such as `pte_offset_map_rw_nolock()`= when we > + know we hold the PTE stable but for brevity we do not explore this. > + See the comment for `__pte_offset_map_lock()` for more details. > + > +When modifying data in ranges we typically only wish to allocate higher = page > +tables as necessary, using these locks to avoid races or overwriting any= thing, > +and set/clear data at the PTE level as required (for instance when page = faulting > +or zapping). > + > +Page table teardown > +------------------- > + > +Tearing down page tables themselves is something that requires significa= nt > +care. There must be no way that page tables designated for removal can b= e > +traversed or referenced by concurrent tasks. > + > +It is insufficient to simply hold an mmap write lock and VMA lock (which= will > +prevent racing faults, and rmap operations), as a file-backed mapping ca= n be > +truncated under the `struct address_space` i_mmap_lock alone. > + > +As a result, no VMA which can be accessed via the reverse mapping (eithe= r > +anon_vma or the `struct address_space->i_mmap` interval tree) can have i= ts page > +tables torn down. > + > +The operation is typically performed via `free_pgtables()`, which assume= s either > +the mmap write lock has been taken (as specified by its `mm_wr_locked` > +parameter), or that it the VMA is fully detached. > + > +It carefully removes the VMA from all reverse mappings, however it's imp= ortant > +that no new ones overlap these or any route remain to permit access to a= ddresses > +within the range whose page tables are being torn down. > + > +As a result of these careful conditions, note that page table entries ar= e > +cleared without page table locks, as it is assumed that all of these pre= cautions > +have already been taken. > + > +mmap write lock downgrading > +--------------------------- > + > +While it is possible to obtain an mmap write or read lock using the > +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade= ** from > +a write lock to a read lock via `mmap_write_downgrade()`. > + > +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA wri= te locks > +via `vma_end_write_all()` (more or this behaviour in the VMA lock intern= als > +section below), but importantly does not relinquish the mmap lock while > +downgrading, therefore keeping the locked virtual address space stable. > + > +A subtlety here is that callers can assume, if they invoke an > +mmap_write_downgrade() operation, that they still have exclusive access = to the > +virtual address space (excluding VMA read lock holders), as for another = task to > +have downgraded they would have had to have exclusive access to the sema= phore > +which can't be the case until the current task completes what it is doin= g. I can't decipher the above paragraph. Could you please dumb it down for the likes of me? > + > +Stack expansion > +--------------- > + > +Stack expansion throws up additional complexities in that we cannot perm= it there > +to be racing page faults, as a result we invoke `vma_start_write()` to p= revent > +this in `expand_downwards()` or `expand_upwards()`. > + > +Lock ordering > +------------- > + > +As we have multiple locks across the kernel which may or may not be take= n at the > +same time as explicit mm or VMA locks, we have to be wary of lock invers= ion, and > +the **order** in which locks are acquired and released becomes very impo= rtant. > + > +.. note:: > + > + Lock inversion occurs when two threads need to acquire multiple locks= , > + but in doing so inadvertently cause a mutual deadlock. > + > + For example, consider thread 1 which holds lock A and tries to acquir= e lock B, > + while thread 2 holds lock B and tries to acquire lock A. > + > + Both threads are now deadlocked on each other. However, had they atte= mpted to > + acquire locks in the same order, one would have waited for the other = to > + complete its work and no deadlock would have occurred. > + > +The opening comment in `mm/rmap.c` describes in detail the required orde= ring of > +locks within memory management code: > + > +.. code-block:: > + > + inode->i_rwsem (while writing or truncating, not reading or faul= ting) > + mm->mmap_lock > + mapping->invalidate_lock (in filemap_fault) > + folio_lock > + hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs b= elow) > + vma_start_write > + mapping->i_mmap_rwsem > + anon_vma->rwsem > + mm->page_table_lock or pte_lock > + swap_lock (in swap_duplicate, swap_info_get) > + mmlist_lock (in mmput, drain_mmlist and others) > + mapping->private_lock (in block_dirty_folio) > + i_pages lock (widely used) > + lruvec->lru_lock (in folio_lruvec_lock_irq) > + inode->i_lock (in set_page_dirty's __mark_inode_di= rty) > + bdi.wb->list_lock (in set_page_dirty's __mark_inod= e_dirty) > + sb_lock (within inode_lock in fs/fs-writeback.c) > + i_pages lock (widely used, in set_page_dirty, > + in arch-dependent flush_dcache_mmap_lo= ck, > + within bdi.wb->list_lock in __sync_sin= gle_inode) > + > +Please check the current state of this comment which may have changed si= nce the > +time of writing of this document. > + > +VMA lock internals > +------------------ > + > +The VMA lock mechanism is designed to be a lightweight means of avoiding= the use > +of the heavily contended mmap lock. It is implemented using a combinatio= n of a > +read/write semaphore and sequence numbers belonging to the containing `s= truct > +mm_struct` and the VMA. > + > +Read locks are acquired via `vma_start_read()`, which is an optimistic > +operation, i.e. it tries to acquire a read lock but returns false if it = is > +unable to do so. At the end of the read operation, `vma_end_read()` is c= alled to > +release the VMA read lock. This can be done under RCU alone. > + > +Writing requires the mmap to be write-locked and the VMA lock to be acqu= ired via > +`vma_start_write()`, however the write lock is released by the terminati= on or > +downgrade of the mmap write lock so no `vma_end_write()` is required. > + > +All this is achieved by the use of per-mm and per-VMA sequence counts. T= his is > +used to reduce complexity, and potential especially around operations wh= ich potential? > +write-lock multiple VMAs at once. > + > +If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence= count > +`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then th= ey are > +not. > + > +Each time an mmap write lock is acquired in `mmap_write_lock()`, > +`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lo= ck_seq` > +sequence number is incremented via `mm_lock_seqcount_begin()`. > + > +Each time the mmap write lock is released in `mmap_write_unlock()` or > +`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also in= crements > +`mm->mm_lock_seq` via `mm_lock_seqcount_end()`. > + > +This way, we ensure regardless of the VMA's sequence number count, that = a write > +lock is not incorrectly indicated (since we increment the sequence count= er on > +acquiring the mmap write lock, which is required in order to obtain a VM= A write > +lock), and that when we release an mmap write lock, we efficiently relea= se > +**all** VMA write locks contained within the mmap at the same time. Ok, I see that you describe some of the rules I mentioned before here. Up to you where to place them. > + > +The exclusivity of the mmap write lock ensures this is what we want, as = there > +would never be a reason to persist per-VMA write locks across multiple m= map > +write lock acquisitions. > + > +Each time a VMA read lock is acquired, we acquire a read lock on the > +`vma->vm_lock` read/write semaphore and hold it, while checking that the > +sequence count of the VMA does not match that of the mm. > + > +If it does, the read lock fails. If it does not, we hold the lock, exclu= ding > +writers, but permitting other readers, who will also obtain this lock un= der RCU. > + > +Importantly, maple tree operations performed in `lock_vma_under_rcu()` a= re also > +RCU safe, so the whole read lock operation is guaranteed to function cor= rectly. > + > +On the write side, we acquire a write lock on the `vma->vm_lock` read/wr= ite > +semaphore, before setting the VMA's sequence number under this lock, als= o > +simultaneously holding the mmap write lock. > + > +This way, if any read locks are in effect, `vma_start_write()` will slee= p until > +these are finished and mutual exclusion is achieved. > + > +After setting the VMA's sequence number, the lock is released, avoiding > +complexity with a long-term held write lock. > + > +This clever combination of a read/write semaphore and sequence count all= ows for > +fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault= , though > +utilised elsewhere) with minimal complexity around lock ordering. > -- > 2.47.0 >