From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 079FBD637AC for ; Wed, 13 Nov 2024 19:47:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4DBC36B0093; Wed, 13 Nov 2024 14:47:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 48A8A6B0095; Wed, 13 Nov 2024 14:47:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3795D6B0096; Wed, 13 Nov 2024 14:47:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 1A7786B0093 for ; Wed, 13 Nov 2024 14:47:21 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8A23E1C6AC3 for ; Wed, 13 Nov 2024 19:47:20 +0000 (UTC) X-FDA: 82782104538.26.1C7865A Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) by imf12.hostedemail.com (Postfix) with ESMTP id 1486E4000B for ; Wed, 13 Nov 2024 19:46:58 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mz7bhO7H; spf=pass (imf12.hostedemail.com: domain of jannh@google.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731527095; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pSJWU8y6bHQxtW/mJOGtqf8Fg9KxuncXsLFp1xbzR6g=; b=mmMY7Xy85T0mzaQkU5juUfn/U82l6ZSQ3wSVyKVNLQWWddB3fBAv0XNgRbJCYoNphe5iK0 Bze0bTWIUqF8IQfUZmTCpGw7A5FBHa7hORJv2KNkxclqi7RILvRiqbS8l8kM1Prr08Z28p eu5tnFI/W9b43I6hSo9thKbVol6vY9U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731527095; a=rsa-sha256; cv=none; b=2v9L8OrIyWE1BxNk0RATMbgKCHKXzM42nhiyi1RdWIXv5t5w+QOIHe6Yh26RcDm8FADPeE soJj9gnSZDqsEsn3n/C8cbGuWG1vx+NDY1fWFrvRqfwj0Uew6fclcUz18YgIQg/VC61K5e 2N7srUvwbjBDQ/j70Ctd4FTRXYED1lo= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mz7bhO7H; spf=pass (imf12.hostedemail.com: domain of jannh@google.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-53da266cf37so872e87.1 for ; Wed, 13 Nov 2024 11:47:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731527237; x=1732132037; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=pSJWU8y6bHQxtW/mJOGtqf8Fg9KxuncXsLFp1xbzR6g=; b=mz7bhO7HeweUNInuYc3irtmjomG4l/hwit4MWwi5tqdE/ypygFidzCldv1qRRKLbkq XBUtNXWoYBw4qr549pX5/EJt1MGoOG81SKy8yW9p3+7wBqM6cNqEqPAj6qVbMcx6cEHL 3rqfYp6xqLLqtL36U63caGigD58yZNWXw1lJzSx/s1XTmla7F8V0rUjEMUzVzIi/2ZYy wa7bvybQ7VMfOLAVBrDEqFcJrCES5eufpdOKXmP5JBUKvtpJH60j3/CLADZRNVVbNlIU tF+NYPghU007XuLOsyZeCxzPFqvY4B2Um+ptqAeS+K2DrUB4ZPQ/x+KZYuypwqgrW/cJ ZOVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731527237; x=1732132037; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pSJWU8y6bHQxtW/mJOGtqf8Fg9KxuncXsLFp1xbzR6g=; b=iL2AueL3uBY5jmrZW+3k/V7f8CYGzwewT7h0N8kUsdphsv6GoGLSVpiHoJP7smaVQN ROETaQ24O9tfEt2rbbk3bQ1ocOPrcEFuDuaqYkSktw3M1XS6AwnWgbTm2I06o5CFXqh6 TSHMo23v5Vkk3aDjBykNefxIkfOSzoIcEmZ/cKH1LWFdOT0tA71WxeVRC5DA9Xp1jMV3 pe9l7sAH/Haf06b6h/PrbmAS2RS8VdnN+dl5AhwR8rMiBgGZlJAGdelQDlvcb9lqwivU D38H1Wx5HsQj5BQZgrse10XJ4LTC/XGC8YjQHj4YzzEnwNDfrqSjiG6s1QMxTfXnLuJI 4LIA== X-Forwarded-Encrypted: i=1; AJvYcCWl2yICRLdqffXKjPOSA6LZTIDydMYr4VtETLpz0Q39UCe5zKZQ2hGWFke7nd82JHCrjjtBrcQ5hA==@kvack.org X-Gm-Message-State: AOJu0YzZcXigbnvKGxdSpqO6ZWtHgDZwlXQN3nxYa9Zv6Fd9A0xeFWes IMVfOF5MdLownbmoNlbjzRuaIjHzeCWnAU111EiL4UP4Ptv1PmHH/gEY0FflGRT6MeEWxFQKUNW 3ivgoj865aqaO6yEbYUjpHZks29mMkOhsjF1c X-Gm-Gg: ASbGncsZAmhep5FOIeHYwSyZQ/L9tzXgPSiLqXiH+5pK+xKPnVOZtM6qxms0N0BGai6 LFz5B44pnGE9TSUK/XS4/KCH+XPMywM5RwC2jYIQEux1RJ7vW0AU31625ARQs X-Google-Smtp-Source: AGHT+IE3ppARTnzlYrCDB5JZuDPBOlh5drJ+grQRZjPODoYVls/AM2IZL036eqnlrknFAL86fkQryAkCrlIkacqzw7I= X-Received: by 2002:ac2:5213:0:b0:530:ae18:810e with SMTP id 2adb3069b0e04-53da4f492dfmr33731e87.5.1731527236243; Wed, 13 Nov 2024 11:47:16 -0800 (PST) MIME-Version: 1.0 References: <20241108135708.48567-1-lorenzo.stoakes@oracle.com> In-Reply-To: <20241108135708.48567-1-lorenzo.stoakes@oracle.com> From: Jann Horn Date: Wed, 13 Nov 2024 20:46:39 +0100 Message-ID: Subject: Re: [PATCH v2] docs/mm: add VMA locks documentation To: Lorenzo Stoakes Cc: Andrew Morton , Jonathan Corbet , "Liam R . Howlett" , Vlastimil Babka , Alice Ryhl , Boqun Feng , Matthew Wilcox , Mike Rapoport , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Suren Baghdasaryan , Hillf Danton , Qi Zheng , SeongJae Park , Bagas Sanjaya Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 1486E4000B X-Stat-Signature: nmpe3d8wxgw3k9sm5zicf44axoc5eq74 X-HE-Tag: 1731527218-724184 X-HE-Meta: U2FsdGVkX1/3jpl0dmoEj80O/faDVYY0E2RvYbR/IZwGcJZFd6UNnUuV/OHgn/cgA6llvB9+549rSqTSeQnFG6VDRdMJAINm0SvjLkZsxEmcGkV7y0DOrZOm4Ss5udyey+/eVpzJg56JGDEi+mzMlsNDj/KKevmOB2LvmfEef0UPRi6ktsIRSyDlJVSvrOGP/Lem6ilVlx3ohClXVuOcA/c98W21b6lQINYgyPbFy30Zc0M2GNsLebIYbBOnVNdeeQ9cVzjT/pz7whGO3RcC02jBbHY5eVhXw+Up6qARCE/8csfj9oGZ3FMGLz5dX720APzGA5TU8mziKcjYtX6NE/YiJFM6OKvBp6VPK3aWLIMlAdYpGaboNL2Z1ys1XPb3G/p8ZRbCnIHl7kj0FPEjkO3fbK/4bOBAgeiPdVLEV1p12T+HTPrL54QyTyoaiXBBXCjbWaZiOhuM03g/mRSvmgWOKv0aFaE+gSKB5oiPBJwI/trkybaOSH5Gf3mb4yhJ1xfu0QkYmECK0hxZrcdjqtDp4eUKu3/SwuCJOz6Ip5tNQmp4AkoWMMxrRm/TJVEIlHwyj0wPTW9HtkjexQ4XXEO9qPs1KsMWMyH2rVWWT/8B1KZyXK8CkcRgFp0LYZpqxIvW4C3wdAUtomtEq7ZNt2jrXMJ/8z3TsqwMqz0Ewb9Mf/4no5cibpgk+9Y3+OD7Jl0f8grDDntQGzEb34UESRn0PDaogXLOyzbo8j09Mfhfpx9MW/m6UfFgtWbRIqNCr+QuJDtmv/DEDEmELt96bEJpM5lkmVHGMuae2pDveYIRbhk2kM46yArD7H58UFowH/L0lnH/ulDZnsIGwKVHcHc1ZJPj5vr8UrKStknBoEio4+4S3Xd9fJAtE3yMnTOmidWpWsGRDEtFWb5qxYGHow/GgysEODGNeaqoFbK557ngiCrwk/RZQzm2+UhcDR98uTTxP9jEdbG49ouhzkB LKFah0N0 kf/73FSFYOUXIZb9sYUzDp/ZZmaj4ltV56lHaSh31SgyCF7ZgcH4sYN8bxVw4v2KR1RJCkvdAkjBsFoHi4rE1ZCF06TKnYsqYriFvp/9X3jwBGlfteVKzbJ7baBf7Q64ABFWFpCxNXo9GakXMSfnl8p8vlWKsOEg942WBZDDHpu6+pE6uNFwcNfAralMpMHZ0kOLjXlE002MljtRnfuLL7pnjl91LqF8f4JT1DevCscZzvqvN1f00gKl+8G3w6wH3kmao10houz1fVkdb+XazbSDeMn5UOFy2gdSIQ1G+mWsIgKhpJiDE81AOcdWaODW4nJ2m X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Nov 8, 2024 at 2:57=E2=80=AFPM Lorenzo Stoakes wrote: > Locking around VMAs is complicated and confusing. While we have a number = of > disparate comments scattered around the place, we seem to be reaching a > level of complexity that justifies a serious effort at clearly documentin= g > how locks are expected to be used when it comes to interacting with > mm_struct and vm_area_struct objects. > > This is especially pertinent as regards the efforts to find sensible > abstractions for these fundamental objects in kernel rust code whose > compiler strictly requires some means of expressing these rules (and > through this expression, self-document these requirements as well as > enforce them). > > The document limits scope to mmap and VMA locks and those that are > immediately adjacent and relevant to them - so additionally covers page > table locking as this is so very closely tied to VMA operations (and reli= es > upon us handling these correctly). > > The document tries to cover some of the nastier and more confusing edge > cases and concerns especially around lock ordering and page table teardow= n. > > The document is split between generally useful information for users of m= m > interfaces, and separately a section intended for mm kernel developers > providing a discussion around internal implementation details. > > Signed-off-by: Lorenzo Stoakes Reviewed-by: Jann Horn except some typos and one inaccuracy: > +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_l= ock` > + which locks at a process address space granularity which can be acquir= ed via > + :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants. > +* **VMA locks** - The VMA lock is at VMA granularity (of course) which b= ehaves > + as a read/write semaphore in practice. A VMA read lock is obtained via > + :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read= `) and a > + write lock via :c:func:`!vma_start_write` (all VMA write locks are unl= ocked > + automatically when the mmap write lock is released). To take a VMA wri= te lock > + you **must** have already acquired an :c:func:`!mmap_write_lock`. > +* **rmap locks** - When trying to access VMAs through the reverse mappin= g via a > + :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` obje= ct > + (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be = stabilised via missing dot between sentences? > +These fields describes the size, start and end of the VMA, and as such c= annot be > +modified without first being hidden from the reverse mapping since these= fields > +are used to locate VMAs within the reverse mapping interval trees. still a typo here, "these fields describes" > +.. note:: In instances where the architecture supports fewer page tables= than > + five the kernel cleverly 'folds' page table levels, that is stu= bbing > + out functions related to the skipped levels. This allows us to > + conceptually act is if there were always five levels, even if t= he typo: s/is if/as if/ > +1. **Traversing** page tables - Simply reading page tables in order to t= raverse > + them. This only requires that the VMA is kept stable, so a lock which > + establishes this suffices for traversal (there are also lockless vari= ants > + which eliminate even this requirement, such as :c:func:`!gup_fast`). > +2. **Installing** page table mappings - Whether creating a new mapping o= r > + modifying an existing one. This requires that the VMA is kept stable = via an > + mmap or VMA lock (explicitly not rmap locks). Arguably clearing A/D bits through the rmap, and switching PTEs between present entries and migration entries, counts as "modifying an existing one", but the locking for that is like for zapping/unmapping (both page_idle_clear_pte_refs and try_to_migrate go through the rmap). So "modifying an existing one" either needs more caveats or more likely should just be moved to point three. > +3. **Zapping/unmapping** page table entries - This is what the kernel ca= lls > + clearing page table mappings at the leaf level only, whilst leaving a= ll page > + tables in place. This is a very common operation in the kernel perfor= med on > + file truncation, the :c:macro:`!MADV_DONTNEED` operation via > + :c:func:`!madvise`, and others. This is performed by a number of func= tions > + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_= pages` > + among others. The VMA need only be kept stable for this operation. > +4. **Freeing** page tables - When finally the kernel removes page tables= from a > + userland process (typically via :c:func:`!free_pgtables`) extreme car= e must > + be taken to ensure this is done safely, as this logic finally frees a= ll page > + tables in the specified range, ignoring existing leaf entries (it ass= umes the > + caller has both zapped the range and prevented any further faults or > + modifications within it). > + > +**Traversing** and **zapping** ranges can be performed holding any one o= f the > +locks described in the terminology section above - that is the mmap lock= , the > +VMA lock or either of the reverse mapping locks. > + > +That is - as long as you keep the relevant VMA **stable** - you are good= to go > +ahead and perform these operations on page tables (though internally, ke= rnel > +operations that perform writes also acquire internal page table locks to > +serialise - see the page table implementation detail section for more de= tails). > + > +When **installing** page table entries, the mmap or VMA lock mut be held= to keep typo: "must be held" > +When performing a page table traversal and keeping the VMA stable, wheth= er a > +read must be performed once and only once or not depends on the architec= ture > +(for instance x86-64 does not require any special precautions). Nitpick: In theory that'd still be a data race with other threads concurrently installing page tables, though in practice compilers don't seem to do anything bad with stuff like that. > +A typical pattern taken when traversing page table entries to install a = new > +mapping is to optimistically determine whether the page table entry in t= he table > +above is empty, if so, only then acquiring the page table lock and check= ing > +again to see if it was allocated underneath is. s/ is// ? > +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PM= D entry > +for the PTE, carefully checking it is as expected, before acquiring the > +PTE-specific lock, and then *again* checking that the PMD lock is as exp= ected. s/PMD lock is/PMD entry is/ ? > +In these instances, it is required that **all** locks are taken, that is > +the mmap lock, the VMA lock and the relevant rmap lock. s/rmap lock/rmap locks/ > +VMA read locking is entirely optimistic - if the lock is contended or a = competing > +write has started, then we do not obtain a read lock. > + > +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` functio= n, which "is obtained by lock_vma_under_rcu function" sounds weird, maybe either "is obtained by lock_vma_under_rcu" or "is obtained by the lock_vma_under_rcu function"?