From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FA68C5AD49 for ; Mon, 2 Jun 2025 22:26:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 163346B035F; Mon, 2 Jun 2025 18:26:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 114376B0360; Mon, 2 Jun 2025 18:26:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 002996B0361; Mon, 2 Jun 2025 18:26:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D65516B035F for ; Mon, 2 Jun 2025 18:26:16 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4B7B2814C0 for ; Mon, 2 Jun 2025 22:26:16 +0000 (UTC) X-FDA: 83511895152.12.882DE87 Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) by imf05.hostedemail.com (Postfix) with ESMTP id 47591100003 for ; Mon, 2 Jun 2025 22:26:13 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=lX6bLRca; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of jannh@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748903174; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ggwqhwzY13rsbQ8W0g8A+kQtQsV22cDwzKKigdCvrFI=; b=41ocEk7K+WdvovD4f8iQGIpMJEDW0xOr/1VjzdcI4mkpSpUO2tDQzfCklInUykLYROWzma 3eURfY9HTSRS/fnSmqqvGMjnYCduat8kDPn+KFnsPKKBpRHTiTaZXJQnKbVXsHyHaDoLBR W3ErTvkiX0WKv8e7wqok4MX5B5PWiKI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748903174; a=rsa-sha256; cv=none; b=aVmBPYPkZwzWp0lGTPrfzha1VY3YKtlank3e2DY/LwEQSvUZLw+B+56bTe/6MO/CBxVFTZ Ekp6A3m/4zHt2wPy8sLPZBK1rVCjXjsE/eJbqWnb6H+ubtMVBfRgMT4x99bEy60V+oNr9+ TmQWMn84X4acPKmcc36egGkSLqGp3C4= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=lX6bLRca; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of jannh@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=jannh@google.com Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-6000791e832so1955a12.1 for ; Mon, 02 Jun 2025 15:26:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748903172; x=1749507972; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ggwqhwzY13rsbQ8W0g8A+kQtQsV22cDwzKKigdCvrFI=; b=lX6bLRcaoZh239j1ErJt43uUGIA0URc6ndsU7k8wbzOCc4YkCsD3NK3NiTtnJMpNlG 5tXgC7MchDAAJqEHHEnO75Hd28VA8HvyN+VTg8ymXkxdqt/d74+ZIFzZvl6CTcj4+RFV FW8CWtCwa+I88DGouZx6mJANmzEY+4nCfwlnEWliWIXupOgjqf1PbMwNSsEs1A1Dvv6q 7gOWYxKgMgArZinP8cF5zAKHndjWSo8WcqAX+L2p7Fb/UEsCKFJU8sg3SGAKqrwdxU9b 0YV3m3OpbOTpI8lCfAtKNbGurmlzffqwLCFozjCuPWIKgH7gNqxzOThc+dkgkggtaO2f TdmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748903172; x=1749507972; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ggwqhwzY13rsbQ8W0g8A+kQtQsV22cDwzKKigdCvrFI=; b=j6Ysc9bNKzOrx+pKwjl3Q6hzStePJq/5wzbaknFu8CrMn0bxUUJZt7nkm5E4v8OeVF xH+mgG0oo+nBUZ819ZwPsXvMOviF3+ExGARsaurf/tbsb6b9323F0ms9IPhP9RLqP7UX CTIhSAnf9YuJnOYFCymXShKDmiPlk+BoytdHCXbmE8vqhVCWy3FT+xAYL8CsmaVwRJzV t2AOY/dUiPiboLg2Dy08awtRjyqBBGhCw+va3AMv/nBpESqk/7kbnucUh3Xooj4bE/lO 0SdFVtUTCjLV/ZWAgAW2OZKo2530/8FXv9kZ5w79aP5vDqdVhwF1M3/r1Nu50oFrUQic 3voQ== X-Forwarded-Encrypted: i=1; AJvYcCVArY+elTytOc3TAOhXrg1JGA3jFuJ1IyOurs361tO+desJqkCeEN+z23vV+BnooRV3+i8tPwDnLw==@kvack.org X-Gm-Message-State: AOJu0YxbPdSW1GiQFRlstY1uCt7d0Z7IKa4pqCpeSQt+eqtLZnX8NWR8 6SZVHlLPk5h23v47O85/6bb42/3l4YR+3hXKemuDencnAz8o4ixpyV46vGAuzLcVLvtt74Y5q1Z mRzQEPrEATrvTK7TIYcXyotBH1GH7jp0aenv+wTbO X-Gm-Gg: ASbGncvDzbiJZz+A/IvugFuzay4/iK2YVYJUTwzpCIB4ZXTZQZWq5UUIgdjs97/ODmc 0QBYC4MdDgte/BXHnBiT0hyTPVvntCZMiI6Nw6/bu/vVM6ozymo7ZWzgNB1nRkpE0o55zX+jyHM kJmF3xoVmgj3hR+0EZDAMaytBO3rSDGzx0Cl4f0wPjiiK0h6MnVoF9YqfnJ0CFyhAqla3ydUAq X-Google-Smtp-Source: AGHT+IHCnQ6O+FfmJRLdseofJPmCzgnTOWeuU97+sL6MZMGY/8rjDnNzhEvaRbUl6ncHmBQYBdBsTtB7yshO2C9nRuE= X-Received: by 2002:a05:6402:28a7:b0:601:f23b:a377 with SMTP id 4fb4d7f45d1cf-606a9542c05mr35408a12.6.1748903172208; Mon, 02 Jun 2025 15:26:12 -0700 (PDT) MIME-Version: 1.0 References: <20250602210710.106159-1-lorenzo.stoakes@oracle.com> In-Reply-To: <20250602210710.106159-1-lorenzo.stoakes@oracle.com> From: Jann Horn Date: Tue, 3 Jun 2025 00:25:36 +0200 X-Gm-Features: AX0GCFtxDfnvFRB2WV6vaT1DdceqGi4zxTufCelV1dyvfVZFFDsmjT-keuBn_9k Message-ID: Subject: Re: [PATCH] docs/mm: expand vma doc to highlight pte freeing, non-vma traversal To: Lorenzo Stoakes Cc: Andrew Morton , Suren Baghdasaryan , "Liam R . Howlett" , Vlastimil Babka , Shakeel Butt , Jonathan Corbet , Qi Zheng , linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 47591100003 X-Stat-Signature: midkkiraqhzqkt974g6xby4omp8x835t X-Rspam-User: X-HE-Tag: 1748903173-254850 X-HE-Meta: U2FsdGVkX1+WbxA77ntsJ4eIVXVnWEOf3huDB9B/RPL5+yDufVq4rkrwnu9fV+ywTHPI/b2NywhNRlWqQvZJqQ4QgPbHWcJEokI56oN21KCfx8rcPJgpGxiIjYkskkPPvOEakNl76B1lmBgGWshasFYrn0bHyfiXdHkoiFq5bWoRcmboPnWyPhlbKqQabdg9IkybEUnJYVzK8BpyzflZCDynlLaLBkkwYtq8tfc7KAomlyULmxiVo/E5fnTVsn/xYEaAw2RLiKzRsWb1Nlz0eHR76DWypW8ihJUkoxjxuAJbjgFHR7L/L4KsxY3WDG48BqJ2Twe2zc1nXGnMidoSC/SDY6tcKcOR2iG9K+W6bhRZM1TD4VU38asM3/T/r7T014yGytP4PB+FXO6fGTik3ig4DpECmFeoKkpg7eQcurr5CbcOz5BQODLy87LWzQi4NfC51cvapljEx2av8AuBdAeRiqFi3wBlizocoOfcozZ9U7k4fm2SpzCxlLbKHsC0fYSuG/aEOevSHn2UWCZAiEnj/3NOiCVBkVQ9DNOScZGC4JwTLNHAiWw+krhx3mCXBrcRmyMG7FbxG5ncatJR0Rd1ieYwSjM+ZaACVg8JOFBd+z7WDacJwOqEvomzy2g51JFIX7UCZ/jQB/WJ7cTDlxjQ+pES0pa/cak4FJ5CqEd9zdRSRhAP+ZkNMiDaUgMGrbcoCnucFqTTpuy9sKCTzTTLoa4svUfaPvpjSedPZiIkxPW55syGCTOkvIlP9HNmcY41Phg9ZYCifNlUhwRFtp8+aFsYc0YTXXSowiRFyWQ+EeME6R8PAyNLEsi2MU5pyXg8gOL+DI8/908p2w61lmjSDFy0xYTh4uusHi8W24uoaK5JfJqgHJgs02Hq6foj1dhLBCFy+QyppYwwRg+yiOoWS5yNY/ON4b+yAJ21rR1VcaPXefTKr2MNuASCK/YfDcxZCIDJhrK5nx+qEdw U46QL0vh 7f6/ck3oy/7cQ6N1Da+hWtSTr+1QNa6P74AmKud7195XTDCeUmEGIgCLieBcIEvOLELu9ESsiJjtoLM2+ELqAuwAU1kXpsDum9sCW4V5H1/okM1gF8gcAc52hyL+iwRSUait7iOdyKcDp67rvsi2flhRJrLvP+kC8x3fOTzWziSyNK/8yOkxnH+U0V+adPwYZ/1F9zSQ9z4dETpO0C8duW9pJpg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 2, 2025 at 11:07=E2=80=AFPM Lorenzo Stoakes wrote: > The process addresses documentation already contains a great deal of > information about mmap/VMA locking and page table traversal and > manipulation. > > However it waves it hands about non-VMA traversal. Add a section for this > and explain the caveats around this kind of traversal. > > Additionally, commit 6375e95f381e ("mm: pgtable: reclaim empty PTE page i= n > madvise(MADV_DONTNEED)") caused zapping to also free empty PTE page > tables. Highlight this and reference how this impacts ptdump non-VMA > traversal of userland mappings. > > Signed-off-by: Lorenzo Stoakes > --- > Documentation/mm/process_addrs.rst | 58 ++++++++++++++++++++++++++---- > 1 file changed, 52 insertions(+), 6 deletions(-) > > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/proces= s_addrs.rst > index e6756e78b476..83166c2b47dc 100644 > --- a/Documentation/mm/process_addrs.rst > +++ b/Documentation/mm/process_addrs.rst > @@ -303,7 +303,9 @@ There are four key operations typically performed on = page tables: > 1. **Traversing** page tables - Simply reading page tables in order to t= raverse > them. This only requires that the VMA is kept stable, so a lock which > establishes this suffices for traversal (there are also lockless vari= ants > - which eliminate even this requirement, such as :c:func:`!gup_fast`). > + which eliminate even this requirement, such as :c:func:`!gup_fast`). = There is > + also a special case of page table traversal for non-VMA regions which= we > + consider separately below. > 2. **Installing** page table mappings - Whether creating a new mapping o= r > modifying an existing one in such a way as to change its identity. Th= is > requires that the VMA is kept stable via an mmap or VMA lock (explici= tly not > @@ -335,15 +337,14 @@ ahead and perform these operations on page tables (= though internally, kernel > operations that perform writes also acquire internal page table locks to > serialise - see the page table implementation detail section for more de= tails). > > +.. note:: Since v6.14 and commit 6375e95f381e ("mm: pgtable: reclaim emp= ty PTE > + page in madvise (MADV_DONTNEED)"), we now also free empty PTE = tables > + on zap. This does not change zapping locking requirements. > + > When **installing** page table entries, the mmap or VMA lock must be hel= d to > keep the VMA stable. We explore why this is in the page table locking de= tails > section below. > > -.. warning:: Page tables are normally only traversed in regions covered = by VMAs. > - If you want to traverse page tables in areas that might not= be > - covered by VMAs, heavier locking is required. > - See :c:func:`!walk_page_range_novma` for details. > - > **Freeing** page tables is an entirely internal memory management operat= ion and > has special requirements (see the page freeing section below for more de= tails). > > @@ -355,6 +356,47 @@ has special requirements (see the page freeing secti= on below for more details). > from the reverse mappings, but no other VMAs can be permitt= ed to be > accessible and span the specified range. > > +Traversing non-VMA page tables > +------------------------------ > + > +We've focused above on traversal of page tables belonging to VMAs. It is= also > +possible to traverse page tables which are not represented by VMAs. > + > +Primarily this is used to traverse kernel page table mappings. In which = case one > +must hold an mmap **read** lock on the :c:macro:`!init_mm` kernel instan= tiation > +of the :c:struct:`!struct mm_struct` metadata object, as performed in > +:c:func:`walk_page_range_novma`. My understanding is that kernel page tables are walked with no MM locks held all the time. See for example: - vmalloc_to_page() - vmap() - KASAN's shadow_mapped() - apply_to_page_range() called from kasan_populate_vmalloc() or arm64's set_direct_map_invalid_noflush() This is possible because kernel-internal page tables are used for allocations managed by kernel-internal users, and so things like the lifetimes of page tables can be guaranteed by higher-level logic. (Like "I own a vmalloc allocation in this range, so the page tables can't change until I call vfree().") The one way in which I think this is currently kinda yolo/broken is that vmap_try_huge_pud() can end up freeing page tables via pud_free_pmd_page(), while holding no MM locks AFAICS, so that could race with the ptdump debug logic such that ptdump walks into freed page tables? I think the current rules for kernel page tables can be summarized as "every kernel subsystem can make up its own rules for its regions of virtual address space", which makes ptdump buggy because it can't follow the different rules of all subsystems; and we should probably change the rules to "every kernel subsystem can make up its own rules except please take the init_mm's mmap lock when you delete page tables". > +This is generally sufficient to preclude other page table walkers (exclu= ding > +vmalloc regions and memory hot plug) as the intermediate kernel page tab= les are > +not usually freed. > + > +For cases where they might be then the caller has to acquire the appropr= iate > +additional locks. > + > +The truly unusual case is the traversal of non-VMA ranges in **userland*= * > +ranges. > + > +This has only one user - the general page table dumping logic (implement= ed in > +:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug = purposes > +even if they are highly unusual (possibly architecture-specific) and are= not > +backed by a VMA. > + > +We must take great care in this case, as the :c:func:`!munmap` implement= ation > +detaches VMAs under an mmap write lock before tearing down page tables u= nder a > +downgraded mmap read lock. > + > +This means such an operation could race with this, and thus an mmap **wr= ite** > +lock is required. > + > +.. warning:: A racing zap operation is problematic if it is performed wi= thout an > + exclusive lock held - since v6.14 and commit 6375e95f381e PT= Es may > + be freed upon zap, so if this occurs the traversal might enc= ounter > + the same issue seen due to :c:func:`!munmap`'s use of a down= graded > + mmap lock. > + > + In this instance, additional appropriate locking is required= . (I think we should take all the vma locks in that ptdump code and get rid of this weird exception instead of documenting it.)