From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4DC9D31769 for ; Tue, 5 Nov 2024 17:22:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 09AD06B007B; Tue, 5 Nov 2024 12:22:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 048AB6B0083; Tue, 5 Nov 2024 12:21:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E2BA56B0085; Tue, 5 Nov 2024 12:21:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C5E486B007B for ; Tue, 5 Nov 2024 12:21:59 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6A1A01614F4 for ; Tue, 5 Nov 2024 17:21:59 +0000 (UTC) X-FDA: 82752707478.07.8D4486A Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com [209.85.167.54]) by imf04.hostedemail.com (Postfix) with ESMTP id 1C6054001D for ; Tue, 5 Nov 2024 17:21:14 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=iPfzUsZJ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of jannh@google.com designates 209.85.167.54 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730827194; a=rsa-sha256; cv=none; b=3d8UFfwF+Fs7ZJA4bsmjZTsFjeDXGzKgfDigSc3juqVLkcFko1+bOjXqKPWuJ7SenQN4gU JeOiU1O9mstXztm3iGDk14+HzAhzx3+9a5awtqcmXDzPNKIawu4vrU5qu6iab/1YBv7eAZ vnGPF9RJrE0SUT/rrb91UfkNC0XX/NU= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=iPfzUsZJ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of jannh@google.com designates 209.85.167.54 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730827194; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5Oa954kD6XCaRt5fGl4rXY5k6zboqDRGjgfVfwZcrQs=; b=B1V71z31apf5rYtpZo9OeuWhTOzKNo08ks/vb+3cQ/fFtkiKlFih5EjS6LS5I4xtzyBa5D zWUnTpaFGK2Umcn1ivRx2JCclqeWwWLOzMh9Q6TZkNvjhIkOAWfMLxWe57EtoIZv4QnsD8 JygrW5sP5R+HkI+U1JXhvqkwmpVGJgo= Received: by mail-lf1-f54.google.com with SMTP id 2adb3069b0e04-539e19d8c41so19634e87.0 for ; Tue, 05 Nov 2024 09:21:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730827315; x=1731432115; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5Oa954kD6XCaRt5fGl4rXY5k6zboqDRGjgfVfwZcrQs=; b=iPfzUsZJmUMPf9Pw3OdTeP0WlS/k05GmhFHhqi60CxrzgpCIhd7Dg2cWFRAUGXdHIh yT5TENDe9qvmZSNgdkq2vIzphkEHHWqUVYu29Ii5ESy4/uXP6XI9VDR2Gkp/b5GT3wKm Gu8b8MDFtj4JO+jFBctEcJwUUDmuKP68PIzRpi6uiF6crqmpflo7kpthp/Wh+7b0Dl3/ zs7PV7jQJNi+76/hXcvml0UV4Jhyje9xlVEiuzqFYSgmOmyc0wQ/e2eWbyk0Hl+lWTtv 0WaY6mxF3eYlNQaawbB+Svy87Zn9zifF6PwlufevCazoz1DoCvSo43gXPpG53jItRsf/ L5BA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730827315; x=1731432115; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5Oa954kD6XCaRt5fGl4rXY5k6zboqDRGjgfVfwZcrQs=; b=GYZ2hxeVRlJ2c1f/h81cLCGsUmnFbVqNP7NnEFzC93Xp5CrpgQNxBcl44PureV1TSN y7JC4bKzizlHQKipbBuMijfI+1jsDTDSr/7L95vhwJy1KbYPV4rN50y+EU+LenEYn/33 0G/G3StqYnKZ67nYp7KqClDGFbnT5TVChx0/1lWKp9/G2HOjHGib0BK0rXy0CmcJwWMY SBdvTXy6ZXChpYzQORp8Ft3lSDju2a1kbX46YRdI/+odeP8BjLMmND2HrpoyXo1Vdoag 90W657UVQiuV2/cibtRkqKup2G7ohyGfFnso+BLRFXfebVGddKIIH+e6D8c1OE5pGDAW cFeA== X-Forwarded-Encrypted: i=1; AJvYcCUlkVgIuHJdNU9Cg4VlK1iVsUMnbQ4B75dboCfsQrK6XBzIRYStaSCGXqsYO1sEjoAGgzrh6mcnEQ==@kvack.org X-Gm-Message-State: AOJu0Ywrrlezrhsks92QYqGOIf3dNC4VpAcBzvIcsCueYhi+yUsbh5eo OuTTkKYBd6HQDZ72ImzeXqurioA3GYBPEvOhg9VLtGRjiYidZmza3UgVrGTsmpVUCyygcMGI7mx ygB42jHY5Sd3gj3zcSOiLaLy543FJCpIjEzzk X-Gm-Gg: ASbGncvYwSEgE/yyeQo2HG0BmEkMl5jMCve51ECiQhVVUcwm7tYoOrjtQBYDjNh7eiH KjwUYbYNHqV4ZfQ4QsO4XE0rV9lxwjWZFUPhlg6QIqW+SgeFmbSYbAIkMPp4= X-Google-Smtp-Source: AGHT+IGm3F52/nNqqcOqRdxFJuhV5jymVu9gPUHEg4JCykt5fGL3gHiQJj1l8LCsSMEDvzdiWg89RqYTDaz/6NYopa4= X-Received: by 2002:a19:385c:0:b0:52c:dd42:cf57 with SMTP id 2adb3069b0e04-53d78288128mr270429e87.0.1730827314967; Tue, 05 Nov 2024 09:21:54 -0800 (PST) MIME-Version: 1.0 References: <20241101185033.131880-1-lorenzo.stoakes@oracle.com> <2bf6329e-eb3b-4c5e-bd3a-b519eefffd63@lucifer.local> <94e4ba07-42bb-4567-b0c9-b719f81dc61b@lucifer.local> In-Reply-To: <94e4ba07-42bb-4567-b0c9-b719f81dc61b@lucifer.local> From: Jann Horn Date: Tue, 5 Nov 2024 18:21:17 +0100 Message-ID: Subject: Re: [RFC PATCH] docs/mm: add VMA locks documentation To: Lorenzo Stoakes Cc: Jonathan Corbet , Andrew Morton , "Liam R . Howlett" , Vlastimil Babka , Alice Ryhl , Boqun Feng , Matthew Wilcox , Mike Rapoport , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Suren Baghdasaryan , "open list:DOCUMENTATION" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1C6054001D X-Stat-Signature: jdc9e5arjqcd98p9d15tgbnpxmekpd4i X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1730827274-4555 X-HE-Meta: U2FsdGVkX19fZslNWOygjYGbJY3YlAG5nj0G6CMbhPVEd7Va4NJ5wtrTDZ5a93uh50pRtwQmlsYyg+dWtiykReN10pQBo01F8Hob1/x/TYX4rxjZu3U1rXyoxNavq6ke2l9Wub7BUqmS86GK4vtDwoHyqj2Vzk/eOLlGsnLcnBJPTWgM1EkzwHA6N8oqJMdqNNbH7fPgUfNn/xiBo1UQ4TeqRY1ChbO4mc//bHS071NyrKn0jCekYkGH/nJl22yQhPJyP/Hi7qVCUYyeH6G4KJPNaSgoGuz7iBFEHDzKCntQzqUAFL2MgCzIYnH91HFn2BGoZBWt5+qCTjzrsIHN0Rf7QNBOSw9Rq3MLuozAugrfIIJgcsC6enCSccGd7R2g6Yjj3UynFpjDUuQZ48kkGj84LCengYx3znOJdssIg3yLz2QNChNzny7c/pGRwqNiz13xVApfT47+sbW+fX2FxZifcDP9C4lxr7+Y8mIVNB6nz9teMXgqJb+j+wXSE5b7ge1omes9+BgCR6ZQqvOL+ODIbD/T2WbD8I/dcAVv+brID+kUFMuS2DSe8akossRy/Il1B4RacLOPwEhUubK6/lcITodXviO7D+sEKKPV9G1A4dkm8hhv7Q5n94ItsXod5qHD8d3fbdrtuxxsk+P/1Md5TCot+bFaRlGdVYh9Md/w/L0dmuJ6A6IfU0YH1sOkcIajE1QETSboP1bfLia2wTKfMmZBUw1fBw8Vs2MbzsV1gM6BWNmMYSYTO34MQFunES0jBs3j/FRnZl6Zc7PTHDZE9cqg4p3ncqBU7mm+X2FiRxcfBas9O+rNjtJy+Q5dk/ZEdWv/646UnDnjK1VG7a6FSkU0ZeHOgmJeZ2hM5u3EKiPvQJs9Hi+k9IpA46AokYooDpQfDZZXy2R0jDRXNGpwe8V0LA5a9h3p8X0Qqv6q0DCy7U7azaG06gEsJ/Ub8PmhWUgjnBjwhK7USu2 WPBGspvi rhCdDYdSQ8gGLPQDKhPQd0OpxAqLloHj5oV/yEMcvrfYu9LidiefwJ5AsqY1IziW6R2sXCRuY8c4yhQIx5brnPpOnQ20MmmFPL+WU5B2EVjxMBmyBhlosxp/VTgGUAz/42SjR5KGLHXV0Nz/li6W4Qag8iJbKj/DSJzcItdXAFW67VXQbrV8exgdrQNimQ3BIiVRplgU5RBRdvlQN9zDqBEUBRCyn32UeLpsXYPSaI/C2q7lWrXWUPQeDx3gg1thUdn8muGbJ6ba2pbyyloOUneE9kB2gqaVrb0bNFuczR51DSX+D4NHzpV1l1+FZBbUwD8XuRJqmlFiUVXYen/sYIENA8psm7JsP3LqpaEb5n13qIps= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Nov 5, 2024 at 5:10=E2=80=AFPM Lorenzo Stoakes wrote: > On Mon, Nov 04, 2024 at 10:29:47PM +0100, Jann Horn wrote: > > On Mon, Nov 4, 2024 at 5:42=E2=80=AFPM Lorenzo Stoakes > > wrote: > > > On Sat, Nov 02, 2024 at 02:45:35AM +0100, Jann Horn wrote: > > > > On Fri, Nov 1, 2024 at 7:50=E2=80=AFPM Lorenzo Stoakes > > > > wrote: > > > > 3. PMD entries that point to page tables can be changed while holdi= ng > > > > the page table spinlocks for the entry and the table it points to > > > > > > Hm wut? When you say 'entry' what do you mean? Obviously a page table= in > > > > By "PMD entry" I mean a pmd_t (located in a Page Middle Directory), > > and by "that point to page tables" I mean "that point to a PTE-level > > page table". > > > > In other words, from the reader perspective (as I think you already > > mentioned further down): > > > > Rule 2 means: From the perspective of a reader who is holding the VMA > > lock in read mode, once you have seen that e.g. a PUD entry > > (overlapping the VMA's virtual address region) points to a PMD page > > table, you know that this PUD entry will keep pointing to that PMD > > table. > > > > Rule 3 means: From the perspective of a reader who is holding the VMA > > lock in read mode, once you have seen that a PMD entry (overlapping > > the VMA's virtual address region) points to a page table, you don't > > know whether this PMD entry will keep pointing to the same page table > > unless you're also holding a spinlock on either the PMD or the page > > table (because khugepaged). > > Thanks right I see what you mean. > > Might be worth having an explicit THP (thus khugepaged) section? And > perhaps even KSM... Maybe, yeah - I think it's important to roughly know what they do, but I would still focus on what rules other parts of MM, or users of MM, have to follow to not break in their interaction with things like THP and KSM. So maybe kinda do it in the direction of "these are the rules (and here is the detail of why we have those arbitrary-looking rules)"? > > But you're right, I was being imprecise - as you pointed out, it's not > > just used for zapping. Maybe the right version of 6 is something like: > > > > Leaf entries that are not in "none" state can > > be changed while holding any one of [...]. > > > > Though I'm not sure if that is overly broad - I think in practice the > > changes we make under the rmap locks are something like the following, > > though that might be missing some: > > > > - zapping leaf entries > > - zapping PMD entries pointing to page tables > > - clearing A/D bits > > - migration > > > > > OK so interestingly this really aligns with what Alice said as to thi= s not > > > giving a clear indicator from a user's perspective as to 'what lock d= o I > > > need to hold'. > > > > > > So I will absolutely address all this and try to get the fundamentals > > > boiled down. > > > > > > Also obviously the exception to your rules are - _freeing_ of higher = level > > > page tables because we assume we are in a state where nothing can acc= ess > > > them so no such locks are required. But I cover that below. > > > > > > > > > > > And then the rules for readers mostly follow from that: > > > > 1 =3D> holding the appropriate page table lock makes the contents o= f a > > > > page table stable, except for A/D updates > > > > 2 =3D> page table entries higher than PMD level that point to lower= page > > > > tables can be followed without taking page table locks > > > > > > Yeah this is true actually, might be worth mentioning page table walk= ers > > > here and how they operate as they're instructive on page table lockin= g > > > requirements. > > > > > > > 3+4 =3D> following PMD entries pointing to page tables requires car= eful > > > > locking, and pte_offset_map_lock() does that for you > > > > > > Well, pte_offset_map_lock() is obtained at the PTE level right? > > > > pte_offset_map_lock() is given a pointer to a PMD entry, and it > > follows the PMD entry to a PTE-level page table. My point here is that > > you can't just simply start a "lock this PTE-level page table" > > operation at the PTE level because by the time you've locked the page > > table, the PMD entry may have changed, and the page table you just > > locked may be empty and doomed to be deleted after RCU delay. So you > > have to use __pte_offset_map_lock(), which takes a pointer to a PMD > > entry, and in a loop, looks up the page table from the PMD entry, > > locks the referenced page table, rechecks that the PMD entry still > > points to the locked page table, and if not, retries all these steps > > until it manages to lock a stable page table. > > Right yeah, I mean this is kind of a standard pattern in the kernel thoug= h > like: > > 1. Grab some pointer to something > 2. Lock > 3. Really make sure it hasn't disappeared from under us > 4. If so, unlock and try again > 5. Otherwise proceed > > You have this pattern with folios too... Yeah, I agree the pattern you need for the access is not that weird, it's just weird that you need it for page tables at one specific level. > > > pmd_lock() at the PMD level (pud_lock() ostensibly at PUD level but t= his > > > amounts to an mm->page_table_lock anyway there) > > > > > > I think something like > > > > https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#ov= erview-documentation-comments > > > > is supposed to let you include the current version of the comment i= nto > > > > the rendered documentation HTML without having to manually keep thi= ngs > > > > in sync. I've never used that myself, but there are a bunch of > > > > examples in the tree; for example, grep for "DMA fences overview". > > > > > > Ah, but this isn't a kernel doc is just a raw comment :) so I'm not s= ure there > > > is a great way of grabbing just that, reliably. Maybe can turn that i= nto a > > > kernel doc comment in a follow up patch or something? > > > > Ah, yeah, sounds reasonable. > > Thanks. > > > I think all this makes me think that we should actually have entirely > separate top level descriptions and internals sections in this document, > which align's again with Alice's comments. > > As the level of detail and caveats here mean that if you provide > implementation details everywhere you end up constantly on a tangent > (important, relevant internal details but to a _user_ of the functionalit= y > not so important). Hmm, yeah.