From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B3F5C83F2C for ; Mon, 4 Sep 2023 18:02:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 22A8E8E0008; Mon, 4 Sep 2023 14:02:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B40D8D0001; Mon, 4 Sep 2023 14:02:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 02D428E0008; Mon, 4 Sep 2023 14:02:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id DF9958D0001 for ; Mon, 4 Sep 2023 14:02:00 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AB7C71CA2D7 for ; Mon, 4 Sep 2023 18:02:00 +0000 (UTC) X-FDA: 81199683600.11.92033B3 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf23.hostedemail.com (Postfix) with ESMTP id DD2B314001F for ; Mon, 4 Sep 2023 18:01:57 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=jzncw7Pu; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf23.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693850518; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kYB01aaUSVXLxTwjkDuPXwQrCnzUBznLpw97Cx6aCMY=; b=F1i6wwzYTTzGhM3BW9DDIps3qxQZqoh+RAWat42aRpKMLm4onZPyuWBpkVuBJoTZoZJwUv 2BNes+c/Zc2dNWeQGSNNyosLUfZLFerClWGBbVb+oqSKIMn4SRw47/QQU0m/GvGTtwL1JJ gTwoPHAD/quPKZKZLw2LEVAI3shIGaM= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=jzncw7Pu; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf23.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693850518; a=rsa-sha256; cv=none; b=xkidIFYZ0wXXl1i4DzjXKrM1Nlanqa+akKE0i5Y/cAckl6NvBtlJbryjSjryCCqRqpP/Fn nYLBQ0S88ig2wulLEIuLe7qjRq5kom98Xzu/i6LupIhgSisbPiy0+bD21jVEws+6TGsHnF CRmBPK8ev0P3vixwewYBL7LPREVHBLc= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id D36006152D; Mon, 4 Sep 2023 18:01:56 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 84AE4C433C7; Mon, 4 Sep 2023 18:01:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1693850516; bh=vnkVaZq/lB8aHehszev+SeZQLIWYgpuwniR73BL97ek=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=jzncw7Puj9sAp4WMEzBR2zZxspbdp3nZUQzKsM1ZhgBADpPqQcwscQxwxA1NOMPo5 C7+VYQCvSl63q0ZuIeRrrOtpwYLJYb9VfndGQbXZUMsyNkvNf/1J7Jln3lqRRq9DcI z4uTnwpi8X9HL0lzE1xjU0I8TXxc8uFE9bkHUllt6XQFs3b5ilHuCPrS/WaQNLc7fU Qz/PI/1F/WoIsXMw9je2R+aYZ9fTWPj2eKI66AUkvJsSLfKMifmw3wj5THZ6F0sQf8 fWwAhAQ/59ptGAQDH8Tmi12SawU1yq7uHZ5/riok5fcdHoatlht9y82huraQce99d5 jMTFPuUbKzwPQ== Date: Mon, 4 Sep 2023 21:01:11 +0300 From: Mike Rapoport To: "Fabio M. De Francesco" Cc: Jonathan Corbet , Jonathan Cameron , Linus Walleij , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Ira Weiny , Matthew Wilcox , Randy Dunlap Subject: Re: [PATCH v3] Documentation/page_tables: Add info about MMU/TLB and Page Faults Message-ID: <20230904180111.GG3223@kernel.org> References: <20230818112726.6156-1-fmdefrancesco@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230818112726.6156-1-fmdefrancesco@gmail.com> X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: DD2B314001F X-Stat-Signature: qy483nsqo7mwa84bgf3gs3a8r6h886dg X-HE-Tag: 1693850517-441848 X-HE-Meta: U2FsdGVkX19wnCe0Gix5+Vu83z5tzJDyfKxiw61YklDKeGY97/UNRwLSW3Ok1ja+i7YzVXKaA+gjW6s7pf/CU04aoDnE3I8CHrnK8al3ZOmceOYjWHp5GtqZgcmoHeBF6tk+bPuevMzMNIuUbvkcnb7c6QJL63JuzfodG6+PUpwjtfZzgksn095GsX0yuuNA5vESVhEVreqD+TPwVpIyXn0v+UMYpTm5REo0Yz71+S8GsONwb4s14SC4V4Ms2ZrYhf0D0BHNJJQpmlL2WrbspNnmIRbxMV2vQ9UBa6qGNt+M3gwkEOir+RvzJJcjSYTdV7bfh7Hj8sFIRoyW+N6WMuCkNNzeDtFGcsEbTxivxEuREj9T6J5S3PQe5+2+Th1ftQfD9MTQRhJ2UhYyPWH+tFCfkypJGoMzST8XohJhBKe+ELi0rX7c7NR8TIhCU8KFi1SKOrZlcO5P7YmER4fmqvP6vB9wpGpTcnqhTVs/kZ/v/qM61n7GselxmwGHZZVTUNXNfw8aceRUf8aZw8TivHajpuonZeNVhMAMkD6L5ZFIayRxAOT91lkzAR8708bSEk8QKR88tnuDi1uAGNwVneauu++/XcHZ2PGN4He+ALqreytnJn+8xXWnDpEspZXNja0VguQ7+qFgF4FL7iyU58FJ3WrYlxjrI5ArM0hAuLNlN88vpZRU8Zn3nDYvP87ppky8yDUxRijgXeFJ+Dl+HXAag7iudJYkm4kvcAl2ehLeMQv6LL5uFiQNRqkHIAk9SXnauROYqLlbwISEnFybc2BlpB4yL3tQrEIkuPFNFRUvhW7Hyz166SUiRPl/V5EbzJpuGyVZFKFnLvkIr4cChFcSW/PUNrQb5j2knm2b/3QCOQzyZ172Bk3jKEbJDpcRWeEUF9eCrycOYUOVQJxofnL8X+x9FqKod88xeFuMaiScU19UKVmUKkptV6SEoVeaZOrVKdm/wi3MY7/45qH CwktVI1q 971GBOFPy2PZ77NilZ+6xR+xuA94G2vq4PH1w+PIaSFORywcSoo4Jhj45CbSK72viQ4w7c9f0tOHTBukemYg6FEh19wZIXvv+wM1hf2c5Fmr7PIZ+4UNmQ6Do4o3pn+9959fmaT37CyeuNtqT8KIsTtNvO6q8TlbVyd4snxg0A2zWaTmm+SIQ8YP9jjX4n4C5PVT9fg8BbZ0XSnt/bheSzWiZBnG6M56NBhXHFLDmleZDWHgrRSlbmsw026tm5EDFs4PBoY/uD6k1WUZYPqCPHRVdcbQCHdhqQeSEz5xxROoQjxhIHMk+t0JlO/3WcVP8VGpvqVC+rRQ6PLcMIe9wrXthK6cgiDwDFSjhEo5oS5H1NTuG7iR2JFGp7AB1go9PNRPozp7Ug/FGukxx2MFRITihCmDe1Lm2QbNJ3Gq6v15e7yaZ96J0cZAJTV48CC7IOU5fgodhHe9OeOAaicRGarPspyamkr+yk8xQTTp8dD4UE2YH0EzdHMrsNxJU1lkOSMoP56kxFxSPfo72BoAi60hL/65HdNXzL//0MVxPRNbmLuwlHO/hg6spS2oU/4zttntYRmXweI6ZlKIlNYM71uLytdF/zUBRTK2JQb8BsiXRqJOsgcRL48NW2qp6n/sO6htITcYbt70SgLgcrhavtD8QMqcBjJ4HUNVMAznO1ECV2bSAsYajaiYri/gMU6DdMYz3D5rciTXnnSA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Aug 18, 2023 at 01:19:34PM +0200, Fabio M. De Francesco wrote: > Extend page_tables.rst by adding a section about the role of MMU and TLB > in translating between virtual addresses and physical page frames. > Furthermore explain the concept behind Page Faults and how the Linux > kernel handles TLB misses. Finally briefly explain how and why to disable > the page faults handler. > > Cc: Andrew Morton > Cc: Ira Weiny > Cc: Jonathan Cameron > Cc: Jonathan Corbet > Cc: Linus Walleij > Cc: Matthew Wilcox > Cc: Mike Rapoport > Cc: Randy Dunlap > Reviewed-by: Linus Walleij > Signed-off-by: Fabio M. De Francesco Acked-by: Mike Rapoport (IBM) > --- > > v2 -> v3: This version fixes the grammar mistakes found by Linus > and forwards his "Reviewed-by" tag (thanks!). > https://lore.kernel.org/all/CACRpkdbq8UCtvtRH7FZUEqvTxPQcoGbrKvf_mT5QHMAfVoYNNQ@mail.gmail.com/ > > v1 -> v2: This version takes into account the comments provided by Mike > (thanks!). I hope I haven't overlooked anything he suggested :-) > https://lore.kernel.org/all/20230807105010.GK2607694@kernel.org/ > > Furthermore, v2 adds few more information about swapping which was not present > in v1. > > before the "real" patch, this has been an RFC PATCH in its 2nd version for a week > or so until I received comments and suggestions from Jonathan Cameron (thanks!), > and then it morphed to a real patch. > > The link to the thread with the RFC PATCH v2 and the messages between Jonathan > and me start at https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@gmail.com/#r > > > Documentation/mm/page_tables.rst | 127 +++++++++++++++++++++++++++++++ > 1 file changed, 127 insertions(+) > > diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst > index 7840c1891751..be47b192a596 100644 > --- a/Documentation/mm/page_tables.rst > +++ b/Documentation/mm/page_tables.rst > @@ -152,3 +152,130 @@ Page table handling code that wishes to be architecture-neutral, such as the > virtual memory manager, will need to be written so that it traverses all of the > currently five levels. This style should also be preferred for > architecture-specific code, so as to be robust to future changes. > + > + > +MMU, TLB, and Page Faults > +========================= > + > +The `Memory Management Unit (MMU)` is a hardware component that handles virtual > +to physical address translations. It may use relatively small caches in hardware > +called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up > +these translations. > + > +When CPU accesses a memory location, it provides a virtual address to the MMU, > +which checks if there is the existing translation in the TLB or in the Page > +Walk Caches (on architectures that support them). If no translation is found, > +MMU uses the page walks to determine the physical address and create the map. > + > +The dirty bit for a page is set (i.e., turned on) when the page is written to. > +Each page of memory has associated permission and dirty bits. The latter > +indicate that the page has been modified since it was loaded into memory. > + > +If nothing prevents it, eventually the physical memory can be accessed and the > +requested operation on the physical frame is performed. > + > +There are several reasons why the MMU can't find certain translations. It could > +happen because the CPU is trying to access memory that the current task is not > +permitted to, or because the data is not present into physical memory. > + > +When these conditions happen, the MMU triggers page faults, which are types of > +exceptions that signal the CPU to pause the current execution and run a special > +function to handle the mentioned exceptions. > + > +There are common and expected causes of page faults. These are triggered by > +process management optimization techniques called "Lazy Allocation" and > +"Copy-on-Write". Page faults may also happen when frames have been swapped out > +to persistent storage (swap partition or file) and evicted from their physical > +locations. > + > +These techniques improve memory efficiency, reduce latency, and minimize space > +occupation. This document won't go deeper into the details of "Lazy Allocation" > +and "Copy-on-Write" because these subjects are out of scope as they belong to > +Process Address Management. > + > +Swapping differentiates itself from the other mentioned techniques because it's > +undesirable since it's performed as a means to reduce memory under heavy > +pressure. > + > +Swapping can't work for memory mapped by kernel logical addresses. These are a > +subset of the kernel virtual space that directly maps a contiguous range of > +physical memory. Given any logical address, its physical address is determined > +with simple arithmetic on an offset. Accesses to logical addresses are fast > +because they avoid the need for complex page table lookups at the expenses of > +frames not being evictable and pageable out. > + > +If the kernel fails to make room for the data that must be present in the > +physical frames, the kernel invokes the out-of-memory (OOM) killer to make room > +by terminating lower priority processes until pressure reduces under a safe > +threshold. > + > +Additionally, page faults may be also caused by code bugs or by maliciously > +crafted addresses that the CPU is instructed to access. A thread of a process > +could use instructions to address (non-shared) memory which does not belong to > +its own address space, or could try to execute an instruction that want to write > +to a read-only location. > + > +If the above-mentioned conditions happen in user-space, the kernel sends a > +`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually > +causes the termination of the thread and of the process it belongs to. > + > +This document is going to simplify and show an high altitude view of how the > +Linux kernel handles these page faults, creates tables and tables' entries, > +check if memory is present and, if not, requests to load data from persistent > +storage or from other devices, and updates the MMU and its caches. > + > +The first steps are architecture dependent. Most architectures jump to > +`do_page_fault()`, whereas the x86 interrupt handler is defined by the > +`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`. > + > +Whatever the routes, all architectures end up to the invocation of > +`handle_mm_fault()` which, in turn, (likely) ends up calling > +`__handle_mm_fault()` to carry out the actual work of allocating the page > +tables. > + > +The unfortunate case of not being able to call `__handle_mm_fault()` means > +that the virtual address is pointing to areas of physical memory which are not > +permitted to be accessed (at least from the current context). This > +condition resolves to the kernel sending the above-mentioned SIGSEGV signal > +to the process and leads to the consequences already explained. > + > +`__handle_mm_fault()` carries out its work by calling several functions to > +find the entry's offsets of the upper layers of the page tables and allocate > +the tables that it may need. > + > +The functions that look for the offset have names like `*_offset()`, where the > +"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the > +corresponding tables, layer by layer, are called `*_alloc`, using the > +above-mentioned convention to name them after the corresponding types of tables > +in the hierarchy. > + > +The page table walk may end at one of the middle or upper layers (PMD, PUD). > + > +Linux supports larger page sizes than the usual 4KB (i.e., the so called > +`huge pages`). When using these kinds of larger pages, higher level pages can > +directly map them, with no need to use lower level page entries (PTE). Huge > +pages contain large contiguous physical regions that usually span from 2MB to > +1GB. They are respectively mapped by the PMD and PUD page entries. > + > +The huge pages bring with them several benefits like reduced TLB pressure, > +reduced page table overhead, memory allocation efficiency, and performance > +improvement for certain workloads. However, these benefits come with > +trade-offs, like wasted memory and allocation challenges. > + > +At the very end of the walk with allocations, if it didn't return errors, > +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()` > +performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`. > +"read", "cow", "shared" give hints about the reasons and the kind of fault it's > +handling. > + > +The actual implementation of the workflow is very complex. Its design allows > +Linux to handle page faults in a way that is tailored to the specific > +characteristics of each architecture, while still sharing a common overall > +structure. > + > +To conclude this high altitude view of how Linux handles page faults, let's > +add that the page faults handler can be disabled and enabled respectively with > +`pagefault_disable()` and `pagefault_enable()`. > + > +Several code path make use of the latter two functions because they need to > +disable traps into the page faults handler, mostly to prevent deadlocks. > -- > 2.41.0 > -- Sincerely yours, Mike.