From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 933B9C04A6A for ; Thu, 3 Aug 2023 17:08:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D0A8828027D; Thu, 3 Aug 2023 13:08:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB9F328022C; Thu, 3 Aug 2023 13:08:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B5B5128027D; Thu, 3 Aug 2023 13:08:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A779828022C for ; Thu, 3 Aug 2023 13:08:43 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 45D35C133F for ; Thu, 3 Aug 2023 17:08:43 +0000 (UTC) X-FDA: 81083427726.13.1CAFF56 Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) by imf28.hostedemail.com (Postfix) with ESMTP id 3CB6EC000C for ; Thu, 3 Aug 2023 17:08:39 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=lUTsarpL; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf28.hostedemail.com: domain of fmdefrancesco@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=fmdefrancesco@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691082520; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Rk7llkQgL4cMEKHdefrrxMgtn62a9OblrizmjtxCpss=; b=2aB4RXXtXh8w/Q/JGoGUwk1iVH0se57HmTYAz8CW88OSRqYF1ltdAAusA360hxveSQu4aO /YkguBSJjwR9/FGIxAs8XjUztz6v+d/wvLdLQaPpaXxmTPZw6lDI6WmyUJMVz2nzcseGdM zD5g/xtRNmNBvPQ7IvRI6Cdd6dD/trw= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=lUTsarpL; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf28.hostedemail.com: domain of fmdefrancesco@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=fmdefrancesco@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691082520; a=rsa-sha256; cv=none; b=PgTVpWexOh+v7NIw9/I5k7A9bpEOiaEkYUqYd0aignideYjaYZbaY+jsHOkadQkQ76gL7D qYN+Pte604TBLgJKxjEL0L2OTRIoTO6MOCgF+0a1RDdGvYM7NJCrBwwVXThCTsTbZ3ltSJ H9+evPr0rbzVcF5IxHSi73vo1TVMQ/0= Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-3fb4146e8fcso7982715e9.0 for ; Thu, 03 Aug 2023 10:08:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1691082518; x=1691687318; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Rk7llkQgL4cMEKHdefrrxMgtn62a9OblrizmjtxCpss=; b=lUTsarpLg2XFm4fyiN/jzKZEg7WRlD8OtXgHZCoaNpblGEyi9jlXA6+HBwsa9nq+EI IONrQzuhPurL03OLdn0Cb49eB3t/kkOeOygtY3nEZzjR8yTfRDpFrM4Tb3Fiwhp2NXnD s6/0GswYV3qMIpc5BUk9cKv6rrZI8XUmZvhePNZh/+4EwAcD48mRz493oLTNT4HJoGUt 4nHvQ9bS0UMoYQ8ilARn7RZmONnCfg5IN6/VMRSkNcAFgdz/Ff+vrLMByIHBVVx3u/28 zLaG+9PNj5HDmJ8mjOKmO4+kJRWwcpG7m2EkcpLD9q8mbq1zaLtSGeUGP9FO+1SBJdy5 +nUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691082518; x=1691687318; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Rk7llkQgL4cMEKHdefrrxMgtn62a9OblrizmjtxCpss=; b=i6WUhgW+YK2t0q422+3jm6hRyFL+x09uAutFYVSsqPGdYD6E4Sk0ZDnAM5XcE9Dcie jTvU2t6315XJHefEoUFswsqtxc6hkJ7h/kanG48KHPwMiBT5gNR6DLOajjxpNbUi121j Vp06UpmCfuFqJDRXfPoR69hhjy0TpbPPeqEXSVa1AOApZ+ekPZ+oWn+dr1bY7TIq7mki jd4zJbO8VA72vtJirLaBSumR9kuvE8B/579fal/zfvFmcKZi+stw6polVNtz9KpDB4x1 pPOb2CLx4/xuecL97LIN9O0e+KuJoQIQY8BDiLSn2J7An8TnR4OwhlsynamJEjMJWm3X jkMQ== X-Gm-Message-State: ABy/qLZt6KIYpkeVpRz52jwTx3/oPgo/EGelE5lba5amKSV/Gzqmq9kT aVU42qS4g2RgQafPvr5RKaU= X-Google-Smtp-Source: APBJJlHUrTdG2N0krMLAGrMlWK0GQXgvtl9r3/mjOqwV41pBjC+SdzLz2A1REJr2BBPs3uxMSINOSA== X-Received: by 2002:a1c:721a:0:b0:3fd:2e87:aa28 with SMTP id n26-20020a1c721a000000b003fd2e87aa28mr8133382wmc.15.1691082518262; Thu, 03 Aug 2023 10:08:38 -0700 (PDT) Received: from suse.localnet (host-79-26-191-229.retail.telecomitalia.it. [79.26.191.229]) by smtp.gmail.com with ESMTPSA id f10-20020a7bc8ca000000b003fe20533a1esm341078wml.44.2023.08.03.10.08.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Aug 2023 10:08:37 -0700 (PDT) From: "Fabio M. De Francesco" To: Jonathan Corbet , Jonathan Cameron , Linus Walleij , Mike Rapoport , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Andrew Morton , Ira Weiny , Matthew Wilcox , Randy Dunlap Subject: Re: [PATCH] Documentation/page_tables: Add info about MMU/TLB and Page Faults Date: Thu, 03 Aug 2023 19:08:35 +0200 Message-ID: <4824798.GXAFRqVoOG@suse> In-Reply-To: <20230728120054.12306-1-fmdefrancesco@gmail.com> References: <20230728120054.12306-1-fmdefrancesco@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" X-Rspamd-Queue-Id: 3CB6EC000C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 9picoj18buuisieb5z9t49nru56dhm9c X-HE-Tag: 1691082519-239467 X-HE-Meta: U2FsdGVkX1+8gdKntjHbcJxYHZKxwi4kuJC6Z40alJTR/iZS3+WwciY9+TjBbrqcbesuY+vwokxO5DIB6lFN5nm25CFL+KNjyg5FFaEaV6Q+SPtpozcrrKuq/jA/09qUQI4xST5vCytX4qzGUsynFzxZnge/z37ovqskVCftxPc+rxUs8u0GJkGgYQaGn7XtL1zHmTeTvPZ1/QIgTTObAMX6bmbZ+ojub9hyO1MLbMIADJQrKi6Dy+jMsDrPQpo09PfYGKFOyGjcqOdAzrCVXyx6wPIRS+Ecn4X1GzbU0e3hZ6b1g0cJYrs/xhJHFkf/QjFvna44S9OWr5gxN8wYzZE9TtjfNffrs45XLHK1MDDEfUkkmP51xpdSHMYmmI/JOK7I9DJW4oPh7UNMjg48PY7Yr98nme7FJ1JDRBEIrgYywXpkGtsx+UuFM8aYhyuM1aPXUqc2TEegytuDogoj4SgV3LIR42hJNcpb95lM60f0ztGYFj7TcxPYjdS05U9fiYu8PIexFxYNQISRCILDUxpOiFhYdHKU/RE+NDVPkEMRBQwxB6GEnm/6EPy+JdaI8azJc5wi1F2DTF7kVwMSVZO0WaMjHvq++8f/8VAjwC1WEjSMG1Jsxl7IH7V9Pk0nhuhrE6rtUz2o2sB8VFyvr6iWrWB8C0OialKKNzUnvQV/BdyMHZg2/Z7sEl4YuyZZkUT/uoZatPQoarwrGRmTxPi7Wd6AQuAVEsN/viVmhueGNmnaJewpuSN8I8ffyjlVn0fMaYcl79F9JvSpYLtd6yRQCPnmcD1fbcJZl7NQ6aUOnVN33gaH2MnP0V60P8NqMn2S6oWvhwsvkepU1FAFldevBX+iVbuEiixBnTKbDd93oCV9qCNJ5f4babdKZr3/G6OQ3n2nlpAIc0oQ+Ggg1gB8q6jOEGtiT56tgeuzB8c1YM9URmB0fp44mIRNRlO2sp8S5v3hk/zaBvOWQfm pZw/Rogv rnT2tmDeMrCRGN6NiT0G11uTFY+rzB7tJZsIV+mzM2OJMGFwhvCZdR8jC21ptB+l2HV68XfM6DOi4tm0g0gOqeX6defm0OM9nlYZKJpqB0nDF1rO0y6s4xMm4ribSe5h7perWM/C8eNj9zWY7B+CBs/TtvcxQ8ugOH40/hNzhalVFCLIox/c+oFsAQgyLny1qn0VoZpUTJUITcvM0eYbISHUdri5XEeZonipPSaboSzCom2HUJ+i4Xqr8vkcixt25OHJLlZoTXzaVf2bJBaVP0SS1zTtgwi/Vwrn6kM7OlMF9k21QuIDB8TbXKK1lEgThiQIXiJIPRuj8A4nsAZLpqkXi5H3dH9ytwfhlSUtKs6V9eEmFRGSjBs33UvcmfL2wR6gjl60v3ZJDplElbceR6EJ0p88GTRLpKDsmBFu9ThSKkH1B4oY6NmaAWNMsw4Ut5Fb3jL9rBBpjxOgEuS18Khr2B8JvKUwMxzneHBDcUDeGrzwKMe7e/Qtu4/2DZsH+Zohn/Ge4uxLxaNhZTbs7c7e0Pi/HWni+/G4GZZb815ydnTSgyQpHqYQ1fxgWYzB0RkouqOh+j0e+YRCpYPskVD5xBhY5GkQtCokRe+NlAwMpHZqkIm6DOxZi4h6USbBxJ98E/w+ZDeUfTJNb+rvjNn+khylguvHJPzQm/kRaxbHoUqD3coConjwvmGY/u1TAMsd6Oih0bT/nGSogzG3S4x4PNm0GeWA/lAeJe3J6+WyplWvXuDT1kmSKJEhe1ZYlUtTFcO+cJ8PKYvkG9zn19SIYcoJSkJHXCg7V00RWjV/I8z7siJ77II9UMwrPlnjZ9d9jaZmK6BPqkuHbvLkj/GW5Vw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On venerd=EC 28 luglio 2023 13:53:01 CEST Fabio M. De Francesco wrote: > Extend page_tables.rst by adding a section about the role of MMU and TLB > in translating between virtual addresses and physical page frames. > Furthermore explain the concept behind Page Faults and how the Linux > kernel handles TLB misses. Finally briefly explain how and why to disable > the page faults handler. Hello everyone, I'd be grateful to anyone who wanted to comment on / or formally review thi= s=20 patch. At the moment I've only had comments by Jonathan Cameron on RFC v2=20 (https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@gmail.com/ #t). Does anybody else want to contribute? Thanks in advance, =46abio > Cc: Andrew Morton > Cc: Ira Weiny > Cc: Jonathan Cameron > Cc: Jonathan Corbet > Cc: Linus Walleij > Cc: Matthew Wilcox > Cc: Mike Rapoport > Cc: Randy Dunlap > Signed-off-by: Fabio M. De Francesco > --- >=20 > This has been an RFC PATCH in its 2nd version for a week or so. I received > comments and suggestions on it from Jonathan Cameron (thanks!), and so it= =20 has > now been modified to a real patch. I hope that other people want to add=20 their > comments on this document in order to further improve and extend it. >=20 > The link to the thread with the RFC PATCH v2 and the messages between=20 Jonathan > and me start at > https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@gmail.com= /#r >=20 > Documentation/mm/page_tables.rst | 105 +++++++++++++++++++++++++++++++ > 1 file changed, 105 insertions(+) >=20 > diff --git a/Documentation/mm/page_tables.rst > b/Documentation/mm/page_tables.rst index 7840c1891751..6ecfd6d2f1f3 100644 > --- a/Documentation/mm/page_tables.rst > +++ b/Documentation/mm/page_tables.rst > @@ -152,3 +152,108 @@ Page table handling code that wishes to be > architecture-neutral, such as the virtual memory manager, will need to be > written so that it traverses all of the currently five levels. This style > should also be preferred for > architecture-specific code, so as to be robust to future changes. > + > + > +MMU, TLB, and Page Faults > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > + > +The `Memory Management Unit (MMU)` is a hardware component that handles > virtual +to physical address translations. It may use relatively small=20 caches > in hardware +called `Translation Lookaside Buffers (TLBs)` and `Page Walk > Caches` to speed up +these translations. > + > +When a process wants to access a memory location, the CPU provides a=20 virtual > +address to the MMU, which then uses the MMU to check access permissions = and > +dirty bits, and if possible it resolves the physical address and consent= s=20 the > +requested type of access to the corresponding physical address. > + > +If the TLBs have not yet any recorded translations, the MMU may use the= =20 Page > +Walk Caches and complete or restart the page tables walks until a physic= al > +address can finally be resolved. Permissions and dirty bits are checked. > + > +In the context of a virtual memory system, like the one used by the Linux > +kernel, each page of memory has associated permission and dirty bits. > + > +The dirty bit for a page is set (i.e., turned on) when the page is writt= en > +to. This indicates that the page has been modified since it was loaded i= nto > +memory. It probably needs to be written on disk or other cores may need = to > +be informed about previous changes before allowing further operations. > + > +If nothing prevents it, eventually the physical memory can be accessed a= nd > +the requested operation on the physical frame is performed. > + > +There are several reasons why the MMU can't find certain translations. It > +could happen because the process is trying to access a range of memory t= hat > is +not allowed to, or because the data is not present into RAM. > + > +When these conditions happen, the MMU triggers page faults, which are ty= pes > +of exceptions that signal the CPU to pause the current process and run a > special +function to handle the mentioned page faults. > + > +One cause of page faults is due to bugs (or maliciously crafted addresse= s) > and +happens when a process tries to access a range of memory that it=20 doesn't > have +permission to. This could be because the memory is reserved for the > kernel or +for another process, or because the process is trying to write= to > a read-only +section of memory. When this happens, the kernel sends a > Segmentation Fault +(SIGSEGV) signal to the process, which usually causes= =20 the > process to terminate. + > +An expected and more common cause of page faults is an optimization call= ed > "lazy +allocation". This is a technique used by the Kernel to improve mem= ory > efficiency +and reduce footprint. Instead of allocating physical memory t= o a > process as soon +as it's requested, the Kernel waits until the process > actually tries to use the +memory. This can save a significant amount of > memory in cases where a process +requests a large block but only uses a=20 small > portion of it. > + > +A related technique is called "Copy-on-Write" (CoW), where the Kernel=20 allows > +multiple processes to share the same physical memory as long as they're= =20 only > +reading from it. If a process tries to write to the shared memory, the=20 kernel > +triggers a page fault and allocates a separate copy of the memory for the > +process. This allows the Kernel to save memory and avoid unnecessary data > +copying and, by doing so, it reduces latency and space occupation. > + > +Now, let's see how the Linux kernel handles these page faults: > + > +1. For most architectures, `do_page_fault()` is the primary interrupt=20 handler > + for page faults. It delegates the actual handling of the page fault t= o +=20 > `handle_mm_fault()`. This function checks the cause of the page fault an= d +=20 > takes the appropriate action, such as loading the required page into + = =20 > memory, granting the process the necessary permissions, or sending a + =20 > SIGSEGV signal to the process. > + > +2. In the specific case of the x86 architecture, the interrupt handler is > + defined by the `DEFINE_IDTENTRY_RAW_ERRORCODE()` macro, which calls > + `handle_page_fault()`. This function then calls either > + `do_user_addr_fault()` or `do_kern_addr_fault()`, depending on whether > + the fault occurred in user space or kernel space. Both of these=20 functions > + eventually lead to `handle_mm_fault()`, similar to the workflow in ot= her > + architectures. > + > +`handle_mm_fault()` (likely) ends up calling `__handle_mm_fault()` to ca= rry > +out the actual work of allocation of the page tables. It works by using > +several functions to find the entry's offsets of the 4 - 5 layers of tab= les > +and allocate the tables it needs to. The functions that look for the off= set > +have names like `*_offset()`, where the "*" is for pgd, p4d, pud, pmd, p= te; > +instead the functions to allocate the corresponding tables, layer by lay= er, > +are named `*_alloc`, with the above mentioned convention to name them af= ter > +the corresponding types of tables in the hierarchy. > + > +At the very end of the walk with allocations, if it didn't return errors, > +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via > +`do_fault()` performs one of `do_read_fault()`, `do_cow_fault()`, > +`do_shared_fault()`. "read", "cow", "shared" give hints about the reasons > +and the kind of fault it's handling. > + > +The actual implementation of the workflow is very complex. Its design=20 allows > +Linux to handle page faults in a way that is tailored to the specific > +characteristics of each architecture, while still sharing a common overa= ll > +structure. > + > +To conclude this brief overview from very high altitude of how Linux=20 handles > +page faults, let's add that page faults handler can be disabled and enab= led > +respectively with `pagefault_disable()` and `pagefault_enable()`. > + > +Several code path make use of the latter two functions because they need= to > +disable traps into the page faults handler, mostly to prevent deadlocks.= [1] > + > +[1] mm/userfaultfd: Replace kmap/kmap_atomic() with kmap_local_page() > +https://lore.kernel.org/all/20221025220136.2366143-1-ira.weiny@intel.com/ > -- > 2.41.0