From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E83A8C77B7E for ; Fri, 26 May 2023 03:01:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 788BB6B0075; Thu, 25 May 2023 23:01:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 710D4900003; Thu, 25 May 2023 23:01:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 563546B007B; Thu, 25 May 2023 23:01:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 41D426B0075 for ; Thu, 25 May 2023 23:01:04 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 151711A0DE4 for ; Fri, 26 May 2023 03:01:04 +0000 (UTC) X-FDA: 80830904448.02.D810B12 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf19.hostedemail.com (Postfix) with ESMTP id 8048F1A000F for ; Fri, 26 May 2023 03:01:01 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=khhVJeY6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of rientjes@google.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685070062; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=F0dNJ1HmczFEx5jyNh2g95uNBP3z00o4+0yhmzZQfdA=; b=cvClgvcJyduhpa+w70CnFgFXC1uWBqflZU55N4SKYVhHA8KJXFqOlCYCS3q3Szp3ZMnz+r MKIAU5aNiUAttq0LUrDHDa/kbok1HgDwQwpvMvlz3P/TpWkz0wxhN5wCVzOA8sfykdkxNP +TDhWZSpZD0/QVQDSNRmrx6m0evmweQ= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=khhVJeY6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of rientjes@google.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685070062; a=rsa-sha256; cv=none; b=IHm5uUERUbjbRBBlrE31PpC8WKgpdTy0q69y/7jQvJb5CXSRVQwSrdOJ2xqpObalRjhW02 PrjSPhPVTpLw4G8AQLnyLWNL9T/PBEWETIbZLryVFmtmCHeIOaUXh0QL2rACn4xkx9YFsY QkrcrLwLrCDBcaPoQcziDBXN1HKlbhg= Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-1ae64580e9fso52305ad.1 for ; Thu, 25 May 2023 20:01:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685070061; x=1687662061; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=F0dNJ1HmczFEx5jyNh2g95uNBP3z00o4+0yhmzZQfdA=; b=khhVJeY6fKjJSSGyKIntjS7l1Z9PJ9jmGeIWGlGrFBGSzLrE4pHK2JSRbi8iMWECnt 5ptDnXRQIAeDaXq+KAay6LN9+IcjQmALuxcIM6+Dz4ELXVgnB+26SGpKPJXexCt7IpXW DN2h8wdpXd7lZNSG1zMUChvJN0ZD6Mx5GrmcfacQzba382ZIAWoKBTvOY3LccVrLIToL UWrcoIuZl/jpo+zmEnauteMN13+y75mc53fzcHjnE5JB9CTVksy+YhOmKK3uvSt/pyqf 75smMxJPGby5pjkrLNNQh1oqAz8u/fRwyq6S0kqdjAHaATShv3n4lnUIRmW4mRP+Si4S AlSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685070061; x=1687662061; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=F0dNJ1HmczFEx5jyNh2g95uNBP3z00o4+0yhmzZQfdA=; b=W/KIniffFBC5ceJXWE6wteMfGlYxOvZrtVKRBC6LNv8g3Rf/f0/BHM49uQumbk46qV lCWySRETS1eGpSazEauuLxbCEr9C6xkY2Dttg+cF9AUvELaFEpSHXSsz9JVwpsbqKGQG 58kRZFS1T/I7wnFG0ljU15D+r8eCuPtsjzqAP+2Tbx2/B45Y2cMk+t5b1Brhbn3hO554 gmBNmgJKOlB9UlRqeqUXYwoVnkpyQ9p5LhXUG8x7qguyHgAuCodbViCvnv9fkikr3koN UOfV9KJpIIob98LmZCbgue6ct9yUFpxn7T4iwOGMz3JtcsmwQAMIqWyQjsmcQMeq/esH zOnA== X-Gm-Message-State: AC+VfDwsfrWwCUzYEVvFJal9O0l6viDl/LnoEtUeQkpiKieSKySRcMbU qlDPyNUhlhpdLgwCMm0RippEdg== X-Google-Smtp-Source: ACHHUZ7PMBE7F9jnU4u81zRowUFqyUIAOFFZIflwvM07wqr+/QSw0s4hAz5trhXb4nHLamMrkV+p6A== X-Received: by 2002:a17:902:c209:b0:1ae:221b:5894 with SMTP id 9-20020a170902c20900b001ae221b5894mr6401pll.1.1685070060789; Thu, 25 May 2023 20:01:00 -0700 (PDT) Received: from [2620:0:1008:11:45b5:2841:5754:a525] ([2620:0:1008:11:45b5:2841:5754:a525]) by smtp.gmail.com with ESMTPSA id i25-20020aa79099000000b00642ea56f06fsm1798127pfa.0.2023.05.25.20.01.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 25 May 2023 20:01:00 -0700 (PDT) Date: Thu, 25 May 2023 20:00:59 -0700 (PDT) From: David Rientjes To: James Houghton , Naoya Horiguchi , Miaohe Lin cc: Mike Kravetz , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Matthew Wilcox , David Hildenbrand , Axel Rasmussen , Jiaqi Yan Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs In-Reply-To: Message-ID: References: <20230306191944.GA15773@monkey> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="2003089352-1374704305-1685070060=:464781" X-Rspamd-Queue-Id: 8048F1A000F X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: aho9p9ijxifn9pe68p9degyfgamr9j4n X-HE-Tag: 1685070061-306920 X-HE-Meta: U2FsdGVkX18ZuyAWf8/HDiEjY6QjIAQ4nWm4YuGGIUgAx/CSVSo5/r8uMjR1pHQHx0xzB8lVmACKgPjRQwLyB2NKUCwm94JdUwEVABbbawS6UgdPM9uHiuZYKjAWmLwwcgq8W+Snwn+Jpr/2DjQHwQgAlFR95rb43/hUkVn1hx/Z2rZGD9aColIut38M/hw4YEs4ZIOqpXrvftv/XlCjz13xQGuhLYsUjNRMSQiA3GYmrpRs6zU+1R2b3QM96Z96Hvosl9D2EYxjjQrXzAd8+7aHEwHCdMO6GyOFoGSrdh5q3y4fpBGV2lFm6DhyvWUVmDrRgkhnhkZiAR+7cZNn0Q9c2PRXFme4FbO/OTewq9KWetVM5T/n8OkNBUbdbCJPcmjw3XKExNfFYdHRoWXfmGx+yFz9Ca5JAQKGnZn9UcEU0KOtBA/49w1nR+3nsObsVZnsNiYJ7zcVJBGlhdSUzMQFjCS7BhxsTzGCMUuflqzKtoCFjo7F2bZjixRrU+DdgkPEfGYJl26lsU4GQJ0Lhy5wYoWATBPSXrjzq3fFDtSrXp0HI7S8ME+N+a00ZnPafB9AYFRPwu45fmHU59yg5oFA7/xNiguYf3+KCZjyaexAazdr9gheieo6USq6DCE0ObLamaKkeoiKkXlGLs8upfeweKf4qwO5Tza2n1tHKmZMM827jFZIQaoeJebCY+M2sqo5xULIUtvBJHUMjR4r+Yqx65DzaGZIYfakFQWxJ6nxlN910zRLOhq/qpkKDU4BpaBk8Lts7l3DLDsbASlCPFARNwHRyJIrhU2ATwUJ4gV8GFzkEUmzP8PL+cfyOuVK6irPHS6AUp3KyK8yiTF+f9ov5NNPrElu5RZl/l4kyTqXTXSLIQYH4HFU2K02QYV9FioRFNRJABqrsX42kRXrKR1MOpAX8huWWg5I23/6Pdvka1PMQ+tOUhqrezaHJgrUnVWx4njvNsTFend02I8 xj+yrhBQ dL9YkHLijjmtC3Mw7rlZXjifF6XeqHbgzwJXkJoHGgz1Kjxc1Lvta5yGksO1luXGrkZ3NYfijXZYAWjCZUhC2+MvUyYscYYfPE1r16fFUkUVjzwEBOsfTWo0zHqWhm2P3R1etObh8h2fBqKWN47w1RIf2QE6cuZXgZVEKI8h30PPmvYdtUBM3y1wRNOFkksiGbDHdmqX+ZIccAfsZNbIEOjvFLG9BbTMUIVGCUKX+qrCL7xRGRuBEDN6jruPYU0hBdCho2oNQaZ6ssHrmfDpjBnQeoprl5/oKZoIktNt+kfNnLfEDCS4AtT+JA/gk4Bny6deNn3DyvWa2MJfhyPQiYdVb0vVzxSHHCiq48ls4B51dJauIrMV25Zf8g0rBA5Rur+tnduSOLhkYvIxYD5e5NGhSXmQ7513hkIIyhaKzX3FRCce0zWk3WgvEi2CodZVnZ+s7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --2003089352-1374704305-1685070060=:464781 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT On Wed, 24 May 2023, James Houghton wrote: > Hi everyone, > > If you came to the HGM session at LSF/MM/BPF, thank you! Thank you, James, for putting together such a detailed discussion and soliciting some great feedback. > I want to > address some of the feedback I got and restate the importance of HGM, > especially as it relates to handling memory poison. > Thanks for bringing this up, I think it's a very important use case. Adding in Naoya Horiguchi and Miaohe Lin as well. > ## Memory poison is a problem > > HGM allows us to unmap poison at 4K instead of unmapping the entire > hugetlb page. For applications that use HugeTLB, losing the entire > hugepage can be catastrophic. For example, if a hypervisor is using 1G > pages for guest memory, the VM will lose 1G of its physical address > space, which is catastrophic (even 2M will most likely kill the VM). > If we can limit the poisoning to only 4K, the VM will most likely be > able to recover. This improved recoverability applies to other HugeTLB > users as well, like databases. > Mike, do you have feedback on how useful this would be, especially for use cases beyond what cloud providers would find helpful? > ## Adding a new filesystem has risks, and unification will take years > > Most of the feedback I got from the HGM session was to simply avoid > adding new code to HugeTLB, and instead to make a new device or > filesystem. Creating a new device or filesystem could work, but it > leaves existing HugeTLB users with no answer for memory poison. Users > would need to switch to the new device/filesystem if they want better > hwpoison handling, and it will probably take years for the new > device/filesystem to support all the features that HugeTLB supports > today (so beyond PUD+ mappings, we would need page table sharing, page > struct freeing, and even private mappings/CoW). > > If we make a new filesystem and are unable to completely implement the > HugeTLB uapi exactly with that filesystem, we will be stuck unable to > remove HugeTLB. We would strongly like to avoid coexisting HugeTLB > implementations (similar to cgroup v1 and cgroup v2) if at all > possible. > > Instead of making a new filesystem, we could add HugeTLB-like features > tmpfs, such as support for gigantic page allocations (from bootmem or > CMA, like HugeTLB), for example. This path would work to mostly unify > HugeTLB with tmpfs, but existing HugeTLB users will still have to wait > for many years before poison can be handled more efficiently. (And > some users care about things like hugetlb_cgroup!) > > ## HGM doesn’t hinder future unification > > HGM doesn’t add any new special cases into mm code; it takes advantage > of the existing special cases that already exist to support HugeTLB. > HGM also isn’t adding a completely novel feature that can’t be > replicated by THPs: PTE-mapping of THPs is already supported. > I think this is important, there are deficiencies that HGM can fully address (like the aforementioned smaller granularity page poisoning, as well as optimized live migration) while not posing an obstacle for future unification if possible. If not for HGM, it would be great to get alignment on what needs to be done so that we can support memory poisoning in smaller sizes for users of 1GB pages *and* optimized live migration for VMs backed by 1GB pages without requiring a full unification of the HugeTLB subsystem with the rest of core MM. While that unification has been discussed for several years, it would be a shame if that became a full blocker to address these real deficiencies that are actively causing pain. > HGM solves a problem that HugeTLB users have right now: unnecessarily > large portions of memory are poisoned. Unless we fix HugeTLB itself, > we will have to spend years effectively rewriting HugeTLB and telling > users to switch to the new system that gets built. > > Given all this, I think we should continue to move forward with HGM > unless there is another feasible way to solve poisoning for existing > HugeTLB users. Also, I encourage everyone to read the series itself > (it's not all that complicated!). > > - James > --2003089352-1374704305-1685070060=:464781--