From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C9CFC7EE25 for ; Thu, 8 Jun 2023 18:51:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 56BC08E0002; Thu, 8 Jun 2023 14:51:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 51C168E0001; Thu, 8 Jun 2023 14:51:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3BDA18E0002; Thu, 8 Jun 2023 14:51:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 2863F8E0001 for ; Thu, 8 Jun 2023 14:51:00 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D751C1C78FF for ; Thu, 8 Jun 2023 18:50:59 +0000 (UTC) X-FDA: 80880472638.08.D9E5B07 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf15.hostedemail.com (Postfix) with ESMTP id 0D7E1A001C for ; Thu, 8 Jun 2023 18:50:57 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=FYWlW8GR; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686250258; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ADvUweDY1gRb8AomvLTi9sTerOxWJW6SALaPrTEKYZw=; b=7ax1gMK4P4MzBbPvFK332dOAKEbYK3iEcJAScdKow9YpeoxkD/6GrtSgDpDs2ibRqRi+Ng sAbPvhxnNYkiOMULSxT7MHRsOl6IrCqGfxvaN5NhhEb/NC4BDYWT5NsdfzK7t+MOtcfA3x 250Bv5BFuFnYXcXPb88K/T+tawYKObI= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=FYWlW8GR; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686250258; a=rsa-sha256; cv=none; b=pvcvxwazVsHvquixUqb0fYbeuYESUSaV9F4KGdYpr7MM/s0/AoF58OfADqX8NpqkNnOaer XgPl/+/nnVgAFPdcaeuo2kva+23N65/8iz8O+uKV1envZoNLKv2fq/46JlfvyC1xdMgR3D 7+z7/8q7FSdRwawycum1JolT5VRwZLg= Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-1b24b34b59fso156495ad.3 for ; Thu, 08 Jun 2023 11:50:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1686250257; x=1688842257; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ADvUweDY1gRb8AomvLTi9sTerOxWJW6SALaPrTEKYZw=; b=FYWlW8GR7mZpWo9iHz6bDDOiersoIDeuVW4hvLyxL6HE5e4rPmaqt/oTtGehGPStCu 8IHWSlWGyVNZlgwW3iROwGXaJTU/rxmAngetiu728h1ODjplkKkILuuLkzXvX88rPTgs bd9Yl16RSnY0jG4/XiOLnGQaQsCIa351kl9i+mMnr4U8gpsmYv5AyY1YDaWi4jcVIyks bNZXSQHP1olhk2IDfp9kMEvvUAoM3BDu5MU7B+xFqDsYd0LQMVIpbhGGRrlk9du7ivYc eEZjHu8aDSF8jZYwVcHpCZ9wW/g2KCvsAjWG9v8WMFubG0u+9uHwWaGZr75vm32bwgN7 MS4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686250257; x=1688842257; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ADvUweDY1gRb8AomvLTi9sTerOxWJW6SALaPrTEKYZw=; b=kctkYnjhuXwknJXVyhcyaouCgxlfNllq3w1zx/ZrbxhNF40J4/E6kyFIk17uWlCI9+ ufiEeFBTg30xwyCABkVOUUtzy7EoUZv6XuL5Wh1iXHKSddXpVRzJDBAb0EnizVce9kXv GsjunbKlogxiOlhud2VgVjAB1vYwcO61i62cOxm2bHTELQaj+A8Ki+pYXckonrMTvESN ofkhhCgiZjv3+Iziij1KTL25TGACbwiC5+d54k3oKwkIVTsTLup6dufCIQc1QD+OJkLz v0iCamMfv7wC+PKnEx+IbJg7KjDFHw0wZPFqd3jOTz7QFn0D9wkvFwyB7gI9pw0PjzgH tHgg== X-Gm-Message-State: AC+VfDxuKppsDLTpbdXiiRH9ynYzPnAMfLQh3fWzfI48dVxdSmNyRWz+ YLbY9li/8ce4qTx71xnoO/pRxJocZjSXdDXceaM= X-Google-Smtp-Source: ACHHUZ7p0n/+TVqiMW+6h4eLX/oUsEC0xfXQu2FfiioIV9vWkg6D/QPQJ7pKmCJv17PYmpAdT9tvm3NXPTB1asZC9uo= X-Received: by 2002:a17:902:f684:b0:1a6:6fe3:df8d with SMTP id l4-20020a170902f68400b001a66fe3df8dmr11916479plg.8.1686250256602; Thu, 08 Jun 2023 11:50:56 -0700 (PDT) MIME-Version: 1.0 References: <20230306191944.GA15773@monkey> <20230602172723.GA3941@monkey> <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com> <20230607220651.GC4122@monkey> <686e3e61-704e-1258-8a8b-f18399b41668@google.com> In-Reply-To: From: Yang Shi Date: Thu, 8 Jun 2023 11:50:44 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs To: David Hildenbrand Cc: David Rientjes , Mike Kravetz , Yosry Ahmed , James Houghton , Naoya Horiguchi , Miaohe Lin , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Matthew Wilcox , Axel Rasmussen , Jiaqi Yan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 0D7E1A001C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ar4g79nuimx3mge73d7c5sxksjsgfda8 X-HE-Tag: 1686250257-235960 X-HE-Meta: U2FsdGVkX1/K6Z+LbobR73koLKeYBzEGQ7MfnJmYXYAFmcBX+y7Xzfb8ZWesoo4kdSF8g55SVqN2hxjOfUDD8zUkYVemPTSCDtIS8g/n7Jmy8d2CHFKtsHfqQAUAl6vYyDBXYHe9c6cgF+3RHx6rxaJNVL57wBWmFZEWLozgFAPDkw9Jt6n5EK5UbvteZ4a4VzuooA8hvW/qH9I0I4a1adJcIaKjFYXZpUZA5ZiQdGs400YXkDVVFM64jL0UktoMky0o/OE5/Np5CZxfzH8yaHjlcWe/y4TZYSFGvoBb66a2qYstyvbHW4O4f/j/TZCSICO++jyKJhs+wF7noFpT6pfJzU96c5wdqX7aSv/DcjppoTJJ3fn1+Yud1FUjV8JLSotQdh83bq06/lkFRlUtEfOqQl2iOgxsh9ABgguyiD8dQGlfxURBjic4Grmt21comvK6q2Hua1P+YJAGhkLHb+UXbcNoMqN5VBqSFnGMzXVZk4Z1sM29KZU6T4T+tiwuvYC7dkBO+sAqZTllaRUpJJjEQZHDLPQvFEZ0SIGVxaPmuVmR8hKehPDFaGWVDMAT2/rFLRFclyPOe5x2/S3MXxUrySaSTIUS2B1P/7+LPU2iGt+0wwc5leqNNPpfhn/FT1iJa71zLaBHwuK8wmmNa3Hiih9zXvCfVI0u9Ww785aUJTruRefSxfx9m4A2uHD/OFd2llt3pIJlt9P2sUVwNdq2qiheV7eGdEwcFiRfk4hK1NIV8FpzaA6kTmvref5CoI1zvCzd0TeM37KKqNqG5DCwmwuLl4e5z1D5vg/iHZwpGyfEbo/9UQ6N1K68hp8wkL3v0SAxRtk02IOFXYkfnC1ex2lzzuBg6zzqzYrqGC7d8rEVWLmOHywuXjNV+AqqRTl/B4bGNhgIJfHcXxLQrjjgpVkxrPrFieQcjQsQdgMiMLamI9TG6bERtxxnbPzCFDni0IwL8cvuHSYB+ZB iEIDzNSI vXuEDI9Z7ACuBZz/990uLM4JVZUF6hMZJfhcv2fLseFM0V8acw+z443+hwPGA0KhzzEsMbomxbL7Mg4eNskFKMkuirqy8IZ5Whmg0h5CjVAPuZYzz8RvRnz5eVGJTqlOl/vSGIFkjWPFI8I9eOgYj6FInE2blo9cB1mLWGheBF+MjNoyqDdKSn4d6fp3u/VYLKkSVGqqL2oaeFyESTp59AlK+No+acm2cewznzf8DJKXawCKg2OkvwYSq/8tNnWb05SkGO8DiDItMmpyg9G17HIwVz4hBRhXdMX3FWd8HmMRocvAMiLd4rup3QkGpFqQuESw3bhFzNo4MnuQHBebBUvkpznhb48aPrkjxcdWTMDtxEDNmgsVDvcih3OHhmGf+9DPC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 7, 2023 at 11:34=E2=80=AFPM David Hildenbrand wrote: > > On 08.06.23 02:02, David Rientjes wrote: > > On Wed, 7 Jun 2023, Mike Kravetz wrote: > > > >>>>>> Are there strong objections to extending hugetlb for this support? > >>>>> > >>>>> I don't want to get too involved in this discussion (busy), but I > >>>>> absolutely agree on the points that were raised at LSF/MM that > >>>>> > >>>>> (A) hugetlb is complicated and very special (many things not integr= ated > >>>>> with core-mm, so we need special-casing all over the place). [examp= le: > >>>>> what is a pte?] > >>>>> > >>>>> (B) We added a bunch of complexity in the past that some people > >>>>> considered very important (and it was not feature frozen, right? ;)= ). > >>>>> Looking back, we might just not have done some of that, or done it > >>>>> differently/cleaner -- better integrated in the core. (PMD sharing, > >>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocat= ion > >>>>> because it fails with NUMA/fork, ...) > >>>>> > >>>>> (C) Unifying hugetlb and the core looks like it's getting more and = more > >>>>> out of reach, maybe even impossible with all the complexity we adde= d > >>>>> over the years (well, and keep adding). > >>>>> > >>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. = But > >>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 = years > >>>>> old. So we managed to get quite far without that optimization. > >>>>> > > > > Sane handling for memory poisoning and optimizations for live migration > > are both much more important for the real-world 1GB hugetlb user, so it > > doesn't quite have that lengthy of a history. > > > > Unfortuantely, cloud providers receive complaints about both of these f= rom > > customers. They are one of the most significant causes for poor custom= er > > experience. > > > > While people have proposed 1GB THP support in the past, it was nacked, = in > > part, because of the suggestion to just use existing 1GB support in > > hugetlb instead :) Yes, but it was before HGM was proposed, we may revisit it. > > Yes, because I still think that the use for "transparent" (for the user) > nowadays is very limited and not worth the complexity. > > IMHO, what you really want is a pool of large pages that (guarantees > about availability and nodes) and fine control about who gets these > pages. That's what hugetlb provides. The most concern for 1G THP is the allocation time. But I don't think it is a no-go for allocating THP from a preallocated pool, for example, CMA. > > In contrast to THP, you don't want to allow for > * Partially mmap, mremap, munmap, mprotect them > * Partially sharing then / COW'ing them > * Partially mixing them with other anon pages (MADV_DONTNEED + refault) IIRC, QEMU treats hugetlbfs as 2M block size, we should be able to teach QEMU to treat tmpfs + THP as 2M block size too. I used to have a patch to make stat.st_blksize return THP size for tmpfs (89fdcd262fd4 mm: shmem: make stat.st_blksize return huge page size if THP is on). So when the applications are aware of the 2M or 1G page/block size, hopefully it may help reduce the partial mapping things. But I'm not an expert on QEMU, I may miss something. > * Exclude them from some features KSM/swap > * (swap them out and eventually split them for that) We have "noswap" mount option for tmpfs now, so swap is not a problem. But we may lose some features, for example, PMD sharing, hugetlb cgroup, etc. Not sure whether they are a showstopper or not. So it sounds easier to have 1G THP than HGM IMHO if I don't miss something vital. > > Because you don't want to get these pages PTE-mapped by the system > *unless* there is a real reason (HGM, hwpoison) -- you want guarantees. > Once such a page is PTE-mapped, you only want to collapse in place. > > But you don't want special-HGM, you simply want the core to PTE-map them > like a (file) THP. > > IMHO, getting that realized much easier would be if we wouldn't have to > care about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD > sharing), but maybe there is a way ... > > > > >>>>> Absolutely, HGM for better postcopy live migration also makes sense= , I > >>>>> guess nobody disagrees on that. > >>>>> > >>>>> > >>>>> But as discussed in that session, maybe we should just start anew a= nd > >>>>> implement something that integrates nicely with the core , instead = of > >>>>> making hugetlb more complicated and even more special. > >>>>> > > > > Certainly an ideal would be where we could support everybody's use case= s > > in a much more cohesive way with the rest of the core MM. I'm > > particularly concerned about how long it will take to get to that state > > even if we had kernel developers committed to doing the work. Even if = we > > had a design for this new subsystem that was more tightly coupled with = the > > core MM, it would take O(years) to implement, test, extend for other > > architectures, and that's before any existing of users of hugetlb could > > make the changes in the rest of their software stack to support it. > > One interesting experiment would be, to just take hugetlb and remove all > complexity (strip it to it's core: a pooling of large pages without > special MAP_PRIVATE support, PMD sharing, reservations, ...). Then, see > how to get core-mm to just treat them like PUD/PMD-mapped folios that > can get PTE-mapped -- just like we have with FS-level THP. > > Maybe we could then factor out what's shared with the old hugetlb > implementations (e.g., pooling) and have both co-exist (e.g., configured > at runtime). > > The user-space interface for hugetlb would not change (well, except fail > MAP_PRIVATE for now) > > (especially, no messing with anon hugetlb pages) > > > Again, the spirit would be "teach the core to just treat them like > folios that can get PTE-mapped" instead of "add HGM to hugetlb". If we > can achieve that without a hugetlb v2, great. But i think that will be > harder .... but I might be just wrong. > > -- > Cheers, > > David / dhildenb > >