From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E1C6EB64D9 for ; Thu, 15 Jun 2023 17:25:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 265886B0072; Thu, 15 Jun 2023 13:25:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2136C6B0074; Thu, 15 Jun 2023 13:25:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0DBB48E0001; Thu, 15 Jun 2023 13:25:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id F32926B0072 for ; Thu, 15 Jun 2023 13:25:03 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id BC66E409B1 for ; Thu, 15 Jun 2023 17:25:03 +0000 (UTC) X-FDA: 80905657686.06.716B79A Received: from mail-il1-f180.google.com (mail-il1-f180.google.com [209.85.166.180]) by imf21.hostedemail.com (Postfix) with ESMTP id ED5F51C0016 for ; Thu, 15 Jun 2023 17:25:01 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=ddGc63NO; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of jthoughton@google.com designates 209.85.166.180 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686849902; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nT/gxbEwOIHCrJVPtYOlmSi/QhbORO/HqRQg2E+WqYo=; b=YG6mfu7IkddBO/BbeH8JLoiRqvfs4lW2SD5eCW3Tss3rQZBqtzF5UJKA8+VTC80aXtejUg FoGXMipec+u5k+lNDH4pP8jyMrLMYiUvADld20AaC/58bjELij+EeTof9hHHmFqjGYcwnJ YB0RZn+bDqVrgFaxjzEuNj0uEx1Mfcc= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=ddGc63NO; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of jthoughton@google.com designates 209.85.166.180 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686849902; a=rsa-sha256; cv=none; b=RVCw34HqOprd3ilwkZPLGrcwmmTKMGubdYrDgWYGNNAjaEWIDGNa0yQAVmaLSronJJKGL4 kJwlkfeWkCJsrG9h7Os7rFpBkXM5it4DRGGOHVRsAv8f51OVM1ZKYyu48OrZDWW+Gn9z+t hZHcuZ5uQg9WTjcIaeZaLA0pcrMvE30= Received: by mail-il1-f180.google.com with SMTP id e9e14a558f8ab-33d928a268eso10665ab.0 for ; Thu, 15 Jun 2023 10:25:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686849901; x=1689441901; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nT/gxbEwOIHCrJVPtYOlmSi/QhbORO/HqRQg2E+WqYo=; b=ddGc63NO1kg2BE2m2s9N6OrKD7NKsXEhI+bagezpux9TiacN5ooD0wwHfQXibPTa2y gQw1MRIu6KfQmcOxahTeffmB0Pjngz/hn2f5E8BWmUE9BGfdMAUuPOYTQ3E5Y6U9Jgm3 OyydnbY/VsIW6mMfTUqHAgzoJHB9PKefpeTP9S0vLyjqLvIRAd7FAiRrSv1q8ncMtf6N 9gB3GtHpS5GX2m9a/Qgo4B7j9d2e3n7v1bqeVaXXuhY1fqMKGA0Vvx18WLQE9wX+sN5G 36dyY1fme/l4aSxIfCOC8mAMJE/F1QGSQENG9J7+88oW1LTFPRhNhOze31XeEVbBkU7u GANQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686849901; x=1689441901; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nT/gxbEwOIHCrJVPtYOlmSi/QhbORO/HqRQg2E+WqYo=; b=GC25/kbLAEibD6FkGtU4+XO7Y7fRIW6o0LFYNppQ3ZqzCmYWB54NAabemZ/vyKrS7C 7mZTNanhioWt52oWknu1EThtaHarLwJu+Kczjj6Q0YjOYdseWFRHTBpMV7jml5Rwhf0n 88il/zVby7Yd3dQQRFd8CQDSzW/3976RXeM7L9Pgk0EevKTcyKpTy7CA89prW5ujzabL dwDpVQhYp3EBUDFgadrmVXn1oKI0qsMV6HvAgABT9a46hj/lcgSlnNZNqikGqGHXL7nH 6IZ1TBG23YsXleDM8pPRFh+CIPZPXgYoBnJCBvtNWCdunXsUqTJnVlnukw2Hm2ACqf7F 9wiQ== X-Gm-Message-State: AC+VfDzAyUHpgoQzUA8dk5+zcwYGzpefODVRkuHdBKnSTwrwYEzFLiHV cWMn9YlKLlXYbKsgdriItvvlODd6r6y3Tc2oiPEKWw== X-Google-Smtp-Source: ACHHUZ47O0AqYjpxZOGXnHwCpU9faN/z6nhSZCBqzLVJnMOPIT+WHfMHBfs+YHfq20nG5jwEWwFYxnpNitDzLyUq6rg= X-Received: by 2002:a05:6e02:156c:b0:33b:4a8c:2147 with SMTP id k12-20020a056e02156c00b0033b4a8c2147mr230843ilu.8.1686849900712; Thu, 15 Jun 2023 10:25:00 -0700 (PDT) MIME-Version: 1.0 References: <20230614230458.GB3559@monkey> <141b7088-684b-32dc-efe4-03713d38ae28@redhat.com> In-Reply-To: <141b7088-684b-32dc-efe4-03713d38ae28@redhat.com> From: James Houghton Date: Thu, 15 Jun 2023 10:24:24 -0700 Message-ID: Subject: Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday To: David Hildenbrand Cc: Michal Hocko , Mike Kravetz , David Rientjes , linux-mm@kvack.org, John Hubbard , Matthew Wilcox , Peter Xu , Vlastimil Babka , Zi Yan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: ED5F51C0016 X-Stat-Signature: xtpph5mt9x4etxg7bk8h84riqnxbgdwm X-Rspam-User: X-HE-Tag: 1686849901-460019 X-HE-Meta: U2FsdGVkX1889G7YqKrOzVcfZDzx+nEryGqv1GXbrZjKTWzL/c+zUap7ORuollwZsW2p8OOR+I7+W3JPUlIoMA4QhLgKdcz6jhnf3zdtEWlncrcU0zIsf8HiofziiS0gzICEJeOXQgtH+fWvDuVx2aRBw3gBJ7F9rnbHFNLupim245jKsMcg7ohWMNeHv1MElpaW6KW1s8XbNHc4hsCLzN15JjxoAxmPFTR+LOvQvlLedu1VbBD2/2dVbmiNm9DdKGTj6MQLhvEXGrnlTTb6hvrerKDHnL/jM4mb7j6k91rpwxXpZjuFhFblMonzE2kzsjIJ0iTSZtj0/GM1z/d2+P706+BkCuZ/ZO+i5YVxjaH41TDtpvlqQq/JjwHKmMewE5mI608bXXzUZHMKcZ6ttIG/3HS4nEMOuE7qa3+xu2egYNcEGEh6/ExEgF5KyKcWYDqkFflXTKgA5atV/jab9lXAfKut8xytCA8mFYex8OsrPPNDPmREyxV08XMlgJFH0fWTaGfpBuzofbLy4Izt6bz5T83SshIpC/1cbyUoPo/eRJzelAFczydUd0UzrL6FIoo/PuiHzgJi0IJ/YLXKkH+XvgB49IyN+2XdT0CAE44iKGc1A7u0kZ6pP4prhVWiAtEHmK+vjIE+VaP8Lh/01Y6ImjsNcLq8rjMx9kXScoJtnVKvnv53vVYFErfoQUGpwODT9ZyieqjFFbeg61+S8P6akjdbo0qGmmFm9Nnx5QPjsafZSUZd96NKzbH8e6II6d9RZuDUJH4vpFa8/uKDnpJJp8K7wzrmYdvDDJeTn5h1u9INMo9iRoMg2zpXOZdeX/9aniHr32Lt+1EscXxVCZdKa6g4RYM2fBDQICNARsmeautmrBIR4tsRibjGnvAHCBDzFayphLNmYR5uCPxnUvP1ntcquSWJ7rG1nUgltoORFQoJG0QmG3NfUgJVdRRyzJ7ZMuzFMsXY40DksbM IHRkQ8z/ JIhFZ/Pq195WHSCxhiAjwjH3jfwwqPauf/M/p9I4KohlUg2EPU8oDOjySm5JVKe2Fv3dq2H9G9Z9HYPdp8ZOUdGJyDltD8X/kFHX1L7sba8ZtMa1mkR2/xEHz6D2nptJyraDtVUKCgAjfSxM8jZvl7mvhTYgNxrAVQyenTcMM+4+Y3gkR3tjr3T29Yv0Y2ohkGgkK4Kg5w3A7vuoUcsJfw9cIlkiS95t3Dggbovxtnaj2VTzqmnHYN9jBhVeORAn0+/jRzYX4vAYxrrWBrk9p/0iht4ZVMNo4RWwBbtuCfGN/LOu97FvczxHB0j8KEsHN9Xtnjbm4mBak7a9KrIfP8ihpNiUdcp63qoiMoV7N4q5+Du4pcnKYFGyy4PBd7tZ403FNU5gQcozadss= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jun 15, 2023 at 1:30=E2=80=AFAM David Hildenbrand wrote: > > On 15.06.23 10:04, Michal Hocko wrote: > > On Wed 14-06-23 16:04:58, Mike Kravetz wrote: > >> On 06/12/23 18:59, David Rientjes wrote: > >>> This week's topic will be a technical brainstorming session on HugeTL= B > >>> convergence with the core MM. This has been discussed most recently = in > >>> this thread: > >>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.or= g/T/ > >> > >> Thank you David for putting this session together! And, thanks to eve= ryone > >> who participated. > >> > >> Following up on linux-mm with most active participants on Cc (sorry if= I > >> missed someone). If it makes more sense to continue the above thread= , > >> please move there. > >> > >> Even though everyone knows that hugetlb is special cased throughout th= e > >> core mm, it came to a head with the proposed introduction of HGM. TBH= , > >> few people in the core mm community paid much attention to HGM when fi= rst > >> introduced. A LSF/MM session was then dedicated to the discussion of > >> HGM with the outcome being the suggestion to create a new filesystem/d= river > >> (hugetlb2 if you will) that would satisfy the use cases requiring HGM. > >> One thing that was not emphasized at LSF/MM is that there are existing > >> hugetlb users experiencing major issues that could be addressed with H= GM: > >> specifically the issues of memory errors and live migration. That was > >> the starting point for recent discussion in the above thread. > >> > >> I may be wrong, but it appeared the direction of that thread was to > >> first try and unify some of the hugetlb and core mm code. Eliminate > >> some of the special casing. If hugetlb was less of a special case, th= en > >> perhaps HGM would be more acceptable. That is the impression I (perha= ps > >> incorrectly) had going into today's session. > > > > My impression from the discussion yesterday was that the level of > > unification would need to be really large and time consuming in order t= o > > be useful for the HGM patchset to be in a more maintainable form. The > > final outcome is quite hard to predict at this stage. > > > >> During today's session, we often discussed what would/could be introdu= ced > >> in a hugetlb v2. The idea is that this would be the ideal place for H= GM. > >> However, people also made the comparisons to cgroup v1 - v2. Such a > >> redesign provides the needed 'clean slate' to do things right, but it > >> does little for existing users who would be unwilling to quickly move = off > >> existing hugetlb. > >> > >> We did spend a good chunk of time on hugetlb/core mm unification and > >> removing special casing. In some (most) of these cases, the benefit o= f > >> removing special cases from core mm would result in adding more code t= o > >> hugetlb. For example: proper type'ing so that hugetlb does not treat > >> all page table entries as PTEs. Again, I may be wrong but I think > >> people were OK with adding more code (and even complexity) to hugetlb > >> if it eliminated special casing in the core mm. But, there did not > >> seem to be a clear concensus especially with the thought that we may > >> need to double hugetlb code to get types right. > > > > This is primarily your call as a maintainer. If you ask me, hugetlb is > > over complicated in its current form already. Regression are not really > > seldom when code is added which is a signal we are hitting maintenance > > cost walls. This doesn't mean further development is impossible of > > course but it is increasingly more costly AFAICS. > > > >> Unless I missed something, there was no clear direction at the end of = this > >> session. I was hoping that we could come up with a plan to address th= e > >> issues facing today's hugetlb users. IMO, there seems to be two optio= ns: > >> 1) Start work on hugetlb v2 with the intention that customers will nee= d > >> to move to this to address their issues. > >> 2) Incorporate functionality like HGM into existing hugetlb. > > > > I fully agree with all that Michal said. > > I'm just going to add that I don't see why anyone would look into a > hugetlbv2 if we're going to use the motivation of "help existing users" > to make hugetlb ever-more complicated and special. "existing users" her > even meaning "people use hugetlb for backing VMs. Now they want to get > postcopy working with less latency." -- which I consider partially a new > use case. > > So working on adding HGM and concurrently starting a hugetlbv2? I don't > think that will happen if we decide on adding HGM and proceeding with > that reasoning about existing users. > > As expressed yesterday, I don't see a fast an clean way to make hugetlb > significantly less special (thanks Willy for the list of odd cases). > > Sure, we can talk about adding pte_t safety, but I don't really see a > way forward to unify page table walking code that way -- there are still > the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if > anybody wants to work on that, why not. > > Having that said, like Michal, I acknowledge that it is Mikes call > regarding the hugetlb code. I, for my part, will push back on any added > core-mm complexity that adds more special casing for hugetlb. Maybe > there are easy ways to integrate it nicely and that is not really a conce= rn. HGM is mostly contained in the already-existing HugeTLB special cases. HGM doesn't really *add* special cases, it just makes the HugeTLB special cases more complicated. There are a few small ways that HGM touches non-hugetlb code: 1. Mapcount (to make hugetlb use the THP scheme) [1], newer version here[2] 2. madvise (to add MADV_SPLIT and update MADV_COLLAPSE) [3] and [4] 3. A small non-hugetlb changes to page_vma_mapped_walk (provide pte_order)[= 5] 4. A small special case in try_to_unmap_one and try_to_migrate_one (to check the head page for page flags)[6] 5. smaps stats[7] [1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-6-jthoughton@g= oogle.com/ [2]: https://lore.kernel.org/linux-mm/20230306230004.1387007-1-jthoughton@g= oogle.com/ [3]: https://lore.kernel.org/linux-mm/20230218002819.1486479-10-jthoughton@= google.com/ [4]: https://lore.kernel.org/linux-mm/20230218002819.1486479-35-jthoughton@= google.com/ [5]: https://lore.kernel.org/linux-mm/20230218002819.1486479-27-jthoughton@= google.com/ [6]: https://lore.kernel.org/linux-mm/20230218002819.1486479-29-jthoughton@= google.com/ [7]: https://lore.kernel.org/linux-mm/20230218002819.1486479-39-jthoughton@= google.com/ > > Note that while we've been discussing how HGM would already interfere > with core-mm, we've not even started discussing how actual > MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and > require special-casing for hugetlb. > > I, for my part, will explore a bit the mapcount topic (as time permits) > and see if we can come up at least with a unified mapcount approach > (e.g., sub-page mapcount?). But I suspect even figuring that out will > take quite a while already ... Thanks! Simply using the current THP mapcount scheme with HGM isn't great (but IIUC this isn't blocking HGM). By using this scheme, HugeTLB loses the vmemmap optimization / page struct freeing when HGM is in use, and, of course, this scheme gets slow with very large folios.