From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E789C7EE23 for ; Thu, 8 Jun 2023 00:02:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6EC666B0072; Wed, 7 Jun 2023 20:02:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 69C2C8E0001; Wed, 7 Jun 2023 20:02:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 58A5B6B0075; Wed, 7 Jun 2023 20:02:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4A83F6B0072 for ; Wed, 7 Jun 2023 20:02:43 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1087C40300 for ; Thu, 8 Jun 2023 00:02:43 +0000 (UTC) X-FDA: 80877629406.18.FA05D85 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf07.hostedemail.com (Postfix) with ESMTP id 216B940023 for ; Thu, 8 Jun 2023 00:02:40 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=3ouooPwq; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of rientjes@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686182561; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CCt/Uv+cze4DpmgrWqYKrxfba2yZTgI2yBEcpA7gOoo=; b=cI5NHKVndOd+f0QMLdN0Q/GxcFAalLXmtJG+aaIybRz6c5FSZLOC5aEBAezbRb/VcKDZZc ZfGf+nzCfz36c+rT99rRMElDWopGea4Cizqm8tSwO+aaQrpZjDDq2gHQFX+c1wU6pD3oSS p8Lu8mxbJ469bAEUDCXe93lOiAg2jkM= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=3ouooPwq; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of rientjes@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686182561; a=rsa-sha256; cv=none; b=6yiWV9yUOIqd1dhfj3Enu/PUZYXTDnBhHG/sMRBf8kYR2wYQIeLirHkOI4Hhu5fFHCVIeB LAKmuGj5noJcQTTV50EAL7p1bh2Jkb2VSUYVEyTW526KmJVt8oESxfFvXxMit68CuR1eWs zGc85qzYwuZPHNqKejaGyZhe0L9uIQI= Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-1b1fdab9d68so74085ad.0 for ; Wed, 07 Jun 2023 17:02:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686182560; x=1688774560; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=CCt/Uv+cze4DpmgrWqYKrxfba2yZTgI2yBEcpA7gOoo=; b=3ouooPwqvHfwC61M9dXNxj6TpTSHiNy9/bTYsE0ahxoP0sEWu9lOzT187MKmM8nNhA Whon1zb4leMl/t/4AjGSVzmvHXclhFvu2OYT5Eg9I8RllWAPkeMfmCmy8EtOfOHwst8C PFTY53l7/zI7dXJZR4nOZsAQOsco2jYzIHl8CaGBf2NfXSVgnWZVeG0R/8SUhHip5Zyp RfC8yTD3cBKavHqvvqS4Ufw3hZRpOusBzm2iUAU2IMWti1jM+MZJK9nFEMsGjGud3HUP WN4vDxYrdHCNlMNgU82WVHhgN0YUmie23IJmOstrxPsEQkQk/UxyGzxswNaz1sHpFV+X nHDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686182560; x=1688774560; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CCt/Uv+cze4DpmgrWqYKrxfba2yZTgI2yBEcpA7gOoo=; b=O7p74ljlSZLyUiVntyWe6JP+doLtJc7VlOM1HrEAf+4Gfh56icAvktfJGfv+o9EeXd yeWfCsZwSkRWlmkW2Ug1FgpbFRmCPhLM6PlN8CFskw2NipQ4t6G3z0cmevtqB18daZrb AvDHLUZU4vngRGKdImVufKtHKa+iPECEFcK9e9DqPJL3QIshoa//c67RwnVQH5QtyKN7 WeRTxgybTMd+H94t/hMWJaSP58ttTfz4RH/PiGisTRTBF2ra/TCi7qmxxXbw0UXviFvi XYNvjKxb4lee2z3zRnYh4RS975C34eGf9NFzwO+tsYa7PmlkWlc80e4m7tGZZUO0zD0C Tpcw== X-Gm-Message-State: AC+VfDz95Zs1OYUQuNPpBTGMqPXKIUo12thgbuiV+d1yYKJmHYYAJDj8 mJ57oB52MlIm4Rn7vZYs9KI9Qg== X-Google-Smtp-Source: ACHHUZ4ITqJb5CE1xF29GspaMh8FQEw3wFNanVZH9nZlrB3OKtfj5wgQIxDxxeNatAT3la0f9Y0fNA== X-Received: by 2002:a17:902:fb4f:b0:1ae:4bbb:e971 with SMTP id lf15-20020a170902fb4f00b001ae4bbbe971mr80366plb.19.1686182559744; Wed, 07 Jun 2023 17:02:39 -0700 (PDT) Received: from [2620:0:1008:15:746f:7fb5:2cce:4ac] ([2620:0:1008:15:746f:7fb5:2cce:4ac]) by smtp.gmail.com with ESMTPSA id jm18-20020a17090304d200b001ac6b926621sm45679plb.292.2023.06.07.17.02.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Jun 2023 17:02:39 -0700 (PDT) Date: Wed, 7 Jun 2023 17:02:37 -0700 (PDT) From: David Rientjes To: Mike Kravetz cc: David Hildenbrand , Yosry Ahmed , James Houghton , Naoya Horiguchi , Miaohe Lin , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Matthew Wilcox , Axel Rasmussen , Jiaqi Yan Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs In-Reply-To: <20230607220651.GC4122@monkey> Message-ID: <686e3e61-704e-1258-8a8b-f18399b41668@google.com> References: <20230306191944.GA15773@monkey> <20230602172723.GA3941@monkey> <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com> <20230607220651.GC4122@monkey> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: 216B940023 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: pqgzydjcfrd3poomfz57uehou4ziwcsp X-HE-Tag: 1686182560-671447 X-HE-Meta: U2FsdGVkX19yQ8ZJsVqTrZExNLI+PCZo2Bnq09SSZRp2K1VjwI44P5Ir/LLC87P4wNlAogIaC/GwKzXHpDVnWHechW7r08ib2GazNK3mk+1/wEtHw3IatlDuVhw/dK3ghuMelvPZa3d2SyWn+y8s1qvw2rFSN/B1zhVVQEOKKxL+RcF/MUWob4DgnUK/227lgT6BQN4SeS7Zs2FtV0OElBoSiPxeoDtc4tnCv31VPiL9n7vllz5ZXLBAmu23uvAi7CEPi/KSn49j3aAvH6ccoeK1Vc1mwNjIb3E7ccvCvq4X1V90UrvDx1SsDleqVTxNvjE72hMyQjn/R7Qk5LPSfDPU69WwXV48+eMfbTyHjV+6aB3/9pIeKgyrruc0gUQnmG+F+oBmchOqDmo3MJUlmmY8bb0leCCunMP4WHiCBXZfNDLQXoFE2J88oYU0a9jwEYny5ZhYjU3oklQ9cPvMLRZMhTvMl1FhjlxbPypXbvc3vEXjaGUFihPqEk6+6mylsV4JtpmAfTn3aaO13b/br7orjo9JJcJtYJBfZrOfCkrAdkTeGDOcdfnRul0nv7sz+Ot2oeZpf9WcomjaZpq220lIUrJnz3QMzHw2TfFeuJ0vWM+ImuMky2bQk5yGGe5RUCtU+XM0DAeB/Ssll5FHAPJq40hZyg+UwupuWBddnxijvl+EO5mdimw6JMLLgA+2+DGM+PB1n998+j3lrArsdUF6NBT1zF8rPr7rx+Vqep18aqrq/CtnmDb5At74P4YvsqrhVvsHbT/GEJRz+pnGW7IOnGwiUYMrDXiAGyu4eYMwj8DC/ONzxmw9oBgZ7blqBwNkYCjhhmoYaPxrB/elMG65tqZTOXYsIAN1hVpR3ZCkv1JtmcefpX44bqOVbzZWY4z5CAxImAJEwff00GYUqgvSfL6ASY5gRn8aVI5+OCCY60pIoNzY8O9WqReuUZmhXG8ZBKRB39Rwrm/67Hb /tQmCoYc O7FouVZiQUr44By2eqU20W19tbmreqABW1+TrI67optPeRoaLljBSaQzsQa+KpUAPS90fQcyBSdFBmf7vYzDphE3P898aY+S909MQfXD3GC5xDUA1MMaZ8dVMDh9i027wFGSCFLNU6Sfnnk0O/5UNLPg0zAy0jITG08E6ClmstxE6QVTd5VcnbItFTfuTjYYMYQHViT/Mx9Kfp9wIC5qa8t+wU4sUYolemuAAdK4WNlJ1okTeIa6T4P+pQIioVYhexAfMUYh5W7awkDPeUOeIaLkzfpGNbgmZcTH/eCK3x681/3c+Wd4+k2cTzpJSVCcFEN3uKvfw/p7ZBATTYMZBq7JSWL/+rKIIsTJyOB4dt2hiws1AOLiS/6d+3g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 7 Jun 2023, Mike Kravetz wrote: > > > > > Are there strong objections to extending hugetlb for this support? > > > > > > > > I don't want to get too involved in this discussion (busy), but I > > > > absolutely agree on the points that were raised at LSF/MM that > > > > > > > > (A) hugetlb is complicated and very special (many things not integrated > > > > with core-mm, so we need special-casing all over the place). [example: > > > > what is a pte?] > > > > > > > > (B) We added a bunch of complexity in the past that some people > > > > considered very important (and it was not feature frozen, right? ;) ). > > > > Looking back, we might just not have done some of that, or done it > > > > differently/cleaner -- better integrated in the core. (PMD sharing, > > > > MAP_PRIVATE, a reservation mechanism that still requires preallocation > > > > because it fails with NUMA/fork, ...) > > > > > > > > (C) Unifying hugetlb and the core looks like it's getting more and more > > > > out of reach, maybe even impossible with all the complexity we added > > > > over the years (well, and keep adding). > > > > > > > > Sure, HGM for the purpose of better hwpoison handling makes sense. But > > > > hugetlb is probably 20 years old and hwpoison handling probably 13 years > > > > old. So we managed to get quite far without that optimization. > > > > Sane handling for memory poisoning and optimizations for live migration are both much more important for the real-world 1GB hugetlb user, so it doesn't quite have that lengthy of a history. Unfortuantely, cloud providers receive complaints about both of these from customers. They are one of the most significant causes for poor customer experience. While people have proposed 1GB THP support in the past, it was nacked, in part, because of the suggestion to just use existing 1GB support in hugetlb instead :) > > > > Absolutely, HGM for better postcopy live migration also makes sense, I > > > > guess nobody disagrees on that. > > > > > > > > > > > > But as discussed in that session, maybe we should just start anew and > > > > implement something that integrates nicely with the core , instead of > > > > making hugetlb more complicated and even more special. > > > > Certainly an ideal would be where we could support everybody's use cases in a much more cohesive way with the rest of the core MM. I'm particularly concerned about how long it will take to get to that state even if we had kernel developers committed to doing the work. Even if we had a design for this new subsystem that was more tightly coupled with the core MM, it would take O(years) to implement, test, extend for other architectures, and that's before any existing of users of hugetlb could make the changes in the rest of their software stack to support it. We have no other solution today for 1GB support in Linux, so waiting O(years) for this yet-to-be-designed future *is* going to cause compounding customer pain in the real world. > > > > Now, we all know, nobody wants to do the heavy lifting for that, that's > > > > why we're discussing how to get in yet another complicated feature. > > > > > > If nobody wants to do the heavy lifting and unifying hugetlb with core > > > MM is becoming impossible as you state, then does adding another > > > feature to hugetlb (that we are all agreeing is useful for multiple > > > use cases) really making things worse? In other words, if someone > > > > Well, if we (as a community) reject more complexity and outline an > > alternative of what would be acceptable (rewrite), people that really want > > these new features will *have to* do the heavy lifting. > > > > [and I see many people from employers that might have the capacity to do the > > heavy lifting if really required being involved in the discussion around HGM > > :P ] > > > > > decides tomorrow to do the heavy lifting, how much harder does this > > > become because of HGM, if any? > > > > > > I am the farthest away from being an expert here, I am just an > > > observer here, but if the answer to the above question is "HGM doesn't > > > actually make it worse" or "HGM only slightly makes things harder", > > > then I naively think that it's something that we should do, from a > > > pure cost-benefit analysis. > > > > Well, there is always the "maintainability" aspect, because upstream has to > > maintain whatever complexity gets merged. No matter what, we'll have to keep > > maintaining the current set of hugetlb features until we can eventually > > deprecate it/some in the far, far future. > > > > I, for my part, am happy as long as I can stay away as far as possible from > > hugetlb code. Again, Mike is the maintainer. > > Thanks for the reminder :) > > Maintainability is my primary concern with HGM. That is one of the reasons > I proposed James pitch the topic at LSFMM. Even though I am the 'maintainer' > changes introduced by HGM will impact others working in mm. > > > What I saw so far regarding HGM does not count as "slightly makes things > > harder". > > > > > Again, I don't have a lot of context here, and I understand everyone's > > > frustration with the current state of hugetlb. Just my 2 cents. > > > > The thing is, we all agree that something that hugetlb provides is valuable > > (i.e., pool of huge/large pages that we can map large), just that after 20 > > years there might be better ways of doing it and integrating it better with > > core-mm. > > I am struggling with how to support existing hugetlb users that are running > into issues like memory errors on hugetlb pages today. And, yes that is a > source of real customer issues. They are not really happy with the current > design that a single error will take out a 1G page, and their VM or > application. Moving to THP is not likely as they really want a pre-allocated > pool of 1G pages. I just don't have a good answer for them. Fully agreed, these customer complaints are a very real and significant problem that is actively causing pain today for 1GB users. That can't be understated. Same for the user who is live migrated because of a disruptive software update on the host. We would very much like a future where the hugetlb subsystem is more closely integrated with the core mm just because of subtle bugs that have popped up over time in hugetlb, including very complex reservation code. We've funded an initiative around hugetlb reliability because of a critical dependency on the subsystem as the *only* way to support 1GB mappings. Don't get me wrong: integration with core mm is very beneficial from a reliability and maintenance perspective. I just don't think the right solution is to mandate O(years) of work *before* we can possibly stop the very real customer pain.