From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D28CC7EE29 for ; Fri, 9 Jun 2023 02:59:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 836948E0002; Thu, 8 Jun 2023 22:59:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E7078E0001; Thu, 8 Jun 2023 22:59:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6874E8E0002; Thu, 8 Jun 2023 22:59:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 590558E0001 for ; Thu, 8 Jun 2023 22:59:06 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 287581C7925 for ; Fri, 9 Jun 2023 02:59:06 +0000 (UTC) X-FDA: 80881702692.02.872ED7A Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by imf20.hostedemail.com (Postfix) with ESMTP id 5767C1C001C for ; Fri, 9 Jun 2023 02:59:03 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=eRkdy7tY; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf20.hostedemail.com: domain of rientjes@google.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686279543; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EvCcLBy4Z8x7OynwvgHZCuWOWKRWTjzSXd5o6s1KZ4I=; b=HEncwiIIfbu9IBsAMQ0sLUXLBY1o/vE0AthZfP1rnela/HG3nM++Wt4VkrXe4BaMiQig2P iOONpLloNtpHLeeoYokNkk0SppdRfu9Id1EqqebrtQH8X37kcsVaBgjqtYYkIq3OO2a90K lwUqZkIRRkWj5eqw/+dOBwZqoopkP+g= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=eRkdy7tY; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf20.hostedemail.com: domain of rientjes@google.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686279543; a=rsa-sha256; cv=none; b=TXYEIZAs3pBtIFOjXkd/5zVeJaHrN0CktlyUpqUFmPW62MtSTi01vRSW0cIupc96654Xsc cpez85YxJIP0ILzV5UIlrt5DlA7FUDdJL5yMoJhqv2O9quZhg9PoTFTwIf41DH0FNpGK8J nIU3yb032GuOpDyMo1ioHGez5ildrQs= Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-1b1fdab9d68so312405ad.0 for ; Thu, 08 Jun 2023 19:59:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686279542; x=1688871542; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=EvCcLBy4Z8x7OynwvgHZCuWOWKRWTjzSXd5o6s1KZ4I=; b=eRkdy7tYRkj1PUii5gNWK/NlRh+BekSMFoEsOO57YgBq8nqQXWRYvqvGefT0tx1qva o9BXkefdWoDKSCXDUZJJtdJfFSlBRqUYEoAmcd+8p7T1T8xgmf3XlxRWM/J2cSEEVKQF dlINX75WzTyZEKACA2xv1QoyZ+Z/yEvrNEjDqxz2EXGH7qvCnbhLD/APtl9cmwww066B AOuycFoLDqnAjubQ5LqefZ4jrnGDf7D171VLAqa7kTwTvElPBdZIFY9wB7+akM3VUmtA dsGM4VNOKiyiiTDEghqhA8t9bReSjVcNgfL5XcucxOMobnC0gqMvVKgxfSrvG+7rPxn3 4q8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686279542; x=1688871542; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=EvCcLBy4Z8x7OynwvgHZCuWOWKRWTjzSXd5o6s1KZ4I=; b=F9dGXHckZluiPXt8EoN2F//G5OnUGBCKFM1mhuxE0CmqxKPHLUoM5JfBPOnaSAJe5v uq3ZcdRr+nOcLlKJ7/+Y1O1qM9lQp/N52eBtxnBFdC3pawOQiG+8ZC5ke6Pe7WXUJ2r1 Gce75OWA/zlW1PWhNLK8i1HGrCqpQzTtTx+KVfMSQXhQJ9zIGO2cx8J+W54+TZ9WRRhd wwd+eGEbFLk8HlHts0roVeCBKQ/QjxPa5P4Yrq/9NXNTXhakGyqe6y+bRsDF+PR9SVbM T37wMpdbC2byMf2DCxr00gNDuFFXDV8v3X0Vl99o5Yc+R2FjLl3DGZ8OXoiWxx+vlP1m PcHA== X-Gm-Message-State: AC+VfDycfnhtCieiM7nNmoEBJNfqNtGY8fWnIoduQ7mAChmoSOfHNbFz QKd08exbY9hbD2Craq7Hq2UxGg== X-Google-Smtp-Source: ACHHUZ7NUuaU3Co62k3X8oE0qvdSTuPCYhmgLJn3xTK3Eu5swV2YgAqxwBijG3fituJzsoatyst9Bg== X-Received: by 2002:a17:902:e886:b0:1b2:421b:28fa with SMTP id w6-20020a170902e88600b001b2421b28famr362721plg.21.1686279541959; Thu, 08 Jun 2023 19:59:01 -0700 (PDT) Received: from [2620:0:1008:15:2697:c1a0:775d:7386] ([2620:0:1008:15:2697:c1a0:775d:7386]) by smtp.gmail.com with ESMTPSA id x17-20020a170902ea9100b0019c13d032d8sm2062563plb.253.2023.06.08.19.59.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jun 2023 19:59:01 -0700 (PDT) Date: Thu, 8 Jun 2023 19:59:00 -0700 (PDT) From: David Rientjes To: Matthew Wilcox cc: David Hildenbrand , Mike Kravetz , Yosry Ahmed , James Houghton , Naoya Horiguchi , Miaohe Lin , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Axel Rasmussen , Jiaqi Yan Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs In-Reply-To: Message-ID: <025dec2b-8363-8715-7ed6-470d243d921e@google.com> References: <20230602172723.GA3941@monkey> <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com> <20230607220651.GC4122@monkey> <686e3e61-704e-1258-8a8b-f18399b41668@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: 5767C1C001C X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: hqzbqbi4merb5eand4x6jigphjiu3yoa X-HE-Tag: 1686279543-682585 X-HE-Meta: U2FsdGVkX186DGVwbA76EOW85Q/HFaENyhQcJGn3ANtCO6s+WJjaVUblkhEotL0jy5VxWppPwYSPYihBiLKMBXX09FBR+KHHVQvYBfDSdtbVrHJ1xKEOIBUQMR1wuBoCcwckRco/zGIRZFTnedjeyJ6h4Go0e+pG/s8I1/kTEMeZ0/l2mf4BILR45kTBpRFUTDIk7nmoHTcSWkNAX6gmamohVGs6dhoyU1Aj2fQhMav5S15X/pZpLeXhbXL/eJQYCfvTjy720cHP0PAEG+0KsmvcVMPvA4p17Hm0uEDl5c1F1puou7gEAkrWDA5RiUsrW/GJhC9r+s7W/qNy9/YllBjrzfQ5CKZMMgyV/hipfODtQWQuGe+QiCHXgiiQSzHLoWw1V+64oVbCkZzXAhMAquXaXjtyIwTtgvAs1e1KrJxgJKFmfBA22e75yLA18IVpLlg6D0po+xMOa/o/K2qfg/Fnkaa/tSdy46mAPChpEE0Qgu7rf8YoRs/0uE4ADIXcofb8wfVPOtSJPCODk/C6ZMFzy3eugm6GN0p7aNun/5+uyur4IQ/uUhmxNZQvCBmHbuYI9jJhUiqRAkyll0Bvnj2vM7oujXDnNlM0Zp6IUFDN785yBU+jZlDQBbKrsIw5mYYpLHvdsTQEmuvqXYJtfZexXw/yGYKUZm1MZeMzXe/3bVJYUntDpRHih9Bzar8CcyIgpORAHQBfKVkwxrVj3IFRawBBA5b0oeiICMm6+WzTftyoEKctkd2I99fP+wgQx7B/uzsPhKJvL2T/p2VcckOEqpRHc/+u+Gxm8OFfsarvf8P4ltkE4sPpcATmdvwmHeYRBHxzWS7gnQTAgxe8Zqvc84U5FKlsyeFGtstfNYa2401iGgvcTVPbWNixNf155OzE32U4C1FhOjYdEirWRMr/DKSqrJABr6tJnHMll6GE69EnRC6KWwloET1MTkNmcnT7zs7uemtZVMzQ3eD 6ejrxKkn X1OhltjU6kSuszhNXa7I/q64H4kzTrgqijHjTrw5oftjhHdH1aEUYK24STiRqKNUBTiDTcNsoOpNYns9caoTMw+NNIqBdL15318I9WIWCZMKnzJcDY9k2Ofpd033bFEqraI+dfAiTP/x8WukQve4eia7t8h6wFyrRx1zkeftdfz452n0Os3BjVwMcOaSaivEoHi4jgn6dotTqL4qmytGdjbu1z9UMdZjlSVR8tNix92RBzpn5unED6oaXuV9mxkMTcJGP4ZNb9SSd16j5B21nyXYGmIT2RuczvlnmvHR+iBG6A7e3wAH+kl13rbME9iIptj9BT3ZRJprgMHlWPHHu/H0bOH3tB8/J9J3nGh5JumDzibi13YBVbh0UxA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 8 Jun 2023, Matthew Wilcox wrote: > On Thu, Jun 08, 2023 at 08:34:10AM +0200, David Hildenbrand wrote: > > On 08.06.23 02:02, David Rientjes wrote: > > > While people have proposed 1GB THP support in the past, it was nacked, in > > > part, because of the suggestion to just use existing 1GB support in > > > hugetlb instead :) > > > > Yes, because I still think that the use for "transparent" (for the user) > > nowadays is very limited and not worth the complexity. > > > > IMHO, what you really want is a pool of large pages that (guarantees about > > availability and nodes) and fine control about who gets these pages. That's > > what hugetlb provides. > > > > In contrast to THP, you don't want to allow for > > * Partially mmap, mremap, munmap, mprotect them > > * Partially sharing then / COW'ing them > > * Partially mixing them with other anon pages (MADV_DONTNEED + refault) > > * Exclude them from some features KSM/swap > > * (swap them out and eventually split them for that) > > > > Because you don't want to get these pages PTE-mapped by the system *unless* > > there is a real reason (HGM, hwpoison) -- you want guarantees. Once such a > > page is PTE-mapped, you only want to collapse in place. > > > > But you don't want special-HGM, you simply want the core to PTE-map them > > like a (file) THP. > > > > IMHO, getting that realized much easier would be if we wouldn't have to care > > about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD sharing), > > but maybe there is a way ... > > I favour a more evolutionary than revolutionary approach. That is, > I think it's acceptable to add new features to hugetlbfs _if_ they're > combined with cleanup work that gets hugetlbfs closer to the main mm. > This is why I harp on things like pagewalk that currently need special > handling for hugetlb -- that's pointless; they should just be treated as > large folios. GUP handles hugetlb separately too, and I'm not sure why. > > That's not to be confused with "hugetlb must change to be more like > the regular mm". Sometimes both are bad, stupid and wrong, and need to > be changed. The MM has never had to handle 1GB pages before and, eg, > handling mapcount by iterating over each struct page is not sensible > because that's 16MB of data just to answer folio_mapcount(). > Ok, so I'll latch onto this feedback because I think it's (1) a concrete path forward to solve existing real-world pain by adding support to hugetlb (to address hwpoison and postcopy live migration latency) and (2) an overall and long-awaited improvement in maintainability for the MM subsystem. Nobody on this thread is interested in substantially increasing the complexity of hugetlb. That's true from the standpoint of sheer maintainability, but also reliability. We've been bitten time and time again by hugetlb-only reliability issues, which are their own class of customer complaints. These have not only been in hugetlb's reservation code. In fact, from my POV, hugetlb *reliability* is the most important topic discussed so far in this thread and that can be substantially improved by this evolutionary approach that reduces the "special casing" that is the hugetlb subsystem today. We would very much want a unified way of handling page walks, for example. I don't think anybody here is advocating for making hugetlb more of a snowflake :) Improving hugetlb maintainability *and* reliability is of critical importance to us, as is solving memory poisoning and live migration latency. I don't think that one needs to block the other. So the work to improve hugetlb reliability and maintainability is something that can be tractable and we'd definitely like feedback on so that we can contribute to it. I'd very much prefer that this does not get in the way of solving real-world problems that HGM addresses just because it's an active source of real customer issues today. Rest assured, making forward progress on HGM will not reduce our interest in improving hugetlb maintainability :) I know that James is very eager to receive code review for the HGM series itself from anybody who would be willing to review it. Is there a way to make forward progress on deciding whether HGM (with any code review comments addressed) has a path forward?