From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E26FFC0218A for ; Thu, 30 Jan 2025 22:45:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 27240280074; Thu, 30 Jan 2025 17:45:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 223DD280070; Thu, 30 Jan 2025 17:45:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C373280074; Thu, 30 Jan 2025 17:45:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E4436280070 for ; Thu, 30 Jan 2025 17:45:49 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 81C58AED3C for ; Thu, 30 Jan 2025 22:45:49 +0000 (UTC) X-FDA: 83065602018.09.EB6FCE4 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 25F674000F for ; Thu, 30 Jan 2025 22:45:46 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NGxT9vZW; spf=pass (imf11.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738277147; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Wq5FucgWHCQASHvBKm5chGp4coKMw2NKnqlFyAcQt0A=; b=aBdyVAKX/MyDZefMnKvT7R/Dca1k8uPlwmPWovnhwgIcwMCpyfqdTwHE6nVJF5Wzlo20Ns lOWTMwMCkOLEJ6zHDBK+m0aILdhV8wr+wpWg1zO0OzDR5zXmZK0+UFpp4rtoirzWoKZV5Q EkjUhgf94Ia+qgNYWOkSBVX4skBgal8= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NGxT9vZW; spf=pass (imf11.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738277147; a=rsa-sha256; cv=none; b=Ouvx4iYP4xa5FVXTduvhgByyk+TwQ1oXWtDJkTa/QFSX1xZNUBhrY/YehSr+J9Fhexnin/ EPSvpsKEPy7QBScJQ02ktGboYtdngYkycfE7rsG9vEWqGAYgYiLFeq4iUrKtyBvXy+919O 0k1ANPbJwh69mJhL3RT9fkRUeI2yVaY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1738277146; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Wq5FucgWHCQASHvBKm5chGp4coKMw2NKnqlFyAcQt0A=; b=NGxT9vZWgVI90n1s1Opip1YFPoFCmipNR3y4xAej1gZhlRRAy4fTsTA7xczqwhlyoDJ6uu tsOBOd6IVVxxmYNSHI6maEh5fCheo7KI0CAUyiHEs+rRiBv8oB8+uhLrcOERiGMpePHNpG udbL9cm1upl9AsfZwrF/7Ua1IeYNnQ0= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-249-sB-GEpWFPwOSHEdJBo5p5A-1; Thu, 30 Jan 2025 17:45:45 -0500 X-MC-Unique: sB-GEpWFPwOSHEdJBo5p5A-1 X-Mimecast-MFC-AGG-ID: sB-GEpWFPwOSHEdJBo5p5A Received: by mail-qk1-f200.google.com with SMTP id af79cd13be357-7b7477f005eso298911185a.0 for ; Thu, 30 Jan 2025 14:45:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738277144; x=1738881944; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Wq5FucgWHCQASHvBKm5chGp4coKMw2NKnqlFyAcQt0A=; b=Gk1GRS2jOVGt2rpi9zxPYP6iYN+kRMpBGIYPzelRU57w9Oa331ra6/L2B90OqhmwXA lBYOrF9IvpzJfOBEMepRTq5VpFzT8Q6IpKb3wi6Kjp+EMv7sS4Pz+Ofn+Mai96nfDgYS SvBFyG2G+2KkKZGSOKSR4q+HPgZFhKgn6NKv9v7yZ6+cRZUj7ZtSxkHlrtKIh4HOgbhU h87EXNcvGwlSZNhwJNH/8wTcyL1b+Kd16/NLc7Wlnx6hJNC4sUYshAD3yNXsZaQrjQLt v6XpHheUzXe/yZA4LwGTCKhlmTohFmY2hCfitK4R7lE/nh4zWdTCYpqcw6l1pYPiommG ET4A== X-Forwarded-Encrypted: i=1; AJvYcCVjhXhlut1vnV+IDQjd4OwE2p4PIBRrxDFfg9AxXV5zTbg3vywbFznxtn9m20MMsxe3YJGAHlGkPA==@kvack.org X-Gm-Message-State: AOJu0Yz6M3O36viA3s+25SoUnhwYkXq/p0JETdg0YrxTUYm/yRficJ+N aX1VzCfnlrDmtC0HHXmvrDgs6gbF3lowz4sY9g+RJvQ9Ajw4rE+Z7UQUC66kFZPZ7JUSX2wil+n HUnaB5p8qlPJFqK6a1VdGbuGIPnk0mb3I4X6lvnUigk7C0JFD X-Gm-Gg: ASbGncvjAjMWr0POqEBfOyyMXW/0fWFDcWfbPB0Eii5ll36aYqlJTuQb0bIbtsl94qZ zFyXAzB7OY7JY/5iNaVoczcPjjVZce3JZYAFHP0lQrKRiMvGgI1XZo0jQMr+GOu5kapz8tTSBYT NzC0diP9Epu5hG2yMJtJIHe+ZlpgHLGit2CUWDqUC3DPMgWtrebiOcp53IzvACY3SkPfq/7GEYe YPut6RmTf7xLXs49acbxJG7N+JmMJXJVRxCqBMeKK95MUmngGNrp20ikBQwdIT2wDJwUQ3oD/M1 AgsNmYAmNmKMN2UY8p+3jPwkboe5ImEH9YiqKxmIe9aCNQko X-Received: by 2002:a05:620a:454b:b0:7b6:f1be:4f7e with SMTP id af79cd13be357-7bffcd0e8a7mr1233852385a.26.1738277144428; Thu, 30 Jan 2025 14:45:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IGVx4jsyk8+NMK228xVN/Vb1gLV5vGf+j8D77g5c8lJ+RCDX0MUpAb5AZnMaTCgmYZBXqQW8Q== X-Received: by 2002:a05:620a:454b:b0:7b6:f1be:4f7e with SMTP id af79cd13be357-7bffcd0e8a7mr1233848285a.26.1738277143942; Thu, 30 Jan 2025 14:45:43 -0800 (PST) Received: from x1.local (pool-99-254-114-190.cpe.net.cable.rogers.com. [99.254.114.190]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7c00a90571csm119967285a.83.2025.01.30.14.45.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Jan 2025 14:45:43 -0800 (PST) Date: Thu, 30 Jan 2025 17:45:39 -0500 From: Peter Xu To: Oscar Salvador Cc: lsf-pc@lists.linux-foundation.org, David Hildenbrand , Muchun Song , linux-mm@kvack.org Subject: Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: NfyQROHDJix8oi21dYNfu9s2N26qngzw8R3LkhSM0Mg_1738277144 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 25F674000F X-Stat-Signature: q1sdnbwi4m98j4k664hnb8ndfy6pod7d X-HE-Tag: 1738277146-438114 X-HE-Meta: U2FsdGVkX196TUeJRBMKd3997ziiE0Zu4hGGPvlMUBii/b9cKl8xO9u/+Z5oVK5SOfkxHNLqi0o9jFndwr8GKcyEBpgIGpZbUAmXeFsUGiZCPTQOKYPcWhM2P01CuApfQgtinZbFBuBJss2wmopqIQXFdEphnNvx5nBfUMi3vuXnTYfgtp4Y1KvCQdxJNg3N0uqZlrBkq7ykR6iAM+KwuWqTY4e1To5cml1focWw+jcpzNw7L0GXoDTZ2MDFE/mirD0YDGypxS5motrv0bO1onTgMtsNO4gdy3DtKDhUx/XK2yD0LYQ3RsI6CRkPlVFZTddyqrTVBV/hMmD/mf4V9nCYlzt96gnOfIWFD3JAXEIXLD1k3cQvkcyEJQpeCNmWALHYXHN6RSlDPnfbUzWJgVr6PyieGcJtDC7z4Ak0syR4Onq6XCHFowb8JvSOr1ruJF8qdRj0pcmGnVoOeoVPu0K1vIqE7I9dvOpuYA29dljm45hDijYTniEpOcOogSmsLkvMtrOYq75WzLCSL/BtEb91H3MR4S1QNnAc3qv8OBybUjM88kg+k6DWQI0KKbtfIginoI8nTYDYY4r90vQretR5rkvQs1EKm5HGqDqdovo+vlKELoex6LLo1rj/kspmj8ha2C5ehGeR+S2Caavcbu3gXdG6G4FgT/gxNv7+k13joZyAmABy9ruu+3JfsrWMbEW+58YY/tw6gERuZPsVhUfomeyLxBXvGpiIQb+HL8F8XXTJQsgj0hNmZz+Frr97AO1jnCKCFcReCrkU6ju0NWGXNUKfiEkjKjZUNRXGB3ycoOChBm76wak24sxIaG0/Rk9Lr0VIksRB5aXTjx/TNZbKSK0grKqWTtRK+XgtOeIYp+rn4pQAn8OZA7rfc+st6nQbA/aXrPOsxlYHa3Vjwje06YApc92oy6PkWhX8cnxj1x/GlX3Ge69YdorLocFQVbc27KhUXuw96ThEYia NwKkXds8 2XWzpiTU2lHtatdfK5DaOTDlBrcqpEH8tH707dTTH0OcdfDthjwi8MZ62Lbt1t0KiK4BmaMcOTZthSZAYD3l5zFeComRkw42ZB4oVE3s5RuSumUT+sooWflvy3OZUdHRSFYm8kF6PMaxCncONj4tVhvEJI9RfL+XXsmZmiFKVRihbeDfHe3bQHWwy10wCZiq8RWGIC7t+GNe+REFGrBJjVSpRvLYOj7YRBsT6TNCvorJTpC/fVudW3YY/+9vTOgkg0EgBJYcbtO1ifQkWSAp3DNlwvxlPXQgboofsT12k5/UUWxSR/ycM1lrWxRdVIu0gh/B0iokptve+WrFt5WuuN7TE32/dL8kgV+QzG1UFqYWu0ObqdlHWglwDXk98VELbUEmy X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 30, 2025 at 10:36:51PM +0100, Oscar Salvador wrote: > Hi, Hello, Oscar, > > last year Peter Xu did a presention at LSFM/MM on how to better integrate hugetlb > in the mm core. > There are several reasons we want to do that, but one could say that the two that > matter the most are 1) code duplication and 2) making hugetlb less special. > > During the last year several patches that went in that direction were merged e.g: > gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic unmapping > path [3] to name some. > > There was also a concern on how to integrate hugetlb into the generic pagewalk, > getting rid by doing so of a lot of code and have a generic path that could handle > everything. > This was first worked in [4] (very basic draft). > > Although a second version is on the works, I would like to present some concerns > I have wrt. that work. > > HugeTLB has its own way of dealing with things. > E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp, > huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc. > > One of the challenges that this raises is that if we want pmd/pud walkers to > be able to make sense of hugetlb stuff, we need to implement pud/pmd > (maybe some pmd we already have because of THP) variants of those. > > E.g: HugeTLB code uses is_swap_pte and pte_to_swp_entry. > If we want PUD walkers to be able to handle hugetlb, this means that we would > need some sort of is_swap_pud and pud_to_swp_entry implementations. > The same happens with a handful of other functions (e.g: huge_pte_*_uffd_wp, > hugetlb pte markers, etc.) > > This has never been a problem because hugetlb has its way of doing things > and we implemented code around that logic, but this falls off the cliff as > soon as we want to make it less special and more generic, because we need to > start implementing all those pte_* variants for pud/pmd_* > > I would like to know how people feel about it, whether this is something worth > pursuing, or we live with the fact that HugeTLB is special, and so it remains > this way. > > [1] > https://patchwork.kernel.org/project/linux-mm/cover/20240327152332.950956-1-peterx@redhat.com/ > [2] > https://patchwork.kernel.org/project/linux-mm/cover/20240812181225.1360970-1-peterx@redhat.com/ > [3] > https://patchwork.kernel.org/project/linux-mm/cover/20241007075037.267650-1-osalvador@suse.de/ > [4] > https://patchwork.kernel.org/project/linux-mm/cover/20240704043132.28501-1-osalvador@suse.de/ Thanks for bringing up this topic. I won't be able to try to apply for lsfmm this year due to some family plan, but I can share some quick thoughts in case if it's anything helpful. We also had some relevant discussions on this, but I guess most of them are not on the list. I definitely agree with you that such cleanup on hugetlb would be always nice especially on the pgtable side. Fundamentally, it's because huge mappings in the pgtables don't have any real difference between hugetlbfs or other forms of it, at least on the known archs I'm aware of. In general, the pgtable part only defines the size of a mapping not the attributes. I have also once shared with you on the concern I had with the work: not only we've got limited resources on developers who would be willing to do this cleanup, but also whoever will be able to review it properly. In general, knowing that hugetlbfs can be feature-freeze after the HGM attempt, I start to evaluate the pros and cons of such global cleanup, and whether it'll pay-off for everyone, with the risk of easily break any existing hugetlbfs users. I won't be surprised that such whole effort should take at least 5-digits LOCs change to be complete, even if we assume the idea is 100% workable, and 100% perfect code, which can still be quite some effort. The gain of such whole work would be having clean code base for pgtable, with no functional change (hopefully! unless we regress some perf here and there.. normally hugetlb API is _slightly_ faster.. even if uglier, per my "can-be-outdated" impression..). So that's a major concern I have, on whether we should stick with clean everything up, or thinking about other approaches, e.g.: - We could still do the low hanging fruits if we see fit, that are self contained, and have direct benefit. E.g. I think maybe it still makes sense to finish your page walk API rewrites at least if it's already half way through (which is my gut feeling, but you know the best..). - We could think about refactoring hugetlb in a way that we could make it more usable and provide new features, rather than reworking on a feature-freeze base idea so we can't get more than "cleanups" only. The latter is also why I started looking at integrating HugeTLB pages / folios without hugetlbfs's presence. So far gmem does look like a good container out of it, as confidential computing will have similar demand to allocate 1G pages out of somewhere, and that "somewhere" shares a lot of common issues to be resolved by hugetlbfs as well. It means it makes sense to me to rework that part of hugetlbfs to suite more consumers (which I start to call them "hugetlb pages / folios" v.s. "hugetlbfs" just to differenciate from the file system). And if gmem 1G can work with CoCo, it can be pretty simple to extend that to !CoCo (which is fundamentally in-place consuming gmem folios with no need to do private<->shared conversions), which means there's chance for a VM cloud provider (private or public) move over to gmem completely replacing hugetlbfs 1G, for either confidential or normal VMs. Then there's a hugetlb-based (not hugetlbfs-based) solution that is not feature-freeze, and meanwhile whatever we rework on hugetlb from this perspective will not only be cleaups, but paving way for anything to be built on top of hugetlb folios to work properly. I don't think it's justified we shouldn't keep cleaning hugetlbfs code, though. So there's still the 3rd option that we could choose to try finish this work. It'll just be challenging from different aspects. Sorry for all these pretty random thoughts. Please ignore them if some (if not all..) doesn't apply for the topic you plan to discuss! Thanks, -- Peter Xu