From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5B9FEB64D0 for ; Tue, 13 Jun 2023 15:15:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 061E46B0071; Tue, 13 Jun 2023 11:15:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 011E96B0074; Tue, 13 Jun 2023 11:15:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E1BF46B0075; Tue, 13 Jun 2023 11:15:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id CFFA46B0071 for ; Tue, 13 Jun 2023 11:15:49 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id EF70B120569 for ; Tue, 13 Jun 2023 15:15:48 +0000 (UTC) X-FDA: 80898074376.04.6F0FD4D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf03.hostedemail.com (Postfix) with ESMTP id 578C920024 for ; Tue, 13 Jun 2023 15:15:45 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BOCLNHN9; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf03.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686669345; a=rsa-sha256; cv=none; b=qklutFqCGMRPXWULVLayKxd5mSa8lq15gx3fdUfHLlSz9DNYE/VIPzbFPBAiKa2Y7PDMrD k06HnO5uaFGuwyvLjhaugjDFxRaenTmN4KXfVM//zgpKZdEPf9DXPaXOaiNE6tVMvxHsaQ ECKhUSe1GjMh2gLsNuwPpJd3g9Z4fac= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BOCLNHN9; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf03.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686669345; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eQjvx5v5BzAK5ZA/7O+h77fr4vibM5u/YgCi8UUxQ4k=; b=rRzcX46B8vp3uUVojbfpMt3P6w8k/GfkUGFCGrgmaU+2FeL2ia2cvhwI9WCkYl41TZ7wNq cYBSGY+vV5GkFH/2B1tYsYGoTWSoNTaIcRpghfPz+1G5CjV7RJ9ptWIjqeYUYqOXqotHiY NXdHXdatEGj2U/uTOXtHc76tERLE0AQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1686669344; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eQjvx5v5BzAK5ZA/7O+h77fr4vibM5u/YgCi8UUxQ4k=; b=BOCLNHN9OGRkLwUoJ9BIbJsicYKKD10Vo+au3yEG4ean3VyzKUoAAygMZ9VwNN87tj4qVK oUUgLpn1IySU7icgYD+rXyHULMNijhsckB9qc9IX0zjwTzZ1wYT/AWnFs4GIpjfeq+M4ot Hglk2wjY+MJoj+sLYn+5Lnwn2Sg7T9c= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-569-Ql5d4sq9MqSAcHNJfUIPUw-1; Tue, 13 Jun 2023 11:15:42 -0400 X-MC-Unique: Ql5d4sq9MqSAcHNJfUIPUw-1 Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-3f5df65f9f4so25099815e9.2 for ; Tue, 13 Jun 2023 08:15:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686669339; x=1689261339; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=eQjvx5v5BzAK5ZA/7O+h77fr4vibM5u/YgCi8UUxQ4k=; b=LXD9n1rP+SXxDYQHpb+F6OGs5aj3t+HLrOapMzdX6fofPjm2V338KGR+BxF5vigGD+ 6Fipft4sp5+4k6nHhMQ8RGfdlV6m7CRgaDIw48GDcXETwgJq8I/r2hG+0ilHqLwllzpr eDzC/C0c7PGWEJ8+zSPE4eevF+IGG++u4bbWdkHTCdWQC0okuGao6LoM8V1b5tLUWYAu gwqaK+FR6WWYn0mIABzWON+3u9p1xxCTNFCOWytMc1K7MYhV50swPZh+/R2eg5zb+IS+ 3LA48LvSbNgL7hcApXcl47JlYLczvEdXOwPBihCdc4slJ0yClim6NS5ikQZxIHHKhPyl lToQ== X-Gm-Message-State: AC+VfDxz/QHbdgOdG2aJYEoOF02+RGlgyDv9+MCU8o429SYwaDTPSQNM eghuKB2dYlLEl2Irdq0HPZcMigZqNFjM5kLvyrBxaCPKZgjigsuINsCZ8MF0xmlBo81vwHGvB/F WmIuIX1L/Ook= X-Received: by 2002:a5d:4449:0:b0:306:2a1a:d265 with SMTP id x9-20020a5d4449000000b003062a1ad265mr6264949wrr.58.1686669338897; Tue, 13 Jun 2023 08:15:38 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5XzR6XkIQNtshgWaoG0lHXuNJu5LsIi89yO42xzuBrXyOmCBK1EvYdZ1hNrTx8oL/qbkdQ9Q== X-Received: by 2002:a5d:4449:0:b0:306:2a1a:d265 with SMTP id x9-20020a5d4449000000b003062a1ad265mr6264925wrr.58.1686669338421; Tue, 13 Jun 2023 08:15:38 -0700 (PDT) Received: from ?IPV6:2003:cb:c710:ff00:1a06:80f:733a:e8c6? (p200300cbc710ff001a06080f733ae8c6.dip0.t-ipconnect.de. [2003:cb:c710:ff00:1a06:80f:733a:e8c6]) by smtp.gmail.com with ESMTPSA id m11-20020a5d6a0b000000b0030b5d203e7esm15692853wru.97.2023.06.13.08.15.37 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 13 Jun 2023 08:15:37 -0700 (PDT) Message-ID: Date: Tue, 13 Jun 2023 17:15:36 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: Jason Gunthorpe , Matthew Wilcox Cc: David Rientjes , Mike Kravetz , Yosry Ahmed , James Houghton , Naoya Horiguchi , Miaohe Lin , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Axel Rasmussen , Jiaqi Yan References: <20230602172723.GA3941@monkey> <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com> <20230607220651.GC4122@monkey> <686e3e61-704e-1258-8a8b-f18399b41668@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 578C920024 X-Stat-Signature: uxgqrwqyb3p7z4mrpa36jo9zmwyrgasq X-HE-Tag: 1686669345-791163 X-HE-Meta: U2FsdGVkX18tLfRMR3xLuvbgLRtAgiDLHSnflgJ93NRfBOqesX3EfBvQQtW5quaFxBoeumDq6eizE0SJmcJ4uc00kre0jwjt6KdIqLmhZgPITJ930jnUuXuN6SJLJmOE0wbb3QvZ8yx/AT9wo+FfUNtXEmKK87b7ApN9HuGqZG4RzSFqHmUdrTVsK8QJ9nIrfNj1FWfGXWRtJqiaL032aWrF/XhShQBEI0yplazmtKAYvPaOK97Yn9KlEHaWv65Sx7YbBQpjCpyJMZ8p41ZKN4HSXXvD8NNV4PDRZ3N1wgSKNMel/WXc5LjKbjA57UuqTL6H+acKngzcLdmPSTULB9/DsadJdcVJzXpS9zqLfrAowPDrxSEU5m5eLDvDdttkreE2/e9RKbC4igYfUuUhtEKvW4AmBYDmy2LTDcHGChCqbglp3Fz72ePhmQkrRc8z9lBkZImcLG10U6K/CEon/9459GMJA2agMVc8GuMZlz6bnEZcM97T4ZLTl1xTT9i9bPFi0zT6vKUeBsBMKQASA3Lw/uaXlClVVRls8H7BdjFnuE+Af6D4lXRJnbh+MfriMadLMGQ0CtmQyWF2FF8g8enkzgp5wY1JLgn8nhV/DrZRlVn062g1kIOAnuiAyZrFzRg8154UF/x5uGm7ZUePQdlrA7LAvc8HFmvdjSLUL70RRME9Og5KYMeGw0K2cWDiGURIo/ugH8LSWwtg93iOhk3+zP1V+YgLKMLj3Gn53bRNIwthgZiCwpJapkN/pP8RvFLLYq5PvvObeL8rqS3pLM3JLHMa4ymKo/WYdods8X2C7NRuRyIQJvTIwjvZmjMV6QXUa6SwIllOkOTjLKOfm+O67fBV8xwLO1AXbS/SuInVuEaVBjP0GmqptuStYkoMQ41f6OrrwbqLZCEwAwfORp50YYZKyGdtWSmONdvqvt3Y1Un10La0FMcwHlbZjaTv6XpbSV4Pcap6Q3xP542 wKNuUIyw 41h6XWGBTNF0Z9qu5wR0X9yA4OSjy9mG1X6P/5GQDYhyasXNRUYu1rpsVrXd9c3ZrgIWhSqiGy+HVvfuS65/Qh0aBxu6i8x/J3rBIpDt3LlC2KnI34ul9CWjLCn9k6xJQBSXS+bpnpUv4A6dKu/Lp2K1rHn+0+zHyxrtDzTVtmFYrjaky6UcR4jQ+/rx4sJ3VB3qjpeyyG/YNlin0rflA7SCwSDrpVkWDvw2N5YobfCCS2E2y9jrD47yPF6ya4nN1G8ViaFDrCEiUOC11anyLNDg8FVUiUReFthsoHXMtDM02fak6KUkntMYbVmD/pzoxvw5zkeYxxrH42cMIr2nDz1iK/RrgI2D2mMAWxG2nBbB5v4rgSicJ6hseoiEGO1XtwTBX X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 13.06.23 16:59, Jason Gunthorpe wrote: > On Thu, Jun 08, 2023 at 09:10:15PM +0100, Matthew Wilcox wrote: >> On Thu, Jun 08, 2023 at 08:34:10AM +0200, David Hildenbrand wrote: >>> On 08.06.23 02:02, David Rientjes wrote: >>>> While people have proposed 1GB THP support in the past, it was nacked, in >>>> part, because of the suggestion to just use existing 1GB support in >>>> hugetlb instead :) >>> >>> Yes, because I still think that the use for "transparent" (for the user) >>> nowadays is very limited and not worth the complexity. >>> >>> IMHO, what you really want is a pool of large pages that (guarantees about >>> availability and nodes) and fine control about who gets these pages. That's >>> what hugetlb provides. >>> >>> In contrast to THP, you don't want to allow for >>> * Partially mmap, mremap, munmap, mprotect them >>> * Partially sharing then / COW'ing them >>> * Partially mixing them with other anon pages (MADV_DONTNEED + refault) >>> * Exclude them from some features KSM/swap >>> * (swap them out and eventually split them for that) >>> >>> Because you don't want to get these pages PTE-mapped by the system *unless* >>> there is a real reason (HGM, hwpoison) -- you want guarantees. Once such a >>> page is PTE-mapped, you only want to collapse in place. >>> >>> But you don't want special-HGM, you simply want the core to PTE-map them >>> like a (file) THP. >>> >>> IMHO, getting that realized much easier would be if we wouldn't have to care >>> about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD sharing), >>> but maybe there is a way ... >> >> I favour a more evolutionary than revolutionary approach. That is, >> I think it's acceptable to add new features to hugetlbfs _if_ they're >> combined with cleanup work that gets hugetlbfs closer to the main mm. >> This is why I harp on things like pagewalk that currently need special >> handling for hugetlb -- that's pointless; they should just be treated as >> large folios. GUP handles hugetlb separately too, and I'm not sure why. > > Yes, this echo's my feelings too. > > Making all the special core-mm cases around hugetlb even more > complicated with HGM seems like a non-starter. > > We need to get to a point where the core-mm handles all the PTE > programming and supports arbitary order folios in the page tables > uniformly for everyone. > > hugetlb is just a special high order folio provider. > > Get rid of all the special PTE formats, unique arch code, and special > code in gup.c/pagewalkers/etc that supports hugetlbfs. > > I think the general path to do that is to make the core-mm and all the > hugetlb supporting arches support a core-code path for working with > high order folios in page tables. > > Maybe this is demo'd & tested with a temporary/simplified hugetlbfs > uAPI. When the core MM and all the arches are ready you switch > hugetlbfs to use the new core API and deleted all the page walk > special cases. > > From there you can then teach the core code to do all the splitting > and whatever that you want. Yes, that's my hope. As I said, some existing oddities like PMD sharing (VM use-cases don't really require that) and MAP_PRIVATE handling (again, VMs don't really require that) could make the conversion more problematic ... IMHO So maybe we should really factor out the core hugetlb pooling logic and write a simplified v2 implementation that integrates nicely with the VM without all of these oddities. We can then either port some of these oddities step by step from v1 to v2 or replace them by something better (for example: if we really want MAP_PRIVATE, then just do it like with any other file and use ordinary anon (THP) ). One day, we can then just switch to v2 and remove v1. If we manage without any uABI changes, great. Doing all the conversion in-place could turn out extremely painful and take much longer ... but I might be just taught otherwise. As you say, hugetlb should just be a special folio provider ... We can discuss tomorrow. -- Cheers, David / dhildenb