From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 41607C433DB for ; Tue, 9 Mar 2021 17:50:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B3F3A65215 for ; Tue, 9 Mar 2021 17:50:40 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B3F3A65215 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E44C58D0124; Tue, 9 Mar 2021 12:50:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DF47F8D007F; Tue, 9 Mar 2021 12:50:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C95248D0124; Tue, 9 Mar 2021 12:50:39 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0047.hostedemail.com [216.40.44.47]) by kanga.kvack.org (Postfix) with ESMTP id AFCB38D007F for ; Tue, 9 Mar 2021 12:50:39 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 62CA5180AD802 for ; Tue, 9 Mar 2021 17:50:39 +0000 (UTC) X-FDA: 77901075798.29.B9EDA70 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf30.hostedemail.com (Postfix) with ESMTP id 67A78E005F24 for ; Tue, 9 Mar 2021 17:50:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1615312232; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wB5JlP23hp5tAup4AoVFCokVu35qaVntvZJ1seqr1d4=; b=TJCtnwQu7KiQvlxB+mPF8kvzfd8gIR5McWPHH5a5huZkVOtAdZl/S2o814CZnbRwHH6blU ErZ5BSCzYynYDYoen8+jcoGrj5kIrzJTZmTqkiD0tTtqJt8EW/y7EQT13k8x/CFBFZVErY g/6BEOkViIaHuxm07Uonf285SU7Ovl8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-3-VwQ_SSYQNLW_VT530u0oqg-1; Tue, 09 Mar 2021 12:50:28 -0500 X-MC-Unique: VwQ_SSYQNLW_VT530u0oqg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 76D091842149; Tue, 9 Mar 2021 17:50:26 +0000 (UTC) Received: from [10.36.114.143] (ovpn-114-143.ams2.redhat.com [10.36.114.143]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9560819C59; Tue, 9 Mar 2021 17:50:24 +0000 (UTC) To: Mike Kravetz , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Oscar Salvador , Zi Yan , David Rientjes , Andrew Morton References: <20210309001855.142453-1-mike.kravetz@oracle.com> <29cb78c5-4fca-0f0a-c603-0c75f9f50d05@redhat.com> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [RFC PATCH 0/3] hugetlb: add demote/split page functionality Message-ID: <6c66c265-c9b9-ffe9-f860-f96f3485477e@redhat.com> Date: Tue, 9 Mar 2021 18:50:23 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Stat-Signature: gzqzoo1p4dqcgzuuc8uc6bk5z9b8aizb X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 67A78E005F24 Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf30; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1615312227-335205 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 09.03.21 18:11, Mike Kravetz wrote: > On 3/9/21 1:01 AM, David Hildenbrand wrote: >> On 09.03.21 01:18, Mike Kravetz wrote: >>> To address these issues, introduce the concept of hugetlb page demoti= on. >>> Demotion provides a means of 'in place' splitting a hugetlb page to >>> pages of a smaller size. For example, on x86 one 1G page can be >>> demoted to 512 2M pages. Page demotion is controlled via sysfs files= . >>> - demote_size Read only target page size for demotion >>> - demote Writable number of hugetlb pages to be demoted >>> >>> Only hugetlb pages which are free at the time of the request can be d= emoted. >>> Demotion does not add to the complexity surplus pages. Demotion also= honors >>> reserved huge pages. Therefore, when a value is written to the sysfs= demote >>> file that value is only the maximum number of pages which will be dem= oted. >>> It is possible fewer will actually be demoted. >>> >>> If demote_size is PAGESIZE, demote will simply free pages to the budd= y >>> allocator. >> >> With the vmemmap optimizations you will have to rework the vmemmap lay= out. How is that handled? Couldn't it happen that you are half-way throug= h splitting a PUD into PMDs when you realize that you cannot allocate vme= mmap pages for properly handling the remaining PMDs? What would happen th= en? >> >> Or are you planning on making both features mutually exclusive? >> >> Of course, one approach would be first completely restoring the vmemma= p for the whole PUD (allocating more pages than necessary in the end) and= then freeing individual pages again when optimizing the layout per PMD. >> >=20 > You are right about the need to address this issue. Patch 3 has the > comment: >=20 > + /* > + * Note for future: > + * When support for reducing vmemmap of huge pages is added, we > + * will need to allocate vmemmap pages here and could fail. > + */ >=20 I only skimmed over the cover letter so far. :) > The simplest approach would be to restore the entire vmemmmap for the > larger page and then delete for smaller pages after the split. We coul= d > hook into the existing vmemmmap reduction code with just a few calls. > This would fail to demote/split, if the allocation fails. However, thi= s > is not optimal. >=20 > Ideally, the code would compute how many pages for vmemmmap are needed > after the split, allocate those and then construct vmmemmap > appropriately when creating the smaller pages. >=20 > I think we would want to always do the allocation of vmmemmap pages up > front and not even start the split process if the allocation fails. No > sense starting something we may not be able to finish. >=20 Makes sense. Another case might also be interesting: Assume you allocated a gigantic=20 page via CMA and denoted it to huge pages. Theoretically (after Oscar's=20 series!), we could come back later and re-allocate a gigantic page via=20 CMA, migrating all now-hugepages out of the CMA region. Would require=20 telling CMA that that area is effectively no longer allocated via CMA=20 (adjusting accounting, bitmaps, etc). That would actually be a neat use case to form new gigantic pages later=20 on when necessary :) But I assume your primary use case is denoting gigantic pages allocated=20 during boot, not via CMA. Maybe you addresses that already as well :) > I purposely did not address that here as first I wanted to get feedback > on the usefulness demote functionality. >=20 Makes sense. I think there could be some value in having this=20 functionality. Gigantic pages are rare and we might want to keep them as=20 long as possible (and as long as we have sufficient free memory). But=20 once we need huge pages (e.g., smaller VMs, different granularity=20 requiremets), we could denote. If we ever have pre-zeroing of huge/gigantic pages, your approach could=20 also avoid having to zero huge pages again when the gigantic page was=20 already zeroed. --=20 Thanks, David / dhildenb