From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 74285EB64DC for ; Mon, 17 Jul 2023 14:55:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E96508D0002; Mon, 17 Jul 2023 10:55:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E46D58D0001; Mon, 17 Jul 2023 10:55:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CC05C8D0002; Mon, 17 Jul 2023 10:55:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B88688D0001 for ; Mon, 17 Jul 2023 10:55:09 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7767614041A for ; Mon, 17 Jul 2023 14:55:09 +0000 (UTC) X-FDA: 81021401538.30.2A39151 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf03.hostedemail.com (Postfix) with ESMTP id D502720021 for ; Mon, 17 Jul 2023 14:55:06 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=D8nZeC56; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf03.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689605707; a=rsa-sha256; cv=none; b=Yv3yA4GTRYQ70OSJRsz6V7GmVshrPyVYTc/Yj2tdtNrLbMEtvOmfqdVIiv4wFxGfk2y+4Z zTeseD1BsUUK/KYg/MF5j/bnzM05VXLSF3GsLNt4KmACNZgu/YmqcK7iuh4I76BFAaDYh7 tgmABYtraelR12y9YApE1Qff1PwI2GM= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=D8nZeC56; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf03.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689605707; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p7ysILThYmwPo2e7W26OOfi15jGsDbCppOEO86p+TmM=; b=XLfCrg//9w31Q/m8CnYSQkRtpSwLd7FaBjbqxG6Ru61Zhkdz4u+Dq4UMpR0F0B1rxNieTU V2sL+Ykd+hmav2bBcHnZzgraBs/LM7cYsMMZ+PQv1jzL5cy05SG9d10eQmtptAKi8hjOmn +ulnkD6goQJHaqYT2MwThUGIkYKGWI8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1689605705; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=p7ysILThYmwPo2e7W26OOfi15jGsDbCppOEO86p+TmM=; b=D8nZeC56vtgvZqXkkhvKOuwECn8btPxZ58Bxn4ScRDxf45fYOBGytjIti9EAmGfHDXl76k 9Ne9Yi5d9DZDDzf4XpbtWvLNnHShVfWzlpuI3FQcBjIdHn2bOYwoG9BdzSlJPjYHHoDnDg 7vV9+WtdePDvoUpEBdUGeZSWMpgZeMQ= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-461--so78FyKM6KGRkGW7wHnWg-1; Mon, 17 Jul 2023 10:55:03 -0400 X-MC-Unique: -so78FyKM6KGRkGW7wHnWg-1 Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-3fc020021efso27166755e9.0 for ; Mon, 17 Jul 2023 07:55:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689605702; x=1692197702; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=p7ysILThYmwPo2e7W26OOfi15jGsDbCppOEO86p+TmM=; b=Teowi5xyyIfa+OElW4XU6LXq9uRoTYmw7in08Bb9aaXvd8sUi51W8Rhn7M1s83J4UL RYfggUorCZhNst1PRY3J5k6DpitJhdgWVmhLc7yha7MKpLxYlU1KDZ/sK3MtoCAe22zU Soy+UYuRznK6DQoJKaK8HUd2j8xgkqdo2JSNeXeGGAWJcKAB7VdtVf5VkcTB4viTLA7o SXeLxQLw6MjGgF0z067RvPzDob4bNe+miEzQzN2E7BD7qnUvRG07ah/1+nZ6oWKxl6K4 U9Sni2yb5GBjfU9XkUX7nmUML4/VIqSNat6seCKYTDv9Mtau2dBu434nve9jTw4d1gwk 6Rsg== X-Gm-Message-State: ABy/qLbNJ/+tQo5WSZrjHv1wFIaNJ3Xpio8ZXWuoOdX8SL6agR2Be14P vXfHl11V935avv8yeSe35j1aP6whJR+mt4qYyhxHZ0qf8dFfmqFFFa9oDnuD9e4T/Z0P9kW9q6p xshGErj2UVN8= X-Received: by 2002:a05:600c:286:b0:3fb:a1d9:ede8 with SMTP id 6-20020a05600c028600b003fba1d9ede8mr10441869wmk.10.1689605702634; Mon, 17 Jul 2023 07:55:02 -0700 (PDT) X-Google-Smtp-Source: APBJJlGc8KK/Y8awGtA4287mNNnPwJsKgJsu1+GjznfksXzwtT/LrPcTQbk8rkbPxRyWx63N/CBNTw== X-Received: by 2002:a05:600c:286:b0:3fb:a1d9:ede8 with SMTP id 6-20020a05600c028600b003fba1d9ede8mr10441856wmk.10.1689605702205; Mon, 17 Jul 2023 07:55:02 -0700 (PDT) Received: from ?IPV6:2003:cb:c735:400:2501:5a2e:13c6:88da? (p200300cbc735040025015a2e13c688da.dip0.t-ipconnect.de. [2003:cb:c735:400:2501:5a2e:13c6:88da]) by smtp.gmail.com with ESMTPSA id v3-20020a05600c470300b003f7f475c3bcsm15571413wmo.1.2023.07.17.07.55.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 17 Jul 2023 07:55:01 -0700 (PDT) Message-ID: Date: Mon, 17 Jul 2023 16:55:00 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance To: Ryan Roberts , Yu Zhao Cc: Andrew Morton , Matthew Wilcox , "Kirill A. Shutemov" , Yin Fengwei , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20230714160407.4142030-1-ryan.roberts@arm.com> <20230714161733.4144503-3-ryan.roberts@arm.com> <82c934af-a777-3437-8d87-ff453ad94bfd@redhat.com> <2c4b2a41-1c98-0782-ac30-80e65bdb2b0c@arm.com> <2e7d5692-8ba7-1e56-a03f-449f1671b100@redhat.com> <4f89d7bf-2fe2-fa53-c7ca-e4f152ca0edf@arm.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: <4f89d7bf-2fe2-fa53-c7ca-e4f152ca0edf@arm.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D502720021 X-Stat-Signature: dxpph9t3x4gim1uafkoyxkfy6q8y8uxi X-HE-Tag: 1689605706-869368 X-HE-Meta: U2FsdGVkX1+579bfHvUF/sRbI1vvQQudAbtZWkS4LNaCj3+qryAXUzuthfXK32ZGZ658Z4JLu4JkPoWLLaK8qwSMa4sfIlV6bchygOIeHzDovGf0zhl7aAIoIaczFw5MtwCCQIYjn9CCysJqLblwKZ0FlMuoYGmAgRlMNe+uxH/gg8zLVpI2I4P3w8VjxrZZht4vxIIO6/DctqBw8zscR+4VLjXnZwrDaDtSja0RuXEYW4J9pn+Soy6J4Knt1ahOS9lnwgarRKQMHgJWv2H5og0V66aPyRDUrnOcCBMuzDImpHAnTsPbjQPuogY2FmL2UAt9tfq+2uEbC0opvajXPG2fss4gOqiWFOIKc7Dlnr/jXXwDe+tGNuXCRRLmI+D0SBx0Hbcqx4AvAqfI6ABq8b+a/fY0qlpfaviw0D4XRar9SS0Y0oPjyjfDFjWqqDMOB8FJHJtDv8iMVhS2p6h9BpRY7hJfqi2U0rl/4HF2PzYO5DT1J6JiJ7tn9fyVMQ5rlQcUjCysNS1PaP3aFJp9fjalnNEcW+WVwUGoFFhshZXD1v/3Sld5jrySZizg9G9FzgDNFf3dOqlpdHLZz489awkLV+3sUgelZ9bBH24pM0xiJ6ogtsCm36mlORIOJSsUgKkTzRKYwfjSdKD39KKhEGilSI5bFXI3YeP7F4UYiWq70+qIFBNQBHphC29azuSCOJv/jKy+UhQ4h5srlwVbZmyKRiw2vGN/OiLSspcIvTQUQDSTf9hidS+c0pl9gSPUweP13deErlxRH3JrcWN2oxKeM63YuMZZO+hgwE834woZvb3sg47nYtJLipa8Mtl4jArOMprjPkzw8/jDq68R/39ulvU2RN6dW8uqcrS0FbX/S1y6VCSTLpPBrVotsduBY1ID7RfW8QSgcgKbCBtnEoLqV2FposTPlM1oZqnbim8XLmFqxduqA0wpb8zhmLbUeEgEqIEMjSU6sjBeUwj 3TnitbX2 Vz008lkEGsZN52jcD2dbsU616wgKiDM8OJ6BJD8YCAtDJ4k2ydeTqin7yu/M8vqeQHk/xcbdpxuhTdWLoL50BgSZHjLVJopSSdtt8WiERpYGHXdooi4JrMIpKXGJBk8VSJWJWyaIC65J16qiCus479sWZuzOsVGrPSS32q3pfJfjx2xmokEmJBRFAHioTW4m3hKmuWn8nZAunGjWY10xAr1TVCNdQpYU9BByvYtQY6jdoTOolH5rZlahlLYjL+PE7DA7jmbSgSNYhcMhuHcoV59LOMAVZELrC2Pvb+2TIrdF2KLNCxQ2zlGE3kuPbruPGq7E3Q1Quy4skL5qSePT0Vq9qn6cXDs1mTsBnkGGH5uOTUJO6RoGcY8IHQYZpualhb25s0QX7gleYzk3m97kvHEzBWTehL3vJ3V+naRB4ezCluTTWBhx3PzugJ50F2cTzNScbcUeeq9p2rFUW5yQ7YBT/uJxbTQEh5VphyFB1rKsb3Dg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 17.07.23 16:47, Ryan Roberts wrote: > On 17/07/2023 14:56, David Hildenbrand wrote: >> On 17.07.23 15:20, Ryan Roberts wrote: >>> On 17/07/2023 14:06, David Hildenbrand wrote: >>>> On 14.07.23 19:17, Yu Zhao wrote: >>>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts wrote: >>>>>> >>>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>>>>> allocated in large folios of a determined order. All pages of the large >>>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>>> counting, rmap management lru list management) are also significantly >>>>>> reduced since those ops now become per-folio. >>>>>> >>>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >>>>>> defaults to disabled for now; The long term aim is for this to defaut to >>>>>> enabled, but there are some risks around internal fragmentation that >>>>>> need to be better understood first. >>>>>> >>>>>> When enabled, the folio order is determined as such: For a vma, process >>>>>> or system that has explicitly disabled THP, we continue to allocate >>>>>> order-0. THP is most likely disabled to avoid any possible internal >>>>>> fragmentation so we honour that request. >>>>>> >>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas >>>>>> that have not explicitly opted-in to use transparent hugepages (e.g. >>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then >>>>>> arch_wants_pte_order() is limited by the new cmdline parameter, >>>>>> `flexthp_unhinted_max`. This allows for a performance boost without >>>>>> requiring any explicit opt-in from the workload while allowing the >>>>>> sysadmin to tune between performance and internal fragmentation. >>>>>> >>>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>>> mechanism allows the architecture to optimize as required. >>>>>> >>>>>> If the preferred order can't be used (e.g. because the folio would >>>>>> breach the bounds of the vma, or because ptes in the region are already >>>>>> mapped) then we fall back to a suitable lower order; first >>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>>> >>>>>> Signed-off-by: Ryan Roberts >>>>>> --- >>>>>>    .../admin-guide/kernel-parameters.txt         |  10 + >>>>>>    mm/Kconfig                                    |  10 + >>>>>>    mm/memory.c                                   | 187 ++++++++++++++++-- >>>>>>    3 files changed, 190 insertions(+), 17 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>>>>> b/Documentation/admin-guide/kernel-parameters.txt >>>>>> index a1457995fd41..405d624e2191 100644 >>>>>> --- a/Documentation/admin-guide/kernel-parameters.txt >>>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>>>>> @@ -1497,6 +1497,16 @@ >>>>>>                           See Documentation/admin-guide/sysctl/net.rst for >>>>>>                           fb_tunnels_only_for_init_ns >>>>>> >>>>>> +       flexthp_unhinted_max= >>>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The >>>>>> maximum >>>>>> +                       folio size that will be allocated for an anonymous vma >>>>>> +                       that has neither explicitly opted in nor out of using >>>>>> +                       transparent hugepages. The size must be a >>>>>> power-of-2 in >>>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size >>>>>> improves >>>>>> +                       performance by reducing page faults, while a smaller >>>>>> +                       size reduces internal fragmentation. Default: max(64K, >>>>>> +                       PAGE_SIZE). Format: size[KMG]. >>>>>> + >>>>> >>>>> Let's split this parameter into a separate patch. >>>>> >>>> >>>> Just a general comment after stumbling over patch #2, let's not start splitting >>>> patches into things that don't make any sense on their own; that just makes >>>> review a lot harder. >>> >>> ACK >>> >>>> >>>> For this case here, I'd suggest first adding the general infrastructure and then >>>> adding tunables we want to have on top. >>> >>> OK, so 1 patch for the main infrastructure, then a patch to disable for >>> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max >>> via a sysctl? >> >> MADV_NOHUGEPAGE handling for me falls under the category "required for >> correctness to not break existing workloads" and has to be there initially. >> >> Anything that is rather a performance tunable (e.g., a sysctl to optimize) can >> be added on top and discussed separately.> >> At least IMHO :) >> >>> >>>> >>>> I agree that toggling that at runtime (for example via sysfs as raised by me >>>> previously) would be nicer. >>> >>> OK, I clearly misunderstood, I thought you were requesting a boot parameter. >> >> Oh, sorry about that. I wanted to actually express >> "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at >> runtime as well. >> >>> What's the ABI compat guarrantee for sysctls? I assumed that for a boot >>> parameter it would be easier to remove in future if we wanted, but for sysctl, >>> its there forever? >> >> sysctl are hard/impossible to remove, yes. So we better make sure what we add >> has clear semantics. >> >> If we ever want some real auto-tunable mode (and can actually implement it >> without harming performance; and I am skeptical), we might want to allow for >> setting such a parameter to "auto", for example. >> >>> >>> Also, how do you feel about the naming and behavior of the parameter? >> >> Very good question. "flexthp_unhinted_max" naming is a bit suboptimal. >> >> For example, I'm not so sure if we should expose the feature to user space as >> "flexthp" at all. I think we should find a clearer feature name to begin with. >> >> ... maybe we can initially get away with dropping that parameter and default to >> something reasonably small (i.e., 64k as you have above)? > > That would certainly get my vote. But it was you who was arguing for a tunable > previously ;-). I propose we use the following as the "unhinted ceiling" for Yes, I still think having tunables makes sense. But it's certainly something to add separately, especially if it makes your work here easier. As long as it can be disabled, good enough for me for the initial version. -- Cheers, David / dhildenb