From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5D6BC5B552 for ; Mon, 9 Jun 2025 11:34:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6F3006B0088; Mon, 9 Jun 2025 07:34:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6CB926B0089; Mon, 9 Jun 2025 07:34:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E0C16B008C; Mon, 9 Jun 2025 07:34:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3E5F66B0088 for ; Mon, 9 Jun 2025 07:34:31 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E69E51D8232 for ; Mon, 9 Jun 2025 11:34:30 +0000 (UTC) X-FDA: 83535654300.04.43DD0AF Received: from mail-wr1-f41.google.com (mail-wr1-f41.google.com [209.85.221.41]) by imf25.hostedemail.com (Postfix) with ESMTP id A2EFBA0003 for ; Mon, 9 Jun 2025 11:34:28 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dVXe+8Lv; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.41 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749468868; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y+ipcKqABSb+lKkWdrUxsmQBri1lQ1CzkeGIMLOQuDs=; b=O0C7nsOscQHI0zMwX1Gmx7gY2Bun4ETG3FQ4QNrtPUtc8/OiVCHcmf5kxeycY+kL7Ea+/U XTT3XBuAEK9IXggEX6McuVOGwTiK/aBIgPH8BVjxax/e0wtHB0vniGDoTXWUwDK08LrCJA e07zb8pEwSDMCfqX7GqLSMLWO0FLBbI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749468868; a=rsa-sha256; cv=none; b=FOwJLavWQHzjK0dY7HANnoxHaWOumDdYbxF3fg30jT1khkuSNXRVA/bxoNgnQmuYu6WbXH HP3xoMnBrW67dPoHrZhn/0ZB2b4q1rQ9N9Ha7sKDd+8G03BmbdNSFHH4DFAcNqVNVbuBSf 0hHeoHZKSEzH4ewnMKc3IDBL2HmZKXE= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dVXe+8Lv; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.41 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com Received: by mail-wr1-f41.google.com with SMTP id ffacd0b85a97d-3a4f72cba73so3571079f8f.1 for ; Mon, 09 Jun 2025 04:34:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749468867; x=1750073667; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=y+ipcKqABSb+lKkWdrUxsmQBri1lQ1CzkeGIMLOQuDs=; b=dVXe+8LvT7DC0fJ1rbxDejrNnUQ2wxtnGjvQf/Y8nYVByT9D1OzRCpg/Q+bGnHrLsD U0XCrsmknF9m86ZPeEd8tqq1J/8wvAPiBwnkPyqm+OovgMEkyGnsZhEA4KqfAqfH2VEm 4/kt+7Iv77RJb2HY6grgI4PKSVub9JejIu/PowwHN04wK2pFmBjQx7KvEYR9SaC+q2z3 lBVj/6sbOF6qBo2OxUmcJ/zUe5kCZ8soGQdO2KVU5S/yJaVNUWKDcuhFB9iI72b2qvEo HqRwpmNw+Rghx8Dxekz45Fo1l53s06/9jSvdx7dW503c6OQUTTC8+wbCKlL/iP5ALjej LXcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749468867; x=1750073667; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=y+ipcKqABSb+lKkWdrUxsmQBri1lQ1CzkeGIMLOQuDs=; b=Sxl2BtbQna1hi+JzTwOXJGQygRjH3lVERfO4Mw46XdlEQJU9325sYxX6lvW294rh4n nUefOZc1OKmMeZrRTkfJl94AqRqm5qRxb1iUPctaAGG8PajO/2XLBKyj5nhSru3eGSfq XWCEkbJiGD1dq5KXfx4bnBbgJWpbEQC1u6Wp0sWKZKKCOLR6mi0RTCxpMb48OerwZlgN ZnElo0IUTPV3spP+HmCuwnhgrfKnZjXUfcH3j2yrZrxR/4F0dumddgIB3VdBCbA+RrIQ WOjWTf6V3NWh7KcHruoglK0diqZvEgg2doav2IiRAMj+mcaSbfVHSJLSlGpDmzjTDRd8 hniA== X-Forwarded-Encrypted: i=1; AJvYcCU4b1JtMcAvIHMhl985aZ0HSdpPBenPpSKjjm0vxwnGRT06SB+dPeu0UYuOVreAzpVW66Ii7g4qKQ==@kvack.org X-Gm-Message-State: AOJu0YzWW8BtlSzk2v9qhB1iiP8LMluXL9joYuXsno7rUBg8KfjjL+7E vPlfd39cXt6HMineWd6d3xshOTCb2wprlDm9L3jECW1s7i1wh1FINm+Y X-Gm-Gg: ASbGncunBtWVoPCjM5YtHbj1B56mgZc2S+jm/rmwNLqQfD6ZL7H+bjgb+qdnqpMDPiF +TDBTECXqryx3BfVXj6qMYiZHG0QTnM6G+rwZFchGj/zRzl66wI8hrWPpEY4db4AW3nKTwvG8Gn W3VGWsEI2IhVUohhMiqa82rmfuIt8TzGqtO7TvOEYx8OhOq16thp6cZvPgy4u/m6m5G6c5+8rE8 bN9F6sKjdQ/KgrE7DP0NUXs/xqpFVSEjdcIeuwt0QzTgGG50pE3ClGvwZb+g1Ta0tLQXSfetn0M 7Fl0CCu9QcJvAdnfneQ37xbsW7ZPE+YZ/4TnWzLiMYBaGrdvEPBLXDs8Jo/I/K2hQET89pKy+i2 hePdb05rW78GiNRpHhdeHKVDu3Q7i/JW38i7ZRWwy/3G9NXgFrqso X-Google-Smtp-Source: AGHT+IHTYTxaJmzwLMVIsTAJwEV4y9pfUGpoQtN410+8NUHonS9MSn/rjc7YIQvaHPllT/zGNuwLeA== X-Received: by 2002:a5d:5f96:0:b0:3a4:d685:3de7 with SMTP id ffacd0b85a97d-3a526dcdd91mr13416592f8f.8.1749468866721; Mon, 09 Jun 2025 04:34:26 -0700 (PDT) Received: from ?IPV6:2a02:6b6f:e750:f900:146f:2c4f:d96e:4241? ([2a02:6b6f:e750:f900:146f:2c4f:d96e:4241]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3a532464e3csm9469713f8f.99.2025.06.09.04.34.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 09 Jun 2025 04:34:26 -0700 (PDT) Message-ID: <4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com> Date: Mon, 9 Jun 2025 12:34:25 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes To: David Hildenbrand , Andrew Morton , linux-mm@kvack.org Cc: hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hughd@google.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Matthew Wilcox References: <20250606143700.3256414-1-usamaarif642@gmail.com> <4c1d5033-0c90-4672-84a1-15978ced245d@redhat.com> Content-Language: en-US From: Usama Arif In-Reply-To: <4c1d5033-0c90-4672-84a1-15978ced245d@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: A2EFBA0003 X-Stat-Signature: pb7m955eyajogb5wapatfpy65bzoxk57 X-Rspam-User: X-HE-Tag: 1749468868-473168 X-HE-Meta: U2FsdGVkX19E8EKdRrlBdssFtIR1rWAvDFdl/+I+IL+ixEPi9CHBTuiCo3OXVJbulcxzuN1APaEqeikLon2YvI2wRLU7ggfknoaxD4z3JQ/iraX3cBPKeuk7Y//XiM06EpdEIWX/qdZ/SpPHJvRbuvVAxS5o3GFWOlheyi1kihHJZdjiErnviGvi5JtdJWkVTHnrx4Dfb8blWglONGzgyf9tlPudPZ1wz07l4V3O6NdakqLr8SftBN/lAlfBdDEejO93WYXHjj/SKZ89rMoGH/Ql7aAGV+JH49fqSZaXtkI9nw4esvn8nZAXzvcrQM723s4RFgp8fJEwpvb85n67Ge/h4K1U425AUrsaJYNDqcjFmjzfO3gYXmqUUEPEY5Dxo2VNjyDL74V8SfVjkx844vJBM5EJ4g/t46TKMzKsZNLHTvTFIYn+D1herLjYdVlh/iu+BO+kmhqQxQ1xJ/9qhR+bXnaP3H0GsJKgWJf2i5iez2XQm9izEhXDr+H4cWSvIlsNdmzcuRt7V7vB3dZqsr8c4673s8tbQ3HRuhhDSFSVR34/cslzjBwz7SZH2UgGxNa39s4yRbht5rIiQKE67h/i13oxYMvc92JWLAsaJZc0/aSLfbKLhr2fmpG6YUdrwMOGYEKUR+xEzojp6QiJ8HJaNinioqZPdffbS5M+zoBxP5jPPt4xyZ/+pg7NkJdolErWnotnKR3StpI5mmq6Ef3bIbezeCeucMKUrZNyB/fmb/h17L+pEuLvlLGNNtIIIU8uKuQh1/fIxpU5bHuBfLg/NGSlKdbhGafwWIk+d8ZkumURTBqdQQM4ASmk48nlP41gej8nA0IJS+DoVw+bUKEe1FIs7TEr+0jyYDuMiWTtIHvUN+3osiEz2QlJEoM3L0fd5fTF+vcqlTVAFQYvx3CL7IrBX7+/CTzR0Om5yC+2NZ8rW+CrEgr4S5LKz27w1XzeKBJ7sWSF1UVwaet segGXVr0 T6FJV9iHyClv8BiqUHLgZWcmPLtFJ5vYEoie/3s63MNMdJNXCkvxQbuszBE0bk4J6Fn1fSNpIxS39OBdRsdAmLdbefQKzg5LiOIp3L0S3Yo7aSKX1oUaX2QczQowXueiB+96SzH2vLCGPNeJ7Q2N4QndXSOPaTwA4edyKZBZcctr2KidqwC3iB7PtuCNixYMJpkuPM9pje+2lNr9nw1XSdyLYEenFdxXh3keWrj9WA3l0teKQ8s239HAPeO2rvPuKaoyu6LUq6mukctfpwMblDDRtuDzkQ405aWVEWsTNdX7nEB9TtYipIYgrRsqloOCtG46D38WtLQfZSYye/Jk0t5pX3EtsO3ndl54k9CaOkm7F1GNlzv2m633YDkvrfdVnAqikzQ4F7+O74cUAjOFrIhLE8VUwFS+v4FCrg9IQxhW7lLGah8EfiU+vwJVRaz8d7HMJFD8eMwlOMto4WRBqHq5TQCX7zGStr9816AftHmEW4jgXwmqYNPZ6ApPQOau6O8LAVHPwwi6NL5lvnof6A9qgfw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 06/06/2025 18:37, David Hildenbrand wrote: > On 06.06.25 16:37, Usama Arif wrote: >> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the >> watermarks are evaluated to extremely high values, for e.g. a server with >> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest >> of the sizes set to never, the min, low and high watermarks evaluate to >> 11.2G, 14G and 16.8G respectively. >> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage >> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M >> and 1G respectively. >> This is because set_recommended_min_free_kbytes is designed for PMD >> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)). >> Such high watermark values can cause performance and latency issues in >> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough >> most of them would never actually use a 512M PMD THP. >> >> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large >> folio order enabled in set_recommended_min_free_kbytes. >> With this patch, when only 2M THP hugepage size is set to madvise for the >> same machine with 64K page size, with the rest of the sizes set to never, >> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G >> respectively. When 512M THP hugepage size is set to madvise for the same >> machine with 64K page size, the min, low and high watermarks evaluate to >> 11.2G, 14G and 16.8G respectively, the same as without this patch. >> >> An alternative solution would be to change PAGE_BLOCK_ORDER by changing >> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this >> is not dynamic with hugepage size, will need different kernel builds for >> different hugepage sizes and most users won't know that this needs to be >> done as it can be difficult to detmermine that the performance and latency >> issues are coming from the high watermark values. >> >> All watermark numbers are for zones of nodes that had the highest number >> of pages, i.e. the value for min size for 4K is obtained using: >> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}'; >> and for 64K using: >> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}'; >> >> An arbirtary min of 128 pages is used for when no hugepage sizes are set >> enabled. >> >> Signed-off-by: Usama Arif >> --- >>   include/linux/huge_mm.h | 25 +++++++++++++++++++++++++ >>   mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++---- >>   mm/shmem.c              | 29 +++++------------------------ >>   3 files changed, 58 insertions(+), 28 deletions(-) >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 2f190c90192d..fb4e51ef0acb 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item) >>   } >>   #endif >>   +/* >> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option >> + * >> + * SHMEM_HUGE_NEVER: >> + *    disables huge pages for the mount; >> + * SHMEM_HUGE_ALWAYS: >> + *    enables huge pages for the mount; >> + * SHMEM_HUGE_WITHIN_SIZE: >> + *    only allocate huge pages if the page will be fully within i_size, >> + *    also respect madvise() hints; >> + * SHMEM_HUGE_ADVISE: >> + *    only allocate huge pages if requested with madvise(); >> + */ >> + >> + #define SHMEM_HUGE_NEVER    0 >> + #define SHMEM_HUGE_ALWAYS    1 >> + #define SHMEM_HUGE_WITHIN_SIZE    2 >> + #define SHMEM_HUGE_ADVISE    3 >> + >>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>     extern unsigned long transparent_hugepage_flags; >> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always; >>   extern unsigned long huge_anon_orders_madvise; >>   extern unsigned long huge_anon_orders_inherit; >>   +extern int shmem_huge __read_mostly; >> +extern unsigned long huge_shmem_orders_always; >> +extern unsigned long huge_shmem_orders_madvise; >> +extern unsigned long huge_shmem_orders_inherit; >> +extern unsigned long huge_shmem_orders_within_size; > > Do really all of these have to be exported? > Hi David, Thanks for the review! For the RFC, I just did it similar to the anon ones when I got the build error trying to use these, but yeah a much better approach would be to just have a function in shmem that would return the largest shmem thp allowable order. >> + >>   static inline bool hugepage_global_enabled(void) >>   { >>       return transparent_hugepage_flags & >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 15203ea7d007..e64cba74eb2a 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none) >>       return 0; >>   } >>   +static int thp_highest_allowable_order(void) > > Did you mean "largest" ? Yes > >> +{ >> +    unsigned long orders = READ_ONCE(huge_anon_orders_always) >> +                   | READ_ONCE(huge_anon_orders_madvise) >> +                   | READ_ONCE(huge_shmem_orders_always) >> +                   | READ_ONCE(huge_shmem_orders_madvise) >> +                   | READ_ONCE(huge_shmem_orders_within_size); >> +    if (hugepage_global_enabled()) >> +        orders |= READ_ONCE(huge_anon_orders_inherit); >> +    if (shmem_huge != SHMEM_HUGE_NEVER) >> +        orders |= READ_ONCE(huge_shmem_orders_inherit); >> + >> +    return orders == 0 ? 0 : fls(orders) - 1; >> +} > > But how does this interact with large folios / THPs in the page cache? > Yes this will be a problem. >From what I see, there doesn't seem to be a max order for pagecache, only mapping_set_folio_min_order for the min. Does this mean that pagecache can fault in 128M, 256M, 512M large folios? I think this could increase the OOM rate significantly when ARM64 servers are used with filesystems that support large folios.. Should there be an upper limit for pagecache? If so, it would either be a new sysfs entry (which I dont like :( ) or just try and reuse the existing entries with something like thp_highest_allowable_order? >> + >> +static unsigned long min_thp_pageblock_nr_pages(void) > > Reading the function name, I have no idea what this function is supposed to do. > > Yeah sorry about that. I knew even before sending the RFC that this was a bad name :( I think an issue is that pageblock_nr_pages is not really 1 << PAGE_BLOCK_ORDER but is 1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when THP is enabled. I wanted to highlight with the name that it will use the minimum of the max THP order that is enabled and PAGE_BLOCK_ORDER when calculating the number of pages..