From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89B92C282DE for ; Thu, 13 Mar 2025 21:07:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E5A3E280008; Thu, 13 Mar 2025 17:07:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DE4C0280001; Thu, 13 Mar 2025 17:07:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7332280008; Thu, 13 Mar 2025 17:07:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8E8C5280001 for ; Thu, 13 Mar 2025 17:07:05 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 55A0454EF0 for ; Thu, 13 Mar 2025 21:07:07 +0000 (UTC) X-FDA: 83217762894.06.FFA3AD2 Received: from mail-qv1-f44.google.com (mail-qv1-f44.google.com [209.85.219.44]) by imf23.hostedemail.com (Postfix) with ESMTP id F0AFA140017 for ; Thu, 13 Mar 2025 21:07:04 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=iFU7D3ib; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.44 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741900025; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sZHRsjhwN4k6tUsXlsg8AVWAXyHTwCMhLnhjRmadEFs=; b=VlUpPpgVjpqlv4ICi/Y2A2G0mWVm/Im4ermG9wo4XQGNKQmbceBBXBjXcrhDymUnUTNV4B eYuzpHPcufzZvTQvEDtvzmdGBRDhC1zmkDDUECpDQgv4+Fxto1nbGTLNPGdyugHkf2+YS7 ilkRhRQ+5kpgduK4thv7Okwf69YceFk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741900025; a=rsa-sha256; cv=none; b=Lo380wSDQdRWAT7E9nBEVcacSzRcEtwzaqmC4Jjh4IUxbeMQRoO7gaWh1B1tLtP8sGgKzE 2ykvxJWOnLSa9Ig/rKWN+QODlTLKHtHrkXx1+Sb4EEGSmKDZpvnhXFeCzXteWxGXJoBjb9 +LzajhRtgybWdqKfSSJ4V06iraO9BLU= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=iFU7D3ib; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.44 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org Received: by mail-qv1-f44.google.com with SMTP id 6a1803df08f44-6ddcff5a823so12826106d6.0 for ; Thu, 13 Mar 2025 14:07:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1741900024; x=1742504824; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=sZHRsjhwN4k6tUsXlsg8AVWAXyHTwCMhLnhjRmadEFs=; b=iFU7D3ibNFAFVyimeKP0HbzQT9ogdeA5In5C6Y+s7mhG/e7jS5ofl3U3zyKaZq8A88 0OTQcJjRWUHsQElr8a3IW72P3ZgnwWR/R4pZMSimvlnNBjNh1oYOSiKTY+VWK/dwpvhB oXZpQc68lbeV+FoovO3ExBaOp/7tR+PDsRN7ElfBLVr3rRxEMFAzFVlGOFUy43UScRKI Y6NIDRaxVCVdyDegQvxcAnlWM3z6ZcN9XSxQ5uB1ugoxGSp1LsGdlUwHvmI6pGB2NHa9 fIr3ALXv14mogeLkafX7+u09kh87JwfeqtvF1u5i14vt/Gc4qYxlipVuvDIWUxHGZ5Zf 2iDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741900024; x=1742504824; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sZHRsjhwN4k6tUsXlsg8AVWAXyHTwCMhLnhjRmadEFs=; b=s6QOMF92yBO/tpnIv0+yM0vw1qzjxXHehhuM2uNSOCBfHOJZ8RPUOuzKI1YfEsD8v2 BH4LJexBRXz8TTv4T/ZOqKKRid1HtnN7yDCxYJ+XcG9h19NhPb7utGh2WzQK9PfwH27s jXfha3E8yrRbTWrk7pMyIItLan9G+WN38E5b05NNTxNyl/u3Pn6CqkiFcly+4J4rFbaR jY0F6yg7cR7oTCMCeCw8IQ1VVfEkx1cvWlNZdXXaTeBZKD4+HupRKh5GNeZ+KNO4LEmu 7CADvxqUPTYRadKWTA4ZZxlE+s7mAjQ4r2S/djg9rJKjNBw997OMj2VcMeaLXdj84/ti LrKA== X-Forwarded-Encrypted: i=1; AJvYcCUyswks2SJM+sKrFaArLErF0igA/YczZhKYgBtx70M6ENJ4LyVU9Bm1U2zK9ihvVXvbuq/ArEaDVQ==@kvack.org X-Gm-Message-State: AOJu0YwC4IoOTN38QiRk1SZuUnmVE9CTkbn2RzLYBHkzEFVnvS7XZaMu utK7hGiOOCBMpjuvFpzUuM7g2PM1AcK3AHyUOyhbMoCAp59CHY1QSKRHOpcx6zc= X-Gm-Gg: ASbGncu3jXvflG5XyvPKaV4XCs5jvZCEB44zsYXAs8EYhxv+Jls+IoA6Aoi5gPizHwE nMusO4XvGP0f8gtMrcZEUq/Wfil7kECfAnoTDUevA8fep8PrPY57UA/n4L5DTGW+uZCn6IXyxRY aEkEkWmahTnwpeVbmQt5ikFFI6aWHw7d44KNLNOKlbT6k6OhIjYeU81HiZxrSDSkKlXYDruFLto 3xtwu4fAf4LROCTUxrUmqu+jU0KJn6xcRx6ligNdTJvHVY+HkKeB3l0W4qBSwql9YOkNBLglgLk GZRsIdxmKYoGsoaFnFQkx2XeEKhw4AWx1ywAN+EVUh0= X-Google-Smtp-Source: AGHT+IHwO+cZb5MGTiwyC3MJf2gMVF38ko0IXPhHKQ4bI2CQqbDOb0AFS8ZUsuofmUulVbWhY9Mvjw== X-Received: by 2002:a05:6214:5299:b0:6e6:6c39:cb71 with SMTP id 6a1803df08f44-6eae7b3faa0mr22068376d6.45.1741900024009; Thu, 13 Mar 2025 14:07:04 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6eade230de3sm14015326d6.28.2025.03.13.14.07.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Mar 2025 14:07:02 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Vlastimil Babka , Mel Gorman , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 3/5] mm: page_alloc: defrag_mode Date: Thu, 13 Mar 2025 17:05:34 -0400 Message-ID: <20250313210647.1314586-4-hannes@cmpxchg.org> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250313210647.1314586-1-hannes@cmpxchg.org> References: <20250313210647.1314586-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: F0AFA140017 X-Stat-Signature: s35yz78p411qeqzys3haw5rgpcxcnkn1 X-HE-Tag: 1741900024-209815 X-HE-Meta: U2FsdGVkX18Ya/1rvgN5Q+Z8p3FHA5vesyhaLnRfIc6vKCj68WK6S991WK7nrDoz+916on840Gn7PpZeBzewDc8rXhKumVx3izm+OjsFpau8RiYiDGI4OwXkC3g7q3zd1yd8/kyAiqCVUEtnQT8eeSacMP8Zq7Y7OaezGTMa/gjl0iPf6dqdURiPDzlatkoCnwlI6HHEmy7+PSYqonjkEddi6POshXwecSa5NtTbVjbt/WOUO3XEc37C1NCyYhvEbqW7NnkxMkmwFHh00IzBVZvoo1FKL+U8SmJBCMAa5a8KhK4t3s0jgCAvK9F4cZIq99c1xFu8692UWXiFWT+mkighhAbhrBYLmZgt1l7g5yPt+0ZAwt9w3ZB1Q2CV5qbD2asm0UP8UYjRYBVIsC42tpp0T50gW4HG6YP6rdb8Ah/23EdogoJvXq4xrfZMP1Krw7L2FAS/7OlUh8eYk3jO+tV99+1kIWA6uiz5qucyp8kLiz9PCutA5JPnDUH1jRVN6dc+dA1dKRTuRR3qpNXYQpQJJpY4WCzjjWxSGR1Yrpo/TxCqm3+QGrhGUfm4vH0Y1JFAxyP8cPAPpAuoLfjPjkmsdmbuWJGCjexovTp3KLs/xmvp9V4v7plqma2Gvc+yPdW5SxbO1EmPyxWHS/epdwkrLtqq6dC8f8uUo3T33OFwLG7EaLosfo60vXi1VFOO1pmAFgyBNXCOXP/Gq7wEszO4aRWbEuUtDX0arPQC3+8Do7IohnFKPEnmYwvKTI4CmwfWQf2NZxV/XGxhobBgzrxDBjgQqyFyBrj5jIyY+b/zssI205i0rU71GjupzRskzfC2mHwifG4qvSCOCoxDc1FOpmQ7Zp8vtz2OxpC5dtQe21y1zIcmPA/4Pc2C+B2K5PR3nY0Tq4OgR8ewbShPuvPkypBtls2Q6D/Kz4wkbQkxOVd8yfPOFLnRVPq65S+qj6t4F24hDB1/QP2ZR0V 1d0Ck4wf susowRU2CEGKnTNmcLAYRZxwidWMn4QapZ57JgjbWkrPwAmw4Z9YE/EWQKE8lg+lG+sqmV7mXlMrKJMBkhXlk9F7rUDS4XcbrI3GJjoWksyxoJoLhusaR896KFPmOHeMKaj3KEqjFiWBBhl0bVwEQLjwpvdZSseK2Y0Yqy+epHRwgB1okXnyblY6Rre2wjmqNt4VsqYlRd1cNIARO4Nur6aqPKpqB1h7m62cA6d2UxyJcSMq18MQaLTTKiLKcFlGvb7ApdEz7Ime/HrrxF7TgirGrgwl3OoTYz03Kup5dxiGcCD+B8FPuHfeGfXc7JuFXdoY4ytuw4WJAc1BR/K8vbxpWV5QP4xm9rho89NCCcx0PMIGLcY1GAmw2LqknFXtB3tuZKGSm4KpNPn2WLDoxOIYKFx/ZxrlVh7S2wfVqLeyJhJEOOoKu6WS2ScBx3giZJsXS1WOTife+HpNLsIw99kzbOSA6nm9zW/q+2VtCBZI+A7M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The page allocator groups requests by migratetype to stave off fragmentation. However, in practice this is routinely defeated by the fact that it gives up *before* invoking reclaim and compaction - which may well produce suitable pages. As a result, fragmentation of physical memory is a common ongoing process in many load scenarios. Fragmentation deteriorates compaction's ability to produce huge pages. Depending on the lifetime of the fragmenting allocations, those effects can be long-lasting or even permanent, requiring drastic measures like forcible idle states or even reboots as the only reliable ways to recover the address space for THP production. In a kernel build test with supplemental THP pressure, the THP allocation rate steadily declines over 15 runs: thp_fault_alloc 61988 56474 57258 50187 52388 55409 52925 47648 43669 40621 36077 41721 36685 34641 33215 This is a hurdle in adopting THP in any environment where hosts are shared between multiple overlapping workloads (cloud environments), and rarely experience true idle periods. To make THP a reliable and predictable optimization, there needs to be a stronger guarantee to avoid such fragmentation. Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT is enforced on the allocator fastpath and the reclaiming slowpath. For now, fallbacks are permitted to avert OOMs. There is a plan to add defrag_mode=2 to prefer OOMs over fragmentation, but this requires additional prep work in compaction and the reserve management to make it ready for all possible allocation contexts. The following test results are from a kernel build with periodic bursts of THP allocations, over 15 runs: vanilla defrag_mode=1 @claimer[unmovable]: 189 103 @claimer[movable]: 92 103 @claimer[reclaimable]: 207 61 @pollute[unmovable from movable]: 25 0 @pollute[unmovable from reclaimable]: 28 0 @pollute[movable from unmovable]: 38835 0 @pollute[movable from reclaimable]: 147136 0 @pollute[reclaimable from unmovable]: 178 0 @pollute[reclaimable from movable]: 33 0 @steal[unmovable from movable]: 11 0 @steal[unmovable from reclaimable]: 5 0 @steal[reclaimable from unmovable]: 107 0 @steal[reclaimable from movable]: 90 0 @steal[movable from reclaimable]: 354 0 @steal[movable from unmovable]: 130 0 Both types of polluting fallbacks are eliminated in this workload. Interestingly, whole block conversions are reduced as well. This is because once a block is claimed for a type, its empty space remains available for future allocations, instead of being padded with fallbacks; this allows the native type to group up instead of spreading out to new blocks. The assumption in the allocator has been that pollution from movable allocations is less harmful than from other types, since they can be reclaimed or migrated out should the space be needed. However, since fallbacks occur *before* reclaim/compaction is invoked, movable pollution will still cause non-movable allocations to spread out and claim more blocks. Without fragmentation, THP rates hold steady with defrag_mode=1: thp_fault_alloc 32478 20725 45045 32130 14018 21711 40791 29134 34458 45381 28305 17265 22584 28454 30850 While the downward trend is eliminated, the keen reader will of course notice that the baseline rate is much smaller than the vanilla kernel's to begin with. This is due to deficiencies in how reclaim and compaction are currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller allocations are competing with THPs for pageblocks, while making no effort themselves to reclaim or compact beyond their own request size. This effect already exists with the current usage of ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole block stealing much more strongly. Subsequent patches will address defrag_mode reclaim strategy to raise the THP success baseline above the vanilla kernel. Signed-off-by: Johannes Weiner --- Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++ mm/page_alloc.c | 27 +++++++++++++++++++++++-- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index ec6343ee4248..e169dbf48180 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -29,6 +29,7 @@ files can be found in mm/swap.c. - compaction_proactiveness - compaction_proactiveness_leeway - compact_unevictable_allowed +- defrag_mode - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -162,6 +163,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due to compaction, which would block the task from becoming active until the fault is resolved. +defrag_mode +=========== + +When set to 1, the page allocator tries harder to avoid fragmentation +and maintain the ability to produce huge pages / higher-order pages. + +It is recommended to enable this right after boot, as fragmentation, +once it occurred, can be long-lasting or even permanent. dirty_background_bytes ====================== diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6f0404941886..9a02772c2461 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -273,6 +273,7 @@ int min_free_kbytes = 1024; int user_min_free_kbytes = -1; static int watermark_boost_factor __read_mostly = 15000; static int watermark_scale_factor = 10; +static int defrag_mode; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -3389,6 +3390,11 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask) */ alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM); + if (defrag_mode) { + alloc_flags |= ALLOC_NOFRAGMENT; + return alloc_flags; + } + #ifdef CONFIG_ZONE_DMA32 if (!zone) return alloc_flags; @@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, continue; } - if (no_fallback && nr_online_nodes > 1 && + if (no_fallback && !defrag_mode && nr_online_nodes > 1 && zone != zonelist_zone(ac->preferred_zoneref)) { int local_nid; @@ -3591,7 +3597,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, * It's possible on a UMA machine to get through all zones that are * fragmented. If avoiding fragmentation, reset and try again. */ - if (no_fallback) { + if (no_fallback && !defrag_mode) { alloc_flags &= ~ALLOC_NOFRAGMENT; goto retry; } @@ -4128,6 +4134,9 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags); + if (defrag_mode) + alloc_flags |= ALLOC_NOFRAGMENT; + return alloc_flags; } @@ -4510,6 +4519,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, &compaction_retries)) goto retry; + /* Reclaim/compaction failed to prevent the fallback */ + if (defrag_mode) { + alloc_flags &= ALLOC_NOFRAGMENT; + goto retry; + } /* * Deal with possible cpuset update races or zonelist updates to avoid @@ -6286,6 +6300,15 @@ static const struct ctl_table page_alloc_sysctl_table[] = { .extra1 = SYSCTL_ONE, .extra2 = SYSCTL_THREE_THOUSAND, }, + { + .procname = "defrag_mode", + .data = &defrag_mode, + .maxlen = sizeof(defrag_mode), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, { .procname = "percpu_pagelist_high_fraction", .data = &percpu_pagelist_high_fraction, -- 2.48.1