From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 750E1C282DE for ; Thu, 13 Mar 2025 21:07:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B848F280002; Thu, 13 Mar 2025 17:06:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B0DF0280001; Thu, 13 Mar 2025 17:06:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 987D6280002; Thu, 13 Mar 2025 17:06:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 77EBC280001 for ; Thu, 13 Mar 2025 17:06:57 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 01B4254E8D for ; Thu, 13 Mar 2025 21:06:58 +0000 (UTC) X-FDA: 83217762558.26.5915CB6 Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf15.hostedemail.com (Postfix) with ESMTP id 85624A001C for ; Thu, 13 Mar 2025 21:06:56 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=K1Kg0EH1; spf=pass (imf15.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.48 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741900017; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=acnkGczaHoZ7Fx3rgT5jbvpZ1dgElb8rpj0kySNqzQc=; b=ztyo/8UmuW1qaP9awWonlngqjXyoKrzFMWzFByQaPtqGpjaa4awh4CL/eGvSWiWS7lEJVF +J8GXCuoy2tb9aspNprGKJnFJTS7akbC8W9n1/wcCePIzFIbHuW4BKQEsR8XZz01HynIwx 9JSnSrLvMpEPtz02zrT4gTXn8lb14og= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741900017; a=rsa-sha256; cv=none; b=k7uQOCZbeoxsfovJzhWBTCnRAevbZ95Shqnc8uNrS1v1imPTXSkvHO/7+/LrN7aeNL+/ad BLex9sQ9KuIm19C1zwwXp/BYaNHsRsjt58iGJM1xQBHOoTf7DR3FkfroVl6Io0BnxTDjhs BcJZvsLZGYhLZpVbAZjqJUMfzEoQciM= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=K1Kg0EH1; spf=pass (imf15.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.48 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6dd15d03eacso16098326d6.0 for ; Thu, 13 Mar 2025 14:06:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1741900015; x=1742504815; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=acnkGczaHoZ7Fx3rgT5jbvpZ1dgElb8rpj0kySNqzQc=; b=K1Kg0EH1C/YPoxHXzkGwNxApWH64pcey61BxQ4bawuCU5tws2YJJqnWDJ5W/2AmAGq D9+QlDnf+t67K5S4wxsSwsrStClGTTZGfJ+xZyJ8PCaRAoI4ECuLcikMLJGcZ1FBTdxR vtDVi4E7wcWSWe5o7PsNYKsOQfArIOFte1Swh2vPB0ctq5OKoMAsIzZySvJ6HlG6Fzx0 Q6vBe9v5YqIBr2s5jbCczCbenF5o/IMjbVr5kvZA2UkCj5WB72ApezfZpTXItn7qhITx 1OeAQM5CwuDhnc771wc16VmUHqgomfdhsrwV4sti0a33ZHbZnGU/NKvJ9sW3yUCR9K3Z N3hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741900015; x=1742504815; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=acnkGczaHoZ7Fx3rgT5jbvpZ1dgElb8rpj0kySNqzQc=; b=q2P7VXVDUBnwT9rVeimtoxAhMMVMB8nrt/mK3EhVpuwl2YkqXgFEsnCRKHznuelzAF ca09MD1/YyQCVzgjpuzANtT8QCowl4NPQXSMLSTC7u24R7VU0oL651DY2akTRLfzENn4 cEinz8XEJbHivp/v+vv/s0iNGnUQg5Cal0iMkwP5N5wg/yme5C9Ru1ZO+AgdS0LBg2eD UBrl0Ak6+UENKWw0XGnZY5OmXP2tVg9oDrACny44KgW4J6/v8QcoW+sJ0JDdrEYWv9H/ CaijEMLmI+p3p9Pi7TFrFycqwi3Lj9XkEVDb4Et6XNF4GI4rVP0aFPC6R6+I7YxstGpA H+5Q== X-Forwarded-Encrypted: i=1; AJvYcCUWdTvWIrp+8ZN09f0QgWtL8Zob1YVwsaLZgZaVWLssmyFsYrZ9c95x1uLp7BOohN+co38hC7AYJQ==@kvack.org X-Gm-Message-State: AOJu0YwPJ4Vic1QtUkX4i9+IueHvQ1l4eC873j7npYZ93gXLChCK3tQB QdaIRkkEqY+y+Gy+/FeSwAeq8I773vil0rSUl65HkghZk5SHo8Y9SpIZuc3d7WQ= X-Gm-Gg: ASbGncu5mk8JoIXA3dvebWG1828baBiAvj6JwtxQvdndGa/iHlUTtNOdA5RHXMEEHYt /JKlkcbBXPYRzJAmbYsyb0HLh2OpHy5djmryvQD4EAr9ZZNvaOPX8lugbqGFMMrcF3Xzx0yKBsR qyPid9TXLlNR97v9645lBOLRsMdCd0lLJbkQ33wuqxsjUZFvTCWlGpmjtJyQIyq+eZMItMkjiQ2 +VeWLY8JMAON/RXa8YACqfQICvUo0cEgXpIAxvoso7DHZdUrzEGECMBClGyMtZLK+dAGc6q3Vat T6TZtgYWDkPb/xjbVGh77GNr/xSpjmcj8KeE/TvydWs= X-Google-Smtp-Source: AGHT+IHhJKfY1xzK0yGlgacwgxvBIof6ldTsBteC55oukFeS9CLgsFgifOVwTo6sprjF60jwNuwHrg== X-Received: by 2002:a05:6214:3007:b0:6e2:485d:fddd with SMTP id 6a1803df08f44-6eaddee6d07mr71984676d6.1.1741900015290; Thu, 13 Mar 2025 14:06:55 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6eade2097f3sm13863406d6.17.2025.03.13.14.06.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Mar 2025 14:06:54 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Vlastimil Babka , Mel Gorman , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/5] mm: reliable huge page allocator Date: Thu, 13 Mar 2025 17:05:31 -0400 Message-ID: <20250313210647.1314586-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.48.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 85624A001C X-Stat-Signature: iqgtayqo9t1hgcc7ucj3gryoyngu4o5n X-HE-Tag: 1741900016-485635 X-HE-Meta: U2FsdGVkX18UdQvySGFk70DB0H9DihPf2eRhXqkYLrcp17o9NTeAZ8UPxayD+EFjoJ3I14HboiUk86LJ0uawP6bDU82ungXN0UhkjHSjhRNStaLnSZKJ9AeG+zlvcqz0y+PmSQwUM00YY5T/OCA2/uAsL/AYbpERpCURfnFSrp7pLCb/gYYGBE84QKI76B0kKFsr7e4La4iTrpCOW1dmAUnxvJliYviuOpx2he0bgXCkdcw3DG/YfsCdsiTL4bveKayiCRufNFCKrA1/Ddw8B+1/F2+DzK4yMfF9umVvzAlZNbG6K22SXVYxRcEj254m8dvYJ1gbrR6WdExv1QOVATQqs4tdIrhnXJA0yihl6yHQIBySFO7R0A0dVKvQTSFj1aYJIafmcRCeeMdahGAWA4cMwmBoGjAzhdkurVz6BW9PJMngPA0+Qhvs/ph9UYIWzNacGBLX9ia+mGPZuNIELqJIGKC4ZC2x3elH0TU5IUKFuOYZJzjR2ciE/gzsrsHzlDQyLstuolM0Cgvf1kLxECCOGhomoOm/jTHWH4aVMV0lAVOk0emwoxQrx1IaApGwIL/kXyYnZadJKsbgsdqpMoIOJfc1KrfeEyeiM3j9Yv60FG+mJGLalgMI7N1mc5zjouJgCGEh7p7nmE6tzHPPQlU4iil4eaws3Me2+TGf83etOktbE4GKSWK3Fp+mRcj2gFHmlTigwoI7TvcZqdbPsAMy+WKq9kJ71hWe895/sSN4Xg/lkLVS0S+KTm2p0IZaC3fHB4pQk1ubeM8PfYmbcWig1cWmoAqBePBAlT3EoCGCR+D7iHq+wkGcSzQTvGIhinExss4jViz96yhmHSnsn1sjKBIm04iDRFMSiv4HUXpAufp/xMv8KZNuUYDoPDYea3DenBfNm97Df4MPdvUP6dMAcwFPWwLWpa11lQX7Hv0OrSGZ5FKlfgyj13SfBGInW2vTzHbHaexbBRVdNQT WL8X8onD BqxaxtYoRimwH2X1Yl/oGWrOtgCrem6QnEB0JIfxRISwZTY64jfEOry8OaeWiEV6pcUd5CydSD4J6Yk/7Bjr6mjdvSHD9O2W7qvX6D/S31uFWq3InAMKX9nbYmeO5mDc+Q5A8OgMBkRb2KIje8TCE0nGWh6X7sxl1ORE/36/c/o5zwtDI5MtX0uElsuZD9vjRWpKQOF388qSuhj16ChFSbSCv8ywEw6XGPyJZhHRrVhkJ0Tra2kPKhYTF5Q4yN9A+VpeZOFZE8pCru5RTOeawMOK3YWjYFRqLqhvdDCHjW9mFkGcKb7Gu9ibNW6tFanE/BXf+4eCyuNYSbwDX8TOj8m2ZLxMSAYj1ZuU0oFJoBpCbwH2gxr5ebrRaJE9J82V5PqvrOYQtgUSsNKqwEMJco9wlzwNGFqnWiisZgNgwPmuaYckZtEmV3CoHXmPnCJxM1SFE6GOtDWi08Q0v5nDELWV8fg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This series makes changes to the allocator and reclaim/compaction code to try harder to avoid fragmentation. As a result, this makes huge page allocations cheaper, more reliable and more sustainable. It's a subset of the huge page allocator RFC initially proposed here: https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/ The following results are from a kernel build test, with additional concurrent bursts of THP allocations on a memory-constrained system. Comparing before and after the changes over 15 runs: before after Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%) THP latencies are cut in half, and failure rates are cut by 75%. These metrics also hold up over time, while the vanilla kernel sees a steady downward trend in success rates with each subsequent run, owed to the cumulative effects of fragmentation. A more detailed discussion of results is in the patch changelogs. The patches first introduce a vm.defrag_mode sysctl, which enforces the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and compaction have run. They then change kswapd and kcompactd to target pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths. Main differences to the RFC: - The freelist hygiene patches have since been upstreamed separately. - The RFC version would prohibit fallbacks entirely, and make pageblock reclaim and compaction mandatory for all allocation contexts. This opens up a large dependency graph for compaction, possibly remaining sources of pollution, and the handling of low-memory situations, OOMs and deadlocks. This version uses only kswapd & kcompactd to pre-produce pageblocks, while still allowing last-ditch fallbacks to avoid memory deadlocks. The long-term goal remains converging on the version proposed in the RFC and its ~100% THP success rate. But this is reserved for future iterations that can build on the changes proposed here. - The RFC version proposed a new MIGRATE_FREE type as well as per-migratetype counters. This allowed making compaction more efficient, and the pre-compaction gap checks more precise, but again at the cost of complex changes in an already invasive series. This series simply uses a new vmstat counter to track the number of free pages in whole blocks to base reclaim/compaction goals on. - The behavior is opt-in and can be toggled at runtime. The risk for regressions with any allocator change is sizable, and while many users care about huge pages, obviously not all do. A runtime knob is warranted to make the behavior optional and provide an escape hatch. Based on today's akpm/mm-unstable. Patches #1 and #2 are somewhat unrelated cleanups, but touch the same code and so included here to avoid conflicts from re-ordering. Documentation/admin-guide/sysctl/vm.rst | 9 ++++ include/linux/compaction.h | 5 +- include/linux/mmzone.h | 1 + mm/compaction.c | 87 ++++++++++++++++++++----------- mm/internal.h | 1 + mm/page_alloc.c | 72 +++++++++++++++++++++---- mm/vmscan.c | 41 ++++++++++----- mm/vmstat.c | 1 + 8 files changed, 161 insertions(+), 56 deletions(-)