From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EBE92FB5180 for ; Tue, 7 Apr 2026 02:43:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 390C46B0088; Mon, 6 Apr 2026 22:43:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 340A06B0089; Mon, 6 Apr 2026 22:43:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 209196B008A; Mon, 6 Apr 2026 22:43:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 09C836B0088 for ; Mon, 6 Apr 2026 22:43:02 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A80F3C2125 for ; Tue, 7 Apr 2026 02:43:01 +0000 (UTC) X-FDA: 84630212562.28.7893CE2 Received: from CH4PR04CU002.outbound.protection.outlook.com (mail-northcentralusazon11013008.outbound.protection.outlook.com [40.107.201.8]) by imf12.hostedemail.com (Postfix) with ESMTP id BE5A340005 for ; Tue, 7 Apr 2026 02:42:58 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=lP9qfH2x; arc=pass ("microsoft.com:s=arcselector10001:i=1"); spf=pass (imf12.hostedemail.com: domain of ziy@nvidia.com designates 40.107.201.8 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (policy=reject) header.from=nvidia.com ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1775529779; a=rsa-sha256; cv=pass; b=0CN1KMvWC0q+sRPYqMzBv1sz9b2MbFqaRqJ32cextfokIm+cmudccTLCDArUhXTYlht5O9 2bx/gdlMANuGDncw6Ux4yrw9Hzz6udzYs5VQhSSkEdCub+t7iCQbkkAYZfryb+JPgrJyfo bhMKHwdf2HVE7xJKNxRY8N0ast92Zao= ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1775529779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wd+MVOdhHp9JnPxjvE1whER0EW5SLJ1nifyFLwP4ZgY=; b=hL4ZXPWDnNr8kHnDPqdtTso9twxzaGAGdaPqEjFAl1yQI/r3HsnvsWE99lgIgVNbbDsURK 6xCKez7oWkVMzVXPQ7V39WZxwNCm1bnd8/w628JZEZr7CBp6v7RbyKC7HIQDgN8AqL7OLS AtRZfWm+8LjD1AiIKcUlX/gPOfCJQzo= ARC-Authentication-Results: i=2; imf12.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=lP9qfH2x; arc=pass ("microsoft.com:s=arcselector10001:i=1"); spf=pass (imf12.hostedemail.com: domain of ziy@nvidia.com designates 40.107.201.8 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (policy=reject) header.from=nvidia.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=ojHEgZQ6ibuAsJS3zAnhMsy/84UGvJso7LNWu9jccUmoonOXPibTYnpFpSR4w+Yk28qrkjBjloZ6/ZaEuGKM+ycs90PzEWnnAOZO/dvhjqBN5lWV96EP9rNrO8NSwPS7n9KaCVk2EAIRK30lw9LYzr/1yz1K3rmZ8j/3vNpY+eszfEBVuIEK3y5KGhRynY60StEOu1JzflPdFZXcZk5BgtTtbt3hwSJHXDm7Eq6zt3dY6oRiXKe8Sq1oPg5Cnag24glL0kYv6Lw5rNfiWbssOL8f6KPmjbRACymtMVYF3eH2KbxZVx4B++NEz26ZGw23SF0B0nSkGM+AuSTbyXA6TA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=wd+MVOdhHp9JnPxjvE1whER0EW5SLJ1nifyFLwP4ZgY=; b=F+vqptwIsItVrrSBGOHvHcBphBF12m/KPi4HCR0LoX7Lrai57bd5Q2huIOi7bkhhiHPc7Z2OrTXVw0SbtxwYu+EuBaz44zgPDGYkvy8hsoqWtouQiPy6OoL/JGZx3MiuVSIJnemVzwlR3ivI3oMZvBdbSSzy1toZ4izmK7c4OQ67xIsz2Bfa3hdku2pKTf3NykMxBchkw4FMdOE6/J11tVIJeZLqEAcc+HsMjWFTUQeTJUDepohyjYbeokXZe/gPhANwzQuk10wh+1QClnBlTLCHuPJnN45EaMA3u/4QUMj80NKSx2KT3744rKzuTaWynHk7/3kdJLtHVvM55d5GRw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=wd+MVOdhHp9JnPxjvE1whER0EW5SLJ1nifyFLwP4ZgY=; b=lP9qfH2xGXgB3RbXN/oekU4oAPrwJ1fWdyqjPjTz7NFfGRv+0Q/DidszDQfSwTx0ZmBdocQfnCTGgHdEtSTmrOkQa4HAfD6Yqu/pmjwFspjCSvoYgyN747qX7QHJNdJsxQ8Jcp9kYm2u4A+vZuZ2xD7r50kf/JL9qjVb1hCcqimTXwt9i75MVCgcggnFSd9RRCkdJ9zHnkZ3grY9FMCkihKI/eLkujIrKC2IyTtfNB2tqWtsULmlFzODpbNQJPY5dQGX4GN/E5Eckpny2GSyRk+tqFTJC69nzselK0quspHpWUfYAUVuX1Uisk63nHrPdi6VBVdtPP0A1KgpGH6MOg== Received: from DS7PR12MB9473.namprd12.prod.outlook.com (2603:10b6:8:252::5) by CH0PR12MB8532.namprd12.prod.outlook.com (2603:10b6:610:191::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9769.18; Tue, 7 Apr 2026 02:42:54 +0000 Received: from DS7PR12MB9473.namprd12.prod.outlook.com ([fe80::f01d:73d2:2dda:c7b2]) by DS7PR12MB9473.namprd12.prod.outlook.com ([fe80::f01d:73d2:2dda:c7b2%4]) with mapi id 15.20.9769.014; Tue, 7 Apr 2026 02:42:53 +0000 From: Zi Yan To: Johannes Weiner Cc: linux-mm@kvack.org, Vlastimil Babka , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Rik van Riel , linux-kernel@vger.kernel.org Subject: Re: [RFC 0/2] mm: page_alloc: pcp buddy allocator Date: Mon, 06 Apr 2026 22:42:50 -0400 X-Mailer: MailMate (2.0r6290) Message-ID: In-Reply-To: References: <20260403194526.477775-1-hannes@cmpxchg.org> <1C961B84-522F-43AB-ADCB-014B3A4ACD21@nvidia.com> Content-Type: text/plain X-ClientProxiedBy: BL1PR13CA0305.namprd13.prod.outlook.com (2603:10b6:208:2c1::10) To DS7PR12MB9473.namprd12.prod.outlook.com (2603:10b6:8:252::5) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DS7PR12MB9473:EE_|CH0PR12MB8532:EE_ X-MS-Office365-Filtering-Correlation-Id: 2d8e5b34-c4eb-444d-54cb-08de944f5e34 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|366016|1800799024|376014|56012099003|18002099003|22082099003; X-Microsoft-Antispam-Message-Info: zdse4QHl2R+kjbOJdGLfshHRSbpjiHbieOL9Hu2A1v33aVqed81PlHyR9go8ILgFmllnCN9pHxUwRktawdX5rZ6oe2WovKpwGyMwff2aaX+Wz9wn9YZ0aML9vnaXK2OGygEBkDn50LqPWXxLnCbuTNdzmBu/llY1oBWxtn8H1HiWIQz19/iTaClrAiWtHOn79K1HD2EfEiyv9MBuxdInZe/PKFizyEWEgCYTaeRizi2pHOly2fNMQXIjA8+SUAopxOmEfVAO/EjXALeADzUx8C0tAp3MDN6UFAI/nMsspwEq2i7678g+uyfwUOa0nZvj5cowu5jBJbQuCSrXsqhLsIl7Laeh3S/Sk5cvuabCXnjq7gNxC16l6DCwwiHsyYluOTOuPc4i/qy3DTNxgjy5xrsCD1A2IffsXp8esh50WKPnXP0QQ2qFzqXk1K5HACwuCRuUkdLq1GsURbJk2tanZ4pRqqIHAMRUwDnuej8tztjbHiifrFpvT/IVZaqFVUpd6SxWXu2Ouur9FHD7DHcDiQWxad2/X66+2E9CnHejK2lP9U36QM1SJWfEw1ZzlWpWx0abBqlvUdDaqkn/TmncsB/WTZUoDzqrfN7SI9B/7mZrr6w4AaWQ5EHpwa1gHEfb+HlhoHQgzvQlyXTN8N3R+bYb71qC4gThWsENi3u1j66BLICNQP1254r6gCAK6Ag7U+6ay3RhmPc5lMezwmjzn9FRT97j0ZwolW7NksGsSeM= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DS7PR12MB9473.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(366016)(1800799024)(376014)(56012099003)(18002099003)(22082099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?fUiZrH4kbxjDp/Dz6sGw0VU1a31T4Zeo6ooXMMe5yhfnPP0MXNYu7vKB7GRh?= =?us-ascii?Q?MN6VlYIgSI5CU/muhxnGycWVIhFQyqMkvrShy0gmNvn8wSYPeCpRIoH8wh1/?= =?us-ascii?Q?GMd6Qg8TpHPrVCYWi/S6xf3PCmSzhE0qCb1rqY31MW/E5BxVYMA3s/7mlBg3?= =?us-ascii?Q?0+SjshXhaZEgg7nxeLu1APsR1vg+ZCwaygSY1PQT8ER2/rAnTkIf37Cy4PKY?= =?us-ascii?Q?x6n8jhBwGRLUWb2CSFXMwtsy6t3cg+NpEZFa2H6OVoeCQo+8NOur7kNpGHY+?= =?us-ascii?Q?oPby//9+HimfrJV0zOChuIBZu9a50J7WiA3ljI6ha7/CTRS6U9dBMTaRjFqx?= =?us-ascii?Q?opj5jSOoKNL7WCEMcygxg4XBC0iu7PGYht9D+77SgjlGolsOGokfVyBZ/A6c?= =?us-ascii?Q?7oVK2PVekTIS985EWyQ+Drw1ZkB1BDN3fwTMJHV5M5KnxVA6Et+Wb7qmSFvS?= =?us-ascii?Q?4N+CsJkG7qFyk9j+SOEEogFPf5wax2PvPGCLtv4VODN8nGwpX5un2rw4JJ7m?= =?us-ascii?Q?/ZXri1hS1QhfNCxvfvVsQAUXjhtLVRTZ/eDrIv7kDqSfP3FuPB0SY0UGYjBI?= =?us-ascii?Q?m7cDJD9R44RT1tnUv7RbP6qv8l8Kn3o69hQLuSZdQU7iAryUg0mDTq/K5Pko?= =?us-ascii?Q?5tbY76S+ijePuLsG8SLpAqn8PSAU6pmwFtJpyF68kwLSBA5mbPfRXRpcnDnX?= =?us-ascii?Q?exfo9DZZpBE3sL9MYFN6D0EJ3KopB6Px4eTztbON6iIrsZU10KsGQiNaK+T6?= =?us-ascii?Q?Q/pbDSC6Dv6vJjXjtPbyIGzv+pXDquXdEzA8I0KNii9Q798rZa4s/6i5etaW?= =?us-ascii?Q?FRSAb/xwi070lGOdKY+1lydcGvNQP3wzqfS5CSHzBA85uJIadXgD9V6RJYA3?= =?us-ascii?Q?ROs10A+8AlAKhU66/PG5Wd2vEpDcEw268A6fRXJcED8JnkT0vnMJ9kKJTl64?= =?us-ascii?Q?rgF6bIezD8fr1DqmRgHvFfkr3yJu+UOibMrCIywYyiQSh/r/yoUf2dnCfnLN?= =?us-ascii?Q?7EqCaOeElbBM/DbCMbR5px36o3Uyv6YFNLUUZVI5xOnEPjmc+p31X5HrwTlV?= =?us-ascii?Q?eC3Wefqe3sDuRXWMsDcv4xikoA6ho3IVgpZIW54U0n7NTtNzXtcq+M6NvMap?= =?us-ascii?Q?lKMIZBo/RbReQIt6/Msi9decArFcDGiXMq0qfKJ5IiydJlcwL5R4uGR16L8X?= =?us-ascii?Q?C43ULhlZzfu6y/I1ME8rqjw3JpJM4zq2YpqtPyRxCKoDfKsRLXzH+8z0I/e7?= =?us-ascii?Q?ArZgMK5QN9lATeuPX6OFVHvmM18myV+vuQAFwC5IcxSADHVocZ894Eg+8KGr?= =?us-ascii?Q?VfACppPviH2yQJ8AahcJnV2Xr/yt3Q4fW+j9W8PvhgW637tfUz/03w6dKPob?= =?us-ascii?Q?0aFDtWThuozFZEpLqHX1V517YfLzdy99SnLORqOs333y89w+0QTksiu1btln?= =?us-ascii?Q?z/sJVNKPIzElRx29/I6orIYx4dX/irZ6zA8XJBbMkiA8bLnHUiJB5u9jFlSy?= =?us-ascii?Q?pHb4A55c7jxlCO1ecj94XDpdwzV883teVJHXhpiMl7dsw1gVyjhScytFsaCB?= =?us-ascii?Q?4od3P8sbvVB/8IAcd9QCvUWSsgDgQqQsbm+MoVmjnnoorMIsH6fnmKWAkzAh?= =?us-ascii?Q?m1svmntw0AQC3eLwD1pvciybS2JQypJYOaksaCBmEr/8Fvu8/lQfHlVXrRrr?= =?us-ascii?Q?p20ti6+kKJfWx/THJZugsWYFQV/iEqOsoFU9syV/IZK2QK0d?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 2d8e5b34-c4eb-444d-54cb-08de944f5e34 X-MS-Exchange-CrossTenant-AuthSource: DS7PR12MB9473.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 07 Apr 2026 02:42:53.8436 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: EBRtlmFpRcnl/oLmk+kYDW69DrCN+uaya91OgAmm0dhPf8JZlPN9UkPZXAvOznJN X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH0PR12MB8532 X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: BE5A340005 X-Stat-Signature: cteaskjo84fef1jfekx1xrr4d8aiyehc X-HE-Tag: 1775529778-292940 X-HE-Meta: U2FsdGVkX1/n+f2TMRJaL5HxDso+oxB9HQBlQ/8r+x7ci8p9mFFK6N585V4ZbK69nOLYCE/063hfGEE3Get7nDYRatS91t5qxX97VyJepESh6LR0o8RnFlT9mzLeOC/VNwP5ZR4EanQGJ/0QQtSnzwa+TPVn2decBlYeDVOLeazwq8nvNkqPD4082LVQS/CjCH2hI4RWFdG9IaZJZc5V+NqwH/eIDNRj/2eQnubOsG/fUnQ8ZhLPf4fBu9klcAEnzXZmTkNx4HRbNuOn6lQI4ocSJ0KmWlVRR9haxxlt6R6Yfx+cPdBdArEoSiMosjz6NDGQgTwIoTFOXR1ChHPNzO8UEr2EDHQnGTZ8cihwfrh0p4PMeeTawtXztiOpMYqY6gDExW01liww+f72ECkfzN7GzgxGTMoVab8FuBxILOgBx0DtgBL7Az/RR1pSInxJO0OiTLBWfCqc1tmQHJamDQNOCdz78lxzDw4ofLwff1zyo7TmkMB4bx+QA3y97zLLRc3LUOz2tyYRZT6060e2bQS+44EUvkiqC2p668RYU1KrbUogKuQRZINkgbtohI0dhdzuFY1ODtu7QHyRRehTktlZn9HlMvgEtHjZDAm6fgnZYMDfcP1eo/AxtlxcS00nDXN2qu42qllGov1pfsgAvQHKMPzEAY14oIud1TXDA4934U0QB25hBI3s25IgYPOWV7UWX6Y9ctpK92dQ8YRl1dSCd6uCaATWfHUkOn84mn3wvJqnAUNiE2lPzt8J8i3PtfBbb1SnewlpzEuI7rrXk8fVhdw8r16AZxbfdWyBeKUZunTTP3zqhr50yc8zBMQcdbTm2KHFdo8WWHJHrr+fY/l9yLXBGxney1G8C8Bmg0fKgHDA/U2M37Gn/LHx/lHUML6/dYL/izI7+eMvb6zd8T7idUAv3UybLbMSnSxfzfYrxKlVbDKXv1L0OEnaL31P+ZHGzXrJNVOmj8BHO1t IaW3EqK4 hI9ypXFJCrbbAuxsfZtEo3605nRV3qOkpX+xzHg78dImsryC7INi/Bt5TzuCjsm4zckvl6l9NYz4vnLn0mCtmGV2E0mW/mO6Jbw8dN/LvW56nmYJlAwYGfJv+F/HADSkarTGiBqk9ghgHN//JbGbMZ4OWPtO5ehjypMRdCzsD9kVpMOFGab0/cL7Y5HpaSPpwAGt1lVMfP+vmMI08LgoQXZk198nECy8Yqsqjj6E45hXOMoHXyLyYcB4+0CF0t5c4Wn6A6V8p7VF0Y3uO28hTH6eEiqnRYAB1y4hB5OWpoH5lbC+hVtSCv/bIpH3COxd8uPWvPC8A+DZKIdWoD3Tkbth7KxWUGZ/uh4eRepItqW6rhr6FbjKG5W8x5w== Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 6 Apr 2026, at 11:24, Johannes Weiner wrote: > On Fri, Apr 03, 2026 at 10:27:36PM -0400, Zi Yan wrote: >> On 3 Apr 2026, at 15:40, Johannes Weiner wrote: >>> this is an RFC for making the page allocator scale better with higher >>> thread counts and larger memory quantities. >>> >>> In Meta production, we're seeing increasing zone->lock contention that >>> was traced back to a few different paths. A prominent one is the >>> userspace allocator, jemalloc. Allocations happen from page faults on >>> all CPUs running the workload. Frees are cached for reuse, but the >>> caches are periodically purged back to the kernel from a handful of >>> purger threads. This breaks affinity between allocations and frees: >>> Both sides use their own PCPs - one side depletes them, the other one >>> overfills them. Both sides routinely hit the zone->locked slowpath. >>> >>> My understanding is that tcmalloc has a similar architecture. >>> >>> Another contributor to contention is process exits, where large >>> numbers of pages are freed at once. The current PCP can only reduce >>> lock time when pages are reused. Reuse is unlikely because it's an >>> avalanche of free pages on a CPU busy walking page tables. Every time >>> the PCP overflows, the drain acquires the zone->lock and frees pages >>> one by one, trying to merge buddies together. >> >> IIUC, zone->lock held time is mostly spent on free page merging. >> Have you tried to let PCP do the free page merging before holding >> zone->lock and returning free pages to buddy? That is a much smaller >> change than what you proposed. This method might not work if >> physically contiguous free pages are allocated by separate CPUs, >> so that PCP merging cannot be done. But this might be rare? > > On my 32G system, pcp->high_min for zone Normal is 988. That's one > block and a half. The rmqueue_smallest policy means the next CPU will > prefer the remainder of that partial block. So if there is > concurrency, every other block is shared. Not exactly uncommon. The > effect lessens the larger the machine is, of course. > > But let's assume it's not an issue. How do you know you can safely > merge with a buddy pfn? You need to establish that it's on that same > PCP's list. Short of *scanning* the list, it seems something like > PagePCPBuddy() and page->pcp_cpu is inevitably needed. But of course a > per-page cpu field is tough to come by. > > So the block ownership is more natural, and then you might as well use > that for affinity routing to increase the odds of merges. > > IOW, I'm having a hard time seeing what could be taken away and still > have it work. You are right. I was assuming that pages that can be merged are freed via the same CPU. That rarely happens. > >>> The idea proposed here is this: instead of single pages, make the PCP >>> grab entire pageblocks, split them outside the zone->lock. That CPU >>> then takes ownership of the block, and all frees route back to that >>> PCP instead of the freeing CPU's local one. >> >> This is basically distributed buddy allocators, right? Instead of >> relying on a single zone->lock, PCP locks are used. The worst case >> it can face is that physically contiguous free pages are allocated >> across all CPUs, so that all CPUs are competing a single PCP lock. > > The worst case is one CPU allocating for everybody else in the system, > so that all freers route to that PCP. > > I've played with microbenchmarks to provoke this, but it looks mostly > neutral over baseline, at least at the scale of this machine. > > In this scenario, baseline will have the affinity mismatch problem: > the allocating CPU routinely hits zone->lock to refill, and the > freeing CPUs routinely hit zone->lock to drain and merge. > > In the new scheme, they would hit the pcp->lock instead of the > zone->lock. So not necessarily an improvement in lock breaking. BUT > because freers refill the allocator's cache, merging is deferred; > that's a net reduction of work performed under the contended lock. This makes sense to me. > >> It seems that you have not hit this. So I wonder if what I proposed >> above might work as a simpler approach. Let me know if I miss anything. >> >> I wonder how this distributed buddy allocators would work if anyone >> wants to allocate >pageblock free pages, like alloc_contig_range(). >> Multiple PCP locks need to be taken one by one. Maybe it is better >> than taking and dropping zone->lock repeatedly. Have you benchmarked >> alloc_contig_range(), like hugetlb allocation? > > I didn't change that aspect. > > The PCPs are still the same size, and PCP pages are still skipped by > the isolation code. > > IOW it's not a purely distributed buddy allocator. It's still just a > per-cpu cache of limited size. The only thing I'm doing is provide a > mechanism for splitting and pre-merging at the cache level, and > setting up affinity/routing rules to increase the chances of > success. But the impact on alloc_contig should be the same. Got it. Thanks for the explanation. > >>> This has several benefits: >>> >>> 1. It's right away coarser/fewer allocations transactions under the >>> zone->lock. >>> >>> 1a. Even if no full free blocks are available (memory pressure or >>> small zone), with splitting available at the PCP level means the >>> PCP can still grab chunks larger than the requested order from the >>> zone->lock freelists, and dole them out on its own time. >>> >>> 2. The pages free back to where the allocations happen, increasing the >>> odds of reuse and reducing the chances of zone->lock slowpaths. >>> >>> 3. The page buddies come back into one place, allowing upfront merging >>> under the local pcp->lock. This makes coarser/fewer freeing >>> transactions under the zone->lock. >> >> I wonder if we could go more radical by moving buddy allocator out of >> zone->lock completely to PCP lock. If one PCP runs out of free pages, >> it can steal another PCP's whole pageblock. I probably should do some >> literature investigation on this. Some research must have been done >> on this. > > This is an interesting idea. Make the zone buddy a pure block economy > and remove all buddy code from it. Slowpath allocs and frees would > always be in whole blocks. > > You'd have to come up with a natural stealing order. If one CPU needs > something it doesn't have, which CPUs, and which order, do you look at > for stealing. One naive idea is to make zone buddy keep track of PCP free lists for stealing. > > I think you'd still have to route back frees to the nominal owner of > the block, or stealing could scatter pages all over the place and we'd > never be able to merge them back up. Basically, we want to keep free pages to be merged as much as possible. Something like free page compaction across all PCPs. > > I think you'd also need to pull accounting (NR_FREE_PAGES) to the > per-cpu level, and inform compaction/isolation to deal with these > pages, since the majority default is now distributed. > > But the scenario where one CPU needs what another one has is an > interesting one. I didn't invent anything new for this for now, but > rather rely on how we have been handling this through the zone > freelists. But I do think it's a little silly: right now, if a CPU > needs something another CPU might have, we ask EVERY CPU in the system > to drain their cache into the shared pool - simultaneously - running > the full buddy merge algorithm on everything that comes in. The CPU > grabs a small handful of these pages, most likely having to split > again. All other CPUs are now cache cold on the next request. Yes, a better way might be that when a CPU wants something, it should be able to ask the other CPUs to drain the minimal amount of free pages. But I do not have a good idea on how to do that yet. It sounds to me that your current approach is a good first step towards distributed buddy allocator. I will check the code and think about it more and ask questions later. Thank you for the explanation. Best Regards, Yan, Zi