From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CA86C47DAF for ; Thu, 18 Jan 2024 22:33:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AF6186B0082; Thu, 18 Jan 2024 17:33:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AA5F76B0085; Thu, 18 Jan 2024 17:33:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9953E6B0087; Thu, 18 Jan 2024 17:33:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8A14D6B0082 for ; Thu, 18 Jan 2024 17:33:10 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 51674C0A42 for ; Thu, 18 Jan 2024 22:33:10 +0000 (UTC) X-FDA: 81693883740.15.457839F Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf29.hostedemail.com (Postfix) with ESMTP id 325AE120026 for ; Thu, 18 Jan 2024 22:33:07 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=aFMzJo6t; spf=none (imf29.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705617188; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=SUTH9ynBOgeJbJhkQOdfaI+6hWixGnXBiitilU0hoVI=; b=r2fyuYF0BgWXGdGpTX5QMjHztc5cNK0CDh5t1k6bMQEPJp1c/N3kAhu1yqbo/txneWNRFF U+DjwWAxMNe/Y0l3VElCTmSXwy/YQrlmbyolAJx+VsrMhkavLtohp938KHQDMv/SuzTle4 L5a2KBIee1YF12tvA7NQQOUMsLqcGbg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705617188; a=rsa-sha256; cv=none; b=Wtca8980Ueq8i6WwjYh2FI+u8GcTRSTJQGj1HJfYm0VsbfJbTg/QfUTT6FLthxNhJFEtfW MyQFbOUYqSU1oUQ++5GuxooXBeJJXcEXVxOkots2zTh+4xHkxa4dafH9fNKBrDVR+Y4ufL ajaG72hS3RjhT6ZKsOIDlZLDjM7YZ3E= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=aFMzJo6t; spf=none (imf29.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:Message-ID: Subject:Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=SUTH9ynBOgeJbJhkQOdfaI+6hWixGnXBiitilU0hoVI=; b=aFMzJo6t3mVK5FG2Kv6vfAaaAS zCeKHiPD5f8qG9XrVoe4279geVkhlxCI62nxkvS2hTiz+/0Ocqa/4RlOrs5Ed6s/GjrSbwW1yJNNv cI9DjbXxKbITupDJ0bqOk/0/+1neiqJj2f/m1EUQnJPE2N2a/s/qivrD6AItsVHZZ+EtmYqLaC2AV tgy5x4KkNv4giU5VSEvKaP9Am3hLC1jaW4xhLszT3DdaTyeyZ5Gimw4gTkRRTLaFCeWuBDKiqN2IU YaRxnpX2dSlQ7cNTVY4P0YZ0V9ZuQE7+EsaZXbNImTWFA81vNm9+c7W3DlSk9g1lhqYrNSN3pENie J5MmzUiw==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rQawa-00000003HW1-0lpg; Thu, 18 Jan 2024 22:33:04 +0000 Date: Thu, 18 Jan 2024 22:33:04 +0000 From: Matthew Wilcox To: linux-mm@kvack.org Cc: Johannes Weiner , Mel Gorman Subject: The buddy allocator in a memdesc world Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Rspamd-Queue-Id: 325AE120026 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: is4jskrndntt5omi95uk9zsommakx875 X-HE-Tag: 1705617187-369367 X-HE-Meta: U2FsdGVkX18lO4yazrhPg3xrly1l6aa6QaQuTZyYkRrCsH2oVmLF9CBg2WGP9KEB4IwsMIDImioX9Ot196SvejtZ4PU1YZFlonC4uWTXPSl1iC22UEF9lPeEr/f9CuCh0Yi2UnT96dHBhC/FjRswSJxIVEWg974hruuSgh/XFvvcnc/jGAueEztlcBA/IBvomz7/BAfLenjvNUDf/03Tkr8UMxZXQoEc60aowlL3ISmM4EoT4obVeN1sBg/LQgYei4WGCIEoMMMR2a+CFC06sHuw3OWZJCaUO9/Pi/Ly4mf6NfRRc9i3E1qI6T3GbXzEJFF5u6GDJLokZPgLSmMct/GcWMkCDzScq6RIdAzaXu4MWaZDANDkoU59EfpV+n60BHZn6DAJRNzaBR6LWQK6NkR8fgP77tXQ4Jjvs/eM4V0PHsK9ePB+0AoM6u5tNbVrG+gl4WY9YRtqf/xDKkxxurTnQogM3BNJpBSCDBmDVMnq4bXglRZBhUesY45dKu/56XqPw/DbHujBEhXZnmedVpsHQ7RMv73nKhRnkLUhF9ntxS9a0OcAB3thppz32IMNI6quF2YqI1Ab1SQ9XPjPsQaUoLsqlY0cMHaDPW+P6hpoXUGcwXdF5+pZtFqGmal9Ysu4z9bO0QmYVhKOcRQZVz6TQteMELZ1AQp5BkkoxLl0e/uo7aHFx8aJDUvT703NXlqYd064iEa9hvkWbRx+UZy8dZYhbqBdp0YoreQAF219Q1sVwKyxI/n2+1HOIYOAQ65+8uivlDBzNnzDPC2Vff5m5n8swH/m+A+3KXpD1fwnxHrp6hXb+qrXHhBBWAuvOu6La3SFuxVYfFwfOkCcl5JF/sLXyNDx28bkhUSS665WYga41uxnVGARW/ZHaT+A1oXhFAULYq2i2Nx5fM8N6qHWBS1nmbiXFfVDg2q+kPcpZvVuDN7c45iDBPyUWVLL7xzseR7PmUE+bQQ7UYS hAUpJPvG x4yPpBUU2xsuNeO+DolQzAKJWV7/hsYpZWHxmnzV8k5tLKB5UHILAZEanl14/6IZiDsSEw1v1bqoWUll4aj5O2j2ttI73qu5qW4ZmED0MSEEor78bhS6+wAeRXVwWCbJwIFc3OxtEHPVRhBiEBxP2IYAKD92kdBupR+Zmq6asWFVJwZxkcubWdMV8r6rtOEeP9eqr X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Lately I've been thinking about the next step after folios -- shrinking struct page. In broad strokes, the plan is to turn struct page into struct page { unsigned long memdesc; }; where the bottom 4 bits is a type discriminator [1] and the rest of the word is usually a pointer to a memory descriptor of some type. More details at https://kernelnewbies.org/MatthewWilcox/Memdescs [1] eg Buddy, File Folio, Anon Folio, Slab, Page Table, ... Today the buddy allocator (not the PCP allocator; that is a subject for a different thread) uses very little information in struct page. It preserves page->flags for the node/zone/section, but does not use that information itself. It uses page->buddy_list to store prev/next pointers to other pages of the same order, and it stores the order of the page in page->private. Naively, then, we could allocate a 32-byte memdesc for pages in the buddy allocator. But that turns out to be horrendously complicated because we need to allocate new ones to split buddy pages, leading to a solvable but ugly loop between the slab & page allocators trying to allocate memory in order to allocate memory. I wrote it all out and deleteed it in disgust. It is my strong preference to embed all the information that buddy needs in the struct page. That incentivises us to see what we can trim out and what we can compress in order to shrink the buddy information as much as possible. Here are a range of options for that. First, note that we don't need to preserve page->flags. We can reconstruct the node/zone/section when the page is allocated. We just need to store the next/prev/order. Option 1 struct buddy { unsigned long next; unsigned long prev; }; struct page { union { unsigned long memdesc; struct buddy buddy; }; }; This gives us a 16 byte struct page. The 'next' field has its bottom four bits used for the memdesc type for buddy pages. The bottom four bits of 'prev' can be used to store the order (assuming we don't need to go past order 15 pages). The remaining bits of 'next' and 'prev' can be used to store pointers to struct page, since struct page is aligned to 16 bits. But this doesn't work on 32-bit because struct page would only be eight bytes. We could make it work by allocating two memdesc types to the buddy allocator (eg types 1 and 9) so that it works for merely 8 byte alignment. But I think we have better options ... Option 2 The same data structure, but store the PFN of the next/prev pages instead of the pointer to the struct page. That gives us a lot more bits to play with! On 32-bit, we can use 28 bits to support up to 1TB of memory (theoretically possible with ARM LPAE). But we no longer have a limit of order-15 pages as we know that PFNs are naturally aligned, and so we can use a single bit [2] in prev to record what order the page is. And we have three bits left over! On 64-bit, we have space for 60 bits of PFN which works out to 4096 exabytes of memory (most 64-bit architectures can support PFNs up to about 51 bits). Option 3 We can compress option 2 on many 64-bit systems. For example, my laptop and my phone have less than 2TB of memory. Instead of using a pair of unsigned long, we can encode next/prev/order into a single 8-byte integer: bits meaning 0-3 memdesc type buddy 4 order encoding 5-33 next 34-62 prev 63 unused That is 29 bits, letting us support up to 2TB systems with an 8 byte memdesc. Assuming there's a decent Kconfig option to determine whether it's OK to decline to support memory above 2TB ... It is tempting to see if we can shrink memdesc to 4 bytes on 32-bit systems, but we only get 13 bits for prev & next, limiting us to a 32MB system. If it's worth the Kconfig option, it'll only be a few extra lines of code to support it. Option 4 Instead of using an absolute PFN, use a PFN relative to the base of the zone. That would mean we need one extra zone per 2TB of memory, which would expand the number of zones a little. But we could keep memdesc at 8 bytes, even on the largest machines, which would save us a Kconfig option for gargantuan machines. (what is the practical limit on PFN today? i see that Oracle Exadata 10M can be configured with 3TB of DRAM. presumably anybody playing with CXL has plans for more PFNs than that ...) I'm keeping this document updated at https://kernelnewbies.org/MatthewWilcox/BuddyAllocator so feedback gratefully appreciated (and my thanks to those who provided feedback on earlier drafts) [2] https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder