From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.8/8.13.1) with ESMTP id l9PNfNKp004624 for ; Thu, 25 Oct 2007 19:41:23 -0400 Received: from lacrosse.corp.redhat.com (lacrosse.corp.redhat.com [172.16.52.154]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l9PNfMr8025136 for ; Thu, 25 Oct 2007 19:41:22 -0400 Received: from bernoulli.boston.redhat.com (bernoulli.boston.redhat.com [172.16.81.92]) by lacrosse.corp.redhat.com (8.12.11.20060308/8.11.6) with ESMTP id l9PNfABv018654 for ; Thu, 25 Oct 2007 19:41:20 -0400 Message-ID: <4721298C.60504@redhat.com> Date: Thu, 25 Oct 2007 19:41:00 -0400 From: Chris Snook MIME-Version: 1.0 Subject: Toward a generic pooled buddy allocator Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: linux-mm@kvack.org List-ID: Hey folks -- In a brief moment of either clarity or insanity, I came up with a possible solution, or rather a framework for a solution, to several different memory management problems. The current buddy allocator divides the system into nodes (if NUMA) and divides those nodes into zones, which may overlap, which can cause headaches, and isn't as flexible as we'd sometimes like it to be, particularly when dealing with strange hardware or attempting to optimize for unusual memory topologies. I would like to treat nodes and zones as special cases of more generic physical memory pools. By giving physical memory pools various properties (node mask, cpu mask, permissions, priority, owner, etc.) we get more flexibility and also shrink the problem size of many specific memory management tasks on large systems. By keeping them exclusive of each other, and allowing pages or groups of pages to be moved between them when necessary, we reduce the amount of locking necessary for common-case operations. Several problems come to mind as things that this could help: 1) special DMA rules There are many devices which can DMA to 64-bit addresses, but only if 32 bits (or 34, or 26, etc.) are the same at any given time. If, at module load time, a driver can look for an existing dma pool that follows its rules, or create a new one if necessary, driver writers will have a lot more flexibility. 2) DMA NUMA locality If I have a 4-node NUMA box with legacy I/O attached to node 0, RAID controller attached node 1, network controller attached to node 2, and FC HBA attached to node 3, I want them each to be DMAing to the closest memory. There is currently no framework to ensure this. 3) preallocation There are plenty of circumstances in which users want to set aside a pool of memory for one particular purpose, and to do this at boot time. At present we have vm.nr_hugepages, which is userspace-only and suitable only for limited applications. 4) containers This should be obvious. 5) NUMA page replication Replicating pages across multiple nodes requires coordinated allocation on multiple nodes, something that is not at all straightforward with the current NUMA allocator. A similar strategy would make NUMA migration relatively straightforward, even without using swap. 6) NUMA allocation policies If a system is running a workload with multiple different numactl memory policies, the task of optimally allocating pages starts to look like a knapsack problem, which we do not want the kernel in the business of trying to solve. These sorts of configurations, typical in HPC, are generally hand-tuned, so we should let the application/administrator tweak allocations between these pools explicitly with the knowledge they have of how the application behaves. 7) realtime Currently, many realtime applications try to use hugetlbfs whenever possible to minimize VM overhead and variability. Unfortunately, hugetlbfs is not very convenient, and using the regular VM sucks for some realtime work. Configurable, resizable, prioritized realtime memory pools would solve this problem. 8) embedded So far, most of what I've been talking about has been about scaling up, not down. If we do it right (and I think we can) we can eliminate the overhead of the scale-up code at compile time or dynamically at boot time. Embedded developers like to carefully manage their resources, since they are so precious. If we put memory pool management in the kernel with some userspace hooks, embedded developers won't need to rely so much on heavily-customized kernel patches, libc re-implementations, etc. There are a lot of excellent performance engineers in the embedded world, and I'd really like them doing more work on the same piece of code that powers my desktop. 9) {anti,de}-fragmentation By tracking the utilization of large memory chunks (say, MAX_ORDER) we can tell which ones would be cheapest to reclaim to satisfy large physically contiguous allocations. Moreover, by segregating this tracking into multiple pools with different properties, we can avoid wasting cycles on unreclaimable memory, and delay wasting cycles on expensively-reclaimable memory until we absolutely need to. 10) paravirtualization At present, paravirt implementations either use locked, dedicated RAM for guest memory, which is fast but not space-efficient, or make normal virtual allocations and let the host VM sort it out, which is space-efficient but can be very slow. A hybrid approach would allow the host to provide a guest with a chunk of guaranteed-fast memory for critical stuff, while still allowing it to use as much capacity as is available when other guests are idle. This would be particularly good if the guest was also equipped with generic physical memory pool support. I recognize that what I'm suggesting may sound like a radical change, but it need not be. My intention is to add the framework and move the existing architecture inside it, and then gradually start using it to do new things that can't be done with the current allocator. I would very much like feedback now before I start experimenting with this, because my knowledge in this area of OS design is more academic than practical, and I will invariably do stupid things and reinvent many wheels if I work on this in a vacuum until I submit a thousand-line patch. In particular, comments on things I should *not* do would be most welcome. -- Chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org