From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 1 May 2007 17:38:59 +0100 (IST) From: Mel Gorman Subject: Re: Antifrag patchset comments In-Reply-To: <463723DE.9030507@yahoo.com.au> Message-ID: References: <463723DE.9030507@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin Cc: Christoph Lameter , Linux Memory Management List , Andrew Morton List-ID: On Tue, 1 May 2007, Nick Piggin wrote: > Christoph Lameter wrote: >> On Sat, 28 Apr 2007, Mel Gorman wrote: > >>>> 10. Radix tree as reclaimable? radix_tree_node_alloc() >>>> >>>> Ummm... Its reclaimable in a sense if all the pages are removed >>>> but I'd say not in general. >>>> >>> >>> I considered them to be indirectly reclaimable. Maybe it wasn't the best >>> choice. >> >> >> Maybe we need to ask Nick about this one. > > I guess they are as reclaimable as the pagecache they hold is. Of > course, they are yet another type of object that makes higher order > reclaim inefficient, regardless of lumpy reclaim etc. > That can be said of the reclaimable slab caches as well. That is why they are grouped together. Unlike page cache and buffer pages, the pages involved cannot be freed without another subsystem being involved. > ... and also there are things besides pagecache that use radix trees.... > > I guess you are faced with conflicting problems here. If you do not > mark things like radix tree nodes and dcache as reclaimable, then your > unreclaimable category gets expanded and fragmented more quickly. > This is understood. It's why the principal mobility types are UNMOVABLE, RECLAIMABLE and MOVABLE instead of UNMOVABLE and MOVABLE which was suggested to me in the past. > On the other hand, if you do mark them (not just radix-trees, but also > bios, dcache, various other things) as reclaimable, then they make it > more difficult to reclaim from the reclaimable memory, This is why dcache and various other things with similar difficulty are in the RECLAIMABLE areas, not the MOVABLE area. This is deliberate as they all get to be difficult together. Care is taken to group pages appropriately so that only easily movable allocations are in the MOVABLE area. > and they also > make the reclaimable memory less robust, because you could have pinned > dentry, or some other radix tree user in there that cannot be reclaimed. > Which is why the success rates of hugepage allocation under heavy load depends more on the number of MOVABLE blocks than RECLAIMABLE. > I guess making radix tree nodes reclaimable is probably the best of the > two options at this stage. > The ideal would be that some caches would become directly reclaimable over time including the radix tree nodes. i.e. given a page that belongs to an inode cache that it would be possible to reclaim all the objects within that page and free it. If that was the case for all reclaimable caches, then the RECLAIMABLE portion of memory becomes much more useful. Right now, it depends on a certain amount of luck that randomly freeing cache objects will free contiguous blocks in the RECLAIMABLE area. There was a similar problem for the MOVABLE area until lumpy reclaim targetted its reclaim. Similar targetting of slab pages is desirable. > But now that I'm asked, I repeat my dislike for the antifrag patches, > because of the above -- ie. they're just a heuristic that slows down > the fragmentation of memory rather than avoids it. > > I really oppose any code that _depends_ on higher order allocations. > Even if only used for performance reasons, I think it is sad because > a system that eventually gets fragmented will end up with worse > performance over time, which is just lame. Although performance could degrade were fragmentation avoidance ineffective, it seems wrong to miss out on that performance improvement through a dislike of it. Any use of high-order pages for performance will be vunerable to fragmentation and would need to handle it. I would be interested to see any proposed uses both to review them and to see how they interact with fragmentation avoidance, as test cases. > For those systems that really want a big chunk of memory set aside (for > hugepages or memory unplugging), I think reservations are reasonable > because they work and are robust. Reservations don't really work for memory unplugging at all. Hugepage reservations have to be done at boot-time which is a difficult requirement to meet and impossible on batch job and shared systems where reboots do not take place. > If we ever _really_ needed arbitrary > contiguous physical memory for some reason, then I think virtual kernel > mapping and true defragmentation would be the logical step. > Breaking the 1:1 phys:virtual mapping incurs a performance hit that is persistent. Minimally, things like page_to_pfn() are no longer a simply calculation which is a bad enough hit. Worse, the kernel can no longer backed by huge pages because you would have to defragment at the base-page level. The kernel is backed by huge page entries at the moment for a good reason, TLB reach is a real problem. Continuing on, "true defragmentation" would require that the system be halted so that the defragmentation can take place with everything disabled so that the copy can take place and every processes pagetables be updated as pagetables are not always shared. Even if shared, all processes would still have to be halted unless the kernel was fully pagable and we were willing to handle page faults in kernel outside of just the vmalloc area. This is before even considering the problem of how the kernel copies the data between two virtual addresses while it's modifing the page tables it's depending on to read the data. Even more horribly, virtual addresses in the kernel are no longer physically contiguous which will likely cause some problems for drivers and possibly DMA engines. The memory compaction mechanism I have in mind operates on MOVABLE pages only using the page migration mechanism with the view to keeping MOVABLE and RECLAIMABLE pages at opposite end of the zone. It doesn't bring the kernel to the halt like it was a Java Virtual Machine or a lisp interpreter doing garbage collection. > AFAIK, nobody has tried to do this yet it seems like the (conceptually) > simplest and most logical way to go if you absolutely need contig > memory. > I believe there was some work at one point to break the 1:1 phys:virt mapping that Dave Hansen was involved it. It was a non-trivial breakage and AFAIK, it made things pretty slow and lost the backing of the kernel address space with large pages. Much time has been spent making sure the fragmentation avoidance patches did not kill performance. As the fragmentation avoidance stuff improves the TLB usage in the kernel portion of the address space, it improves performance in some cases. That alone should be considered a positive. Here are test figures from an x86_64 without min_free_kbytes adjusted comparing fragmentation avoidance on 2.6.21-rc6-mm1. Newer figures are being generated but it takes a long time to go through it all. KernBench Comparison -------------------- 2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based %diff User CPU time 85.55 86.27 -0.84% System CPU time 35.85 33.67 6.08% Total CPU time 121.4 119.94 1.20% Complaints about kernbench as a valid benchmark aside, it is dependent on the page allocator's performance. The figures show a 1.2% overall improvement in total CPU time. The AIM9 results look like 2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based 1 creat-clo 154674.22 171921.35 17247.13 11.15% File Creations and Closes/second 2 page_test 184050.99 188470.25 4419.26 2.40% System Allocations & Pages/second 3 brk_test 1840486.50 2011331.44 170844.94 9.28% System Memory Allocations/second 6 exec_test 224.01 234.71 10.70 4.78% Program Loads/second 7 fork_test 3892.04 4325.22 433.18 11.13% Task Creations/second More improvements here although I'll admit aim9 can be unreliable on some machines. The allocation of hugepages under load and at rest look like HighAlloc Under Load Test Results 2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based Order 9 9 Allocation type HighMem HighMem Attempted allocations 499 499 Success allocs 33 361 Failed allocs 466 138 DMA32 zone allocs 31 359 DMA zone allocs 2 2 Normal zone allocs 0 0 HighMem zone allocs 0 0 EasyRclm zone allocs 0 0 % Success 6 72 HighAlloc Test Results while Rested 2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based Order 9 9 Allocation type HighMem HighMem Attempted allocations 499 499 Success allocs 154 366 Failed allocs 345 133 DMA32 zone allocs 152 364 DMA zone allocs 2 2 Normal zone allocs 0 0 HighMem zone allocs 0 0 EasyRclm zone allocs 0 0 % Success 30 73 On machines with large TLBs that can fit the entire working set no matter what, the worst performance regression we've seen is 0.2% in total CPU time in kernbench which is comparable to what you'd see between kernel versions. I didn't spot anything out of the way in the performance figures on test.kernel.org either since fragmentation avoidance was merged. > But firstly, I think we should fight against needing to do that step. > I don't care what people say, we are in some position to influence > hardware vendors, and it isn't the end of the world if we don't run This is conflating the large page cache discussion with the fragmentation avoidance patches. If fragmentation avoidance is merged and the page cache wants to take advantage of it, it will need to; a) deal with the lack of availability of contiguous pages if fragmentation avoidance is ineffective b) be reviewed to see what its fragmentation behaviour looks like Similar comments apply to SLUB if it uses order-1 or order-2 contiguous pages although SLUB is different because as it'll make most reclaimable allocations the same order. Hence they'll also get freed at the same order so it suffers less from external fragmentation problems due to less mixing of orders than one might initially suspect. Ideally, any subsystem using larger pages does a better job than a "reasonable job". At worst, any use of contiguous pages should continue to work if they are not available and at *worst*, it's performance should comparable to base page usage. Your assertion seems to be that it's better to always run slow than run quickly in some situations with the possibility it might slow down later. We have seen some evidence that fragmentation avoidance gives more consistent results when running kernbench during the lifetime of the system than without it. Without it, there are slowdowns probably due to reduced TLB reach. > optimally on some hardware today. I say we try to avoid higher order > allocations. It will be hard to ever remove this large amount of > machinery once the code is in. > > So to answer Andrew's request for review, I have looked through the > patches at times, and they don't seem to be technically wrong (I would > have prefered that it use resizable zones rather than new sub-zone > zones, but hey...). The resizable zones option was considered as well and it seemed messier than what the current stuff does. Not only do we have to deal with overlapping non-contiguous zones, but things like the page->flags identifying which zone a page belongs to have to be moved out (not enough bits) and you get an explosion of zones like ZONE_DMA_UNMOVABLE ZONE_DMA_RECLAIMABLE ZONE_DMA_MOVABLE ZONE_DMA32_UNMOVABLE etc. Everything else aside, that will interact terribly with reclaim. In the end, it would also suffer from similar problems with the size of the RECLAIMABLE areas in comparison to MOVABLE and resizing zones would be expensive. > However I am against the whole direction they go > in, so I haven't really looked at them lately. > > I think the direction we should take is firstly ask whether we can do > a reasonable job with PAGE_SIZE pages, secondly ask whether we can do > an acceptable special-case (eg. reserve memory), Hugepage-wise, memory gets reserved and it's a problem on systems that have changing requirements for the number of hugepages they need available. i.e. the current real use cases for the reservation model have runtime and system management problems. From what I understand, some customers have bought bigger machines and not used huge pages because the reserve model was too difficult to deal with. Base pages are unusuable for memory hot-remove particularly on ppc64 running virtual machines where it wants to move memory in 16MB chunks between machine partitions. > lastly, _actually_ > do defragmentation of kernel memory. Anti-frag would come somewhere > after that last step, as a possible optimisation. > This is in the wrong order. Defragmentation of memory makes way more sense when anti-fragmentation is already in place. There is less memory that will require moving. Full defragmentation requires breaking 1:1 phys:virt mapping or halting the machine to get useful work done. Anti-fragmentation using memory compaction of MOVABLE pages should handle the situation without breaking 1:1 mappings. > So I haven't been following where we're at WRT the requirements. Why > can we not do with PAGE_SIZE pages or memory reserves? PAGE_SIZE pages cannot grow the hugepage pool. The size of the hugepage pool required for the lifetime of the system is not always known. PPC64 is not able to hot-remove a single page and the balloon driver from Xen has it's own problems. As already stated, reserves come with their own host of problems that people are not willing to deal with. > If it is a > matter of efficiency, then how much does it matter, and to whom? > The kernel already uses huge PTE entries in its portion of the address space because TLB reach is a real problem. Grouping kernel allocations together in the same hugepages improves overall performance due to reduced TLB pressure. This is a general improvement and how much of an effect it has depends on the workload and the TLB size. >>From your other mail > Oh, and: why won't they get upset if memory does eventually end up > getting fragmented? For hugepages, it's annoying because the application will have to fallback to using small pages which is not always possible and it loses performance. I get bad emails but the system survives. For memory hot-remove (be it virtualisation, power saving or whatever), I get sent another complaining email because the memory can not be removed but the system again lives. So, for those two use cases, if memory gets fragmented there is a non-critical bug report and the problem gets kicked by me with some egg on my face. Going forward, the large page cache stuff will need to deal with a situation where contiguous pages are not available. What I see happening is that an API like buffered_rmqueue() is available that gives back an amount of memory in a list that is as contiguous as possible. This seems feasible and it would be best if stats were maintained on how often contiguous pages were actually used to diagnose bug reports that look like "IO performs really well for a few weeks but then starts slowing up". At worst it should regress to the vanilla kernels performance at which point I get a complaining email but again, the system survives. SLUB using higher orders needs to be tested but as it is using lower orders to begin with, it may not be an issue. If the minimum page size it uses is fixed, then many blocks within the RECLAIMABLE areas will be the same size in the vast majority of cases. As they get freed, they'll be freeing at the same minimum order so it should not hit external fragmentation problems. This hypothesis will need to be tested heavily before merging but the bug reports at least will be really obvious (system went BANG) and I'm in the position to kick this quite heavily using the test.kernel.org system. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org