* Antifrag patchset comments
@ 2007-04-28 3:46 Christoph Lameter
2007-04-28 13:21 ` Mel Gorman
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2007-04-28 3:46 UTC (permalink / raw)
To: mel; +Cc: linux-mm
I just had a look at the patches in mm....
Ok so we have the unmovable allocations and then 3 special types
RECLAIMABLE
Memory can be reclaimed? Ahh this is used for buffer heads
and the like. Allocations that can be reclaimed by some
sort of system action that cannot be directly targeted
at an object?
It seems that you also included temporary allocs here?
MOVABLE
Memory can be moved by going to the page and reclaiming it?
So right now this is only a higher form of RECLAIMABLE.
We currently do not move memory.... so why have it?
MIGRATE_RESERVE
Some atomic reserve to preserve contiguity of allocations?
Or just a fallback if other pools are all used? What is this?
So have 4 categories. Any additional category causes more overhead on
the pcp lists since we will have to find the correct type on the lists.
Why do we have MIGRATE_RESERVE?
Then we have ZONE_MOVABLE whose purpose is to guarantee that a large
portion of memory is always reclaimable and movable. Which is pawned off
the highest available allocation zone. Very similar to memory policies
same problems. Some nodes do not have the highest zone (many x86_64
NUMA are in that strange situation). Memory policies do not work quite
right there and it seems that the antifrag methods will be switched off
for such a node. Trouble ahead. Why do we need it? To crash when the
kernel does too many unmovable allocs?
Other things:
1. alloc_zeroed_user_highpage is no longer used
Its noted in the patches but it was not removed nor marked
as depreciated.
2. submit_bh allocates bios using __GFP_MOVABLE
How can a bio be moved? Or does that indicate that the
bio can be reclaimed?
3. Highmem pages for user space are marked __GFP_MOVABLE
Looks okay to me. So I guess that __GFP_MOVABLE
implies GFP_RECLAIMABLE? Hmmm... It seems that
mlocked pages are therefore also movable and reclaimable
(not true!). So we still have that problem spot?
4. Default inode alloc mod is set to GFP_HIGH_MOVABLE....
Good.
5. Hugepages are set to movable in some cases.
That is because they are large order allocs and do not
cause fragmentation if all other allocs are smaller. But that
assumption may turn out to be problematic. Huge pages allocs
as movable may make higher order allocation problematic if
MAX_ORDER becomes much larger than the huge page order. In
particular on IA64 the huge page order is dynamically settable
on bootup. They can be quite small and thus cause fragmentation
in the movable blocks.
I think it may be possible to make huge pages supported by
page migration in some way which may justify putting it into
the movable section for all cases. But right now this seems to
be more an x86_64/i386'ism.
6. First in bdget() we set the mapping for a block device up using
GFP_MOVABLE. However, then in grow_dev_page for an actual
allocation we will use__GFP_RECLAIMABLE for the block device.
We should use one type I would think and its GFP_MOVABLE as
far as I can tell.
7. dentry allocation uses GFP_KERNEL|__GFP_RECLAIMABLE.
Why not set this by default in the slab allocators if
kmem_cache_create sets up a slab with SLAB_RECLAIM_ACCOUNT?
8. Same occurs for inodes. The reclaim flag should not be specified
for individual allocations since reclaim is a slab wide
activity. It also has no effect if the objects is taken off
a queue.
9. proc_loginuid_write(), do_proc_readlink(), proc_pid_att_write() etc.
Why are these allocation reclaimable? Should be GFP_KERNEL alloc there?
These are temporary allocs. What is the benefit of
__GFP_RECLAIMABLE?
10. Radix tree as reclaimable? radix_tree_node_alloc()
Ummm... Its reclaimable in a sense if all the pages are removed
but I'd say not in general.
11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? They can be
swapped out and moved by page migration, so GFP_MOVABLE?
12. skbs slab allocs marked GFP_RECLAIMABLE.
Ok the queues are temporary. GFP_RECLAIMABLE means temporary
alloc that will go away? This is a slab that is not using
SLAB_ACCOUNT_RECLAIMABLE. Do we need a SLAB_RECLAIMABLE flag?
13. In the patches it was mentioned that it is no longer necessary
to set min_free_kbytes? What is the current state of that?
14. I am a bit concerned about an increase in the alloc types. There are
two that I am not sure what their purpose is which is
MIGRATION_RESERVE and MIGRATION_HIGHATOMIC. HIGHATOMIC seems to have
been removed again.
15. Tuning for particular workloads.
Another concern is are patches here that indicate that new alloc types
were created to accomodate certain workloads? The exceptions worry me.
16. Both memory policies and antifrag seem to
determine the highest zone. Memory policies call this the policy
zone. Could you consolidate that code?
17. MAX_ORDER issues. At least on IA64 the antifrag measures will
require a reduction in max order. However, we currently have MAX_ORDER
of 1G because there are applications using huge pages of 1 Gigabyte size
(TLB pressure issues on IA64). OTOH, Machines exist that only have 1GB
RAM per node, so it may be difficult to create multiple MAX_ORDER blocks
as needed.
I have not gotten my head around how the code in page_alloc.c actually
works. This is just from reviewing comments.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: Antifrag patchset comments 2007-04-28 3:46 Antifrag patchset comments Christoph Lameter @ 2007-04-28 13:21 ` Mel Gorman 2007-04-28 21:44 ` Christoph Lameter 0 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2007-04-28 13:21 UTC (permalink / raw) To: Christoph Lameter; +Cc: Linux Memory Management List On Fri, 27 Apr 2007, Christoph Lameter wrote: > I just had a look at the patches in mm.... > > Ok so we have the unmovable allocations and then 3 special types > > RECLAIMABLE > Memory can be reclaimed? Ahh this is used for buffer heads > and the like. Allocations that can be reclaimed by some > sort of system action that cannot be directly targeted > at an object? > Exactly. Inode caches currently fall into the same category. When shrink_slab() is called the amount of memory in RECLAIMABLE areas will be reduced. > It seems that you also included temporary allocs here? > Temporary and short-lived allocations are also treated as reclaimable to stop more areas than necessary being marked UNMOVABLE. The fewer UNMOVABLE blocks there are, the better. > MOVABLE > Memory can be moved by going to the page and reclaiming it? > Or potentially with page migration although that code does not exist. MOVABLE memory means just that - it can be moved while the data is still preserved. Moving it to swap is still moving. > So right now this is only a higher form of RECLAIMABLE. > The names used to be RCLM_NORCLM, RCLM_EASY and RCLM_KERN which confused more people, hence the current naming. > We currently do not move memory.... so why have it? > Because I wanted to build memory compaction on top of this when movable memory is not just memory that can go to swap but includes mlocked pages as well > MIGRATE_RESERVE > > Some atomic reserve to preserve contiguity of allocations? > Or just a fallback if other pools are all used? What is this? > The standard allocator keeps high-order pages free until memory pressure forces them to be split. In practice, this means that pages for min_free_kbytes are kept as contiguous pages for quite a long time but once split never become contiguous again. This lets short-lived high-order atomic allocations to work for quite a while which is why setting min_free_kbytes to 16384 seems to let jumbo frames work for a long time. Grouping by mobility is more concerned with the type of page so it breaks up the min_free_kbytes pages early removing a desirable property of the standard allocator for high-order atomic allocations. MIGRATE_RESERVE brings that desirable property back. The number of blocks marked MIGRATE_RESERVE depends on min_free_kbytes and the area is only used when the alternative is to fail the allocation. The effect is that pages kept free for min_free_kbytes tend to exist in these MIGRATE_RESERVE areas as contiguous areas. This is an improvement over what the standard allocator does because it makes no effort to keep the minimum number of free pages contiguous. > So have 4 categories. Any additional category causes more overhead on > the pcp lists since we will have to find the correct type on the lists. > Why do we have MIGRATE_RESERVE? > It resolved a problem with order-1 atomic allocations used by a network adapter when it was using bittorrent heavily. They affected user hasn't complained since. > Then we have ZONE_MOVABLE whose purpose is to guarantee that a large > portion of memory is always reclaimable and movable. Which is pawned off > the highest available allocation zone. Right. This is a separate issue to grouping pages by mobility. The memory partition does not require grouping pages by mobility to be available and vice-versa. All they share is the marking of allocations __GFP_MOVABLE. > Very similar to memory policies > same problems. Some nodes do not have the highest zone (many x86_64 > NUMA are in that strange situation). yep. Dealing with only the highest zone made the code manageable, particularly where HIGHMEM was involved although the issue between NORMAL and DMA32 isn't much better. > Memory policies do not work quite > right there and it seems that the antifrag methods will be switched off > for such a node. Not quite. If the zone doesn't exist in a node, it will not be in the zonelists and things plod along as normal. Grouping pages by mobility works independent of memory partitioning so it'll still work in these nodes whether the zone is there is not. > Trouble ahead. Why do we need it? To crash when the > kernel does too many unmovable allocs? > It's needed for a few reasons but the two main ones are; a) grouping pages by mobility does not give guaranteed bounds on how much contiguous memory will be movable. While it could, it would be very complex and would replicate the behavior of zones to the extent I'll get a slap in the head for even trying. Partitioning memory gives hard guarantees on memory availability b) Early feedback was that grouping pages by mobility should be done only with zones but that is very restrictive. Different people liked each approach for different reasons so it constantly went in circles. That is why both can sit side-by-side now The zone is also of interest to the memory hot-remove people. Granted, if kernelcore= is given too small a value, it'll cause problems. > > Other things: > > > 1. alloc_zeroed_user_highpage is no longer used > > Its noted in the patches but it was not removed nor marked > as depreciated. > Indeed. Rather than marking it deprecated I was going to wait until it was unused for one cycle and then mark it deprecated and see who complains. > 2. submit_bh allocates bios using __GFP_MOVABLE > > How can a bio be moved? Or does that indicate that the > bio can be reclaimed? > I consider the pages allocated for the buffer to be movable because the buffers can be cleaned and discarded by standard reclaim. When/if page migration is used, this will have to be revisisted but for the moment I believe it's correct. If the RECLAIMABLE areas could be properly targeted, it would make sense to mark these pages RECLAIMABLE instead but that is not the situation today. > 3. Highmem pages for user space are marked __GFP_MOVABLE > > Looks okay to me. So I guess that __GFP_MOVABLE > implies GFP_RECLAIMABLE? Hmmm... It seems that > mlocked pages are therefore also movable and reclaimable > (not true!). So we still have that problem spot? > No, at worst we have a naming ambiguity which has come up before. RECLAIMABLE refers to allocations that are reclaimable via shrink_slab() or short-lived. MOVABLE pages are reclaimable by pageout or movable with page migration. > 4. Default inode alloc mod is set to GFP_HIGH_MOVABLE.... > > Good. > > 5. Hugepages are set to movable in some cases. > Specifically, they are considered movable when they are allowed to be allocated from ZONE_MOVABLE. So for it to really cause fragmentation, there has to be high-order movable allocations in play using ZONE_MOVABLE. This is currently never the case but the large blocksize stuff may change that. > That is because they are large order allocs and do not > cause fragmentation if all other allocs are smaller. But that > assumption may turn out to be problematic. Huge pages allocs > as movable may make higher order allocation problematic if > MAX_ORDER becomes much larger than the huge page order. In > particular on IA64 the huge page order is dynamically settable > on bootup. They can be quite small and thus cause fragmentation > in the movable blocks. > You're right here. I have always considered huge page allocations to be the highest order anything in the system will ever care about. I was not aware of any situation except at boot-time where that is different. What sort of situation do you forsee where the huge page size is not the largest high-order allocation used by the system? Even the large blocksize stuff doesn't seem to apply here. > I think it may be possible to make huge pages supported by > page migration in some way which may justify putting it into > the movable section for all cases. That was the long-term aim. I figured there was no reason that hugepages could not be moved just that it was unnecessary to date. > But right now this seems to be more an x86_64/i386'ism. > Depends on whether IA64 really has situations where allocations of a higher-order than hugepage size are common. > 6. First in bdget() we set the mapping for a block device up using > GFP_MOVABLE. However, then in grow_dev_page for an actual > allocation we will use__GFP_RECLAIMABLE for the block device. > We should use one type I would think and its GFP_MOVABLE as > far as I can tell. > I'll revisit this one. I think it should be __GFP_RECLAIMABLE in both cases because I have a vague memory that pages due to grow_dev_page caused problems fragmentation wise because they could not be reclaimed. That might simply have been an unrelated bug at the time. I've put this on the TODO to investigate further. > 7. dentry allocation uses GFP_KERNEL|__GFP_RECLAIMABLE. > Why not set this by default in the slab allocators if > kmem_cache_create sets up a slab with SLAB_RECLAIM_ACCOUNT? > Because .... errr..... it didn't occur to me. /me adds an item to the TODO list This will simplify one of the patches. Are all slabs with SLAB_RECLAIM_ACCOUNT guaranteed to have a shrinker available either directly or indirectly? > 8. Same occurs for inodes. The reclaim flag should not be specified > for individual allocations since reclaim is a slab wide > activity. It also has no effect if the objects is taken off > a queue. > If SLAB_RECLAIM_ACCOUNT always uses __GFP_RECLAIMABLE, this will be caught too, right? > 9. proc_loginuid_write(), do_proc_readlink(), proc_pid_att_write() etc. > > Why are these allocation reclaimable? Should be GFP_KERNEL alloc there? > > These are temporary allocs. What is the benefit of > __GFP_RECLAIMABLE? > Because they are temporary. I didn't want large bursts of proc activity to cause MAX_ORDER_NR_PAGES blocks to be marked unmovable. > > 10. Radix tree as reclaimable? radix_tree_node_alloc() > > Ummm... Its reclaimable in a sense if all the pages are removed > but I'd say not in general. > I considered them to be indirectly reclaimable. Maybe it wasn't the best choice. > 11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? They can be > swapped out and moved by page migration, so GFP_MOVABLE? > Because they might be ramfs pages which are not movable - http://lkml.org/lkml/2006/11/24/150 > 12. skbs slab allocs marked GFP_RECLAIMABLE. > > Ok the queues are temporary. GFP_RECLAIMABLE means temporary > alloc that will go away? This is a slab that is not using > SLAB_ACCOUNT_RECLAIMABLE. Do we need a SLAB_RECLAIMABLE flag? > I'll add it to the TODO to see what it looks like. > 13. In the patches it was mentioned that it is no longer necessary > to set min_free_kbytes? What is the current state of that? > I ran some tests yesterday. If min_free_kbytes is left untouched, the number of hugepages that can be allocated at the end of the test is very variable: +/- 5% of physical memory on x86_64. When it's set to 4*MAX_ORDER_NR_PAGES, it's +/- 1% generally. I think it's safe to leave the min_free_kbytes as-is for the moment and see what happens. If issues are encountered, I'll be asking that min_free_kbytes be increased on that machine to see if it really makes a difference in practice or not. >From the results I have on x86_64 with 1GB of RAM, grouping page by mobility was able to allocate 69% of memory as 2MB hugepages under heavy load. The standard allocator got 2%. At rest at the end of the test when nothing is running, 72% was available as huge pages when grouping pages by mobility in comparison to 30%. On PPC64 with 4GB of RAM when grouping pages by mobility, 11% was available under load and 57% of memory was available as 16MB huge pages at the end of the test in comparison to 0% with the vanilla allocator under load and 8% at rest. With 1GB of RAM, grouping pages by mobility got 35% of memory as huge pages at the end of the test and the vanilla allocator got 0%. I hope to improve this figure more over time. > 14. I am a bit concerned about an increase in the alloc types. There are > two that I am not sure what their purpose is which is > MIGRATION_RESERVE and MIGRATION_HIGHATOMIC. HIGHATOMIC seems to have > been removed again. > HIGHATOMIC has gone out the door for the moment as MIGRATE_RESERVE does the job of having some contiguous blocks available for high-order atomic allocations better. > 15. Tuning for particular workloads. > > Another concern is are patches here that indicate that new alloc types > were created to accomodate certain workloads? The exceptions worry me. > They are not intentionally aimed at certain workloads. The current tests are known to be very hostile for external fragmentation (e.g. 0% success on PPC64 at the end of tests with the standard allocator). Yesterday in preparation for testing large blocksize patches, I added ltp, dbench and fsxlinux into the tool normally used for testing grouping pages by mobility so the workloads will vary more in the future. I hope to get information on other workloads as the patches get more exposure. > 16. Both memory policies and antifrag seem to > determine the highest zone. Memory policies call this the policy > zone. Could you consolidate that code? > Maybe but probably not - I'll look into it. The problem is that at the time kernelcore= is handled the zones are not initialised yet (again, this is indpendent of grouping pages by mobility) bind_zonelist() appears uses z->present_pages for example which isn't even set at the time ZONE_MOVABLE is setup. > 17. MAX_ORDER issues. At least on IA64 the antifrag measures will > require a reduction in max order. However, we currently have MAX_ORDER > of 1G because there are applications using huge pages of 1 Gigabyte size > (TLB pressure issues on IA64). Ok, this explains why MAX_ORDER is so much larger than what appeared to be the huge page size. > OTOH, Machines exist that only have 1GB > RAM per node, so it may be difficult to create multiple MAX_ORDER blocks > as needed. > *ponders* This is the trickest feedback from your review so far. However, mobility types are grouping based on MAX_ORDER_NR_PAGES simply because it was the easiest to implement and made sense at the time. Right now, __rmqueue_smallest() searches up to MAX_ORDER-1 and 2 bits are stored per MAX_ORDER_NR_PAGES tracking the mobility of the group. There is nothing to say that it searches up to some other arbitrary order. The pageblock flags would then need 2 bits per ARBITRARY_ORDER_NR_PAGES instead of MAX_ORDER_NR_PAGES. I'll look into how it can be implemented. I have an IA64 box with just 1GB of RAM here that I can use to test the concept. > I have not gotten my head around how the code in page_alloc.c actually > works. This is just from reviewing comments. > Thanks a lot for looking through them. My TODO list so far from this is 1. Check that bdget() is really doing the right thing with respect to __GFP_RECLAIMABLE 2. Use SLAB_ACCOUNT_RECLAIMBLE to set __GFP_RECLAIMABLE instead of setting flags individually 3. Consider adding a SLAB_RECLAIMABLE where sockets make short-lived allocations 4. Group based on blocks smaller than MAX_ORDER_NR_PAGES -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-04-28 13:21 ` Mel Gorman @ 2007-04-28 21:44 ` Christoph Lameter 2007-04-30 9:37 ` Mel Gorman 2007-05-01 11:26 ` Nick Piggin 0 siblings, 2 replies; 16+ messages in thread From: Christoph Lameter @ 2007-04-28 21:44 UTC (permalink / raw) To: Mel Gorman, Nick Piggin; +Cc: Linux Memory Management List On Sat, 28 Apr 2007, Mel Gorman wrote: > Because I wanted to build memory compaction on top of this when movable memory > is not just memory that can go to swap but includes mlocked pages as well Ahh. Ok. > > MIGRATE_RESERVE > The standard allocator keeps high-order pages free until memory pressure > forces them to be split. In practice, this means that pages for > min_free_kbytes are kept as contiguous pages for quite a long time but once > split never become contiguous again. This lets short-lived high-order atomic > allocations to work for quite a while which is why setting min_free_kbytes to > 16384 seems to let jumbo frames work for a long time. Grouping by mobility is > more concerned with the type of page so it breaks up the min_free_kbytes pages > early removing a desirable property of the standard allocator for high-order > atomic allocations. MIGRATE_RESERVE brings that desirable property back. Hmmmm... A special pool for atomic allocs... > > Trouble ahead. Why do we need it? To crash when the > > kernel does too many unmovable allocs? > It's needed for a few reasons but the two main ones are; > > a) grouping pages by mobility does not give guaranteed bounds on how much > contiguous memory will be movable. While it could, it would be very > complex and would replicate the behavior of zones to the extent I'll > get a slap in the head for even trying. Partitioning memory gives hard > guarantees on memory availability And crashes the kernel if the availability is no longer guaranteed? > b) Early feedback was that grouping pages by mobility should be > done only with zones but that is very restrictive. Different people > liked each approach for different reasons so it constantly went in > circles. That is why both can sit side-by-side now > > The zone is also of interest to the memory hot-remove people. Indeed that is a good thing.... It would be good if a movable area would be a dynamic split of a zone and not be a separate zone that has to be configured on the kernel command line. > Granted, if kernelcore= is given too small a value, it'll cause problems. That is what I thought. > > 1. alloc_zeroed_user_highpage is no longer used > > Its noted in the patches but it was not removed nor marked > > as depreciated. > Indeed. Rather than marking it deprecated I was going to wait until it was > unused for one cycle and then mark it deprecated and see who complains. I'd say remove it immediately. This is confusing. > > 2. submit_bh allocates bios using __GFP_MOVABLE > > > > How can a bio be moved? Or does that indicate that the > > bio can be reclaimed? > > > > I consider the pages allocated for the buffer to be movable because the > buffers can be cleaned and discarded by standard reclaim. When/if page > migration is used, this will have to be revisisted but for the moment I > believe it's correct. This would make it __GFP_RECLAIMABLE. The same is true for the caches that can be reclaimed. They are not marked __GFP_MOVABLE. > If the RECLAIMABLE areas could be properly targeted, it would make sense to > mark these pages RECLAIMABLE instead but that is not the situation today. What is the problem with targeting? > > That is because they are large order allocs and do not > > cause fragmentation if all other allocs are smaller. But that > > assumption may turn out to be problematic. Huge pages allocs > > as movable may make higher order allocation problematic if > > MAX_ORDER becomes much larger than the huge page order. In > > particular on IA64 the huge page order is dynamically settable > > on bootup. They can be quite small and thus cause fragmentation > > in the movable blocks. > You're right here. I have always considered huge page allocations to be the > highest order anything in the system will ever care about. I was not aware of > any situation except at boot-time where that is different. What sort of > situation do you forsee where the huge page size is not the largest high-order > allocation used by the system? Even the large blocksize stuff doesn't seem to > apply here. Boot an IA64 box with the parameter hugepagesz=64k for example. That will give you a huge page size of 64k on a system with MAX_ORDER = 1G. The default for the huge page size is 256k which is a quarter of max order. But some people boot with 1G huge pages. > > 6. First in bdget() we set the mapping for a block device up using > > GFP_MOVABLE. However, then in grow_dev_page for an actual > > allocation we will use__GFP_RECLAIMABLE for the block device. > > We should use one type I would think and its GFP_MOVABLE as > > far as I can tell. > > > > I'll revisit this one. I think it should be __GFP_RECLAIMABLE in both cases > because I have a vague memory that pages due to grow_dev_page caused problems > fragmentation wise because they could not be reclaimed. That might simply have > been an unrelated bug at the time. It depends on who allocates these pages. If they are mapped by the user then they are movable. If a filesystem gets them for metadata then they are reclaimable. > This will simplify one of the patches. Are all slabs with SLAB_RECLAIM_ACCOUNT > guaranteed to have a shrinker available either directly or indirectly? I have not checked that recently but historically yes. There is no point in accounting slabs for reclaim if you cannot reclaim them. > > 8. Same occurs for inodes. The reclaim flag should not be specified > > for individual allocations since reclaim is a slab wide > > activity. It also has no effect if the objects is taken off > > a queue. > > > > If SLAB_RECLAIM_ACCOUNT always uses __GFP_RECLAIMABLE, this will be caught > too, right? Correct. > > 10. Radix tree as reclaimable? radix_tree_node_alloc() > > > > Ummm... Its reclaimable in a sense if all the pages are removed > > but I'd say not in general. > > > > I considered them to be indirectly reclaimable. Maybe it wasn't the best > choice. Maybe we need to ask Nick about this one. > > 11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? They can be > > swapped out and moved by page migration, so GFP_MOVABLE? > > > > Because they might be ramfs pages which are not movable - > http://lkml.org/lkml/2006/11/24/150 URL does not provide any useful information regarding the issue. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-04-28 21:44 ` Christoph Lameter @ 2007-04-30 9:37 ` Mel Gorman 2007-04-30 12:35 ` Peter Zijlstra ` (2 more replies) 2007-05-01 11:26 ` Nick Piggin 1 sibling, 3 replies; 16+ messages in thread From: Mel Gorman @ 2007-04-30 9:37 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, Linux Memory Management List On Sat, 28 Apr 2007, Christoph Lameter wrote: > On Sat, 28 Apr 2007, Mel Gorman wrote: > >> Because I wanted to build memory compaction on top of this when movable memory >> is not just memory that can go to swap but includes mlocked pages as well > > Ahh. Ok. > >>> MIGRATE_RESERVE >> The standard allocator keeps high-order pages free until memory pressure >> forces them to be split. In practice, this means that pages for >> min_free_kbytes are kept as contiguous pages for quite a long time but once >> split never become contiguous again. This lets short-lived high-order atomic >> allocations to work for quite a while which is why setting min_free_kbytes to >> 16384 seems to let jumbo frames work for a long time. Grouping by mobility is >> more concerned with the type of page so it breaks up the min_free_kbytes pages >> early removing a desirable property of the standard allocator for high-order >> atomic allocations. MIGRATE_RESERVE brings that desirable property back. > > Hmmmm... A special pool for atomic allocs... > That is not it's intention although it doubles up at that. The intention is to preserve free pages kept for min_free_kbytes as contiguous pages because it's a property of the current allocator that atomic allocations depend on today. >>> Trouble ahead. Why do we need it? To crash when the >>> kernel does too many unmovable allocs? >> It's needed for a few reasons but the two main ones are; >> >> a) grouping pages by mobility does not give guaranteed bounds on how much >> contiguous memory will be movable. While it could, it would be very >> complex and would replicate the behavior of zones to the extent I'll >> get a slap in the head for even trying. Partitioning memory gives hard >> guarantees on memory availability > > And crashes the kernel if the availability is no longer guaranteed? > OOM. >> b) Early feedback was that grouping pages by mobility should be >> done only with zones but that is very restrictive. Different people >> liked each approach for different reasons so it constantly went in >> circles. That is why both can sit side-by-side now >> >> The zone is also of interest to the memory hot-remove people. > > Indeed that is a good thing.... It would be good if a movable area > would be a dynamic split of a zone and not be a separate zone that has to > be configured on the kernel command line. > There are problems with doing that. In particular, the zone can only be sized on one direction and can only be sized at the zone boundary because zones do not currently overlap and I believe there will be assumptions made about them not overlapping within a node. It's worth looking into in the future but I'm putting it at the bottom of the TODO list. >> Granted, if kernelcore= is given too small a value, it'll cause problems. > > That is what I thought. > >>> 1. alloc_zeroed_user_highpage is no longer used >>> Its noted in the patches but it was not removed nor marked >>> as depreciated. >> Indeed. Rather than marking it deprecated I was going to wait until it was >> unused for one cycle and then mark it deprecated and see who complains. > > I'd say remove it immediately. This is confusing. > Ok. >>> 2. submit_bh allocates bios using __GFP_MOVABLE >>> >>> How can a bio be moved? Or does that indicate that the >>> bio can be reclaimed? >>> >> >> I consider the pages allocated for the buffer to be movable because the >> buffers can be cleaned and discarded by standard reclaim. When/if page >> migration is used, this will have to be revisisted but for the moment I >> believe it's correct. > > This would make it __GFP_RECLAIMABLE. The same is true for the caches that > can be reclaimed. They are not marked __GFP_MOVABLE. > As we are currently depend on reclaim to free contiguous pages, it works out better *at the moment* to have buffers with other pages reclaimed via the LRU. >> If the RECLAIMABLE areas could be properly targeted, it would make sense to >> mark these pages RECLAIMABLE instead but that is not the situation today. > > What is the problem with targeting? > It's currently not possible to target effectively. >>> That is because they are large order allocs and do not >>> cause fragmentation if all other allocs are smaller. But that >>> assumption may turn out to be problematic. Huge pages allocs >>> as movable may make higher order allocation problematic if >>> MAX_ORDER becomes much larger than the huge page order. In >>> particular on IA64 the huge page order is dynamically settable >>> on bootup. They can be quite small and thus cause fragmentation >>> in the movable blocks. >> >> You're right here. I have always considered huge page allocations to be the >> highest order anything in the system will ever care about. I was not aware of >> any situation except at boot-time where that is different. What sort of >> situation do you forsee where the huge page size is not the largest high-order >> allocation used by the system? Even the large blocksize stuff doesn't seem to >> apply here. > > Boot an IA64 box with the parameter hugepagesz=64k for example. That will > give you a huge page size of 64k on a system with MAX_ORDER = 1G. The > default for the huge page size is 256k which is a quarter of max order. > But some people boot with 1G huge pages. > Right, that's fair enough. Now that I recognise the problem, I can start kicking it. >>> 6. First in bdget() we set the mapping for a block device up using >>> GFP_MOVABLE. However, then in grow_dev_page for an actual >>> allocation we will use__GFP_RECLAIMABLE for the block device. >>> We should use one type I would think and its GFP_MOVABLE as >>> far as I can tell. >>> >> >> I'll revisit this one. I think it should be __GFP_RECLAIMABLE in both cases >> because I have a vague memory that pages due to grow_dev_page caused problems >> fragmentation wise because they could not be reclaimed. That might simply have >> been an unrelated bug at the time. > > It depends on who allocates these pages. If they are mapped by the user > then they are movable. If a filesystem gets them for metadata then they > are reclaimable. > >> This will simplify one of the patches. Are all slabs with SLAB_RECLAIM_ACCOUNT >> guaranteed to have a shrinker available either directly or indirectly? > > I have not checked that recently but historically yes. There is no point > in accounting slabs for reclaim if you cannot reclaim them. > Right, I'll go with the assumption that they somehow all get reclaimed via shrink_icache_memory() for the moment. >>> 8. Same occurs for inodes. The reclaim flag should not be specified >>> for individual allocations since reclaim is a slab wide >>> activity. It also has no effect if the objects is taken off >>> a queue. >>> >> >> If SLAB_RECLAIM_ACCOUNT always uses __GFP_RECLAIMABLE, this will be caught >> too, right? > > Correct. > >>> 10. Radix tree as reclaimable? radix_tree_node_alloc() >>> >>> Ummm... Its reclaimable in a sense if all the pages are removed >>> but I'd say not in general. >>> >> >> I considered them to be indirectly reclaimable. Maybe it wasn't the best >> choice. > > Maybe we need to ask Nick about this one. Nick, at what point are nodes allocated with radix_tree_node_alloc() freed? My current understanding is that some get freed when pages are removed from the page cache but I haven't looked closely enough to be certain. >>> 11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? They can be >>> swapped out and moved by page migration, so GFP_MOVABLE? >>> >> >> Because they might be ramfs pages which are not movable - >> http://lkml.org/lkml/2006/11/24/150 > > URL does not provide any useful information regarding the issue. > Not all pages allocated via shmem_alloc_page() are movable because they may pages for ramfs. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-04-30 9:37 ` Mel Gorman @ 2007-04-30 12:35 ` Peter Zijlstra 2007-04-30 17:30 ` Christoph Lameter 2007-05-01 13:31 ` Hugh Dickins 2 siblings, 0 replies; 16+ messages in thread From: Peter Zijlstra @ 2007-04-30 12:35 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Lameter, Nick Piggin, Linux Memory Management List On Mon, 2007-04-30 at 10:37 +0100, Mel Gorman wrote: > >>> 10. Radix tree as reclaimable? radix_tree_node_alloc() > >>> > >>> Ummm... Its reclaimable in a sense if all the pages are removed > >>> but I'd say not in general. > >>> > >> > >> I considered them to be indirectly reclaimable. Maybe it wasn't the best > >> choice. > > > > Maybe we need to ask Nick about this one. > > Nick, at what point are nodes allocated with radix_tree_node_alloc() > freed? > > My current understanding is that some get freed when pages are removed > from the page cache but I haven't looked closely enough to be certain. Indeed, radix tree nodes are freed when the tree loses elements. Both through freeing nodes that have no elements left, and shrinking the tree when the top node has only the first entry in use. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-04-30 9:37 ` Mel Gorman 2007-04-30 12:35 ` Peter Zijlstra @ 2007-04-30 17:30 ` Christoph Lameter 2007-04-30 18:33 ` Mel Gorman 2007-05-01 13:31 ` Hugh Dickins 2 siblings, 1 reply; 16+ messages in thread From: Christoph Lameter @ 2007-04-30 17:30 UTC (permalink / raw) To: Mel Gorman; +Cc: Nick Piggin, Linux Memory Management List On Mon, 30 Apr 2007, Mel Gorman wrote: > > Indeed that is a good thing.... It would be good if a movable area > > would be a dynamic split of a zone and not be a separate zone that has to > > be configured on the kernel command line. > There are problems with doing that. In particular, the zone can only be sized > on one direction and can only be sized at the zone boundary because zones do > not currently overlap and I believe there will be assumptions made about them > not overlapping within a node. It's worth looking into in the future but I'm > putting it at the bottom of the TODO list. Its is better to have a dynamic limit rather than OOMing. > > > If the RECLAIMABLE areas could be properly targeted, it would make sense > > > to > > > mark these pages RECLAIMABLE instead but that is not the situation today. > > What is the problem with targeting? > It's currently not possible to target effectively. Could you be more specific? > > > Because they might be ramfs pages which are not movable - > > > http://lkml.org/lkml/2006/11/24/150 > > > > URL does not provide any useful information regarding the issue. > > > > Not all pages allocated via shmem_alloc_page() are movable because they may > pages for ramfs. Not familiar with ramfs. There would have to be work on ramfs to make them movable? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-04-30 17:30 ` Christoph Lameter @ 2007-04-30 18:33 ` Mel Gorman 0 siblings, 0 replies; 16+ messages in thread From: Mel Gorman @ 2007-04-30 18:33 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, Linux Memory Management List On Mon, 30 Apr 2007, Christoph Lameter wrote: > On Mon, 30 Apr 2007, Mel Gorman wrote: > >>> Indeed that is a good thing.... It would be good if a movable area >>> would be a dynamic split of a zone and not be a separate zone that has to >>> be configured on the kernel command line. >> There are problems with doing that. In particular, the zone can only be sized >> on one direction and can only be sized at the zone boundary because zones do >> not currently overlap and I believe there will be assumptions made about them >> not overlapping within a node. It's worth looking into in the future but I'm >> putting it at the bottom of the TODO list. > > Its is better to have a dynamic limit rather than OOMing. > I'll certainly give the problem a kick. I simply have a strong feeling that dynamically resizing zones will not be very straight-forward and as the zone is manually sized by the administrator, I didn't feel strongly about it being possible for an admin to put his machine in an OOM-able situation. >>>> If the RECLAIMABLE areas could be properly targeted, it would make sense >>>> to >>>> mark these pages RECLAIMABLE instead but that is not the situation today. >>> What is the problem with targeting? >> It's currently not possible to target effectively. > > Could you be more specific? > The situation I wanted to end up with was that a percentage of memory could be reclaimed or moved so that contiguous allocations would succeed. When reclaiming __GFP_MOVABLE, we can use lumpy reclaim to find a suitable area of pages to reclaim. Some of the pages there are buffer pages even though they are not movable in the page migration sense of the word. Given a page allocated for an inode slab cache, we can't reclaim the objects in there in the same way as a buffer page can be cleaned and discared. Hence, to increase the amount of memory that can be reclaimed for contiguous allocations, I group the buffer pages with other movable pages instead of putting them in with __GFP_RECLAIMABLE pages like slab where they are not as useful from a future contiguous allocation perspective. In the event that given a page of slab objects I could be sure of reclaiming all the objects in that page and freeing it, then it would make sense to group buffer pages with those. Does that make sense? >>>> Because they might be ramfs pages which are not movable - >>>> http://lkml.org/lkml/2006/11/24/150 >>> >>> URL does not provide any useful information regarding the issue. >>> >> >> Not all pages allocated via shmem_alloc_page() are movable because they may >> pages for ramfs. > > Not familiar with ramfs. There would have to be work on ramfs to make them > movable? Minimally yes. I haven't looked too closely at the issue yet because to start with, it was enough to know that the pages were not always movable or reclaimable in any way other than deleting files. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-04-30 9:37 ` Mel Gorman 2007-04-30 12:35 ` Peter Zijlstra 2007-04-30 17:30 ` Christoph Lameter @ 2007-05-01 13:31 ` Hugh Dickins 2 siblings, 0 replies; 16+ messages in thread From: Hugh Dickins @ 2007-05-01 13:31 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Lameter, Nick Piggin, Linux Memory Management List On Mon, 30 Apr 2007, Mel Gorman wrote: > On Sat, 28 Apr 2007, Christoph Lameter wrote: > > > > > 11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? > > > > They can be swapped out and moved by page migration, so GFP_MOVABLE? > > > > > > Because they might be ramfs pages which are not movable - > > > http://lkml.org/lkml/2006/11/24/150 > > > > URL does not provide any useful information regarding the issue. > > Not all pages allocated via shmem_alloc_page() are movable because they may > pages for ramfs. We seem to have a miscommunication here. shmem_alloc_page() is static to mm/shmem.c, is used for all shm/tmpfs data pages (unless CONFIG_TINY_SHMEM), and all those data pages may be swapped out (while not locked in use). ramfs pages cannot be swapped out; but shmem_alloc_page() is not used to allocate them. CONFIG_TINY_SHMEM uses mm/tiny-shmem.c instead of mm/shmem.c, redirecting all shm/tmpfs requests to the simpler but unswappable ramfs. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-04-28 21:44 ` Christoph Lameter 2007-04-30 9:37 ` Mel Gorman @ 2007-05-01 11:26 ` Nick Piggin 2007-05-01 12:22 ` Nick Piggin 2007-05-01 16:38 ` Mel Gorman 1 sibling, 2 replies; 16+ messages in thread From: Nick Piggin @ 2007-05-01 11:26 UTC (permalink / raw) To: Christoph Lameter; +Cc: Mel Gorman, Linux Memory Management List, Andrew Morton Christoph Lameter wrote: > On Sat, 28 Apr 2007, Mel Gorman wrote: >>>10. Radix tree as reclaimable? radix_tree_node_alloc() >>> >>> Ummm... Its reclaimable in a sense if all the pages are removed >>> but I'd say not in general. >>> >> >>I considered them to be indirectly reclaimable. Maybe it wasn't the best >>choice. > > > Maybe we need to ask Nick about this one. I guess they are as reclaimable as the pagecache they hold is. Of course, they are yet another type of object that makes higher order reclaim inefficient, regardless of lumpy reclaim etc. ... and also there are things besides pagecache that use radix trees.... I guess you are faced with conflicting problems here. If you do not mark things like radix tree nodes and dcache as reclaimable, then your unreclaimable category gets expanded and fragmented more quickly. On the other hand, if you do mark them (not just radix-trees, but also bios, dcache, various other things) as reclaimable, then they make it more difficult to reclaim from the reclaimable memory, and they also make the reclaimable memory less robust, because you could have pinned dentry, or some other radix tree user in there that cannot be reclaimed. I guess making radix tree nodes reclaimable is probably the best of the two options at this stage. But now that I'm asked, I repeat my dislike for the antifrag patches, because of the above -- ie. they're just a heuristic that slows down the fragmentation of memory rather than avoids it. I really oppose any code that _depends_ on higher order allocations. Even if only used for performance reasons, I think it is sad because a system that eventually gets fragmented will end up with worse performance over time, which is just lame. For those systems that really want a big chunk of memory set aside (for hugepages or memory unplugging), I think reservations are reasonable because they work and are robust. If we ever _really_ needed arbitrary contiguous physical memory for some reason, then I think virtual kernel mapping and true defragmentation would be the logical step. AFAIK, nobody has tried to do this yet it seems like the (conceptually) simplest and most logical way to go if you absolutely need contig memory. But firstly, I think we should fight against needing to do that step. I don't care what people say, we are in some position to influence hardware vendors, and it isn't the end of the world if we don't run optimally on some hardware today. I say we try to avoid higher order allocations. It will be hard to ever remove this large amount of machinery once the code is in. So to answer Andrew's request for review, I have looked through the patches at times, and they don't seem to be technically wrong (I would have prefered that it use resizable zones rather than new sub-zone zones, but hey...). However I am against the whole direction they go in, so I haven't really looked at them lately. I think the direction we should take is firstly ask whether we can do a reasonable job with PAGE_SIZE pages, secondly ask whether we can do an acceptable special-case (eg. reserve memory), lastly, _actually_ do defragmentation of kernel memory. Anti-frag would come somewhere after that last step, as a possible optimisation. So I haven't been following where we're at WRT the requirements. Why can we not do with PAGE_SIZE pages or memory reserves? If it is a matter of efficiency, then how much does it matter, and to whom? -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-05-01 11:26 ` Nick Piggin @ 2007-05-01 12:22 ` Nick Piggin 2007-05-01 16:38 ` Mel Gorman 1 sibling, 0 replies; 16+ messages in thread From: Nick Piggin @ 2007-05-01 12:22 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, Mel Gorman, Linux Memory Management List, Andrew Morton Nick Piggin wrote: > So I haven't been following where we're at WRT the requirements. Why > can we not do with PAGE_SIZE pages or memory reserves? If it is a > matter of efficiency, then how much does it matter, and to whom? Oh, and: why won't they get upset if memory does eventually end up getting fragmented? -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-05-01 11:26 ` Nick Piggin 2007-05-01 12:22 ` Nick Piggin @ 2007-05-01 16:38 ` Mel Gorman 2007-05-02 2:43 ` Nick Piggin 1 sibling, 1 reply; 16+ messages in thread From: Mel Gorman @ 2007-05-01 16:38 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton On Tue, 1 May 2007, Nick Piggin wrote: > Christoph Lameter wrote: >> On Sat, 28 Apr 2007, Mel Gorman wrote: > >>>> 10. Radix tree as reclaimable? radix_tree_node_alloc() >>>> >>>> Ummm... Its reclaimable in a sense if all the pages are removed >>>> but I'd say not in general. >>>> >>> >>> I considered them to be indirectly reclaimable. Maybe it wasn't the best >>> choice. >> >> >> Maybe we need to ask Nick about this one. > > I guess they are as reclaimable as the pagecache they hold is. Of > course, they are yet another type of object that makes higher order > reclaim inefficient, regardless of lumpy reclaim etc. > That can be said of the reclaimable slab caches as well. That is why they are grouped together. Unlike page cache and buffer pages, the pages involved cannot be freed without another subsystem being involved. > ... and also there are things besides pagecache that use radix trees.... > > I guess you are faced with conflicting problems here. If you do not > mark things like radix tree nodes and dcache as reclaimable, then your > unreclaimable category gets expanded and fragmented more quickly. > This is understood. It's why the principal mobility types are UNMOVABLE, RECLAIMABLE and MOVABLE instead of UNMOVABLE and MOVABLE which was suggested to me in the past. > On the other hand, if you do mark them (not just radix-trees, but also > bios, dcache, various other things) as reclaimable, then they make it > more difficult to reclaim from the reclaimable memory, This is why dcache and various other things with similar difficulty are in the RECLAIMABLE areas, not the MOVABLE area. This is deliberate as they all get to be difficult together. Care is taken to group pages appropriately so that only easily movable allocations are in the MOVABLE area. > and they also > make the reclaimable memory less robust, because you could have pinned > dentry, or some other radix tree user in there that cannot be reclaimed. > Which is why the success rates of hugepage allocation under heavy load depends more on the number of MOVABLE blocks than RECLAIMABLE. > I guess making radix tree nodes reclaimable is probably the best of the > two options at this stage. > The ideal would be that some caches would become directly reclaimable over time including the radix tree nodes. i.e. given a page that belongs to an inode cache that it would be possible to reclaim all the objects within that page and free it. If that was the case for all reclaimable caches, then the RECLAIMABLE portion of memory becomes much more useful. Right now, it depends on a certain amount of luck that randomly freeing cache objects will free contiguous blocks in the RECLAIMABLE area. There was a similar problem for the MOVABLE area until lumpy reclaim targetted its reclaim. Similar targetting of slab pages is desirable. > But now that I'm asked, I repeat my dislike for the antifrag patches, > because of the above -- ie. they're just a heuristic that slows down > the fragmentation of memory rather than avoids it. > > I really oppose any code that _depends_ on higher order allocations. > Even if only used for performance reasons, I think it is sad because > a system that eventually gets fragmented will end up with worse > performance over time, which is just lame. Although performance could degrade were fragmentation avoidance ineffective, it seems wrong to miss out on that performance improvement through a dislike of it. Any use of high-order pages for performance will be vunerable to fragmentation and would need to handle it. I would be interested to see any proposed uses both to review them and to see how they interact with fragmentation avoidance, as test cases. > For those systems that really want a big chunk of memory set aside (for > hugepages or memory unplugging), I think reservations are reasonable > because they work and are robust. Reservations don't really work for memory unplugging at all. Hugepage reservations have to be done at boot-time which is a difficult requirement to meet and impossible on batch job and shared systems where reboots do not take place. > If we ever _really_ needed arbitrary > contiguous physical memory for some reason, then I think virtual kernel > mapping and true defragmentation would be the logical step. > Breaking the 1:1 phys:virtual mapping incurs a performance hit that is persistent. Minimally, things like page_to_pfn() are no longer a simply calculation which is a bad enough hit. Worse, the kernel can no longer backed by huge pages because you would have to defragment at the base-page level. The kernel is backed by huge page entries at the moment for a good reason, TLB reach is a real problem. Continuing on, "true defragmentation" would require that the system be halted so that the defragmentation can take place with everything disabled so that the copy can take place and every processes pagetables be updated as pagetables are not always shared. Even if shared, all processes would still have to be halted unless the kernel was fully pagable and we were willing to handle page faults in kernel outside of just the vmalloc area. This is before even considering the problem of how the kernel copies the data between two virtual addresses while it's modifing the page tables it's depending on to read the data. Even more horribly, virtual addresses in the kernel are no longer physically contiguous which will likely cause some problems for drivers and possibly DMA engines. The memory compaction mechanism I have in mind operates on MOVABLE pages only using the page migration mechanism with the view to keeping MOVABLE and RECLAIMABLE pages at opposite end of the zone. It doesn't bring the kernel to the halt like it was a Java Virtual Machine or a lisp interpreter doing garbage collection. > AFAIK, nobody has tried to do this yet it seems like the (conceptually) > simplest and most logical way to go if you absolutely need contig > memory. > I believe there was some work at one point to break the 1:1 phys:virt mapping that Dave Hansen was involved it. It was a non-trivial breakage and AFAIK, it made things pretty slow and lost the backing of the kernel address space with large pages. Much time has been spent making sure the fragmentation avoidance patches did not kill performance. As the fragmentation avoidance stuff improves the TLB usage in the kernel portion of the address space, it improves performance in some cases. That alone should be considered a positive. Here are test figures from an x86_64 without min_free_kbytes adjusted comparing fragmentation avoidance on 2.6.21-rc6-mm1. Newer figures are being generated but it takes a long time to go through it all. KernBench Comparison -------------------- 2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based %diff User CPU time 85.55 86.27 -0.84% System CPU time 35.85 33.67 6.08% Total CPU time 121.4 119.94 1.20% Complaints about kernbench as a valid benchmark aside, it is dependent on the page allocator's performance. The figures show a 1.2% overall improvement in total CPU time. The AIM9 results look like 2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based 1 creat-clo 154674.22 171921.35 17247.13 11.15% File Creations and Closes/second 2 page_test 184050.99 188470.25 4419.26 2.40% System Allocations & Pages/second 3 brk_test 1840486.50 2011331.44 170844.94 9.28% System Memory Allocations/second 6 exec_test 224.01 234.71 10.70 4.78% Program Loads/second 7 fork_test 3892.04 4325.22 433.18 11.13% Task Creations/second More improvements here although I'll admit aim9 can be unreliable on some machines. The allocation of hugepages under load and at rest look like HighAlloc Under Load Test Results 2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based Order 9 9 Allocation type HighMem HighMem Attempted allocations 499 499 Success allocs 33 361 Failed allocs 466 138 DMA32 zone allocs 31 359 DMA zone allocs 2 2 Normal zone allocs 0 0 HighMem zone allocs 0 0 EasyRclm zone allocs 0 0 % Success 6 72 HighAlloc Test Results while Rested 2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based Order 9 9 Allocation type HighMem HighMem Attempted allocations 499 499 Success allocs 154 366 Failed allocs 345 133 DMA32 zone allocs 152 364 DMA zone allocs 2 2 Normal zone allocs 0 0 HighMem zone allocs 0 0 EasyRclm zone allocs 0 0 % Success 30 73 On machines with large TLBs that can fit the entire working set no matter what, the worst performance regression we've seen is 0.2% in total CPU time in kernbench which is comparable to what you'd see between kernel versions. I didn't spot anything out of the way in the performance figures on test.kernel.org either since fragmentation avoidance was merged. > But firstly, I think we should fight against needing to do that step. > I don't care what people say, we are in some position to influence > hardware vendors, and it isn't the end of the world if we don't run This is conflating the large page cache discussion with the fragmentation avoidance patches. If fragmentation avoidance is merged and the page cache wants to take advantage of it, it will need to; a) deal with the lack of availability of contiguous pages if fragmentation avoidance is ineffective b) be reviewed to see what its fragmentation behaviour looks like Similar comments apply to SLUB if it uses order-1 or order-2 contiguous pages although SLUB is different because as it'll make most reclaimable allocations the same order. Hence they'll also get freed at the same order so it suffers less from external fragmentation problems due to less mixing of orders than one might initially suspect. Ideally, any subsystem using larger pages does a better job than a "reasonable job". At worst, any use of contiguous pages should continue to work if they are not available and at *worst*, it's performance should comparable to base page usage. Your assertion seems to be that it's better to always run slow than run quickly in some situations with the possibility it might slow down later. We have seen some evidence that fragmentation avoidance gives more consistent results when running kernbench during the lifetime of the system than without it. Without it, there are slowdowns probably due to reduced TLB reach. > optimally on some hardware today. I say we try to avoid higher order > allocations. It will be hard to ever remove this large amount of > machinery once the code is in. > > So to answer Andrew's request for review, I have looked through the > patches at times, and they don't seem to be technically wrong (I would > have prefered that it use resizable zones rather than new sub-zone > zones, but hey...). The resizable zones option was considered as well and it seemed messier than what the current stuff does. Not only do we have to deal with overlapping non-contiguous zones, but things like the page->flags identifying which zone a page belongs to have to be moved out (not enough bits) and you get an explosion of zones like ZONE_DMA_UNMOVABLE ZONE_DMA_RECLAIMABLE ZONE_DMA_MOVABLE ZONE_DMA32_UNMOVABLE etc. Everything else aside, that will interact terribly with reclaim. In the end, it would also suffer from similar problems with the size of the RECLAIMABLE areas in comparison to MOVABLE and resizing zones would be expensive. > However I am against the whole direction they go > in, so I haven't really looked at them lately. > > I think the direction we should take is firstly ask whether we can do > a reasonable job with PAGE_SIZE pages, secondly ask whether we can do > an acceptable special-case (eg. reserve memory), Hugepage-wise, memory gets reserved and it's a problem on systems that have changing requirements for the number of hugepages they need available. i.e. the current real use cases for the reservation model have runtime and system management problems. From what I understand, some customers have bought bigger machines and not used huge pages because the reserve model was too difficult to deal with. Base pages are unusuable for memory hot-remove particularly on ppc64 running virtual machines where it wants to move memory in 16MB chunks between machine partitions. > lastly, _actually_ > do defragmentation of kernel memory. Anti-frag would come somewhere > after that last step, as a possible optimisation. > This is in the wrong order. Defragmentation of memory makes way more sense when anti-fragmentation is already in place. There is less memory that will require moving. Full defragmentation requires breaking 1:1 phys:virt mapping or halting the machine to get useful work done. Anti-fragmentation using memory compaction of MOVABLE pages should handle the situation without breaking 1:1 mappings. > So I haven't been following where we're at WRT the requirements. Why > can we not do with PAGE_SIZE pages or memory reserves? PAGE_SIZE pages cannot grow the hugepage pool. The size of the hugepage pool required for the lifetime of the system is not always known. PPC64 is not able to hot-remove a single page and the balloon driver from Xen has it's own problems. As already stated, reserves come with their own host of problems that people are not willing to deal with. > If it is a > matter of efficiency, then how much does it matter, and to whom? > The kernel already uses huge PTE entries in its portion of the address space because TLB reach is a real problem. Grouping kernel allocations together in the same hugepages improves overall performance due to reduced TLB pressure. This is a general improvement and how much of an effect it has depends on the workload and the TLB size. >From your other mail > Oh, and: why won't they get upset if memory does eventually end up > getting fragmented? For hugepages, it's annoying because the application will have to fallback to using small pages which is not always possible and it loses performance. I get bad emails but the system survives. For memory hot-remove (be it virtualisation, power saving or whatever), I get sent another complaining email because the memory can not be removed but the system again lives. So, for those two use cases, if memory gets fragmented there is a non-critical bug report and the problem gets kicked by me with some egg on my face. Going forward, the large page cache stuff will need to deal with a situation where contiguous pages are not available. What I see happening is that an API like buffered_rmqueue() is available that gives back an amount of memory in a list that is as contiguous as possible. This seems feasible and it would be best if stats were maintained on how often contiguous pages were actually used to diagnose bug reports that look like "IO performs really well for a few weeks but then starts slowing up". At worst it should regress to the vanilla kernels performance at which point I get a complaining email but again, the system survives. SLUB using higher orders needs to be tested but as it is using lower orders to begin with, it may not be an issue. If the minimum page size it uses is fixed, then many blocks within the RECLAIMABLE areas will be the same size in the vast majority of cases. As they get freed, they'll be freeing at the same minimum order so it should not hit external fragmentation problems. This hypothesis will need to be tested heavily before merging but the bug reports at least will be really obvious (system went BANG) and I'm in the position to kick this quite heavily using the test.kernel.org system. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-05-01 16:38 ` Mel Gorman @ 2007-05-02 2:43 ` Nick Piggin 2007-05-02 12:41 ` Mel Gorman 0 siblings, 1 reply; 16+ messages in thread From: Nick Piggin @ 2007-05-02 2:43 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton Mel Gorman wrote: > On Tue, 1 May 2007, Nick Piggin wrote: >> But now that I'm asked, I repeat my dislike for the antifrag patches, >> because of the above -- ie. they're just a heuristic that slows down >> the fragmentation of memory rather than avoids it. >> >> I really oppose any code that _depends_ on higher order allocations. >> Even if only used for performance reasons, I think it is sad because >> a system that eventually gets fragmented will end up with worse >> performance over time, which is just lame. > > > Although performance could degrade were fragmentation avoidance > ineffective, > it seems wrong to miss out on that performance improvement through a > dislike > of it. Any use of high-order pages for performance will be vunerable to > fragmentation and would need to handle it. I would be interested to see > any proposed uses both to review them and to see how they interact with > fragmentation avoidance, as test cases. Not miss out, but use something robust, or try to get the performance some other way. >> For those systems that really want a big chunk of memory set aside (for >> hugepages or memory unplugging), I think reservations are reasonable >> because they work and are robust. > > > Reservations don't really work for memory unplugging at all. Hugepage > reservations have to be done at boot-time which is a difficult requirement > to meet and impossible on batch job and shared systems where reboots do > not take place. You just have to make a tradeoff about how much memory you want to set aside. Note that this memory is not wasted, because it is used for user allocations. So I think the downsides of reservations are really overstated. Note that in a batch environment where reboots do not take place, the anti-frag patches can eventually stop working, but the reservations will not. AFAIK, reservations work for hypervisor type memory unplugging. For arbitrary physical memory unplug, I doubt the anti-frag patches work either. You'd need hardware support or virtually mapped kernel for that. >> If we ever _really_ needed arbitrary >> contiguous physical memory for some reason, then I think virtual kernel >> mapping and true defragmentation would be the logical step. >> > > Breaking the 1:1 phys:virtual mapping incurs a performance hit that is > persistent. Minimally, things like page_to_pfn() are no longer a simply > calculation which is a bad enough hit. Worse, the kernel can no longer > backed by > huge pages because you would have to defragment at the base-page level. The > kernel is backed by huge page entries at the moment for a good reason, > TLB reach is a real problem. Yet this is what you _have_ to do if you must use arbitrary physical memory. And I haven't seen any numbers posted. > Continuing on, "true defragmentation" would require that the system be > halted so that the defragmentation can take place with everything disabled > so that the copy can take place and every processes pagetables be updated > as pagetables are not always shared. Even if shared, all processes would > still have to be halted unless the kernel was fully pagable and we were > willing to handle page faults in kernel outside of just the vmalloc area. vunmap doesn't need to run with the system halted, so I don't see why unmapping the source page would need to. I don't know why we'd need to handle a full page fault in the kernel if the critical part of the defrag code runs atomically and replaces the pte when it is done. > This is before even considering the problem of how the kernel copies the > data between two virtual addresses while it's modifing the page tables > it's depending on to read the data. What's the problem: map the source page into a special area, unmap it from its normal address, allocate a new page, copy the data, swap the mapping. > Even more horribly, virtual addresses > in the kernel are no longer physically contiguous which will likely cause > some problems for drivers and possibly DMA engines. Of course it is trivial to _get_ physically contiguous, virtually contiguous pages, because now you actually have a mechanism to do so. >> AFAIK, nobody has tried to do this yet it seems like the (conceptually) >> simplest and most logical way to go if you absolutely need contig >> memory. >> > > I believe there was some work at one point to break the 1:1 phys:virt > mapping > that Dave Hansen was involved it. It was a non-trivial breakage and AFAIK, > it made things pretty slow and lost the backing of the kernel address space > with large pages. Much time has been spent making sure the fragmentation > avoidance patches did not kill performance. As the fragmentation avoidance > stuff improves the TLB usage in the kernel portion of the address space, it > improves performance in some cases. That alone should be considered a > positive. > > Here are test figures from an x86_64 without min_free_kbytes adjusted > comparing fragmentation avoidance on 2.6.21-rc6-mm1. Newer figures are > being generated but it takes a long time to go through it all. It isn't performance of your patches I'm so worried about. It is that they only slow down the rate of fragmentation, so why do we want to add them and why can't we use something more robust? hugepages are a good example of where you can use reservations. You could even use reservations for higher order pagecache (rather than crapping the whole thing up with small-pages fallbacks everywhere). >> But firstly, I think we should fight against needing to do that step. >> I don't care what people say, we are in some position to influence >> hardware vendors, and it isn't the end of the world if we don't run > > > This is conflating the large page cache discussion with the fragmentation > avoidance patches. If fragmentation avoidance is merged and the page cache > wants to take advantage of it, it will need to; I don't think it is. Because the only reason to need more than a couple of physically contiguous pages is to work around hardware limitations or inefficiency. > a) deal with the lack of availability of contiguous pages if fragmentation > avoidance is ineffective > b) be reviewed to see what its fragmentation behaviour looks like > > Similar comments apply to SLUB if it uses order-1 or order-2 contiguous > pages although SLUB is different because as it'll make most reclaimable > allocations the same order. Hence they'll also get freed at the same order > so it suffers less from external fragmentation problems due to less mixing > of orders than one might initially suspect. Surely you can still have failure cases where you get fragmentation in your unmovable thingy. > Ideally, any subsystem using larger pages does a better job than a > "reasonable > job". At worst, any use of contiguous pages should continue to work if they > are not available and at *worst*, it's performance should comparable to > base > page usage. > > Your assertion seems to be that it's better to always run slow than run > quickly in some situations with the possibility it might slow down > later. We > have seen some evidence that fragmentation avoidance gives more consistent > results when running kernbench during the lifetime of the system than > without > it. Without it, there are slowdowns probably due to reduced TLB reach. No. My assertion is that we should speed things up in other ways, eg. by making the small pages case faster or by using something robust like reservations. On a lot of systems it is actually quite a problem if performance slows down over time, regardless of whether the base performance is about the same as a non-slowing kernel. >> optimally on some hardware today. I say we try to avoid higher order >> allocations. It will be hard to ever remove this large amount of >> machinery once the code is in. >> >> So to answer Andrew's request for review, I have looked through the >> patches at times, and they don't seem to be technically wrong (I would >> have prefered that it use resizable zones rather than new sub-zone >> zones, but hey...). > > > The resizable zones option was considered as well and it seemed messier > than > what the current stuff does. Not only do we have to deal with overlapping > non-contiguous zones, We have to do that anyway, don't we? > but things like the page->flags identifying which > zone a page belongs to have to be moved out (not enough bits) Another 2 bits? I think on most architectures that should be OK, shouldn't it? > and you get > an explosion of zones like > > ZONE_DMA_UNMOVABLE > ZONE_DMA_RECLAIMABLE > ZONE_DMA_MOVABLE > ZONE_DMA32_UNMOVABLE So of course you don't make them visible to the API. Just select them based on your GFP_ movable flags. > etc. What is etc? Are those the best reasons why this wasn't made to use zones? > Everything else aside, that will interact terribly with reclaim. Why? And why does the current scheme not? Doesn't seem like it would have to be a given. You _are_ allowed to change some things. > In the end, it would also suffer from similar problems with the size of > the RECLAIMABLE areas in comparison to MOVABLE and resizing zones would > be expensive. Why is it expensive but resizing your other things is not? And if you already have non-contiguous overlapping zones, you _could_ even just make them all the same size and just move pages between them. > This is in the wrong order. Defragmentation of memory makes way more sense > when anti-fragmentation is already in place. There is less memory that > will require moving. Full defragmentation requires breaking 1:1 phys:virt > mapping or halting the machine to get useful work done. Anti-fragmentation > using memory compaction of MOVABLE pages should handle the situation > without > breaking 1:1 mappings. My arguments are about anti-fragmentation _not_ making sense without defragmentation. >> So I haven't been following where we're at WRT the requirements. Why >> can we not do with PAGE_SIZE pages or memory reserves? > > > PAGE_SIZE pages cannot grow the hugepage pool. The size of the hugepage > pool required for the lifetime of the system is not always known. PPC64 is > not able to hot-remove a single page and the balloon driver from Xen has > it's own problems. As already stated, reserves come with their own host of > problems that people are not willing to deal with. I don't understand exactly what you mean? You don't have a hugepage pool, but an always-reclaimable pool. So you can use this for any kind of pagecache and even anonymous and mlocked memory assuming you account for it correctly so it can be moved away if needed. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-05-02 2:43 ` Nick Piggin @ 2007-05-02 12:41 ` Mel Gorman 2007-05-04 6:16 ` Nick Piggin 0 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2007-05-02 12:41 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton On Wed, 2 May 2007, Nick Piggin wrote: > Mel Gorman wrote: > > On Tue, 1 May 2007, Nick Piggin wrote: > > > > But now that I'm asked, I repeat my dislike for the antifrag patches, > > > because of the above -- ie. they're just a heuristic that slows down > > > the fragmentation of memory rather than avoids it. > > > > > > I really oppose any code that _depends_ on higher order allocations. > > > Even if only used for performance reasons, I think it is sad because > > > a system that eventually gets fragmented will end up with worse > > > performance over time, which is just lame. > > > > > > Although performance could degrade were fragmentation avoidance > > ineffective, > > it seems wrong to miss out on that performance improvement through a > > dislike > > of it. Any use of high-order pages for performance will be vunerable to > > fragmentation and would need to handle it. I would be interested to see > > any proposed uses both to review them and to see how they interact with > > fragmentation avoidance, as test cases. > > Not miss out, but use something robust, or try to get the performance > some other way. > > > > > For those systems that really want a big chunk of memory set aside > > > (for > > > hugepages or memory unplugging), I think reservations are reasonable > > > because they work and are robust. > > > > > > Reservations don't really work for memory unplugging at all. Hugepage > > reservations have to be done at boot-time which is a difficult > > requirement > > to meet and impossible on batch job and shared systems where reboots do > > not take place. > > You just have to make a tradeoff about how much memory you want to set > aside. This tradeoff in sizing the reservation is something that users of shared systems have real problems with because a hugepool once sized can only be used for hugepage allocations. One compromise lead to the development of ZONE_MOVABLE where a portion of memory could be set aside that was usable for small pages but that the huge page pool could borrow from. > Note that this memory is not wasted, because it is used for user > allocations. So I think the downsides of reservations are really > overstated. > This does sound like you think the first step here would be a zone based reservation system. Would you support inclusion of the ZONE_MOVABLE part of the patch set? > Note that in a batch environment where reboots do not take place, the > anti-frag patches can eventually stop working, but the reservations will > not. > Again, does this imply that you're happy for ZONE_MOVABLE to go through? > AFAIK, reservations work for hypervisor type memory unplugging. For > arbitrary physical memory unplug, I doubt the anti-frag patches work > either. You'd need hardware support or virtually mapped kernel for that. > The grouping pages by mobility pages alone helps unplugging of 16MB memory sections on ppc64 (the minimum hypervisor allocation), where all sections are interchangable so any section will do. I believe that Yasunori Goto is looking at using ZONE_MOVABLE for unplugging larger regions. > > > If we ever _really_ needed arbitrary > > > contiguous physical memory for some reason, then I think virtual > > > kernel > > > mapping and true defragmentation would be the logical step. > > > > > > > Breaking the 1:1 phys:virtual mapping incurs a performance hit that is > > persistent. Minimally, things like page_to_pfn() are no longer a simply > > calculation which is a bad enough hit. Worse, the kernel can no longer > > backed by > > huge pages because you would have to defragment at the base-page level. > > The > > kernel is backed by huge page entries at the moment for a good reason, > > TLB reach is a real problem. > > Yet this is what you _have_ to do if you must use arbitrary physical > memory. And I haven't seen any numbers posted. > Numbers require an implementation and that is a non-trivial undertaking. I've cc'd Dave Hansen who I believe tried breaking 1:1 phys:virtual mapping some time in the past. He might have further comments to make. > > Continuing on, "true defragmentation" would require that the system be > > halted so that the defragmentation can take place with everything > > disabled > > so that the copy can take place and every processes pagetables be > > updated > > as pagetables are not always shared. Even if shared, all processes > > would > > still have to be halted unless the kernel was fully pagable and we were > > willing to handle page faults in kernel outside of just the vmalloc > > area. > > vunmap doesn't need to run with the system halted, so I don't see why > unmapping the source page would need to. > vunmap() is freeing an address range where it knows it is the only accessor of any data in that range. It's not the same when there are other processes potentially memory in the same area at the same time expecting it to exist. > I don't know why we'd need to handle a full page fault in the kernel if > the critical part of the defrag code runs atomically and replaces the > pte when it is done. > And how exactly would one atomically copy a page of data, update the page tables and flush the TLB without stalling all writers? The setup would have to mark the PTE for that area read-only and flush the TLB so that other processes will fault on write and wait until the migration has completed before retrying the fault. That would allow the data to be safely read and copied to somewhere else. It would be at least feasible to back SLAB_RECLAIM_ACCOUNT slabs by a virtual map for the purposes of defragmenting it like this. However, it would work better in conjunction with fragmentation avoidance instead of replacing it because the fragmentation avoidance mechanism could be easily used to group virtually-backed allocations together in the same physical blocks as much as possible to reduce future migration work. > > This is before even considering the problem of how the kernel copies the > > data between two virtual addresses while it's modifing the page tables > > it's depending on to read the data. > > What's the problem: map the source page into a special area, unmap it > from its normal address, allocate a new page, copy the data, swap the > mapping. > You'd have to do something like I described above to handle synchronous writes to the area during defragmentation. > > > Even more horribly, virtual addresses > > in the kernel are no longer physically contiguous which will likely > > cause > > some problems for drivers and possibly DMA engines. > > Of course it is trivial to _get_ physically contiguous, virtually > contiguous pages, because now you actually have a mechanism to do so. > I think that would require that the kernel portion have a split between the vmap() like area and a 1:1 virt:phys area - i.e. similar to today except the vmalloc() region is bigger. It is difficult to predict what the impact of a much expanded use of the vmalloc area would be. > > > AFAIK, nobody has tried to do this yet it seems like the > > > (conceptually) > > > simplest and most logical way to go if you absolutely need contig > > > memory. > > > > > > > I believe there was some work at one point to break the 1:1 phys:virt > > mapping > > that Dave Hansen was involved it. It was a non-trivial breakage and > > AFAIK, > > it made things pretty slow and lost the backing of the kernel address > > space > > with large pages. Much time has been spent making sure the > > fragmentation > > avoidance patches did not kill performance. As the fragmentation > > avoidance > > stuff improves the TLB usage in the kernel portion of the address space, > > it > > improves performance in some cases. That alone should be considered a > > positive. > > > > Here are test figures from an x86_64 without min_free_kbytes adjusted > > comparing fragmentation avoidance on 2.6.21-rc6-mm1. Newer figures are > > being generated but it takes a long time to go through it all. > > It isn't performance of your patches I'm so worried about. It is that > they only slow down the rate of fragmentation, so why do we want to add > them and why can't we use something more robust? > Because as I've maintained for quite some time, I see the patches as a pre-requisite for a more complete and robust solution for dealing with external fragmentation. I see the merits of what you are suggesting but feel it can be built up incrementally starting with the fragmentation avoidance stuff, then compacting MOVABLE pages towards the end of the zone before finally dealing with full defragmentation. But I am reluctant to built large bodies of work on top of a foundation with an uncertain future. > hugepages are a good example of where you can use reservations. > Except that it has to be sized at boot-time, can never grow and users find it very inflexible in the real world where requirements change over time and a reboot is required to effectively change these reservations. > You could even use reservations for higher order pagecache (rather than > crapping the whole thing up with small-pages fallbacks everywhere). > True, although that means that an administrator is then required to size their buffer cache at boot time if they are using high order pagecache. I doubt they'll like that any more than sizing a hugepage pool. > > > But firstly, I think we should fight against needing to do that step. > > > I don't care what people say, we are in some position to influence > > > hardware vendors, and it isn't the end of the world if we don't run > > > > > > This is conflating the large page cache discussion with the > > fragmentation > > avoidance patches. If fragmentation avoidance is merged and the page > > cache > > wants to take advantage of it, it will need to; > > I don't think it is. Because the only reason to need more than a couple > of physically contiguous pages is to work around hardware limitations or > inefficiency. > A low TLB reach with base page size is a real problem that some classes of users have to deal with. Sometimes there just is no easy way around having to deal with large amounts of data at the same time. > > > a) deal with the lack of availability of contiguous pages if > > fragmentation > > avoidance is ineffective > > b) be reviewed to see what its fragmentation behaviour looks like > > > > Similar comments apply to SLUB if it uses order-1 or order-2 contiguous > > pages although SLUB is different because as it'll make most reclaimable > > allocations the same order. Hence they'll also get freed at the same > > order > > so it suffers less from external fragmentation problems due to less > > mixing > > of orders than one might initially suspect. > > Surely you can still have failure cases where you get fragmentation in > your > unmovable thingy. > Possibly, it's simply not known what those failure cases are. I'll be writing some statistics gathering code as suggested by Christoph to identify situations where it broke down. In the SLUB case, I would push that only SLAB_RECLAIM_ACCOUNT allocations use higher orders to start with to avoid unmovable allocations spilling out due to them requiring higher orders. > > Ideally, any subsystem using larger pages does a better job than a > > "reasonable > > job". At worst, any use of contiguous pages should continue to work if > > they > > are not available and at *worst*, it's performance should comparable to > > base > > page usage. > > > > Your assertion seems to be that it's better to always run slow than run > > quickly in some situations with the possibility it might slow down > > later. We > > have seen some evidence that fragmentation avoidance gives more > > consistent > > results when running kernbench during the lifetime of the system than > > without > > it. Without it, there are slowdowns probably due to reduced TLB reach. > > No. My assertion is that we should speed things up in other ways, eg. The principal reason I developed fragmentation avoidance was to relax restrictions on the resizing of the huge page pool where it's not a question of poor performance, it's a question of simply not working. The large page cache stuff arrived later as a potential additional benefiticary of lower fragmentation as well as SLUB. > by making the small pages case faster or by using something robust > like reservations. On a lot of systems it is actually quite a problem > if performance slows down over time, regardless of whether the base > performance is about the same as a non-slowing kernel. > > > > > optimally on some hardware today. I say we try to avoid higher order > > > allocations. It will be hard to ever remove this large amount of > > > machinery once the code is in. > > > > > > So to answer Andrew's request for review, I have looked through the > > > patches at times, and they don't seem to be technically wrong (I would > > > have prefered that it use resizable zones rather than new sub-zone > > > zones, but hey...). > > > > > > The resizable zones option was considered as well and it seemed messier > > than > > what the current stuff does. Not only do we have to deal with > > overlapping > > non-contiguous zones, > > We have to do that anyway, don't we? > Where do we deal with overlapping non-contiguous zones within a node today? > > but things like the page->flags identifying which > > zone a page belongs to have to be moved out (not enough bits) > > Another 2 bits? I think on most architectures that should be OK, > shouldn't it? > page->flags is not exactly flush with space. The last I heard, there were 3 bits free and there was work being done to remove some of them so more could be used. > > and you get > > an explosion of zones like > > > > ZONE_DMA_UNMOVABLE > > ZONE_DMA_RECLAIMABLE > > ZONE_DMA_MOVABLE > > ZONE_DMA32_UNMOVABLE > > So of course you don't make them visible to the API. Just select them > based on your GFP_ movable flags. > Just because they are invisible to the API does not mean they are invisible to the size of pgdat->node_zones[] and the size of the zone fallback lists. Christoph will eventually complain about the number of zones having doubled or tripled. > > > etc. > > What is etc? Are those the best reasons why this wasn't made to use zones? > No, I simply thought those problems were bad enough without going into additional ones - here's another one. If a block of pages has to move between zones, page->flags has to be updated which means a lock to the page has to be acquired to guard against concurrent use before moving the zone. Grouping pages by mobility updates two bits indicating where all pages in the block should belong to on free and moves the currently free pages leaving the other pages alone. On free, they get placed on the correct list. > > > Everything else aside, that will interact terribly with reclaim. > > Why? Because reclaim is based on zones. Due to zone fallbacks, there will be LRU pages in each of the zones unless strict partitioning is used. That means when reclaiming ZONE_NORMAL pages for example, reclaim may need to be triggered in ZONE_NORMAL_UNMOVABLE, ZONE_NORMAL_MOVABLE and ZONE_NORMAL_RECLAIMABLE. If strict partitioning us used, then the size of the pools has to be carefully balanced or the system goes bang. > And why does the current scheme not? Doesn't seem like it would have > to be a given. You _are_ allowed to change some things. > The current scheme does not impact reclaim because the LRU lists remain exactly as they are. > > > In the end, it would also suffer from similar problems with the size of > > the RECLAIMABLE areas in comparison to MOVABLE and resizing zones would > > be expensive. > > Why is it expensive but resizing your other things is not? Because to resize currently, the bits representing the block are updated and the free pages only are moved. We don't have to deal with the pages already in use, particularly awkward ones like free per-cpu pages which we cannot get a lock on. > And if you > already have non-contiguous overlapping zones, you _could_ even just > make them all the same size and just move pages between them. > The pages in use would have to have their page->flags updated so that page_zone() will resolve correctly and that is not cheap (it might not even be safe in all cases like the per-cpu pages). > > > This is in the wrong order. Defragmentation of memory makes way more > > sense > > when anti-fragmentation is already in place. There is less memory that > > will require moving. Full defragmentation requires breaking 1:1 > > phys:virt > > mapping or halting the machine to get useful work done. > > Anti-fragmentation > > using memory compaction of MOVABLE pages should handle the situation > > without > > breaking 1:1 mappings. > > My arguments are about anti-fragmentation _not_ making sense without > defragmentation. > I have repeatadly asserted that I'm perfectly happy to build defragmentation on top of anti-fragmentation. I consider anti-fragmentation to be a sensible prerequisite to defragmentation. I believe defragmentation can be taken a long way before the 1:1 phys:virt mapping is broken when anti-fragmentation is involved. > > > > So I haven't been following where we're at WRT the requirements. Why > > > can we not do with PAGE_SIZE pages or memory reserves? > > > > > > PAGE_SIZE pages cannot grow the hugepage pool. The size of the hugepage > > pool required for the lifetime of the system is not always known. PPC64 > > is > > not able to hot-remove a single page and the balloon driver from Xen has > > it's own problems. As already stated, reserves come with their own host > > of > > problems that people are not willing to deal with. > > I don't understand exactly what you mean? You don't have a hugepage pool, I was referring to /proc/sys/vm/nr_hugepages. I thought you were suggesting that it be refilled with PAGE_SIZE pages. > but an always-reclaimable pool. So you can use this for any kind of > pagecache and even anonymous and mlocked memory assuming you account for > it correctly so it can be moved away if needed. > The always-reclaimable pool is not treated as a static-sized thing although it can be with ZONE_MOVABLE if that is really what the user requires. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-05-02 12:41 ` Mel Gorman @ 2007-05-04 6:16 ` Nick Piggin 2007-05-04 6:55 ` Nick Piggin 2007-05-08 9:23 ` Mel Gorman 0 siblings, 2 replies; 16+ messages in thread From: Nick Piggin @ 2007-05-04 6:16 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton Mel Gorman wrote: > On Wed, 2 May 2007, Nick Piggin wrote: > >> Mel Gorman wrote: >> > reservations have to be done at boot-time which is a difficult >> > requirement >> > to meet and impossible on batch job and shared systems where reboots do >> > not take place. >> >> You just have to make a tradeoff about how much memory you want to set >> aside. > > > This tradeoff in sizing the reservation is something that users of shared > systems have real problems with because a hugepool once sized can only be > used for hugepage allocations. One compromise lead to the development of > ZONE_MOVABLE where a portion of memory could be set aside that was usable > for small pages but that the huge page pool could borrow from. What's wrong with that? Do we have any of that stuff upstream yet, and if not, then that probably should be done _first_. From there we can see what is left for the anti-fragmentation patches... >> Note that this memory is not wasted, because it is used for user >> allocations. So I think the downsides of reservations are really >> overstated. >> > > This does sound like you think the first step here would be a zone based > reservation system. Would you support inclusion of the ZONE_MOVABLE part > of the patch set? Ah, that answers my questions. Yes, I don't see why not, if the various people who were interested in that feature are happy with it. Not that I looked at the most recent implementation (which patches are they?) >> > persistent. Minimally, things like page_to_pfn() are no longer a simply >> > calculation which is a bad enough hit. Worse, the kernel can no longer >> > backed by >> > huge pages because you would have to defragment at the base-page level. >> > The >> > kernel is backed by huge page entries at the moment for a good reason, >> > TLB reach is a real problem. >> >> Yet this is what you _have_ to do if you must use arbitrary physical >> memory. And I haven't seen any numbers posted. >> > > Numbers require an implementation and that is a non-trivial undertaking. > I've cc'd Dave Hansen who I believe tried breaking 1:1 phys:virtual mapping > some time in the past. He might have further comments to make. I'm sure it wouldn't be trivial :) TLB's are pretty good, though. Virtualised kernels don't seem to take a huge hit (I had some vague idea that a lot of their performance problems were with IO). >> > Continuing on, "true defragmentation" would require that the system be >> > halted so that the defragmentation can take place with everything >> > disabled >> > so that the copy can take place and every processes pagetables be >> > updated >> > as pagetables are not always shared. Even if shared, all processes >> > would >> > still have to be halted unless the kernel was fully pagable and we were >> > willing to handle page faults in kernel outside of just the vmalloc >> > area. >> >> vunmap doesn't need to run with the system halted, so I don't see why >> unmapping the source page would need to. >> > > vunmap() is freeing an address range where it knows it is the only accessor > of any data in that range. It's not the same when there are other processes > potentially memory in the same area at the same time expecting it to exist. I don't see what the distinction is. We obviously wouldn't have multiple processes with different kernel virtual addresses pointing to the same page. It would be managed almost exactly like vmalloc space is today, I'd imagine. >> I don't know why we'd need to handle a full page fault in the kernel if >> the critical part of the defrag code runs atomically and replaces the >> pte when it is done. >> > > And how exactly would one atomically copy a page of data, update the page > tables and flush the TLB without stalling all writers? The setup would > have > to mark the PTE for that area read-only and flush the TLB so that other > processes will fault on write and wait until the migration has completed > before retrying the fault. That would allow the data to be safely read and > copied to somewhere else. Why is there a requirement to prevent stalling of writers? > It would be at least feasible to back SLAB_RECLAIM_ACCOUNT slabs by a > virtual map for the purposes of defragmenting it like this. However, it > would work better in conjunction with fragmentation avoidance instead of > replacing it because the fragmentation avoidance mechanism could be easily > used to group virtually-backed allocations together in the same physical > blocks as much as possible to reduce future migration work. Yeah, maybe. But what I am getting at is that fragmentation avoidance isn't _the_ big ticket (as the name implies). Defragmentation is. With defragmentation in, I think that avoidance makes much more sense. Now I'm still hoping that neither is necessary... my thought process on this is to keep hoping that nothing comes up that _requires_ us to support higher order allocations in the kernel generally. As an aside, it might actually be nice to be able to reduce MAX_ORDER significantly after boot in order to reduce page allocator overhead... >> > This is before even considering the problem of how the kernel copies >> the >> > data between two virtual addresses while it's modifing the page tables >> > it's depending on to read the data. >> >> What's the problem: map the source page into a special area, unmap it >> from its normal address, allocate a new page, copy the data, swap the >> mapping. >> > > You'd have to do something like I described above to handle synchronous > writes to the area during defragmentation. Yeah, that's what the "unmap the source page" is (which would also block reads, and I think would be a better approach to try first, because it would reduce TLB flushing. Although moving and flushing could probably be batched, so mapping them readonly first might be a good optimisation after that). >> > Even more horribly, virtual addresses >> > in the kernel are no longer physically contiguous which will likely >> > cause >> > some problems for drivers and possibly DMA engines. >> >> Of course it is trivial to _get_ physically contiguous, virtually >> contiguous pages, because now you actually have a mechanism to do so. >> > > I think that would require that the kernel portion have a split between the > vmap() like area and a 1:1 virt:phys area - i.e. similar to today except > the > vmalloc() region is bigger. It is difficult to predict what the impact of a > much expanded use of the vmalloc area would be. Yeah that would probably be reasonable. So huge tlbs could still be used for various large boot time structures. Predicting the impact of it? Could we look at how something like KVM performs when using 4K pages for its memory map? >> It isn't performance of your patches I'm so worried about. It is that >> they only slow down the rate of fragmentation, so why do we want to add >> them and why can't we use something more robust? >> > > Because as I've maintained for quite some time, I see the patches as > a pre-requisite for a more complete and robust solution for dealing with > external fragmentation. I see the merits of what you are suggesting but > feel > it can be built up incrementally starting with the fragmentation avoidance > stuff, then compacting MOVABLE pages towards the end of the zone before > finally dealing with full defragmentation. But I am reluctant to built > large bodies of work on top of a foundation with an uncertain future. The first thing we need to decide is if there is a big need to support higher order allocations generally in the kernel. I'm still a "no" with that one :) If and when we decide "yes", I don't see how anti-fragmentation does much good for that -- all the new wonderful higher order allocations we add in will need fallbacks, and things can slowly degrade over time which I'm sorry but that really sucks. I think that to decide yes, we have to realise that requires real defragmentation. At that point, OK, I'm not going to split hairs over whether you think anti-frag logically belongs first (I think it doesn't :)). >> hugepages are a good example of where you can use reservations. >> > > Except that it has to be sized at boot-time, can never grow and users find > it very inflexible in the real world where requirements change over time > and a reboot is required to effectively change these reservations. > >> You could even use reservations for higher order pagecache (rather than >> crapping the whole thing up with small-pages fallbacks everywhere). >> > > True, although that means that an administrator is then required to size > their buffer cache at boot time if they are using high order pagecache. I > doubt they'll like that any more than sizing a hugepage pool. > >> I don't think it is. Because the only reason to need more than a couple >> of physically contiguous pages is to work around hardware limitations or >> inefficiency. >> > > A low TLB reach with base page size is a real problem that some classes of > users have to deal with. Sometimes there just is no easy way around having > to deal with large amounts of data at the same time. To the 3 above: yes, I completely know we are not and never will be absolutely optimal for everyone. And the end-game for Linux, if there is one, I don't think is to be in a state that is perfect for everyone either. I don't think any feature can be justified simply because "someone" wants it, even if those someones are people running benchmarks at big companies. >> No. My assertion is that we should speed things up in other ways, eg. > > > The principal reason I developed fragmentation avoidance was to relax > restrictions on the resizing of the huge page pool where it's not a > question > of poor performance, it's a question of simply not working. The large page > cache stuff arrived later as a potential additional benefiticary of lower > fragmentation as well as SLUB. So that's even worse than a purely for performance patch, because it can now work for a while and then randomly stop working eventually. >> > what the current stuff does. Not only do we have to deal with >> > overlapping >> > non-contiguous zones, >> >> We have to do that anyway, don't we? >> > > Where do we deal with overlapping non-contiguous zones within a node today? In the buddy allocator and physical memory models, I guess? http://marc.info/?l=linux-mm&m=114774325131397&w=2 Doesn't that imply overlapping non-contiguous zones? >> > but things like the page->flags identifying which >> > zone a page belongs to have to be moved out (not enough bits) >> >> Another 2 bits? I think on most architectures that should be OK, >> shouldn't it? >> > > page->flags is not exactly flush with space. The last I heard, there > were 3 bits free and there was work being done to remove some of them so > more could be used. No, you wouldn't be using that part of the flags, but the other part. AFAIK there is reasonable amount of room on 64-bit, and only on huge NUMA 32-bit (ie. dinosaurs) is it a squeeze... but it falls back to an out of line thingy anyway. >> > and you get >> > an explosion of zones like >> > > ZONE_DMA_UNMOVABLE >> > ZONE_DMA_RECLAIMABLE >> > ZONE_DMA_MOVABLE >> > ZONE_DMA32_UNMOVABLE >> >> So of course you don't make them visible to the API. Just select them >> based on your GFP_ movable flags. >> > > Just because they are invisible to the API does not mean they are invisible > to the size of pgdat->node_zones[] and the size of the zone fallback lists. > Christoph will eventually complain about the number of zones having doubled > or tripled. Well there is already a reasonable amount of duplication, eg pcp lists. And I think it is much better to put up with a couple of complaints from Christoph rather than introduce something entirely new if possible. Hey it might even give people an incentive to improve the existing schemes. >> > etc. >> >> What is etc? Are those the best reasons why this wasn't made to use >> zones? >> > > No, I simply thought those problems were bad enough without going into > additional ones - here's another one. If a block of pages has to move > between zones, page->flags has to be updated which means a lock to the page > has to be acquired to guard against concurrent use before moving the zone. If you're only moving free pages, then the page allocator lock should be fine. There may be a couple of other places that would need help (eg swsusp)... ... but anyway, I'll snip the rest because I didn't want to digress into implementation details so much (now I'm sorry for bringing it up). -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-05-04 6:16 ` Nick Piggin @ 2007-05-04 6:55 ` Nick Piggin 2007-05-08 9:23 ` Mel Gorman 1 sibling, 0 replies; 16+ messages in thread From: Nick Piggin @ 2007-05-04 6:55 UTC (permalink / raw) Cc: Mel Gorman, Christoph Lameter, Linux Memory Management List, Andrew Morton Nick Piggin wrote: > ... but anyway, I'll snip the rest because I didn't want to digress into > implementation details so much (now I'm sorry for bringing it up). And in saying this, I'm not implying there _are_ implementation problems; I'm sure you're not just making up difficulties involved with a zone based approach ;) I just wanted to keep the discussion on the higher picture, so I shouldn't have brought up that implementation detail anyway with my initial post :P -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Antifrag patchset comments 2007-05-04 6:16 ` Nick Piggin 2007-05-04 6:55 ` Nick Piggin @ 2007-05-08 9:23 ` Mel Gorman 1 sibling, 0 replies; 16+ messages in thread From: Mel Gorman @ 2007-05-08 9:23 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton Sorry for the delayed response, I was offline for several days. On Fri, 4 May 2007, Nick Piggin wrote: > Mel Gorman wrote: >> On Wed, 2 May 2007, Nick Piggin wrote: >> >>> Mel Gorman wrote: > >>> > reservations have to be done at boot-time which is a difficult >>> > requirement >>> > to meet and impossible on batch job and shared systems where reboots do >>> > not take place. >>> >>> You just have to make a tradeoff about how much memory you want to set >>> aside. >> >> >> This tradeoff in sizing the reservation is something that users of shared >> systems have real problems with because a hugepool once sized can only be >> used for hugepage allocations. One compromise lead to the development of >> ZONE_MOVABLE where a portion of memory could be set aside that was usable >> for small pages but that the huge page pool could borrow from. > > What's wrong with that? Because it's still something that is configured at boot-time, remains static for the lifetime of the system, has consequences if the admin gets it wrong and the zone does not help stuff like e1000 using jumbo frames. Also, while it's not useless to memory hot-remove, removing 16MB sections on Power is desirable and having a zone was overkill for that purpose. > Do we have any of that stuff upstream yet, and > if not, then that probably should be done _first_. Ok, that can be done. The patch sets complement each other but only actually share one common patch and can be treated separetly. > From there we can see > what is left for the anti-fragmentation patches... > > >>> Note that this memory is not wasted, because it is used for user >>> allocations. So I think the downsides of reservations are really >>> overstated. >>> >> >> This does sound like you think the first step here would be a zone based >> reservation system. Would you support inclusion of the ZONE_MOVABLE part >> of the patch set? > > Ah, that answers my questions. Yes, I don't see why not, if the various > people who were interested in that feature are happy with it. Not that I > looked at the most recent implementation (which patches are they?) > In 2.6.21-mm1, the relevant patches are add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated.patch create-the-zone_movable-zone.patch allow-huge-page-allocations-to-use-gfp_high_movable.patch handle-kernelcore=-generic.patch The last three patches are ZONE_MOVABLE specific as nicely noted in the series file. The first patch is shared between grouping pages by mobility and ZONE_MOVABLE. A TODO item for this set of patches is to rename GFP_HIGH_MOVABLE to GFP_HIGHUSER_MOVABLE and flag page cache allocations specifically instead of marking them movable which is confusing. A second item is to look at sizing the zone at runtime. >>> > persistent. Minimally, things like page_to_pfn() are no longer a simply >>> > calculation which is a bad enough hit. Worse, the kernel can no longer >>> > backed by >>> > huge pages because you would have to defragment at the base-page level. >>> > The >>> > kernel is backed by huge page entries at the moment for a good reason, >>> > TLB reach is a real problem. >>> >>> Yet this is what you _have_ to do if you must use arbitrary physical >>> memory. And I haven't seen any numbers posted. >>> >> >> Numbers require an implementation and that is a non-trivial undertaking. >> I've cc'd Dave Hansen who I believe tried breaking 1:1 phys:virtual mapping >> some time in the past. He might have further comments to make. > > I'm sure it wouldn't be trivial :) > > TLB's are pretty good, though. Not everwhere and there are still userspace workloads that see 10-40% improvements when using hugepages so breaking it in userspace should not be done lightly. > Virtualised kernels don't seem to take a > huge hit (I had some vague idea that a lot of their performance problems > were with IO). > I think it would be very hard to see where it was losing on TLB anyway because it's so different. Tomorrow, I'll put a patch in place that prevents the kernel portion of the address space being backed by huge pages on x86_64 and run a few tests to see what it looks like. > >>> > Continuing on, "true defragmentation" would require that the system be >>> > halted so that the defragmentation can take place with everything >>> > disabled >>> > so that the copy can take place and every processes pagetables be >>> > updated >>> > as pagetables are not always shared. Even if shared, all processes >>> > would >>> > still have to be halted unless the kernel was fully pagable and we were >>> > willing to handle page faults in kernel outside of just the vmalloc >>> > area. >>> >>> vunmap doesn't need to run with the system halted, so I don't see why >>> unmapping the source page would need to. >>> >> >> vunmap() is freeing an address range where it knows it is the only accessor >> of any data in that range. It's not the same when there are other processes >> potentially memory in the same area at the same time expecting it to exist. > > I don't see what the distinction is. We obviously wouldn't have multiple > processes with different kernel virtual addresses pointing to the same > page. no, but lets say the page of interest was holding slab objects and we wanted to migrate them. There could be multiple readers of the objects with no real way of locking the page for accesses. > It would be managed almost exactly like vmalloc space is today, I'd > imagine. > I think it'll be considerably more complex than that but like everything else, it's not impossible. > >>> I don't know why we'd need to handle a full page fault in the kernel if >>> the critical part of the defrag code runs atomically and replaces the >>> pte when it is done. >>> >> >> And how exactly would one atomically copy a page of data, update the page >> tables and flush the TLB without stalling all writers? The setup would >> have >> to mark the PTE for that area read-only and flush the TLB so that other >> processes will fault on write and wait until the migration has completed >> before retrying the fault. That would allow the data to be safely read and >> copied to somewhere else. > > Why is there a requirement to prevent stalling of writers? > On the contrary, this mechanism would require the stalling of writers to work correctly so the system is going to run slower while defragmentation takes place but that is hardly a suprise. > >> It would be at least feasible to back SLAB_RECLAIM_ACCOUNT slabs by a >> virtual map for the purposes of defragmenting it like this. However, it >> would work better in conjunction with fragmentation avoidance instead of >> replacing it because the fragmentation avoidance mechanism could be easily >> used to group virtually-backed allocations together in the same physical >> blocks as much as possible to reduce future migration work. > > Yeah, maybe. But what I am getting at is that fragmentation avoidance > isn't _the_ big ticket (as the name implies). Defragmentation is. With > defragmentation in, I think that avoidance makes much more sense. > This is kind of splitting hairs because the end result remains the same - both are likely required. I have a statistics patch put together that prints out information like the following; Free pages count per migrate type Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reclaimable 131 17 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 202 39 8 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reserve 86 9 1 1 1 1 1 1 1 0 0 Node 0, zone Normal, type Unmovable 59 12 3 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Reclaimable 598 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Movable 90 3 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Reserve 10 6 6 5 2 1 1 1 0 1 0 Number of blocks type Unmovable Reclaimable Movable Reserve Node 0, zone DMA 0 1 2 1 Node 0, zone Normal 3 32 88 1 Number of mixed blocks Unmovable Reclaimable Movable Reserve Node 0, zone DMA 0 1 2 1 Node 0, zone Normal 3 32 41 1 The last piece of information needs PAGE_OWNER to be set but when it's set, /proc/pageowner also contains information on the PFN, the type of page it was and some flags like this Page allocated via order 0, mask 0x1200d2 PFN 86899 Block 84 type 2 Flags LAD [0xc014776e] generic_file_buffered_write+414 [0xc0147ec0] __generic_file_aio_write_nolock+640 [0xc01481d6] generic_file_aio_write+102 [0xc01aa99d] ext3_file_write+45 [0xc0167b0e] do_sync_write+206 [0xc0168399] vfs_write+153 [0xc0168acd] sys_write+61 [0xc0102ac4] syscall_call+7 This information will help determine to what extent defragmentation is required and at what times fragmentation avoidance gets into trouble. The page owner part is only useful in -mm kernels unfortunatly. > Now I'm still hoping that neither is necessary... my thought process > on this is to keep hoping that nothing comes up that _requires_ us to > support higher order allocations in the kernel generally. > I'd be suprised if a feature was introduced that *required* higher order allocations to be generally available. Currently, things are still depending on a lot of reclaim to take place so higher orders will not be quickly available even if they are possible. My main interests are better hugepage support and memory hot-remove. The large pagecache stuff and SLUB are really interesting but I expected them to not require the large page availability. > As an aside, it might actually be nice to be able to reduce MAX_ORDER > significantly after boot in order to reduce page allocator overhead... > As the vast majority of allocations go through the per-cpu allocator, there may not be much savings to be made but it can be checked out. > >>> > This is before even considering the problem of how the kernel copies the >>> > data between two virtual addresses while it's modifing the page tables >>> > it's depending on to read the data. >>> >>> What's the problem: map the source page into a special area, unmap it >>> from its normal address, allocate a new page, copy the data, swap the >>> mapping. >>> >> >> You'd have to do something like I described above to handle synchronous >> writes to the area during defragmentation. > > Yeah, that's what the "unmap the source page" is (which would also block > reads, and I think would be a better approach to try first >, because it > would reduce TLB flushing. Although moving and flushing could probably > be batched, so mapping them readonly first might be a good optimisation > after that). > Ok, making sense. >>> > Even more horribly, virtual addresses >>> > in the kernel are no longer physically contiguous which will likely >>> > cause >>> > some problems for drivers and possibly DMA engines. >>> >>> Of course it is trivial to _get_ physically contiguous, virtually >>> contiguous pages, because now you actually have a mechanism to do so. >>> >> >> I think that would require that the kernel portion have a split between the >> vmap() like area and a 1:1 virt:phys area - i.e. similar to today except >> the >> vmalloc() region is bigger. It is difficult to predict what the impact of a >> much expanded use of the vmalloc area would be. > > Yeah that would probably be reasonable. So huge tlbs could still be used > for various large boot time structures. > Yes. > Predicting the impact of it? Could we look at how something like KVM > performs when using 4K pages for its memory map? > I'm not sure I have a machine capable of KVM available at the moment. However, just removing the hugetlb backing of the kernel address space should be trivial and give the same data. > >>> It isn't performance of your patches I'm so worried about. It is that >>> they only slow down the rate of fragmentation, so why do we want to add >>> them and why can't we use something more robust? >>> >> >> Because as I've maintained for quite some time, I see the patches as >> a pre-requisite for a more complete and robust solution for dealing with >> external fragmentation. I see the merits of what you are suggesting but >> feel >> it can be built up incrementally starting with the fragmentation avoidance >> stuff, then compacting MOVABLE pages towards the end of the zone before >> finally dealing with full defragmentation. But I am reluctant to built >> large bodies of work on top of a foundation with an uncertain future. > > The first thing we need to decide is if there is a big need to support > higher order allocations generally in the kernel. I'm still a "no" with > that one :) > And I'll keep on about hugepages for userspace and memory hot-remove but we're not likely to finish this argument any time soon :) > If and when we decide "yes", I don't see how anti-fragmentation does much > good for that -- all the new wonderful higher order allocations we add in > will need fallbacks, and things can slowly degrade over time which I'm > sorry but that really sucks. > Ok. I will get onto the next stages of what is required. > I think that to decide yes, we have to realise that requires real > defragmentation. At that point, OK, I'm not going to split hairs over > whether you think anti-frag logically belongs first (I think it > doesn't :)). > And I think it does but reckon both are needed. I'm happy enough to work on defragmentation on top of fragmentation avoidance to see where it brings things. >>> hugepages are a good example of where you can use reservations. >>> >> >> Except that it has to be sized at boot-time, can never grow and users find >> it very inflexible in the real world where requirements change over time >> and a reboot is required to effectively change these reservations. >> >>> You could even use reservations for higher order pagecache (rather than >>> crapping the whole thing up with small-pages fallbacks everywhere). >>> >> >> True, although that means that an administrator is then required to size >> their buffer cache at boot time if they are using high order pagecache. I >> doubt they'll like that any more than sizing a hugepage pool. >> >>> I don't think it is. Because the only reason to need more than a couple >>> of physically contiguous pages is to work around hardware limitations or >>> inefficiency. >>> >> >> A low TLB reach with base page size is a real problem that some classes of >> users have to deal with. Sometimes there just is no easy way around having >> to deal with large amounts of data at the same time. > > To the 3 above: yes, I completely know we are not and never will be > absolutely optimal for everyone. And the end-game for Linux, if there > is one, I don't think is to be in a state that is perfect for everyone > either. I don't think any feature can be justified simply because > "someone" wants it, even if those someones are people running benchmarks > at big companies. > > >>> No. My assertion is that we should speed things up in other ways, eg. >> >> >> The principal reason I developed fragmentation avoidance was to relax >> restrictions on the resizing of the huge page pool where it's not a >> question >> of poor performance, it's a question of simply not working. The large page >> cache stuff arrived later as a potential additional benefiticary of lower >> fragmentation as well as SLUB. > > So that's even worse than a purely for performance patch, because it > can now work for a while and then randomly stop working eventually. > > >>> > what the current stuff does. Not only do we have to deal with >>> > overlapping >>> > non-contiguous zones, >>> >>> We have to do that anyway, don't we? >>> >> >> Where do we deal with overlapping non-contiguous zones within a node today? > > In the buddy allocator and physical memory models, I guess? > > http://marc.info/?l=linux-mm&m=114774325131397&w=2 > > Doesn't that imply overlapping non-contiguous zones? > The are not overlapping in the same node. There is never a situation on a node where pages belonging to zones A and B look like AAABBBAAA as would be the case if ZONE_MOVABLE consisted of arbitrary pages from the highest available zone for example. > >>> > but things like the page->flags identifying which >>> > zone a page belongs to have to be moved out (not enough bits) >>> >>> Another 2 bits? I think on most architectures that should be OK, >>> shouldn't it? >>> >> >> page->flags is not exactly flush with space. The last I heard, there >> were 3 bits free and there was work being done to remove some of them so >> more could be used. > > No, you wouldn't be using that part of the flags, but the other > part. AFAIK there is reasonable amount of room on 64-bit, and only > on huge NUMA 32-bit (ie. dinosaurs) is it a squeeze... but it falls > back to an out of line thingy anyway. > Using bits where available and moving to the out-of-line bitmap where they are not available is a possibility. Keeping them out-of-line though would allow a lazy moving between zones. You say later you don't want to get into implementation details so I won't either. > >>> > and you get >>> > an explosion of zones like >>> > > ZONE_DMA_UNMOVABLE >>> > ZONE_DMA_RECLAIMABLE >>> > ZONE_DMA_MOVABLE >>> > ZONE_DMA32_UNMOVABLE >>> >>> So of course you don't make them visible to the API. Just select them >>> based on your GFP_ movable flags. >>> >> >> Just because they are invisible to the API does not mean they are invisible >> to the size of pgdat->node_zones[] and the size of the zone fallback lists. >> Christoph will eventually complain about the number of zones having doubled >> or tripled. > > Well there is already a reasonable amount of duplication, eg pcp lists. And each new zone will increase that duplication quite considerably. > And I think it is much better to put up with a couple of complaints from > Christoph rather than introduce something entirely new if possible. Hey > it might even give people an incentive to improve the existing schemes. > > >>> > etc. >>> >>> What is etc? Are those the best reasons why this wasn't made to use zones? >>> >> >> No, I simply thought those problems were bad enough without going into >> additional ones - here's another one. If a block of pages has to move >> between zones, page->flags has to be updated which means a lock to the page >> has to be acquired to guard against concurrent use before moving the zone. > > If you're only moving free pages, then the page allocator lock should be > fine. Not for the pcp lists but they could be drained. > There may be a couple of other places that would need help (eg > swsusp)... > > ... but anyway, I'll snip the rest because I didn't want to digress into > implementation details so much (now I'm sorry for bringing it up). > ok -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2007-05-08 9:23 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-04-28 3:46 Antifrag patchset comments Christoph Lameter 2007-04-28 13:21 ` Mel Gorman 2007-04-28 21:44 ` Christoph Lameter 2007-04-30 9:37 ` Mel Gorman 2007-04-30 12:35 ` Peter Zijlstra 2007-04-30 17:30 ` Christoph Lameter 2007-04-30 18:33 ` Mel Gorman 2007-05-01 13:31 ` Hugh Dickins 2007-05-01 11:26 ` Nick Piggin 2007-05-01 12:22 ` Nick Piggin 2007-05-01 16:38 ` Mel Gorman 2007-05-02 2:43 ` Nick Piggin 2007-05-02 12:41 ` Mel Gorman 2007-05-04 6:16 ` Nick Piggin 2007-05-04 6:55 ` Nick Piggin 2007-05-08 9:23 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox