Antifrag patchset comments

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Antifrag patchset comments
@ 2007-04-28  3:46 Christoph Lameter
  2007-04-28 13:21 ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2007-04-28  3:46 UTC (permalink / raw)
  To: mel; +Cc: linux-mm

I just had a look at the patches in mm....

Ok so we have the unmovable allocations and then 3 special types

RECLAIMABLE
	Memory can be reclaimed? Ahh this is used for buffer heads
	and the like. Allocations that can be reclaimed by some
	sort of system action that cannot be directly targeted
	at an object?

	It seems that you also included temporary allocs here?

MOVABLE
	Memory can be moved by going to the page and reclaiming it?

	So right now this is only a higher form of RECLAIMABLE.

	We currently do not move memory.... so why have it?

MIGRATE_RESERVE

	Some atomic reserve to preserve contiguity of allocations?
	Or just a fallback if other pools are all used? What is this?

So have 4 categories. Any additional category causes more overhead on
the pcp lists since we will have to find the correct type on the lists.
Why do we have MIGRATE_RESERVE?


Then we have ZONE_MOVABLE whose purpose is to guarantee that a large 
portion of memory is always reclaimable and movable. Which is pawned off
the highest available allocation zone. Very similar to memory policies
same problems. Some nodes do not have the highest zone (many x86_64 
NUMA are in that strange situation). Memory policies do not work quite 
right there and it seems that the antifrag methods will be switched off
for such a node. Trouble ahead. Why do we need it? To crash when the
kernel does too many unmovable allocs?


Other things:


1. alloc_zeroed_user_highpage is no longer used

	Its noted in the patches but it was not removed nor marked
	as depreciated.

2. submit_bh allocates bios using __GFP_MOVABLE

	How can a bio be moved? Or does that indicate that the
	bio can be reclaimed?

3. Highmem pages for user space are marked __GFP_MOVABLE

	Looks okay to me. So I guess that __GFP_MOVABLE
	implies GFP_RECLAIMABLE? Hmmm... It seems that 
	mlocked pages are therefore also movable and reclaimable
	(not true!). So we still have that problem spot?

4. Default inode alloc mod is set to GFP_HIGH_MOVABLE....

	Good.

5. Hugepages are set to movable in some cases.

	That is because they are large order allocs and do not
	cause fragmentation if all other allocs are smaller. But that
	assumption may turn out to be problematic. Huge pages allocs
	as movable may make higher order allocation problematic if 
	MAX_ORDER becomes much larger than the huge page order. In
	particular on IA64 the huge page order is dynamically settable
	on bootup. They can be quite small and thus cause fragmentation
	in the movable blocks.

	I think it may be possible to make huge pages supported by
	page migration in some way which may justify putting it into
	the movable section for all cases. But right now this seems to
	be more an x86_64/i386'ism.

6. First in bdget() we set the mapping for a block device up using
	GFP_MOVABLE. However, then in grow_dev_page for an actual
	allocation we will use__GFP_RECLAIMABLE for the block device.
	We should use one type I would think and its GFP_MOVABLE as
	far as I can tell.

7. dentry allocation uses GFP_KERNEL|__GFP_RECLAIMABLE.
	Why not set this by default in the slab allocators if 
	kmem_cache_create sets up a slab with SLAB_RECLAIM_ACCOUNT?

8. Same occurs for inodes. The reclaim flag should not be specified
	for individual allocations since reclaim is a slab wide
	activity. It also has no effect if the objects is taken off
	a queue.

9. proc_loginuid_write(), do_proc_readlink(), proc_pid_att_write() etc.

	Why are these allocation reclaimable? Should be GFP_KERNEL alloc there?

	These are temporary allocs. What is the benefit of 
	__GFP_RECLAIMABLE?


10. Radix tree as reclaimable? radix_tree_node_alloc()

	Ummm... Its reclaimable in a sense if all the pages are removed
	but I'd say not in general.

11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? They can be 
        swapped out and moved by page migration, so GFP_MOVABLE?

12. skbs slab allocs marked GFP_RECLAIMABLE.

	Ok the queues are temporary. GFP_RECLAIMABLE means temporary
	alloc that will go away? This is a slab that is not using
	SLAB_ACCOUNT_RECLAIMABLE. Do we need a SLAB_RECLAIMABLE flag?

13. In the patches it was mentioned that it is no longer necessary 
    to set min_free_kbytes? What is the current state of that?

14. I am a bit concerned about an increase in the alloc types. There are
    two that I am not sure what their purpose is which is
    MIGRATION_RESERVE and MIGRATION_HIGHATOMIC. HIGHATOMIC seems to have
    been removed again.

15. Tuning for particular workloads.

Another concern is are patches here that indicate that new alloc types 
were created to accomodate certain workloads? The exceptions worry me.

16. Both memory policies and antifrag seem to 
   determine the highest zone. Memory policies call this the policy
   zone. Could you consolidate that code?

17. MAX_ORDER issues. At least on IA64 the antifrag measures will 
    require a reduction in max order. However, we currently have MAX_ORDER 
    of 1G because there are applications using huge pages of 1 Gigabyte size 
    (TLB pressure issues on IA64). OTOH, Machines exist that only have 1GB 
    RAM per node, so it may be difficult to create multiple MAX_ORDER blocks 
    as  needed.

I have not gotten my head around how the code in page_alloc.c actually 
works. This is just from reviewing comments.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-04-28  3:46 Antifrag patchset comments Christoph Lameter
@ 2007-04-28 13:21 ` Mel Gorman
  2007-04-28 21:44   ` Christoph Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2007-04-28 13:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Linux Memory Management List

On Fri, 27 Apr 2007, Christoph Lameter wrote:

> I just had a look at the patches in mm....
>
> Ok so we have the unmovable allocations and then 3 special types
>
> RECLAIMABLE
> 	Memory can be reclaimed? Ahh this is used for buffer heads
> 	and the like. Allocations that can be reclaimed by some
> 	sort of system action that cannot be directly targeted
> 	at an object?
>

Exactly. Inode caches currently fall into the same category. When 
shrink_slab() is called the amount of memory in RECLAIMABLE areas will be 
reduced.

> 	It seems that you also included temporary allocs here?
>

Temporary and short-lived allocations are also treated as reclaimable to 
stop more areas than necessary being marked UNMOVABLE. The fewer UNMOVABLE 
blocks there are, the better.

> MOVABLE
> 	Memory can be moved by going to the page and reclaiming it?
>

Or potentially with page migration although that code does not exist. 
MOVABLE memory means just that - it can be moved while the data is still 
preserved. Moving it to swap is still moving.

> 	So right now this is only a higher form of RECLAIMABLE.
>

The names used to be RCLM_NORCLM, RCLM_EASY and RCLM_KERN which confused 
more people, hence the current naming.

> 	We currently do not move memory.... so why have it?
>

Because I wanted to build memory compaction on top of this when movable 
memory is not just memory that can go to swap but includes mlocked pages 
as well

> MIGRATE_RESERVE
>
> 	Some atomic reserve to preserve contiguity of allocations?
> 	Or just a fallback if other pools are all used? What is this?
>

The standard allocator keeps high-order pages free until memory pressure 
forces them to be split. In practice, this means that pages for 
min_free_kbytes are kept as contiguous pages for quite a long time but 
once split never become contiguous again. This lets short-lived high-order 
atomic allocations to work for quite a while which is why setting 
min_free_kbytes to 16384 seems to let jumbo frames work for a long time. 
Grouping by mobility is more concerned with the type of page so it breaks 
up the min_free_kbytes pages early removing a desirable property of the 
standard allocator for high-order atomic allocations. MIGRATE_RESERVE 
brings that desirable property back.

The number of blocks marked MIGRATE_RESERVE depends on min_free_kbytes and 
the area is only used when the alternative is to fail the allocation. The 
effect is that pages kept free for min_free_kbytes tend to exist in these 
MIGRATE_RESERVE areas as contiguous areas. This is an improvement over 
what the standard allocator does because it makes no effort to keep the 
minimum number of free pages contiguous.

> So have 4 categories. Any additional category causes more overhead on
> the pcp lists since we will have to find the correct type on the lists.
> Why do we have MIGRATE_RESERVE?
>

It resolved a problem with order-1 atomic allocations used by a network 
adapter when it was using bittorrent heavily. They affected user hasn't 
complained since.

> Then we have ZONE_MOVABLE whose purpose is to guarantee that a large
> portion of memory is always reclaimable and movable. Which is pawned off
> the highest available allocation zone.

Right. This is a separate issue to grouping pages by mobility. The memory 
partition does not require grouping pages by mobility to be available and 
vice-versa. All they share is the marking of allocations __GFP_MOVABLE.

> Very similar to memory policies
> same problems. Some nodes do not have the highest zone (many x86_64
> NUMA are in that strange situation).

yep. Dealing with only the highest zone made the code manageable, 
particularly where HIGHMEM was involved although the issue between NORMAL 
and DMA32 isn't much better.

> Memory policies do not work quite
> right there and it seems that the antifrag methods will be switched off
> for such a node.

Not quite. If the zone doesn't exist in a node, it will not be in the 
zonelists and things plod along as normal. Grouping pages by mobility 
works independent of memory partitioning so it'll still work in these 
nodes whether the zone is there is not.

> Trouble ahead. Why do we need it? To crash when the
> kernel does too many unmovable allocs?
>

It's needed for a few reasons but the two main ones are;

a) grouping pages by mobility does not give guaranteed bounds on how much
    contiguous memory will be movable. While it could, it would be very
    complex and would replicate the behavior of zones to the extent I'll
    get a slap in the head for even trying. Partitioning memory gives hard
    guarantees on memory availability

b) Early feedback was that grouping pages by mobility should be
    done only with zones but that is very restrictive. Different people
    liked each approach for different reasons so it constantly went in
    circles. That is why both can sit side-by-side now

The zone is also of interest to the memory hot-remove people.

Granted, if kernelcore= is given too small a value, it'll cause problems.

> > Other things:
>
>
> 1. alloc_zeroed_user_highpage is no longer used
>
> 	Its noted in the patches but it was not removed nor marked
> 	as depreciated.
>

Indeed. Rather than marking it deprecated I was going to wait until it was 
unused for one cycle and then mark it deprecated and see who complains.

> 2. submit_bh allocates bios using __GFP_MOVABLE
>
> 	How can a bio be moved? Or does that indicate that the
> 	bio can be reclaimed?
>

I consider the pages allocated for the buffer to be movable because the 
buffers can be cleaned and discarded by standard reclaim. When/if page 
migration is used, this will have to be revisisted but for the moment I 
believe it's correct.

If the RECLAIMABLE areas could be properly targeted, it would make sense 
to mark these pages RECLAIMABLE instead but that is not the situation 
today.

> 3. Highmem pages for user space are marked __GFP_MOVABLE
>
> 	Looks okay to me. So I guess that __GFP_MOVABLE
> 	implies GFP_RECLAIMABLE? Hmmm... It seems that
> 	mlocked pages are therefore also movable and reclaimable
> 	(not true!). So we still have that problem spot?
>

No, at worst we have a naming ambiguity which has come up before. 
RECLAIMABLE refers to allocations that are reclaimable via shrink_slab() 
or short-lived. MOVABLE pages are reclaimable by pageout or movable with 
page migration.

> 4. Default inode alloc mod is set to GFP_HIGH_MOVABLE....
>
> 	Good.
>
> 5. Hugepages are set to movable in some cases.
>

Specifically, they are considered movable when they are allowed to be 
allocated from ZONE_MOVABLE. So for it to really cause fragmentation, 
there has to be high-order movable allocations in play using ZONE_MOVABLE. 
This is currently never the case but the large blocksize stuff may change 
that.

> 	That is because they are large order allocs and do not
> 	cause fragmentation if all other allocs are smaller. But that
> 	assumption may turn out to be problematic. Huge pages allocs
> 	as movable may make higher order allocation problematic if
> 	MAX_ORDER becomes much larger than the huge page order. In
> 	particular on IA64 the huge page order is dynamically settable
> 	on bootup. They can be quite small and thus cause fragmentation
> 	in the movable blocks.
>

You're right here. I have always considered huge page allocations to be 
the highest order anything in the system will ever care about. I was not 
aware of any situation except at boot-time where that is different. What 
sort of situation do you forsee where the huge page size is not the 
largest high-order allocation used by the system? Even the large blocksize 
stuff doesn't seem to apply here.

> 	I think it may be possible to make huge pages supported by
> 	page migration in some way which may justify putting it into
> 	the movable section for all cases.

That was the long-term aim. I figured there was no reason that hugepages 
could not be moved just that it was unnecessary to date.

>	But right now this seems to be more an x86_64/i386'ism.
>

Depends on whether IA64 really has situations where allocations of a 
higher-order than hugepage size are common.

> 6. First in bdget() we set the mapping for a block device up using
> 	GFP_MOVABLE. However, then in grow_dev_page for an actual
> 	allocation we will use__GFP_RECLAIMABLE for the block device.
> 	We should use one type I would think and its GFP_MOVABLE as
> 	far as I can tell.
>

I'll revisit this one. I think it should be __GFP_RECLAIMABLE in both 
cases because I have a vague memory that pages due to grow_dev_page caused 
problems fragmentation wise because they could not be reclaimed. That 
might simply have been an unrelated bug at the time.

I've put this on the TODO to investigate further.

> 7. dentry allocation uses GFP_KERNEL|__GFP_RECLAIMABLE.
> 	Why not set this by default in the slab allocators if
> 	kmem_cache_create sets up a slab with SLAB_RECLAIM_ACCOUNT?
>

Because .... errr..... it didn't occur to me.

/me adds an item to the TODO list

This will simplify one of the patches. Are all slabs with 
SLAB_RECLAIM_ACCOUNT guaranteed to have a shrinker available either 
directly or indirectly?

> 8. Same occurs for inodes. The reclaim flag should not be specified
> 	for individual allocations since reclaim is a slab wide
> 	activity. It also has no effect if the objects is taken off
> 	a queue.
>

If SLAB_RECLAIM_ACCOUNT always uses __GFP_RECLAIMABLE, this will be caught 
too, right?

> 9. proc_loginuid_write(), do_proc_readlink(), proc_pid_att_write() etc.
>
> 	Why are these allocation reclaimable? Should be GFP_KERNEL alloc there?
>
> 	These are temporary allocs. What is the benefit of
> 	__GFP_RECLAIMABLE?
>

Because they are temporary. I didn't want large bursts of proc activity to 
cause MAX_ORDER_NR_PAGES blocks to be marked unmovable.

>
> 10. Radix tree as reclaimable? radix_tree_node_alloc()
>
> 	Ummm... Its reclaimable in a sense if all the pages are removed
> 	but I'd say not in general.
>

I considered them to be indirectly reclaimable. Maybe it wasn't the best 
choice.

> 11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? They can be
>        swapped out and moved by page migration, so GFP_MOVABLE?
>

Because they might be ramfs pages which are not movable - 
http://lkml.org/lkml/2006/11/24/150

> 12. skbs slab allocs marked GFP_RECLAIMABLE.
>
> 	Ok the queues are temporary. GFP_RECLAIMABLE means temporary
> 	alloc that will go away? This is a slab that is not using
> 	SLAB_ACCOUNT_RECLAIMABLE. Do we need a SLAB_RECLAIMABLE flag?
>

I'll add it to the TODO to see what it looks like.

> 13. In the patches it was mentioned that it is no longer necessary
>    to set min_free_kbytes? What is the current state of that?
>

I ran some tests yesterday. If min_free_kbytes is left untouched, the 
number of hugepages that can be allocated at the end of the test is very 
variable: +/- 5% of physical memory on x86_64. When it's set to 
4*MAX_ORDER_NR_PAGES, it's +/- 1% generally. I think it's safe to leave 
the min_free_kbytes as-is for the moment and see what happens. If issues 
are encountered, I'll be asking that min_free_kbytes be increased on that 
machine to see if it really makes a difference in practice or not.

>From the results I have on x86_64 with 1GB of RAM, grouping page by 
mobility was able to allocate 69% of memory as 2MB hugepages under heavy 
load. The standard allocator got 2%. At rest at the end of the test when 
nothing is running, 72% was available as huge pages when grouping pages by 
mobility in comparison to 30%.

On PPC64 with 4GB of RAM when grouping pages by mobility, 11% was 
available under load and 57% of memory was available as 16MB huge pages at 
the end of the test in comparison to 0% with the vanilla allocator under 
load and 8% at rest. With 1GB of RAM, grouping pages by mobility got 35% 
of memory as huge pages at the end of the test and the vanilla allocator 
got 0%. I hope to improve this figure more over time.

> 14. I am a bit concerned about an increase in the alloc types. There are
>    two that I am not sure what their purpose is which is
>    MIGRATION_RESERVE and MIGRATION_HIGHATOMIC. HIGHATOMIC seems to have
>    been removed again.
>

HIGHATOMIC has gone out the door for the moment as MIGRATE_RESERVE does 
the job of having some contiguous blocks available for high-order atomic 
allocations better.

> 15. Tuning for particular workloads.
>
> Another concern is are patches here that indicate that new alloc types
> were created to accomodate certain workloads? The exceptions worry me.
>

They are not intentionally aimed at certain workloads. The current tests 
are known to be very hostile for external fragmentation (e.g. 0% success 
on PPC64 at the end of tests with the standard allocator). Yesterday in 
preparation for testing large blocksize patches, I added ltp, dbench and 
fsxlinux into the tool normally used for testing grouping pages by 
mobility so the workloads will vary more in the future. I hope to get 
information on other workloads as the patches get more exposure.

> 16. Both memory policies and antifrag seem to
>   determine the highest zone. Memory policies call this the policy
>   zone. Could you consolidate that code?
>

Maybe but probably not - I'll look into it. The problem is that at the 
time kernelcore= is handled the zones are not initialised yet (again, this 
is indpendent of grouping pages by mobility) bind_zonelist() appears uses 
z->present_pages for example which isn't even set at the time ZONE_MOVABLE 
is setup.

> 17. MAX_ORDER issues. At least on IA64 the antifrag measures will
>    require a reduction in max order. However, we currently have MAX_ORDER
>    of 1G because there are applications using huge pages of 1 Gigabyte size
>    (TLB pressure issues on IA64).

Ok, this explains why MAX_ORDER is so much larger than what appeared to be 
the huge page size.

> OTOH, Machines exist that only have 1GB
>    RAM per node, so it may be difficult to create multiple MAX_ORDER blocks
>    as  needed.
>

*ponders*

This is the trickest feedback from your review so far. However, mobility 
types are grouping based on MAX_ORDER_NR_PAGES simply because it was the 
easiest to implement and made sense at the time.

Right now, __rmqueue_smallest() searches up to MAX_ORDER-1 and 2 bits are 
stored per MAX_ORDER_NR_PAGES tracking the mobility of the group. There is 
nothing to say that it searches up to some other arbitrary order. The 
pageblock flags would then need 2 bits per ARBITRARY_ORDER_NR_PAGES 
instead of MAX_ORDER_NR_PAGES.

I'll look into how it can be implemented. I have an IA64 box with just 1GB 
of RAM here that I can use to test the concept.

> I have not gotten my head around how the code in page_alloc.c actually
> works. This is just from reviewing comments.
>

Thanks a lot for looking through them. My TODO list so far from this is

1. Check that bdget() is really doing the right thing with respect to
    __GFP_RECLAIMABLE

2. Use SLAB_ACCOUNT_RECLAIMBLE to set __GFP_RECLAIMABLE instead of setting
    flags individually

3. Consider adding a SLAB_RECLAIMABLE where sockets make short-lived
   allocations

4. Group based on blocks smaller than MAX_ORDER_NR_PAGES

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-04-28 13:21 ` Mel Gorman
@ 2007-04-28 21:44   ` Christoph Lameter
  2007-04-30  9:37     ` Mel Gorman
  2007-05-01 11:26     ` Nick Piggin
  0 siblings, 2 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-04-28 21:44 UTC (permalink / raw)
  To: Mel Gorman, Nick Piggin; +Cc: Linux Memory Management List

On Sat, 28 Apr 2007, Mel Gorman wrote:

> Because I wanted to build memory compaction on top of this when movable memory
> is not just memory that can go to swap but includes mlocked pages as well

Ahh. Ok.

> > MIGRATE_RESERVE
> The standard allocator keeps high-order pages free until memory pressure
> forces them to be split. In practice, this means that pages for
> min_free_kbytes are kept as contiguous pages for quite a long time but once
> split never become contiguous again. This lets short-lived high-order atomic
> allocations to work for quite a while which is why setting min_free_kbytes to
> 16384 seems to let jumbo frames work for a long time. Grouping by mobility is
> more concerned with the type of page so it breaks up the min_free_kbytes pages
> early removing a desirable property of the standard allocator for high-order
> atomic allocations. MIGRATE_RESERVE brings that desirable property back.

Hmmmm... A special pool for atomic allocs...
 
> > Trouble ahead. Why do we need it? To crash when the
> > kernel does too many unmovable allocs?
> It's needed for a few reasons but the two main ones are;
> 
> a) grouping pages by mobility does not give guaranteed bounds on how much
>    contiguous memory will be movable. While it could, it would be very
>    complex and would replicate the behavior of zones to the extent I'll
>    get a slap in the head for even trying. Partitioning memory gives hard
>    guarantees on memory availability

And crashes the kernel if the availability is no longer guaranteed?
 
> b) Early feedback was that grouping pages by mobility should be
>    done only with zones but that is very restrictive. Different people
>    liked each approach for different reasons so it constantly went in
>    circles. That is why both can sit side-by-side now
> 
> The zone is also of interest to the memory hot-remove people.

Indeed that is a good thing.... It would be good if a movable area
would be a dynamic split of a zone and not be a separate zone that has to 
be configured on the kernel command line.

> Granted, if kernelcore= is given too small a value, it'll cause problems.

That is what I thought.

> > 1. alloc_zeroed_user_highpage is no longer used
> > 	Its noted in the patches but it was not removed nor marked
> > 	as depreciated.
> Indeed. Rather than marking it deprecated I was going to wait until it was
> unused for one cycle and then mark it deprecated and see who complains.

I'd say remove it immediately. This is confusing.

> > 2. submit_bh allocates bios using __GFP_MOVABLE
> > 
> > 	How can a bio be moved? Or does that indicate that the
> > 	bio can be reclaimed?
> > 
> 
> I consider the pages allocated for the buffer to be movable because the
> buffers can be cleaned and discarded by standard reclaim. When/if page
> migration is used, this will have to be revisisted but for the moment I
> believe it's correct.

This would make it __GFP_RECLAIMABLE. The same is true for the caches that
can be reclaimed. They are not marked __GFP_MOVABLE.

> If the RECLAIMABLE areas could be properly targeted, it would make sense to
> mark these pages RECLAIMABLE instead but that is not the situation today.

What is the problem with targeting?

> > 	That is because they are large order allocs and do not
> > 	cause fragmentation if all other allocs are smaller. But that
> > 	assumption may turn out to be problematic. Huge pages allocs
> > 	as movable may make higher order allocation problematic if
> > 	MAX_ORDER becomes much larger than the huge page order. In
> > 	particular on IA64 the huge page order is dynamically settable
> > 	on bootup. They can be quite small and thus cause fragmentation
> > 	in the movable blocks.
> You're right here. I have always considered huge page allocations to be the
> highest order anything in the system will ever care about. I was not aware of
> any situation except at boot-time where that is different. What sort of
> situation do you forsee where the huge page size is not the largest high-order
> allocation used by the system? Even the large blocksize stuff doesn't seem to
> apply here.

Boot an IA64 box with the parameter hugepagesz=64k for example. That will
give you a huge page size of 64k on a system with MAX_ORDER = 1G. The 
default for the huge page size is 256k which is a quarter of max order. 
But some people boot with 1G huge pages.

> > 6. First in bdget() we set the mapping for a block device up using
> > 	GFP_MOVABLE. However, then in grow_dev_page for an actual
> > 	allocation we will use__GFP_RECLAIMABLE for the block device.
> > 	We should use one type I would think and its GFP_MOVABLE as
> > 	far as I can tell.
> > 
> 
> I'll revisit this one. I think it should be __GFP_RECLAIMABLE in both cases
> because I have a vague memory that pages due to grow_dev_page caused problems
> fragmentation wise because they could not be reclaimed. That might simply have
> been an unrelated bug at the time.

It depends on who allocates these pages. If they are mapped by the user 
then they are movable. If a filesystem gets them for metadata then they 
are reclaimable.

> This will simplify one of the patches. Are all slabs with SLAB_RECLAIM_ACCOUNT
> guaranteed to have a shrinker available either directly or indirectly?

I have not checked that recently but historically yes. There is no point 
in accounting slabs for reclaim if you cannot reclaim them.

> > 8. Same occurs for inodes. The reclaim flag should not be specified
> > 	for individual allocations since reclaim is a slab wide
> > 	activity. It also has no effect if the objects is taken off
> > 	a queue.
> > 
> 
> If SLAB_RECLAIM_ACCOUNT always uses __GFP_RECLAIMABLE, this will be caught
> too, right?

Correct.
 
> > 10. Radix tree as reclaimable? radix_tree_node_alloc()
> > 
> > 	Ummm... Its reclaimable in a sense if all the pages are removed
> > 	but I'd say not in general.
> > 
> 
> I considered them to be indirectly reclaimable. Maybe it wasn't the best
> choice.

Maybe we need to ask Nick about this one.

> > 11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? They can be
> >        swapped out and moved by page migration, so GFP_MOVABLE?
> > 
> 
> Because they might be ramfs pages which are not movable -
> http://lkml.org/lkml/2006/11/24/150

URL does not provide any useful information regarding the issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-04-28 21:44   ` Christoph Lameter
@ 2007-04-30  9:37     ` Mel Gorman
  2007-04-30 12:35       ` Peter Zijlstra
                         ` (2 more replies)
  2007-05-01 11:26     ` Nick Piggin
  1 sibling, 3 replies; 16+ messages in thread
From: Mel Gorman @ 2007-04-30  9:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, Linux Memory Management List

On Sat, 28 Apr 2007, Christoph Lameter wrote:

> On Sat, 28 Apr 2007, Mel Gorman wrote:
>
>> Because I wanted to build memory compaction on top of this when movable memory
>> is not just memory that can go to swap but includes mlocked pages as well
>
> Ahh. Ok.
>
>>> MIGRATE_RESERVE
>> The standard allocator keeps high-order pages free until memory pressure
>> forces them to be split. In practice, this means that pages for
>> min_free_kbytes are kept as contiguous pages for quite a long time but once
>> split never become contiguous again. This lets short-lived high-order atomic
>> allocations to work for quite a while which is why setting min_free_kbytes to
>> 16384 seems to let jumbo frames work for a long time. Grouping by mobility is
>> more concerned with the type of page so it breaks up the min_free_kbytes pages
>> early removing a desirable property of the standard allocator for high-order
>> atomic allocations. MIGRATE_RESERVE brings that desirable property back.
>
> Hmmmm... A special pool for atomic allocs...
>

That is not it's intention although it doubles up at that. The intention 
is to preserve free pages kept for min_free_kbytes as contiguous pages 
because it's a property of the current allocator that atomic allocations 
depend on today.

>>> Trouble ahead. Why do we need it? To crash when the
>>> kernel does too many unmovable allocs?
>> It's needed for a few reasons but the two main ones are;
>>
>> a) grouping pages by mobility does not give guaranteed bounds on how much
>>    contiguous memory will be movable. While it could, it would be very
>>    complex and would replicate the behavior of zones to the extent I'll
>>    get a slap in the head for even trying. Partitioning memory gives hard
>>    guarantees on memory availability
>
> And crashes the kernel if the availability is no longer guaranteed?
>

OOM.

>> b) Early feedback was that grouping pages by mobility should be
>>    done only with zones but that is very restrictive. Different people
>>    liked each approach for different reasons so it constantly went in
>>    circles. That is why both can sit side-by-side now
>>
>> The zone is also of interest to the memory hot-remove people.
>
> Indeed that is a good thing.... It would be good if a movable area
> would be a dynamic split of a zone and not be a separate zone that has to
> be configured on the kernel command line.
>

There are problems with doing that. In particular, the zone can only be 
sized on one direction and can only be sized at the zone boundary because 
zones do not currently overlap and I believe there will be assumptions 
made about them not overlapping within a node. It's worth looking into in 
the future but I'm putting it at the bottom of the TODO list.

>> Granted, if kernelcore= is given too small a value, it'll cause problems.
>
> That is what I thought.
>
>>> 1. alloc_zeroed_user_highpage is no longer used
>>> 	Its noted in the patches but it was not removed nor marked
>>> 	as depreciated.
>> Indeed. Rather than marking it deprecated I was going to wait until it was
>> unused for one cycle and then mark it deprecated and see who complains.
>
> I'd say remove it immediately. This is confusing.
>

Ok.

>>> 2. submit_bh allocates bios using __GFP_MOVABLE
>>>
>>> 	How can a bio be moved? Or does that indicate that the
>>> 	bio can be reclaimed?
>>>
>>
>> I consider the pages allocated for the buffer to be movable because the
>> buffers can be cleaned and discarded by standard reclaim. When/if page
>> migration is used, this will have to be revisisted but for the moment I
>> believe it's correct.
>
> This would make it __GFP_RECLAIMABLE. The same is true for the caches that
> can be reclaimed. They are not marked __GFP_MOVABLE.
>

As we are currently depend on reclaim to free contiguous pages, it works 
out better *at the moment* to have buffers with other pages reclaimed via 
the LRU.

>> If the RECLAIMABLE areas could be properly targeted, it would make sense to
>> mark these pages RECLAIMABLE instead but that is not the situation today.
>
> What is the problem with targeting?
>

It's currently not possible to target effectively.

>>> 	That is because they are large order allocs and do not
>>> 	cause fragmentation if all other allocs are smaller. But that
>>> 	assumption may turn out to be problematic. Huge pages allocs
>>> 	as movable may make higher order allocation problematic if
>>> 	MAX_ORDER becomes much larger than the huge page order. In
>>> 	particular on IA64 the huge page order is dynamically settable
>>> 	on bootup. They can be quite small and thus cause fragmentation
>>> 	in the movable blocks.
>>
>> You're right here. I have always considered huge page allocations to be the
>> highest order anything in the system will ever care about. I was not aware of
>> any situation except at boot-time where that is different. What sort of
>> situation do you forsee where the huge page size is not the largest high-order
>> allocation used by the system? Even the large blocksize stuff doesn't seem to
>> apply here.
>
> Boot an IA64 box with the parameter hugepagesz=64k for example. That will
> give you a huge page size of 64k on a system with MAX_ORDER = 1G. The
> default for the huge page size is 256k which is a quarter of max order.
> But some people boot with 1G huge pages.
>

Right, that's fair enough. Now that I recognise the problem, I can start 
kicking it.

>>> 6. First in bdget() we set the mapping for a block device up using
>>> 	GFP_MOVABLE. However, then in grow_dev_page for an actual
>>> 	allocation we will use__GFP_RECLAIMABLE for the block device.
>>> 	We should use one type I would think and its GFP_MOVABLE as
>>> 	far as I can tell.
>>>
>>
>> I'll revisit this one. I think it should be __GFP_RECLAIMABLE in both cases
>> because I have a vague memory that pages due to grow_dev_page caused problems
>> fragmentation wise because they could not be reclaimed. That might simply have
>> been an unrelated bug at the time.
>
> It depends on who allocates these pages. If they are mapped by the user
> then they are movable. If a filesystem gets them for metadata then they
> are reclaimable.
>
>> This will simplify one of the patches. Are all slabs with SLAB_RECLAIM_ACCOUNT
>> guaranteed to have a shrinker available either directly or indirectly?
>
> I have not checked that recently but historically yes. There is no point
> in accounting slabs for reclaim if you cannot reclaim them.
>

Right, I'll go with the assumption that they somehow all get reclaimed 
via shrink_icache_memory() for the moment.

>>> 8. Same occurs for inodes. The reclaim flag should not be specified
>>> 	for individual allocations since reclaim is a slab wide
>>> 	activity. It also has no effect if the objects is taken off
>>> 	a queue.
>>>
>>
>> If SLAB_RECLAIM_ACCOUNT always uses __GFP_RECLAIMABLE, this will be caught
>> too, right?
>
> Correct.
>
>>> 10. Radix tree as reclaimable? radix_tree_node_alloc()
>>>
>>> 	Ummm... Its reclaimable in a sense if all the pages are removed
>>> 	but I'd say not in general.
>>>
>>
>> I considered them to be indirectly reclaimable. Maybe it wasn't the best
>> choice.
>
> Maybe we need to ask Nick about this one.

Nick, at what point are nodes allocated with radix_tree_node_alloc() 
freed?

My current understanding is that some get freed when pages are removed 
from the page cache but I haven't looked closely enough to be certain.

>>> 11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE? They can be
>>>        swapped out and moved by page migration, so GFP_MOVABLE?
>>>
>>
>> Because they might be ramfs pages which are not movable -
>> http://lkml.org/lkml/2006/11/24/150
>
> URL does not provide any useful information regarding the issue.
>

Not all pages allocated via shmem_alloc_page() are movable because they 
may pages for ramfs.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-04-30  9:37     ` Mel Gorman
@ 2007-04-30 12:35       ` Peter Zijlstra
  2007-04-30 17:30       ` Christoph Lameter
  2007-05-01 13:31       ` Hugh Dickins
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2007-04-30 12:35 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, Nick Piggin, Linux Memory Management List

On Mon, 2007-04-30 at 10:37 +0100, Mel Gorman wrote:

> >>> 10. Radix tree as reclaimable? radix_tree_node_alloc()
> >>>
> >>> 	Ummm... Its reclaimable in a sense if all the pages are removed
> >>> 	but I'd say not in general.
> >>>
> >>
> >> I considered them to be indirectly reclaimable. Maybe it wasn't the best
> >> choice.
> >
> > Maybe we need to ask Nick about this one.
> 
> Nick, at what point are nodes allocated with radix_tree_node_alloc() 
> freed?
> 
> My current understanding is that some get freed when pages are removed 
> from the page cache but I haven't looked closely enough to be certain.

Indeed, radix tree nodes are freed when the tree loses elements. Both
through freeing nodes that have no elements left, and shrinking the tree
when the top node has only the first entry in use.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-04-30  9:37     ` Mel Gorman
  2007-04-30 12:35       ` Peter Zijlstra
@ 2007-04-30 17:30       ` Christoph Lameter
  2007-04-30 18:33         ` Mel Gorman
  2007-05-01 13:31       ` Hugh Dickins
  2 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2007-04-30 17:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nick Piggin, Linux Memory Management List

On Mon, 30 Apr 2007, Mel Gorman wrote:

> > Indeed that is a good thing.... It would be good if a movable area
> > would be a dynamic split of a zone and not be a separate zone that has to
> > be configured on the kernel command line.
> There are problems with doing that. In particular, the zone can only be sized
> on one direction and can only be sized at the zone boundary because zones do
> not currently overlap and I believe there will be assumptions made about them
> not overlapping within a node. It's worth looking into in the future but I'm
> putting it at the bottom of the TODO list.

Its is better to have a dynamic limit rather than OOMing.
 
> > > If the RECLAIMABLE areas could be properly targeted, it would make sense
> > > to
> > > mark these pages RECLAIMABLE instead but that is not the situation today.
> > What is the problem with targeting?
> It's currently not possible to target effectively.

Could you be more specific?
 
> > > Because they might be ramfs pages which are not movable -
> > > http://lkml.org/lkml/2006/11/24/150
> > 
> > URL does not provide any useful information regarding the issue.
> > 
> 
> Not all pages allocated via shmem_alloc_page() are movable because they may
> pages for ramfs.

Not familiar with ramfs. There would have to be work on ramfs to make them 
movable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-04-30 17:30       ` Christoph Lameter
@ 2007-04-30 18:33         ` Mel Gorman
  0 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2007-04-30 18:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, Linux Memory Management List

On Mon, 30 Apr 2007, Christoph Lameter wrote:

> On Mon, 30 Apr 2007, Mel Gorman wrote:
>
>>> Indeed that is a good thing.... It would be good if a movable area
>>> would be a dynamic split of a zone and not be a separate zone that has to
>>> be configured on the kernel command line.
>> There are problems with doing that. In particular, the zone can only be sized
>> on one direction and can only be sized at the zone boundary because zones do
>> not currently overlap and I believe there will be assumptions made about them
>> not overlapping within a node. It's worth looking into in the future but I'm
>> putting it at the bottom of the TODO list.
>
> Its is better to have a dynamic limit rather than OOMing.
>

I'll certainly give the problem a kick. I simply have a strong feeling 
that dynamically resizing zones will not be very straight-forward and as 
the zone is manually sized by the administrator, I didn't feel strongly 
about it being possible for an admin to put his machine in an OOM-able 
situation.

>>>> If the RECLAIMABLE areas could be properly targeted, it would make sense
>>>> to
>>>> mark these pages RECLAIMABLE instead but that is not the situation today.
>>> What is the problem with targeting?
>> It's currently not possible to target effectively.
>
> Could you be more specific?
>

The situation I wanted to end up with was that a percentage of memory 
could be reclaimed or moved so that contiguous allocations would succeed. 
When reclaiming __GFP_MOVABLE, we can use lumpy reclaim to find a suitable 
area of pages to reclaim. Some of the pages there are buffer pages even 
though they are not movable in the page migration sense of the word.

Given a page allocated for an inode slab cache, we can't reclaim the 
objects in there in the same way as a buffer page can be cleaned and 
discared.

Hence, to increase the amount of memory that can be reclaimed for 
contiguous allocations, I group the buffer pages with other movable pages 
instead of putting them in with __GFP_RECLAIMABLE pages like slab where 
they are not as useful from a future contiguous allocation perspective.

In the event that given a page of slab objects I could be sure of 
reclaiming all the objects in that page and freeing it, then it would make 
sense to group buffer pages with those.

Does that make sense?

>>>> Because they might be ramfs pages which are not movable -
>>>> http://lkml.org/lkml/2006/11/24/150
>>>
>>> URL does not provide any useful information regarding the issue.
>>>
>>
>> Not all pages allocated via shmem_alloc_page() are movable because they may
>> pages for ramfs.
>
> Not familiar with ramfs. There would have to be work on ramfs to make them
> movable?

Minimally yes. I haven't looked too closely at the issue yet because to 
start with, it was enough to know that the pages were not always movable 
or reclaimable in any way other than deleting files.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-04-30  9:37     ` Mel Gorman
  2007-04-30 12:35       ` Peter Zijlstra
  2007-04-30 17:30       ` Christoph Lameter
@ 2007-05-01 13:31       ` Hugh Dickins
  2 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2007-05-01 13:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, Nick Piggin, Linux Memory Management List

On Mon, 30 Apr 2007, Mel Gorman wrote:
> On Sat, 28 Apr 2007, Christoph Lameter wrote:
> 
> > > > 11. shmem_alloc_page() shmem pages are only __GFP_RECLAIMABLE?
> > > > They can be swapped out and moved by page migration, so GFP_MOVABLE?
> > >
> > > Because they might be ramfs pages which are not movable -
> > > http://lkml.org/lkml/2006/11/24/150
> >
> > URL does not provide any useful information regarding the issue.
> 
> Not all pages allocated via shmem_alloc_page() are movable because they may
> pages for ramfs.

We seem to have a miscommunication here.

shmem_alloc_page() is static to mm/shmem.c, is used for all shm/tmpfs
data pages (unless CONFIG_TINY_SHMEM), and all those data pages may be
swapped out (while not locked in use).

ramfs pages cannot be swapped out; but shmem_alloc_page() is not used
to allocate them.  CONFIG_TINY_SHMEM uses mm/tiny-shmem.c instead of
mm/shmem.c, redirecting all shm/tmpfs requests to the simpler but
unswappable ramfs.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-04-28 21:44   ` Christoph Lameter
  2007-04-30  9:37     ` Mel Gorman
@ 2007-05-01 11:26     ` Nick Piggin
  2007-05-01 12:22       ` Nick Piggin
  2007-05-01 16:38       ` Mel Gorman
  1 sibling, 2 replies; 16+ messages in thread
From: Nick Piggin @ 2007-05-01 11:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, Linux Memory Management List, Andrew Morton

Christoph Lameter wrote:
> On Sat, 28 Apr 2007, Mel Gorman wrote:

>>>10. Radix tree as reclaimable? radix_tree_node_alloc()
>>>
>>>	Ummm... Its reclaimable in a sense if all the pages are removed
>>>	but I'd say not in general.
>>>
>>
>>I considered them to be indirectly reclaimable. Maybe it wasn't the best
>>choice.
> 
> 
> Maybe we need to ask Nick about this one.

I guess they are as reclaimable as the pagecache they hold is. Of
course, they are yet another type of object that makes higher order
reclaim inefficient, regardless of lumpy reclaim etc.

... and also there are things besides pagecache that use radix trees....

I guess you are faced with conflicting problems here. If you do not
mark things like radix tree nodes and dcache as reclaimable, then your
unreclaimable category gets expanded and fragmented more quickly.

On the other hand, if you do mark them (not just radix-trees, but also
bios, dcache, various other things) as reclaimable, then they make it
more difficult to reclaim from the reclaimable memory, and they also
make the reclaimable memory less robust, because you could have pinned
dentry, or some other radix tree user in there that cannot be reclaimed.

I guess making radix tree nodes reclaimable is probably the best of the
two options at this stage.

But now that I'm asked, I repeat my dislike for the antifrag patches,
because of the above -- ie. they're just a heuristic that slows down
the fragmentation of memory rather than avoids it.

I really oppose any code that _depends_ on higher order allocations.
Even if only used for performance reasons, I think it is sad because
a system that eventually gets fragmented will end up with worse
performance over time, which is just lame.

For those systems that really want a big chunk of memory set aside (for
hugepages or memory unplugging), I think reservations are reasonable
because they work and are robust. If we ever _really_ needed arbitrary
contiguous physical memory for some reason, then I think virtual kernel
mapping and true defragmentation would be the logical step.

AFAIK, nobody has tried to do this yet it seems like the (conceptually)
simplest and most logical way to go if you absolutely need contig
memory.

But firstly, I think we should fight against needing to do that step.
I don't care what people say, we are in some position to influence
hardware vendors, and it isn't the end of the world if we don't run
optimally on some hardware today. I say we try to avoid higher order
allocations. It will be hard to ever remove this large amount of
machinery once the code is in.

So to answer Andrew's request for review, I have looked through the
patches at times, and they don't seem to be technically wrong (I would
have prefered that it use resizable zones rather than new sub-zone
zones, but hey...). However I am against the whole direction they go
in, so I haven't really looked at them lately.

I think the direction we should take is firstly ask whether we can do
a reasonable job with PAGE_SIZE pages, secondly ask whether we can do
an acceptable special-case (eg. reserve memory), lastly, _actually_
do defragmentation of kernel memory. Anti-frag would come somewhere
after that last step, as a possible optimisation.

So I haven't been following where we're at WRT the requirements. Why
can we not do with PAGE_SIZE pages or memory reserves? If it is a
matter of efficiency, then how much does it matter, and to whom?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-05-01 11:26     ` Nick Piggin
@ 2007-05-01 12:22       ` Nick Piggin
  2007-05-01 16:38       ` Mel Gorman
  1 sibling, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2007-05-01 12:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Mel Gorman, Linux Memory Management List,
	Andrew Morton

Nick Piggin wrote:

> So I haven't been following where we're at WRT the requirements. Why
> can we not do with PAGE_SIZE pages or memory reserves? If it is a
> matter of efficiency, then how much does it matter, and to whom?

Oh, and: why won't they get upset if memory does eventually end up
getting fragmented?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-05-01 11:26     ` Nick Piggin
  2007-05-01 12:22       ` Nick Piggin
@ 2007-05-01 16:38       ` Mel Gorman
  2007-05-02  2:43         ` Nick Piggin
  1 sibling, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2007-05-01 16:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton

On Tue, 1 May 2007, Nick Piggin wrote:

> Christoph Lameter wrote:
>> On Sat, 28 Apr 2007, Mel Gorman wrote:
>
>>>> 10. Radix tree as reclaimable? radix_tree_node_alloc()
>>>>
>>>> 	Ummm... Its reclaimable in a sense if all the pages are removed
>>>> 	but I'd say not in general.
>>>> 
>>> 
>>> I considered them to be indirectly reclaimable. Maybe it wasn't the best
>>> choice.
>> 
>> 
>> Maybe we need to ask Nick about this one.
>
> I guess they are as reclaimable as the pagecache they hold is. Of
> course, they are yet another type of object that makes higher order
> reclaim inefficient, regardless of lumpy reclaim etc.
>

That can be said of the reclaimable slab caches as well. That is why they
are grouped together. Unlike page cache and buffer pages, the pages involved
cannot be freed without another subsystem being involved.

> ... and also there are things besides pagecache that use radix trees....
>
> I guess you are faced with conflicting problems here. If you do not
> mark things like radix tree nodes and dcache as reclaimable, then your
> unreclaimable category gets expanded and fragmented more quickly.
>

This is understood. It's why the principal mobility types are UNMOVABLE,
RECLAIMABLE and MOVABLE instead of UNMOVABLE and MOVABLE which was suggested
to me in the past.

> On the other hand, if you do mark them (not just radix-trees, but also
> bios, dcache, various other things) as reclaimable, then they make it
> more difficult to reclaim from the reclaimable memory,

This is why dcache and various other things with similar difficulty are
in the RECLAIMABLE areas, not the MOVABLE area. This is deliberate as they
all get to be difficult together. Care is taken to group pages appropriately
so that only easily movable allocations are in the MOVABLE area.

> and they also
> make the reclaimable memory less robust, because you could have pinned
> dentry, or some other radix tree user in there that cannot be reclaimed.
>

Which is why the success rates of hugepage allocation under heavy load
depends more on the number of MOVABLE blocks than RECLAIMABLE.

> I guess making radix tree nodes reclaimable is probably the best of the
> two options at this stage.
>

The ideal would be that some caches would become directly reclaimable over
time including the radix tree nodes. i.e. given a page that belongs to an
inode cache that it would be possible to reclaim all the objects within
that page and free it.

If that was the case for all reclaimable caches, then the RECLAIMABLE portion
of memory becomes much more useful. Right now, it depends on a certain amount
of luck that randomly freeing cache objects will free contiguous blocks in the
RECLAIMABLE area. There was a similar problem for the MOVABLE area until lumpy
reclaim targetted its reclaim. Similar targetting of slab pages is desirable.

> But now that I'm asked, I repeat my dislike for the antifrag patches,
> because of the above -- ie. they're just a heuristic that slows down
> the fragmentation of memory rather than avoids it.
>
> I really oppose any code that _depends_ on higher order allocations.
> Even if only used for performance reasons, I think it is sad because
> a system that eventually gets fragmented will end up with worse
> performance over time, which is just lame.

Although performance could degrade were fragmentation avoidance ineffective,
it seems wrong to miss out on that performance improvement through a dislike
of it.  Any use of high-order pages for performance will be vunerable to
fragmentation and would need to handle it.  I would be interested to see
any proposed uses both to review them and to see how they interact with
fragmentation avoidance, as test cases.

> For those systems that really want a big chunk of memory set aside (for
> hugepages or memory unplugging), I think reservations are reasonable
> because they work and are robust.

Reservations don't really work for memory unplugging at all. Hugepage
reservations have to be done at boot-time which is a difficult requirement
to meet and impossible on batch job and shared systems where reboots do
not take place.

> If we ever _really_ needed arbitrary
> contiguous physical memory for some reason, then I think virtual kernel
> mapping and true defragmentation would be the logical step.
>

Breaking the 1:1 phys:virtual mapping incurs a performance hit that is
persistent. Minimally, things like page_to_pfn() are no longer a simply
calculation which is a bad enough hit. Worse, the kernel can no longer backed by
huge pages because you would have to defragment at the base-page level. The
kernel is backed by huge page entries at the moment for a good reason,
TLB reach is a real problem.

Continuing on, "true defragmentation" would require that the system be
halted so that the defragmentation can take place with everything disabled
so that the copy can take place and every processes pagetables be updated
as pagetables are not always shared.  Even if shared, all processes would
still have to be halted unless the kernel was fully pagable and we were
willing to handle page faults in kernel outside of just the vmalloc area.

This is before even considering the problem of how the kernel copies the
data between two virtual addresses while it's modifing the page tables
it's depending on to read the data. Even more horribly, virtual addresses
in the kernel are no longer physically contiguous which will likely cause
some problems for drivers and possibly DMA engines.

The memory compaction mechanism I have in mind operates on MOVABLE pages
only using the page migration mechanism with the view to keeping MOVABLE and
RECLAIMABLE pages at opposite end of the zone. It doesn't bring the kernel
to the halt like it was a Java Virtual Machine or a lisp interpreter doing
garbage collection.

> AFAIK, nobody has tried to do this yet it seems like the (conceptually)
> simplest and most logical way to go if you absolutely need contig
> memory.
>

I believe there was some work at one point to break the 1:1 phys:virt mapping
that Dave Hansen was involved it. It was a non-trivial breakage and AFAIK,
it made things pretty slow and lost the backing of the kernel address space
with large pages.  Much time has been spent making sure the fragmentation
avoidance patches did not kill performance. As the fragmentation avoidance
stuff improves the TLB usage in the kernel portion of the address space, it
improves performance in some cases. That alone should be considered a positive.

Here are test figures from an x86_64 without min_free_kbytes adjusted
comparing fragmentation avoidance on 2.6.21-rc6-mm1. Newer figures are
being generated but it takes a long time to go through it all.

KernBench Comparison
--------------------
                           2.6.21-rc6-mm1-clean 2.6.21-rc6-mm1-list-based      %diff
User   CPU time                          85.55                     86.27     -0.84%
System CPU time                          35.85                     33.67      6.08%
Total  CPU time                          121.4                    119.94      1.20%

Complaints about kernbench as a valid benchmark aside, it is dependent on
the page allocator's performance. The figures show a 1.2% overall improvement
in total CPU time. The AIM9 results look like

                  2.6.21-rc6-mm1-clean  2.6.21-rc6-mm1-list-based
  1 creat-clo                154674.22                  171921.35   17247.13 11.15% File Creations and Closes/second
  2 page_test                184050.99                  188470.25    4419.26  2.40% System Allocations & Pages/second
  3 brk_test                1840486.50                 2011331.44  170844.94  9.28% System Memory Allocations/second
  6 exec_test                   224.01                     234.71      10.70  4.78% Program Loads/second
  7 fork_test                  3892.04                    4325.22     433.18 11.13% Task Creations/second

More improvements here although I'll admit aim9 can be unreliable on some
machines. The allocation of hugepages under load and at rest look like

HighAlloc Under Load Test Results
                            2.6.21-rc6-mm1-clean  2.6.21-rc6-mm1-list-based 
Order                                         9                          9 
Allocation type                         HighMem                    HighMem 
Attempted allocations                       499                        499 
Success allocs                               33                        361 
Failed allocs                               466                        138 
DMA32 zone allocs                            31                        359 
DMA zone allocs                               2                          2 
Normal zone allocs                            0                          0 
HighMem zone allocs                           0                          0 
EasyRclm zone allocs                          0                          0 
% Success                                     6                         72 
HighAlloc Test Results while Rested
                            2.6.21-rc6-mm1-clean  2.6.21-rc6-mm1-list-based 
Order                                         9                          9 
Allocation type                         HighMem                    HighMem 
Attempted allocations                       499                        499 
Success allocs                              154                        366 
Failed allocs                               345                        133 
DMA32 zone allocs                           152                        364 
DMA zone allocs                               2                          2 
Normal zone allocs                            0                          0 
HighMem zone allocs                           0                          0 
EasyRclm zone allocs                          0                          0 
% Success                                    30                         73

On machines with large TLBs that can fit the entire working set no matter
what, the worst performance regression we've seen is 0.2% in total CPU
time in kernbench which is comparable to what you'd see between kernel
versions. I didn't spot anything out of the way in the performance figures
on test.kernel.org either since fragmentation avoidance was merged.

> But firstly, I think we should fight against needing to do that step.
> I don't care what people say, we are in some position to influence
> hardware vendors, and it isn't the end of the world if we don't run

This is conflating the large page cache discussion with the fragmentation
avoidance patches. If fragmentation avoidance is merged and the page cache
wants to take advantage of it, it will need to;

a) deal with the lack of availability of contiguous pages if fragmentation
    avoidance is ineffective
b) be reviewed to see what its fragmentation behaviour looks like

Similar comments apply to SLUB if it uses order-1 or order-2 contiguous
pages although SLUB is different because as it'll make most reclaimable
allocations the same order. Hence they'll also get freed at the same order
so it suffers less from external fragmentation problems due to less mixing
of orders than one might initially suspect.

Ideally, any subsystem using larger pages does a better job than a "reasonable
job". At worst, any use of contiguous pages should continue to work if they
are not available and at *worst*, it's performance should comparable to base
page usage.

Your assertion seems to be that it's better to always run slow than run
quickly in some situations with the possibility it might slow down later. We
have seen some evidence that fragmentation avoidance gives more consistent
results when running kernbench during the lifetime of the system than without
it. Without it, there are slowdowns probably due to reduced TLB reach.

> optimally on some hardware today. I say we try to avoid higher order
> allocations. It will be hard to ever remove this large amount of
> machinery once the code is in.
>
> So to answer Andrew's request for review, I have looked through the
> patches at times, and they don't seem to be technically wrong (I would
> have prefered that it use resizable zones rather than new sub-zone
> zones, but hey...).

The resizable zones option was considered as well and it seemed messier than
what the current stuff does. Not only do we have to deal with overlapping
non-contiguous zones, but things like the page->flags identifying which
zone a page belongs to have to be moved out (not enough bits) and you get
an explosion of zones like

ZONE_DMA_UNMOVABLE
ZONE_DMA_RECLAIMABLE
ZONE_DMA_MOVABLE
ZONE_DMA32_UNMOVABLE

etc. Everything else aside, that will interact terribly with reclaim.

In the end, it would also suffer from similar problems with the size of
the RECLAIMABLE areas in comparison to MOVABLE and resizing zones would
be expensive.

> However I am against the whole direction they go
> in, so I haven't really looked at them lately.
>
> I think the direction we should take is firstly ask whether we can do
> a reasonable job with PAGE_SIZE pages, secondly ask whether we can do
> an acceptable special-case (eg. reserve memory),

Hugepage-wise, memory gets reserved and it's a problem on systems that
have changing requirements for the number of hugepages they need available.
i.e. the current real use cases for the reservation model have runtime and
system management problems.  From what I understand, some customers have
bought bigger machines and not used huge pages because the reserve model
was too difficult to deal with.

Base pages are unusuable for memory hot-remove particularly on ppc64 running
virtual machines where it wants to move memory in 16MB chunks between
machine partitions.

> lastly, _actually_
> do defragmentation of kernel memory. Anti-frag would come somewhere
> after that last step, as a possible optimisation.
>

This is in the wrong order. Defragmentation of memory makes way more sense
when anti-fragmentation is already in place. There is less memory that
will require moving. Full defragmentation requires breaking 1:1 phys:virt
mapping or halting the machine to get useful work done. Anti-fragmentation
using memory compaction of MOVABLE pages should handle the situation without
breaking 1:1 mappings.

> So I haven't been following where we're at WRT the requirements. Why
> can we not do with PAGE_SIZE pages or memory reserves?

PAGE_SIZE pages cannot grow the hugepage pool. The size of the hugepage
pool required for the lifetime of the system is not always known. PPC64 is
not able to hot-remove a single page and the balloon driver from Xen has
it's own problems. As already stated, reserves come with their own host of
problems that people are not willing to deal with.

> If it is a
> matter of efficiency, then how much does it matter, and to whom?
>

The kernel already uses huge PTE entries in its portion of the address
space because TLB reach is a real problem. Grouping kernel allocations
together in the same hugepages improves overall performance due to reduced
TLB pressure. This is a general improvement and how much of an effect it
has depends on the workload and the TLB size.

>From your other mail

> Oh, and: why won't they get upset if memory does eventually end up
> getting fragmented?

For hugepages, it's annoying because the application will have to fallback to
using small pages which is not always possible and it loses performance. I
get bad emails but the system survives. For memory hot-remove (be it
virtualisation, power saving or whatever), I get sent another complaining
email because the memory can not be removed but the system again lives.

So, for those two use cases, if memory gets fragmented there is a non-critical
bug report and the problem gets kicked by me with some egg on my face.

Going forward, the large page cache stuff will need to deal with a situation
where contiguous pages are not available. What I see happening is that an API
like buffered_rmqueue() is available that gives back an amount of memory in
a list that is as contiguous as possible. This seems feasible and it would
be best if stats were maintained on how often contiguous pages were actually
used to diagnose bug reports that look like "IO performs really well for a
few weeks but then starts slowing up". At worst it should regress to the
vanilla kernels performance at which point I get a complaining email but
again, the system survives.

SLUB using higher orders needs to be tested but as it is using lower orders
to begin with, it may not be an issue. If the minimum page size it uses is
fixed, then many blocks within the RECLAIMABLE areas will be the same size
in the vast majority of cases. As they get freed, they'll be freeing at the
same minimum order so it should not hit external fragmentation problems. This
hypothesis will need to be tested heavily before merging but the bug reports
at least will be really obvious (system went BANG) and I'm in the position
to kick this quite heavily using the test.kernel.org system.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-05-01 16:38       ` Mel Gorman
@ 2007-05-02  2:43         ` Nick Piggin
  2007-05-02 12:41           ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2007-05-02  2:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton

Mel Gorman wrote:
> On Tue, 1 May 2007, Nick Piggin wrote:

>> But now that I'm asked, I repeat my dislike for the antifrag patches,
>> because of the above -- ie. they're just a heuristic that slows down
>> the fragmentation of memory rather than avoids it.
>>
>> I really oppose any code that _depends_ on higher order allocations.
>> Even if only used for performance reasons, I think it is sad because
>> a system that eventually gets fragmented will end up with worse
>> performance over time, which is just lame.
> 
> 
> Although performance could degrade were fragmentation avoidance 
> ineffective,
> it seems wrong to miss out on that performance improvement through a 
> dislike
> of it.  Any use of high-order pages for performance will be vunerable to
> fragmentation and would need to handle it.  I would be interested to see
> any proposed uses both to review them and to see how they interact with
> fragmentation avoidance, as test cases.

Not miss out, but use something robust, or try to get the performance
some other way.


>> For those systems that really want a big chunk of memory set aside (for
>> hugepages or memory unplugging), I think reservations are reasonable
>> because they work and are robust.
> 
> 
> Reservations don't really work for memory unplugging at all. Hugepage
> reservations have to be done at boot-time which is a difficult requirement
> to meet and impossible on batch job and shared systems where reboots do
> not take place.

You just have to make a tradeoff about how much memory you want to set
aside. Note that this memory is not wasted, because it is used for user
allocations. So I think the downsides of reservations are really overstated.

Note that in a batch environment where reboots do not take place, the
anti-frag patches can eventually stop working, but the reservations will
not.

AFAIK, reservations work for hypervisor type memory unplugging. For
arbitrary physical memory unplug, I doubt the anti-frag patches work
either. You'd need hardware support or virtually mapped kernel for that.


>> If we ever _really_ needed arbitrary
>> contiguous physical memory for some reason, then I think virtual kernel
>> mapping and true defragmentation would be the logical step.
>>
> 
> Breaking the 1:1 phys:virtual mapping incurs a performance hit that is
> persistent. Minimally, things like page_to_pfn() are no longer a simply
> calculation which is a bad enough hit. Worse, the kernel can no longer 
> backed by
> huge pages because you would have to defragment at the base-page level. The
> kernel is backed by huge page entries at the moment for a good reason,
> TLB reach is a real problem.

Yet this is what you _have_ to do if you must use arbitrary physical
memory. And I haven't seen any numbers posted.


> Continuing on, "true defragmentation" would require that the system be
> halted so that the defragmentation can take place with everything disabled
> so that the copy can take place and every processes pagetables be updated
> as pagetables are not always shared.  Even if shared, all processes would
> still have to be halted unless the kernel was fully pagable and we were
> willing to handle page faults in kernel outside of just the vmalloc area.

vunmap doesn't need to run with the system halted, so I don't see why
unmapping the source page would need to.

I don't know why we'd need to handle a full page fault in the kernel if
the critical part of the defrag code runs atomically and replaces the
pte when it is done.


> This is before even considering the problem of how the kernel copies the
> data between two virtual addresses while it's modifing the page tables
> it's depending on to read the data.

What's the problem: map the source page into a special area, unmap it
from its normal address, allocate a new page, copy the data, swap the
mapping.


> Even more horribly, virtual addresses
> in the kernel are no longer physically contiguous which will likely cause
> some problems for drivers and possibly DMA engines.

Of course it is trivial to _get_ physically contiguous, virtually
contiguous pages, because now you actually have a mechanism to do so.


>> AFAIK, nobody has tried to do this yet it seems like the (conceptually)
>> simplest and most logical way to go if you absolutely need contig
>> memory.
>>
> 
> I believe there was some work at one point to break the 1:1 phys:virt 
> mapping
> that Dave Hansen was involved it. It was a non-trivial breakage and AFAIK,
> it made things pretty slow and lost the backing of the kernel address space
> with large pages.  Much time has been spent making sure the fragmentation
> avoidance patches did not kill performance. As the fragmentation avoidance
> stuff improves the TLB usage in the kernel portion of the address space, it
> improves performance in some cases. That alone should be considered a 
> positive.
> 
> Here are test figures from an x86_64 without min_free_kbytes adjusted
> comparing fragmentation avoidance on 2.6.21-rc6-mm1. Newer figures are
> being generated but it takes a long time to go through it all.

It isn't performance of your patches I'm so worried about. It is that
they only slow down the rate of fragmentation, so why do we want to add
them and why can't we use something more robust?

hugepages are a good example of where you can use reservations.

You could even use reservations for higher order pagecache (rather than
crapping the whole thing up with small-pages fallbacks everywhere).


>> But firstly, I think we should fight against needing to do that step.
>> I don't care what people say, we are in some position to influence
>> hardware vendors, and it isn't the end of the world if we don't run
> 
> 
> This is conflating the large page cache discussion with the fragmentation
> avoidance patches. If fragmentation avoidance is merged and the page cache
> wants to take advantage of it, it will need to;

I don't think it is. Because the only reason to need more than a couple
of physically contiguous pages is to work around hardware limitations or
inefficiency.


> a) deal with the lack of availability of contiguous pages if fragmentation
>    avoidance is ineffective
> b) be reviewed to see what its fragmentation behaviour looks like
> 
> Similar comments apply to SLUB if it uses order-1 or order-2 contiguous
> pages although SLUB is different because as it'll make most reclaimable
> allocations the same order. Hence they'll also get freed at the same order
> so it suffers less from external fragmentation problems due to less mixing
> of orders than one might initially suspect.

Surely you can still have failure cases where you get fragmentation in your
unmovable thingy.


> Ideally, any subsystem using larger pages does a better job than a 
> "reasonable
> job". At worst, any use of contiguous pages should continue to work if they
> are not available and at *worst*, it's performance should comparable to 
> base
> page usage.
> 
> Your assertion seems to be that it's better to always run slow than run
> quickly in some situations with the possibility it might slow down 
> later. We
> have seen some evidence that fragmentation avoidance gives more consistent
> results when running kernbench during the lifetime of the system than 
> without
> it. Without it, there are slowdowns probably due to reduced TLB reach.

No. My assertion is that we should speed things up in other ways, eg.
by making the small pages case faster or by using something robust
like reservations. On a lot of systems it is actually quite a problem
if performance slows down over time, regardless of whether the base
performance is about the same as a non-slowing kernel.


>> optimally on some hardware today. I say we try to avoid higher order
>> allocations. It will be hard to ever remove this large amount of
>> machinery once the code is in.
>>
>> So to answer Andrew's request for review, I have looked through the
>> patches at times, and they don't seem to be technically wrong (I would
>> have prefered that it use resizable zones rather than new sub-zone
>> zones, but hey...).
> 
> 
> The resizable zones option was considered as well and it seemed messier 
> than
> what the current stuff does. Not only do we have to deal with overlapping
> non-contiguous zones,

We have to do that anyway, don't we?

> but things like the page->flags identifying which
> zone a page belongs to have to be moved out (not enough bits)

Another 2 bits? I think on most architectures that should be OK,
shouldn't it?

> and you get
> an explosion of zones like
> 
> ZONE_DMA_UNMOVABLE
> ZONE_DMA_RECLAIMABLE
> ZONE_DMA_MOVABLE
> ZONE_DMA32_UNMOVABLE

So of course you don't make them visible to the API. Just select them
based on your GFP_ movable flags.


> etc.

What is etc? Are those the best reasons why this wasn't made to use zones?


> Everything else aside, that will interact terribly with reclaim.

Why? And why does the current scheme not? Doesn't seem like it would have
to be a given. You _are_ allowed to change some things.


> In the end, it would also suffer from similar problems with the size of
> the RECLAIMABLE areas in comparison to MOVABLE and resizing zones would
> be expensive.

Why is it expensive but resizing your other things is not? And if you
already have non-contiguous overlapping zones, you _could_ even just
make them all the same size and just move pages between them.


> This is in the wrong order. Defragmentation of memory makes way more sense
> when anti-fragmentation is already in place. There is less memory that
> will require moving. Full defragmentation requires breaking 1:1 phys:virt
> mapping or halting the machine to get useful work done. Anti-fragmentation
> using memory compaction of MOVABLE pages should handle the situation 
> without
> breaking 1:1 mappings.

My arguments are about anti-fragmentation _not_ making sense without
defragmentation.


>> So I haven't been following where we're at WRT the requirements. Why
>> can we not do with PAGE_SIZE pages or memory reserves?
> 
> 
> PAGE_SIZE pages cannot grow the hugepage pool. The size of the hugepage
> pool required for the lifetime of the system is not always known. PPC64 is
> not able to hot-remove a single page and the balloon driver from Xen has
> it's own problems. As already stated, reserves come with their own host of
> problems that people are not willing to deal with.

I don't understand exactly what you mean? You don't have a hugepage pool,
but an always-reclaimable pool. So you can use this for any kind of
pagecache and even anonymous and mlocked memory assuming you account for
it correctly so it can be moved away if needed.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-05-02  2:43         ` Nick Piggin
@ 2007-05-02 12:41           ` Mel Gorman
  2007-05-04  6:16             ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2007-05-02 12:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton

On Wed, 2 May 2007, Nick Piggin wrote:

> Mel Gorman wrote:
> > On Tue, 1 May 2007, Nick Piggin wrote:
> 
> > > But now that I'm asked, I repeat my dislike for the antifrag patches,
> > > because of the above -- ie. they're just a heuristic that slows down
> > > the fragmentation of memory rather than avoids it.
> > > 
> > > I really oppose any code that _depends_ on higher order allocations.
> > > Even if only used for performance reasons, I think it is sad because
> > > a system that eventually gets fragmented will end up with worse
> > > performance over time, which is just lame.
> > 
> > 
> > Although performance could degrade were fragmentation avoidance
> > ineffective,
> > it seems wrong to miss out on that performance improvement through a
> > dislike
> > of it.  Any use of high-order pages for performance will be vunerable to
> > fragmentation and would need to handle it.  I would be interested to see
> > any proposed uses both to review them and to see how they interact with
> > fragmentation avoidance, as test cases.
> 
> Not miss out, but use something robust, or try to get the performance
> some other way.
> 
> 
> > > For those systems that really want a big chunk of memory set aside
> > > (for
> > > hugepages or memory unplugging), I think reservations are reasonable
> > > because they work and are robust.
> > 
> > 
> > Reservations don't really work for memory unplugging at all. Hugepage
> > reservations have to be done at boot-time which is a difficult
> > requirement
> > to meet and impossible on batch job and shared systems where reboots do
> > not take place.
> 
> You just have to make a tradeoff about how much memory you want to set
> aside.

This tradeoff in sizing the reservation is something that users of shared
systems have real problems with because a hugepool once sized can only be
used for hugepage allocations. One compromise lead to the development of
ZONE_MOVABLE where a portion of memory could be set aside that was usable
for small pages but that the huge page pool could borrow from.

> Note that this memory is not wasted, because it is used for user
> allocations. So I think the downsides of reservations are really
> overstated.
>

This does sound like you think the first step here would be a zone based
reservation system.  Would you support inclusion of the ZONE_MOVABLE part
of the patch set?

> Note that in a batch environment where reboots do not take place, the
> anti-frag patches can eventually stop working, but the reservations will
> not.
>

Again, does this imply that you're happy for ZONE_MOVABLE to go through?

> AFAIK, reservations work for hypervisor type memory unplugging. For
> arbitrary physical memory unplug, I doubt the anti-frag patches work
> either. You'd need hardware support or virtually mapped kernel for that.
>

The grouping pages by mobility pages alone helps unplugging of 16MB memory
sections on ppc64 (the minimum hypervisor allocation), where all sections
are interchangable so any section will do.  I believe that Yasunori Goto is
looking at using ZONE_MOVABLE for unplugging larger regions.

> > > If we ever _really_ needed arbitrary
> > > contiguous physical memory for some reason, then I think virtual
> > > kernel
> > > mapping and true defragmentation would be the logical step.
> > > 
> > 
> > Breaking the 1:1 phys:virtual mapping incurs a performance hit that is
> > persistent. Minimally, things like page_to_pfn() are no longer a simply
> > calculation which is a bad enough hit. Worse, the kernel can no longer
> > backed by
> > huge pages because you would have to defragment at the base-page level.
> > The
> > kernel is backed by huge page entries at the moment for a good reason,
> > TLB reach is a real problem.
> 
> Yet this is what you _have_ to do if you must use arbitrary physical
> memory. And I haven't seen any numbers posted.
>

Numbers require an implementation and that is a non-trivial undertaking.
I've cc'd Dave Hansen who I believe tried breaking 1:1 phys:virtual mapping
some time in the past. He might have further comments to make.

> > Continuing on, "true defragmentation" would require that the system be
> > halted so that the defragmentation can take place with everything
> > disabled
> > so that the copy can take place and every processes pagetables be
> > updated
> > as pagetables are not always shared.  Even if shared, all processes
> > would
> > still have to be halted unless the kernel was fully pagable and we were
> > willing to handle page faults in kernel outside of just the vmalloc
> > area.
> 
> vunmap doesn't need to run with the system halted, so I don't see why
> unmapping the source page would need to.
>

vunmap() is freeing an address range where it knows it is the only accessor
of any data in that range. It's not the same when there are other processes
potentially memory in the same area at the same time expecting it to exist.

> I don't know why we'd need to handle a full page fault in the kernel if
> the critical part of the defrag code runs atomically and replaces the
> pte when it is done.
>

And how exactly would one atomically copy a page of data, update the page
tables and flush the TLB without stalling all writers?  The setup would have
to mark the PTE for that area read-only and flush the TLB so that other
processes will fault on write and wait until the migration has completed
before retrying the fault. That would allow the data to be safely read and
copied to somewhere else.

It would be at least feasible to back SLAB_RECLAIM_ACCOUNT slabs by a
virtual map for the purposes of defragmenting it like this. However, it
would work better in conjunction with fragmentation avoidance instead of
replacing it because the fragmentation avoidance mechanism could be easily
used to group virtually-backed allocations together in the same physical
blocks as much as possible to reduce future migration work.

> > This is before even considering the problem of how the kernel copies the
> > data between two virtual addresses while it's modifing the page tables
> > it's depending on to read the data.
> 
> What's the problem: map the source page into a special area, unmap it
> from its normal address, allocate a new page, copy the data, swap the
> mapping.
>

You'd have to do something like I described above to handle synchronous
writes to the area during defragmentation.

> 
> > Even more horribly, virtual addresses
> > in the kernel are no longer physically contiguous which will likely
> > cause
> > some problems for drivers and possibly DMA engines.
> 
> Of course it is trivial to _get_ physically contiguous, virtually
> contiguous pages, because now you actually have a mechanism to do so.
>

I think that would require that the kernel portion have a split between the
vmap() like area and a 1:1 virt:phys area - i.e. similar to today except the
vmalloc() region is bigger. It is difficult to predict what the impact of a
much expanded use of the vmalloc area would be.

> > > AFAIK, nobody has tried to do this yet it seems like the
> > > (conceptually)
> > > simplest and most logical way to go if you absolutely need contig
> > > memory.
> > > 
> > 
> > I believe there was some work at one point to break the 1:1 phys:virt
> > mapping
> > that Dave Hansen was involved it. It was a non-trivial breakage and
> > AFAIK,
> > it made things pretty slow and lost the backing of the kernel address
> > space
> > with large pages.  Much time has been spent making sure the
> > fragmentation
> > avoidance patches did not kill performance. As the fragmentation
> > avoidance
> > stuff improves the TLB usage in the kernel portion of the address space,
> > it
> > improves performance in some cases. That alone should be considered a
> > positive.
> > 
> > Here are test figures from an x86_64 without min_free_kbytes adjusted
> > comparing fragmentation avoidance on 2.6.21-rc6-mm1. Newer figures are
> > being generated but it takes a long time to go through it all.
> 
> It isn't performance of your patches I'm so worried about. It is that
> they only slow down the rate of fragmentation, so why do we want to add
> them and why can't we use something more robust?
>

Because as I've maintained for quite some time, I see the patches as
a pre-requisite for a more complete and robust solution for dealing with
external fragmentation. I see the merits of what you are suggesting but feel
it can be built up incrementally starting with the fragmentation avoidance
stuff, then compacting MOVABLE pages towards the end of the zone before
finally dealing with full defragmentation.  But I am reluctant to built
large bodies of work on top of a foundation with an uncertain future.

> hugepages are a good example of where you can use reservations.
>

Except that it has to be sized at boot-time, can never grow and users find
it very inflexible in the real world where requirements change over time
and a reboot is required to effectively change these reservations.

> You could even use reservations for higher order pagecache (rather than
> crapping the whole thing up with small-pages fallbacks everywhere).
>

True, although that means that an administrator is then required to size
their buffer cache at boot time if they are using high order pagecache. I
doubt they'll like that any more than sizing a hugepage pool.

> > > But firstly, I think we should fight against needing to do that step.
> > > I don't care what people say, we are in some position to influence
> > > hardware vendors, and it isn't the end of the world if we don't run
> > 
> > 
> > This is conflating the large page cache discussion with the
> > fragmentation
> > avoidance patches. If fragmentation avoidance is merged and the page
> > cache
> > wants to take advantage of it, it will need to;
> 
> I don't think it is. Because the only reason to need more than a couple
> of physically contiguous pages is to work around hardware limitations or
> inefficiency.
>

A low TLB reach with base page size is a real problem that some classes of
users have to deal with. Sometimes there just is no easy way around having
to deal with large amounts of data at the same time.

> 
> > a) deal with the lack of availability of contiguous pages if
> > fragmentation
> >    avoidance is ineffective
> > b) be reviewed to see what its fragmentation behaviour looks like
> > 
> > Similar comments apply to SLUB if it uses order-1 or order-2 contiguous
> > pages although SLUB is different because as it'll make most reclaimable
> > allocations the same order. Hence they'll also get freed at the same
> > order
> > so it suffers less from external fragmentation problems due to less
> > mixing
> > of orders than one might initially suspect.
> 
> Surely you can still have failure cases where you get fragmentation in
> your
> unmovable thingy.
>

Possibly, it's simply not known what those failure cases are. I'll be
writing some statistics gathering code as suggested by Christoph to identify
situations where it broke down. In the SLUB case, I would push that only
SLAB_RECLAIM_ACCOUNT allocations use higher orders to start with to avoid
unmovable allocations spilling out due to them requiring higher orders.

> > Ideally, any subsystem using larger pages does a better job than a
> > "reasonable
> > job". At worst, any use of contiguous pages should continue to work if
> > they
> > are not available and at *worst*, it's performance should comparable to
> > base
> > page usage.
> > 
> > Your assertion seems to be that it's better to always run slow than run
> > quickly in some situations with the possibility it might slow down
> > later. We
> > have seen some evidence that fragmentation avoidance gives more
> > consistent
> > results when running kernbench during the lifetime of the system than
> > without
> > it. Without it, there are slowdowns probably due to reduced TLB reach.
> 
> No. My assertion is that we should speed things up in other ways, eg.

The principal reason I developed fragmentation avoidance was to relax
restrictions on the resizing of the huge page pool where it's not a question
of poor performance, it's a question of simply not working. The large page
cache stuff arrived later as a potential additional benefiticary of lower
fragmentation as well as SLUB.

> by making the small pages case faster or by using something robust
> like reservations. On a lot of systems it is actually quite a problem
> if performance slows down over time, regardless of whether the base
> performance is about the same as a non-slowing kernel.
> 
> 
> > > optimally on some hardware today. I say we try to avoid higher order
> > > allocations. It will be hard to ever remove this large amount of
> > > machinery once the code is in.
> > > 
> > > So to answer Andrew's request for review, I have looked through the
> > > patches at times, and they don't seem to be technically wrong (I would
> > > have prefered that it use resizable zones rather than new sub-zone
> > > zones, but hey...).
> > 
> > 
> > The resizable zones option was considered as well and it seemed messier
> > than
> > what the current stuff does. Not only do we have to deal with
> > overlapping
> > non-contiguous zones,
> 
> We have to do that anyway, don't we?
>

Where do we deal with overlapping non-contiguous zones within a node today?

> > but things like the page->flags identifying which
> > zone a page belongs to have to be moved out (not enough bits)
> 
> Another 2 bits? I think on most architectures that should be OK,
> shouldn't it?
>

page->flags is not exactly flush with space. The last I heard, there
were 3 bits free and there was work being done to remove some of them so
more could be used.

> > and you get
> > an explosion of zones like
> > 
> > ZONE_DMA_UNMOVABLE
> > ZONE_DMA_RECLAIMABLE
> > ZONE_DMA_MOVABLE
> > ZONE_DMA32_UNMOVABLE
> 
> So of course you don't make them visible to the API. Just select them
> based on your GFP_ movable flags.
>

Just because they are invisible to the API does not mean they are invisible
to the size of pgdat->node_zones[] and the size of the zone fallback lists.
Christoph will eventually complain about the number of zones having doubled
or tripled.

> 
> > etc.
> 
> What is etc? Are those the best reasons why this wasn't made to use zones?
>

No, I simply thought those problems were bad enough without going into
additional ones - here's another one. If a block of pages has to move
between zones, page->flags has to be updated which means a lock to the page
has to be acquired to guard against concurrent use before moving the zone.

Grouping pages by mobility updates two bits indicating where all pages in
the block should belong to on free and moves the currently free pages
leaving the other pages alone. On free, they get placed on the correct list.

> 
> > Everything else aside, that will interact terribly with reclaim.
> 
> Why?

Because reclaim is based on zones. Due to zone fallbacks, there will be LRU
pages in each of the zones unless strict partitioning is used. That means
when reclaiming ZONE_NORMAL pages for example, reclaim may need to be
triggered in ZONE_NORMAL_UNMOVABLE, ZONE_NORMAL_MOVABLE and
ZONE_NORMAL_RECLAIMABLE.

If strict partitioning us used, then the size of the pools has to be
carefully balanced or the system goes bang.

> And why does the current scheme not? Doesn't seem like it would have
> to be a given. You _are_ allowed to change some things.
>

The current scheme does not impact reclaim because the LRU lists remain
exactly as they are.

> 
> > In the end, it would also suffer from similar problems with the size of
> > the RECLAIMABLE areas in comparison to MOVABLE and resizing zones would
> > be expensive.
> 
> Why is it expensive but resizing your other things is not?

Because to resize currently, the bits representing the block are updated and
the free pages only are moved. We don't have to deal with the pages already
in use, particularly awkward ones like free per-cpu pages which we cannot
get a lock on.

> And if you
> already have non-contiguous overlapping zones, you _could_ even just
> make them all the same size and just move pages between them.
>

The pages in use would have to have their page->flags updated so that
page_zone() will resolve correctly and that is not cheap (it might not even
be safe in all cases like the per-cpu pages).

> 
> > This is in the wrong order. Defragmentation of memory makes way more
> > sense
> > when anti-fragmentation is already in place. There is less memory that
> > will require moving. Full defragmentation requires breaking 1:1
> > phys:virt
> > mapping or halting the machine to get useful work done.
> > Anti-fragmentation
> > using memory compaction of MOVABLE pages should handle the situation
> > without
> > breaking 1:1 mappings.
> 
> My arguments are about anti-fragmentation _not_ making sense without
> defragmentation.
>

I have repeatadly asserted that I'm perfectly happy to build defragmentation
on top of anti-fragmentation. I consider anti-fragmentation to be a sensible
prerequisite to defragmentation. I believe defragmentation can be taken a
long way before the 1:1 phys:virt mapping is broken when anti-fragmentation
is involved.

> 
> > > So I haven't been following where we're at WRT the requirements. Why
> > > can we not do with PAGE_SIZE pages or memory reserves?
> > 
> > 
> > PAGE_SIZE pages cannot grow the hugepage pool. The size of the hugepage
> > pool required for the lifetime of the system is not always known. PPC64
> > is
> > not able to hot-remove a single page and the balloon driver from Xen has
> > it's own problems. As already stated, reserves come with their own host
> > of
> > problems that people are not willing to deal with.
> 
> I don't understand exactly what you mean? You don't have a hugepage pool,

I was referring to /proc/sys/vm/nr_hugepages. I thought you were suggesting
that it be refilled with PAGE_SIZE pages.

> but an always-reclaimable pool. So you can use this for any kind of
> pagecache and even anonymous and mlocked memory assuming you account for
> it correctly so it can be moved away if needed.
>

The always-reclaimable pool is not treated as a static-sized thing although
it can be with ZONE_MOVABLE if that is really what the user requires.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-05-02 12:41           ` Mel Gorman
@ 2007-05-04  6:16             ` Nick Piggin
  2007-05-04  6:55               ` Nick Piggin
  2007-05-08  9:23               ` Mel Gorman
  0 siblings, 2 replies; 16+ messages in thread
From: Nick Piggin @ 2007-05-04  6:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton

Mel Gorman wrote:
> On Wed, 2 May 2007, Nick Piggin wrote:
> 
>> Mel Gorman wrote:

>> > reservations have to be done at boot-time which is a difficult
>> > requirement
>> > to meet and impossible on batch job and shared systems where reboots do
>> > not take place.
>>
>> You just have to make a tradeoff about how much memory you want to set
>> aside.
> 
> 
> This tradeoff in sizing the reservation is something that users of shared
> systems have real problems with because a hugepool once sized can only be
> used for hugepage allocations. One compromise lead to the development of
> ZONE_MOVABLE where a portion of memory could be set aside that was usable
> for small pages but that the huge page pool could borrow from.

What's wrong with that? Do we have any of that stuff upstream yet, and
if not, then that probably should be done _first_. From there we can see
what is left for the anti-fragmentation patches...


>> Note that this memory is not wasted, because it is used for user
>> allocations. So I think the downsides of reservations are really
>> overstated.
>>
> 
> This does sound like you think the first step here would be a zone based
> reservation system.  Would you support inclusion of the ZONE_MOVABLE part
> of the patch set?

Ah, that answers my questions. Yes, I don't see why not, if the various
people who were interested in that feature are happy with it. Not that I
looked at the most recent implementation (which patches are they?)


>> > persistent. Minimally, things like page_to_pfn() are no longer a simply
>> > calculation which is a bad enough hit. Worse, the kernel can no longer
>> > backed by
>> > huge pages because you would have to defragment at the base-page level.
>> > The
>> > kernel is backed by huge page entries at the moment for a good reason,
>> > TLB reach is a real problem.
>>
>> Yet this is what you _have_ to do if you must use arbitrary physical
>> memory. And I haven't seen any numbers posted.
>>
> 
> Numbers require an implementation and that is a non-trivial undertaking.
> I've cc'd Dave Hansen who I believe tried breaking 1:1 phys:virtual mapping
> some time in the past. He might have further comments to make.

I'm sure it wouldn't be trivial :)

TLB's are pretty good, though. Virtualised kernels don't seem to take a
huge hit (I had some vague idea that a lot of their performance problems
were with IO).


>> > Continuing on, "true defragmentation" would require that the system be
>> > halted so that the defragmentation can take place with everything
>> > disabled
>> > so that the copy can take place and every processes pagetables be
>> > updated
>> > as pagetables are not always shared.  Even if shared, all processes
>> > would
>> > still have to be halted unless the kernel was fully pagable and we were
>> > willing to handle page faults in kernel outside of just the vmalloc
>> > area.
>>
>> vunmap doesn't need to run with the system halted, so I don't see why
>> unmapping the source page would need to.
>>
> 
> vunmap() is freeing an address range where it knows it is the only accessor
> of any data in that range. It's not the same when there are other processes
> potentially memory in the same area at the same time expecting it to exist.

I don't see what the distinction is. We obviously wouldn't have multiple
processes with different kernel virtual addresses pointing to the same
page. It would be managed almost exactly like vmalloc space is today, I'd
imagine.


>> I don't know why we'd need to handle a full page fault in the kernel if
>> the critical part of the defrag code runs atomically and replaces the
>> pte when it is done.
>>
> 
> And how exactly would one atomically copy a page of data, update the page
> tables and flush the TLB without stalling all writers?  The setup would 
> have
> to mark the PTE for that area read-only and flush the TLB so that other
> processes will fault on write and wait until the migration has completed
> before retrying the fault. That would allow the data to be safely read and
> copied to somewhere else.

Why is there a requirement to prevent stalling of writers?


> It would be at least feasible to back SLAB_RECLAIM_ACCOUNT slabs by a
> virtual map for the purposes of defragmenting it like this. However, it
> would work better in conjunction with fragmentation avoidance instead of
> replacing it because the fragmentation avoidance mechanism could be easily
> used to group virtually-backed allocations together in the same physical
> blocks as much as possible to reduce future migration work.

Yeah, maybe. But what I am getting at is that fragmentation avoidance
isn't _the_ big ticket (as the name implies). Defragmentation is. With
defragmentation in, I think that avoidance makes much more sense.

Now I'm still hoping that neither is necessary... my thought process
on this is to keep hoping that nothing comes up that _requires_ us to
support higher order allocations in the kernel generally.

As an aside, it might actually be nice to be able to reduce MAX_ORDER
significantly after boot in order to reduce page allocator overhead...


>> > This is before even considering the problem of how the kernel copies 
>> the
>> > data between two virtual addresses while it's modifing the page tables
>> > it's depending on to read the data.
>>
>> What's the problem: map the source page into a special area, unmap it
>> from its normal address, allocate a new page, copy the data, swap the
>> mapping.
>>
> 
> You'd have to do something like I described above to handle synchronous
> writes to the area during defragmentation.

Yeah, that's what the "unmap the source page" is (which would also block
reads, and I think would be a better approach to try first, because it
would reduce TLB flushing. Although moving and flushing could probably
be batched, so mapping them readonly first might be a good optimisation
after that).


>> > Even more horribly, virtual addresses
>> > in the kernel are no longer physically contiguous which will likely
>> > cause
>> > some problems for drivers and possibly DMA engines.
>>
>> Of course it is trivial to _get_ physically contiguous, virtually
>> contiguous pages, because now you actually have a mechanism to do so.
>>
> 
> I think that would require that the kernel portion have a split between the
> vmap() like area and a 1:1 virt:phys area - i.e. similar to today except 
> the
> vmalloc() region is bigger. It is difficult to predict what the impact of a
> much expanded use of the vmalloc area would be.

Yeah that would probably be reasonable. So huge tlbs could still be used
for various large boot time structures.

Predicting the impact of it? Could we look at how something like KVM
performs when using 4K pages for its memory map?


>> It isn't performance of your patches I'm so worried about. It is that
>> they only slow down the rate of fragmentation, so why do we want to add
>> them and why can't we use something more robust?
>>
> 
> Because as I've maintained for quite some time, I see the patches as
> a pre-requisite for a more complete and robust solution for dealing with
> external fragmentation. I see the merits of what you are suggesting but 
> feel
> it can be built up incrementally starting with the fragmentation avoidance
> stuff, then compacting MOVABLE pages towards the end of the zone before
> finally dealing with full defragmentation.  But I am reluctant to built
> large bodies of work on top of a foundation with an uncertain future.

The first thing we need to decide is if there is a big need to support
higher order allocations generally in the kernel. I'm still a "no" with
that one :)

If and when we decide "yes", I don't see how anti-fragmentation does much
good for that -- all the new wonderful higher order allocations we add in
will need fallbacks, and things can slowly degrade over time which I'm
sorry but that really sucks.

I think that to decide yes, we have to realise that requires real
defragmentation. At that point, OK, I'm not going to split hairs over
whether you think anti-frag logically belongs first (I think it
doesn't :)).

>> hugepages are a good example of where you can use reservations.
>>
> 
> Except that it has to be sized at boot-time, can never grow and users find
> it very inflexible in the real world where requirements change over time
> and a reboot is required to effectively change these reservations.
> 
>> You could even use reservations for higher order pagecache (rather than
>> crapping the whole thing up with small-pages fallbacks everywhere).
>>
> 
> True, although that means that an administrator is then required to size
> their buffer cache at boot time if they are using high order pagecache. I
> doubt they'll like that any more than sizing a hugepage pool.
> 
>> I don't think it is. Because the only reason to need more than a couple
>> of physically contiguous pages is to work around hardware limitations or
>> inefficiency.
>>
> 
> A low TLB reach with base page size is a real problem that some classes of
> users have to deal with. Sometimes there just is no easy way around having
> to deal with large amounts of data at the same time.

To the 3 above: yes, I completely know we are not and never will be
absolutely optimal for everyone. And the end-game for Linux, if there
is one, I don't think is to be in a state that is perfect for everyone
either. I don't think any feature can be justified simply because
"someone" wants it, even if those someones are people running benchmarks
at big companies.


>> No. My assertion is that we should speed things up in other ways, eg.
> 
> 
> The principal reason I developed fragmentation avoidance was to relax
> restrictions on the resizing of the huge page pool where it's not a 
> question
> of poor performance, it's a question of simply not working. The large page
> cache stuff arrived later as a potential additional benefiticary of lower
> fragmentation as well as SLUB.

So that's even worse than a purely for performance patch, because it
can now work for a while and then randomly stop working eventually.


>> > what the current stuff does. Not only do we have to deal with
>> > overlapping
>> > non-contiguous zones,
>>
>> We have to do that anyway, don't we?
>>
> 
> Where do we deal with overlapping non-contiguous zones within a node today?

In the buddy allocator and physical memory models, I guess?

http://marc.info/?l=linux-mm&m=114774325131397&w=2

Doesn't that imply overlapping non-contiguous zones?


>> > but things like the page->flags identifying which
>> > zone a page belongs to have to be moved out (not enough bits)
>>
>> Another 2 bits? I think on most architectures that should be OK,
>> shouldn't it?
>>
> 
> page->flags is not exactly flush with space. The last I heard, there
> were 3 bits free and there was work being done to remove some of them so
> more could be used.

No, you wouldn't be using that part of the flags, but the other
part. AFAIK there is reasonable amount of room on 64-bit, and only
on huge NUMA 32-bit (ie. dinosaurs) is it a squeeze... but it falls
back to an out of line thingy anyway.


>> > and you get
>> > an explosion of zones like
>> > > ZONE_DMA_UNMOVABLE
>> > ZONE_DMA_RECLAIMABLE
>> > ZONE_DMA_MOVABLE
>> > ZONE_DMA32_UNMOVABLE
>>
>> So of course you don't make them visible to the API. Just select them
>> based on your GFP_ movable flags.
>>
> 
> Just because they are invisible to the API does not mean they are invisible
> to the size of pgdat->node_zones[] and the size of the zone fallback lists.
> Christoph will eventually complain about the number of zones having doubled
> or tripled.

Well there is already a reasonable amount of duplication, eg pcp lists.
And I think it is much better to put up with a couple of complaints from
Christoph rather than introduce something entirely new if possible. Hey
it might even give people an incentive to improve the existing schemes.


>> > etc.
>>
>> What is etc? Are those the best reasons why this wasn't made to use 
>> zones?
>>
> 
> No, I simply thought those problems were bad enough without going into
> additional ones - here's another one. If a block of pages has to move
> between zones, page->flags has to be updated which means a lock to the page
> has to be acquired to guard against concurrent use before moving the zone.

If you're only moving free pages, then the page allocator lock should be
fine. There may be a couple of other places that would need help (eg
swsusp)...

... but anyway, I'll snip the rest because I didn't want to digress into
implementation details so much (now I'm sorry for bringing it up).

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-05-04  6:16             ` Nick Piggin
@ 2007-05-04  6:55               ` Nick Piggin
  2007-05-08  9:23               ` Mel Gorman
  1 sibling, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2007-05-04  6:55 UTC (permalink / raw)
  Cc: Mel Gorman, Christoph Lameter, Linux Memory Management List,
	Andrew Morton

Nick Piggin wrote:

> ... but anyway, I'll snip the rest because I didn't want to digress into
> implementation details so much (now I'm sorry for bringing it up).

And in saying this, I'm not implying there _are_ implementation problems;
I'm sure you're not just making up difficulties involved with a zone based
approach ;)

I just wanted to keep the discussion on the higher picture, so I shouldn't
have brought up that implementation detail anyway with my initial post :P

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Antifrag patchset comments
  2007-05-04  6:16             ` Nick Piggin
  2007-05-04  6:55               ` Nick Piggin
@ 2007-05-08  9:23               ` Mel Gorman
  1 sibling, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2007-05-08  9:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Linux Memory Management List, Andrew Morton

Sorry for the delayed response, I was offline for several days.

On Fri, 4 May 2007, Nick Piggin wrote:

> Mel Gorman wrote:
>> On Wed, 2 May 2007, Nick Piggin wrote:
>> 
>>> Mel Gorman wrote:
>
>>> > reservations have to be done at boot-time which is a difficult
>>> > requirement
>>> > to meet and impossible on batch job and shared systems where reboots do
>>> > not take place.
>>> 
>>> You just have to make a tradeoff about how much memory you want to set
>>> aside.
>> 
>> 
>> This tradeoff in sizing the reservation is something that users of shared
>> systems have real problems with because a hugepool once sized can only be
>> used for hugepage allocations. One compromise lead to the development of
>> ZONE_MOVABLE where a portion of memory could be set aside that was usable
>> for small pages but that the huge page pool could borrow from.
>
> What's wrong with that?

Because it's still something that is configured at boot-time, remains 
static for the lifetime of the system, has consequences if the admin gets 
it wrong and the zone does not help stuff like e1000 using jumbo frames. 
Also, while it's not useless to memory hot-remove, removing 16MB sections 
on Power is desirable and having a zone was overkill for that purpose.

> Do we have any of that stuff upstream yet, and
> if not, then that probably should be done _first_.

Ok, that can be done. The patch sets complement each other but only 
actually share one common patch and can be treated separetly.

> From there we can see
> what is left for the anti-fragmentation patches...
>
>
>>> Note that this memory is not wasted, because it is used for user
>>> allocations. So I think the downsides of reservations are really
>>> overstated.
>>> 
>> 
>> This does sound like you think the first step here would be a zone based
>> reservation system.  Would you support inclusion of the ZONE_MOVABLE part
>> of the patch set?
>
> Ah, that answers my questions. Yes, I don't see why not, if the various
> people who were interested in that feature are happy with it. Not that I
> looked at the most recent implementation (which patches are they?)
>

In 2.6.21-mm1, the relevant patches are

add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated.patch
create-the-zone_movable-zone.patch
allow-huge-page-allocations-to-use-gfp_high_movable.patch
handle-kernelcore=-generic.patch

The last three patches are ZONE_MOVABLE specific as nicely noted in the 
series file. The first patch is shared between grouping pages by mobility 
and ZONE_MOVABLE.

A TODO item for this set of patches is to rename GFP_HIGH_MOVABLE to 
GFP_HIGHUSER_MOVABLE and flag page cache allocations specifically instead 
of marking them movable which is confusing. A second item is to look at 
sizing the zone at runtime.

>>> > persistent. Minimally, things like page_to_pfn() are no longer a simply
>>> > calculation which is a bad enough hit. Worse, the kernel can no longer
>>> > backed by
>>> > huge pages because you would have to defragment at the base-page level.
>>> > The
>>> > kernel is backed by huge page entries at the moment for a good reason,
>>> > TLB reach is a real problem.
>>> 
>>> Yet this is what you _have_ to do if you must use arbitrary physical
>>> memory. And I haven't seen any numbers posted.
>>> 
>> 
>> Numbers require an implementation and that is a non-trivial undertaking.
>> I've cc'd Dave Hansen who I believe tried breaking 1:1 phys:virtual mapping
>> some time in the past. He might have further comments to make.
>
> I'm sure it wouldn't be trivial :)
>
> TLB's are pretty good, though.

Not everwhere and there are still userspace workloads that see 10-40% 
improvements when using hugepages so breaking it in userspace should not 
be done lightly.

> Virtualised kernels don't seem to take a
> huge hit (I had some vague idea that a lot of their performance problems
> were with IO).
>

I think it would be very hard to see where it was losing on TLB anyway 
because it's so different. Tomorrow, I'll put a patch in place that 
prevents the kernel portion of the address space being backed by huge 
pages on x86_64 and run a few tests to see what it looks like.

>
>>> > Continuing on, "true defragmentation" would require that the system be
>>> > halted so that the defragmentation can take place with everything
>>> > disabled
>>> > so that the copy can take place and every processes pagetables be
>>> > updated
>>> > as pagetables are not always shared.  Even if shared, all processes
>>> > would
>>> > still have to be halted unless the kernel was fully pagable and we were
>>> > willing to handle page faults in kernel outside of just the vmalloc
>>> > area.
>>> 
>>> vunmap doesn't need to run with the system halted, so I don't see why
>>> unmapping the source page would need to.
>>> 
>> 
>> vunmap() is freeing an address range where it knows it is the only accessor
>> of any data in that range. It's not the same when there are other processes
>> potentially memory in the same area at the same time expecting it to exist.
>
> I don't see what the distinction is. We obviously wouldn't have multiple
> processes with different kernel virtual addresses pointing to the same
> page.

no, but lets say the page of interest was holding slab objects and we 
wanted to migrate them. There could be multiple readers of the objects 
with no real way of locking the page for accesses.

> It would be managed almost exactly like vmalloc space is today, I'd
> imagine.
>

I think it'll be considerably more complex than that but like everything 
else, it's not impossible.

>
>>> I don't know why we'd need to handle a full page fault in the kernel if
>>> the critical part of the defrag code runs atomically and replaces the
>>> pte when it is done.
>>> 
>> 
>> And how exactly would one atomically copy a page of data, update the page
>> tables and flush the TLB without stalling all writers?  The setup would 
>> have
>> to mark the PTE for that area read-only and flush the TLB so that other
>> processes will fault on write and wait until the migration has completed
>> before retrying the fault. That would allow the data to be safely read and
>> copied to somewhere else.
>
> Why is there a requirement to prevent stalling of writers?
>

On the contrary, this mechanism would require the stalling of writers to 
work correctly so the system is going to run slower while defragmentation 
takes place but that is hardly a suprise.

>
>> It would be at least feasible to back SLAB_RECLAIM_ACCOUNT slabs by a
>> virtual map for the purposes of defragmenting it like this. However, it
>> would work better in conjunction with fragmentation avoidance instead of
>> replacing it because the fragmentation avoidance mechanism could be easily
>> used to group virtually-backed allocations together in the same physical
>> blocks as much as possible to reduce future migration work.
>
> Yeah, maybe. But what I am getting at is that fragmentation avoidance
> isn't _the_ big ticket (as the name implies). Defragmentation is. With
> defragmentation in, I think that avoidance makes much more sense.
>

This is kind of splitting hairs because the end result remains the same - 
both are likely required. I have a statistics patch put together that 
prints out information like the following;

Free pages count per migrate type
Node 0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0
Node 0, zone      DMA, type  Reclaimable    131     17      0      0      0      0      0      0      0      0      0
Node 0, zone      DMA, type      Movable    202     39      8      0      0      0      0      0      0      0      0
Node 0, zone      DMA, type      Reserve     86      9      1      1      1      1      1      1      1      0      0
Node 0, zone   Normal, type    Unmovable     59     12      3      0      0      0      0      0      0      0      0
Node 0, zone   Normal, type  Reclaimable    598      0      0      0      0      0      0      0      0      0      0
Node 0, zone   Normal, type      Movable     90      3      0      0      0      0      0      0      0      0      0
Node 0, zone   Normal, type      Reserve     10      6      6      5      2      1      1      1      0      1      0

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve
Node 0, zone      DMA            0            1            2            1
Node 0, zone   Normal            3           32           88            1

Number of mixed blocks    Unmovable  Reclaimable      Movable      Reserve
Node 0, zone      DMA            0            1            2            1
Node 0, zone   Normal            3           32           41            1

The last piece of information needs PAGE_OWNER to be set but when it's 
set, /proc/pageowner also contains information on the PFN, the type of 
page it was and some flags like this

Page allocated via order 0, mask 0x1200d2
PFN 86899 Block 84 type 2    Flags  LAD
[0xc014776e] generic_file_buffered_write+414
[0xc0147ec0] __generic_file_aio_write_nolock+640
[0xc01481d6] generic_file_aio_write+102
[0xc01aa99d] ext3_file_write+45
[0xc0167b0e] do_sync_write+206
[0xc0168399] vfs_write+153
[0xc0168acd] sys_write+61
[0xc0102ac4] syscall_call+7

This information will help determine to what extent defragmentation is 
required and at what times fragmentation avoidance gets into trouble. The 
page owner part is only useful in -mm kernels unfortunatly.

> Now I'm still hoping that neither is necessary... my thought process
> on this is to keep hoping that nothing comes up that _requires_ us to
> support higher order allocations in the kernel generally.
>

I'd be suprised if a feature was introduced that *required* higher order 
allocations to be generally available. Currently, things are still 
depending on a lot of reclaim to take place so higher orders will not be 
quickly available even if they are possible. My main interests are better 
hugepage support and memory hot-remove. The large pagecache stuff and SLUB 
are really interesting but I expected them to not require the large page 
availability.

> As an aside, it might actually be nice to be able to reduce MAX_ORDER
> significantly after boot in order to reduce page allocator overhead...
>

As the vast majority of allocations go through the per-cpu allocator, 
there may not be much savings to be made but it can be checked out.

>
>>> > This is before even considering the problem of how the kernel copies the
>>> > data between two virtual addresses while it's modifing the page tables
>>> > it's depending on to read the data.
>>> 
>>> What's the problem: map the source page into a special area, unmap it
>>> from its normal address, allocate a new page, copy the data, swap the
>>> mapping.
>>> 
>> 
>> You'd have to do something like I described above to handle synchronous
>> writes to the area during defragmentation.
>
> Yeah, that's what the "unmap the source page" is (which would also block
> reads, and I think would be a better approach to try first
>, because it
> would reduce TLB flushing. Although moving and flushing could probably
> be batched, so mapping them readonly first might be a good optimisation
> after that).
>

Ok, making sense.

>>> > Even more horribly, virtual addresses
>>> > in the kernel are no longer physically contiguous which will likely
>>> > cause
>>> > some problems for drivers and possibly DMA engines.
>>> 
>>> Of course it is trivial to _get_ physically contiguous, virtually
>>> contiguous pages, because now you actually have a mechanism to do so.
>>> 
>> 
>> I think that would require that the kernel portion have a split between the
>> vmap() like area and a 1:1 virt:phys area - i.e. similar to today except 
>> the
>> vmalloc() region is bigger. It is difficult to predict what the impact of a
>> much expanded use of the vmalloc area would be.
>
> Yeah that would probably be reasonable. So huge tlbs could still be used
> for various large boot time structures.
>

Yes.

> Predicting the impact of it? Could we look at how something like KVM
> performs when using 4K pages for its memory map?
>

I'm not sure I have a machine capable of KVM available at the moment. 
However, just removing the hugetlb backing of the kernel address space 
should be trivial and give the same data.

>
>>> It isn't performance of your patches I'm so worried about. It is that
>>> they only slow down the rate of fragmentation, so why do we want to add
>>> them and why can't we use something more robust?
>>> 
>> 
>> Because as I've maintained for quite some time, I see the patches as
>> a pre-requisite for a more complete and robust solution for dealing with
>> external fragmentation. I see the merits of what you are suggesting but 
>> feel
>> it can be built up incrementally starting with the fragmentation avoidance
>> stuff, then compacting MOVABLE pages towards the end of the zone before
>> finally dealing with full defragmentation.  But I am reluctant to built
>> large bodies of work on top of a foundation with an uncertain future.
>
> The first thing we need to decide is if there is a big need to support
> higher order allocations generally in the kernel. I'm still a "no" with
> that one :)
>

And I'll keep on about hugepages for userspace and memory hot-remove but 
we're not likely to finish this argument any time soon :)

> If and when we decide "yes", I don't see how anti-fragmentation does much
> good for that -- all the new wonderful higher order allocations we add in
> will need fallbacks, and things can slowly degrade over time which I'm
> sorry but that really sucks.
>

Ok. I will get onto the next stages of what is required.

> I think that to decide yes, we have to realise that requires real
> defragmentation. At that point, OK, I'm not going to split hairs over
> whether you think anti-frag logically belongs first (I think it
> doesn't :)).
>

And I think it does but reckon both are needed. I'm happy enough to work 
on defragmentation on top of fragmentation avoidance to see where it 
brings things.

>>> hugepages are a good example of where you can use reservations.
>>> 
>> 
>> Except that it has to be sized at boot-time, can never grow and users find
>> it very inflexible in the real world where requirements change over time
>> and a reboot is required to effectively change these reservations.
>> 
>>> You could even use reservations for higher order pagecache (rather than
>>> crapping the whole thing up with small-pages fallbacks everywhere).
>>> 
>> 
>> True, although that means that an administrator is then required to size
>> their buffer cache at boot time if they are using high order pagecache. I
>> doubt they'll like that any more than sizing a hugepage pool.
>> 
>>> I don't think it is. Because the only reason to need more than a couple
>>> of physically contiguous pages is to work around hardware limitations or
>>> inefficiency.
>>> 
>> 
>> A low TLB reach with base page size is a real problem that some classes of
>> users have to deal with. Sometimes there just is no easy way around having
>> to deal with large amounts of data at the same time.
>
> To the 3 above: yes, I completely know we are not and never will be
> absolutely optimal for everyone. And the end-game for Linux, if there
> is one, I don't think is to be in a state that is perfect for everyone
> either. I don't think any feature can be justified simply because
> "someone" wants it, even if those someones are people running benchmarks
> at big companies.
>
>
>>> No. My assertion is that we should speed things up in other ways, eg.
>> 
>> 
>> The principal reason I developed fragmentation avoidance was to relax
>> restrictions on the resizing of the huge page pool where it's not a 
>> question
>> of poor performance, it's a question of simply not working. The large page
>> cache stuff arrived later as a potential additional benefiticary of lower
>> fragmentation as well as SLUB.
>
> So that's even worse than a purely for performance patch, because it
> can now work for a while and then randomly stop working eventually.
>
>
>>> > what the current stuff does. Not only do we have to deal with
>>> > overlapping
>>> > non-contiguous zones,
>>> 
>>> We have to do that anyway, don't we?
>>> 
>> 
>> Where do we deal with overlapping non-contiguous zones within a node today?
>
> In the buddy allocator and physical memory models, I guess?
>
> http://marc.info/?l=linux-mm&m=114774325131397&w=2
>
> Doesn't that imply overlapping non-contiguous zones?
>

The are not overlapping in the same node. There is never a situation on a 
node where pages belonging to zones A and B look like AAABBBAAA as would 
be the case if ZONE_MOVABLE consisted of arbitrary pages from the highest 
available zone for example.

>
>>> > but things like the page->flags identifying which
>>> > zone a page belongs to have to be moved out (not enough bits)
>>> 
>>> Another 2 bits? I think on most architectures that should be OK,
>>> shouldn't it?
>>> 
>> 
>> page->flags is not exactly flush with space. The last I heard, there
>> were 3 bits free and there was work being done to remove some of them so
>> more could be used.
>
> No, you wouldn't be using that part of the flags, but the other
> part. AFAIK there is reasonable amount of room on 64-bit, and only
> on huge NUMA 32-bit (ie. dinosaurs) is it a squeeze... but it falls
> back to an out of line thingy anyway.
>

Using bits where available and moving to the out-of-line bitmap where they 
are not available is a possibility. Keeping them out-of-line though would 
allow a lazy moving between zones. You say later you don't want to get 
into implementation details so I won't either.

>
>>> > and you get
>>> > an explosion of zones like
>>> > > ZONE_DMA_UNMOVABLE
>>> > ZONE_DMA_RECLAIMABLE
>>> > ZONE_DMA_MOVABLE
>>> > ZONE_DMA32_UNMOVABLE
>>> 
>>> So of course you don't make them visible to the API. Just select them
>>> based on your GFP_ movable flags.
>>> 
>> 
>> Just because they are invisible to the API does not mean they are invisible
>> to the size of pgdat->node_zones[] and the size of the zone fallback lists.
>> Christoph will eventually complain about the number of zones having doubled
>> or tripled.
>
> Well there is already a reasonable amount of duplication, eg pcp lists.

And each new zone will increase that duplication quite considerably.

> And I think it is much better to put up with a couple of complaints from
> Christoph rather than introduce something entirely new if possible. Hey
> it might even give people an incentive to improve the existing schemes.
>
>
>>> > etc.
>>> 
>>> What is etc? Are those the best reasons why this wasn't made to use zones?
>>> 
>> 
>> No, I simply thought those problems were bad enough without going into
>> additional ones - here's another one. If a block of pages has to move
>> between zones, page->flags has to be updated which means a lock to the page
>> has to be acquired to guard against concurrent use before moving the zone.
>
> If you're only moving free pages, then the page allocator lock should be
> fine.

Not for the pcp lists but they could be drained.

> There may be a couple of other places that would need help (eg
> swsusp)...
>
> ... but anyway, I'll snip the rest because I didn't want to digress into
> implementation details so much (now I'm sorry for bringing it up).
>

ok

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2007-05-08  9:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-28  3:46 Antifrag patchset comments Christoph Lameter
2007-04-28 13:21 ` Mel Gorman
2007-04-28 21:44   ` Christoph Lameter
2007-04-30  9:37     ` Mel Gorman
2007-04-30 12:35       ` Peter Zijlstra
2007-04-30 17:30       ` Christoph Lameter
2007-04-30 18:33         ` Mel Gorman
2007-05-01 13:31       ` Hugh Dickins
2007-05-01 11:26     ` Nick Piggin
2007-05-01 12:22       ` Nick Piggin
2007-05-01 16:38       ` Mel Gorman
2007-05-02  2:43         ` Nick Piggin
2007-05-02 12:41           ` Mel Gorman
2007-05-04  6:16             ` Nick Piggin
2007-05-04  6:55               ` Nick Piggin
2007-05-08  9:23               ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox