On Thu, 13 Apr 2006, Andi Kleen wrote:

> On Thursday 13 April 2006 02:22, Mel Gorman wrote:
>
>> I experimented with the idea of all architectures sharing the struct
>> node_active_region rather than storing the information twice. It got very
>> messy, particularly for x86 because it needs to store more than nid,
>> start_pfn and end_pfn for a range of page frames (see node_memory_chunk_s
>> in arch/i386/kernel/srat.c). Worse, some architecture-specific code
>> remembers the ranges of active memory as addresses and others as pfn's. In
>> the end, I was not too worried about having the information in two places,
>> because the active ranges are kept in __initdata and gets freed.
>
> The problem is not memory consumption but complexity of code/data structures.

The architecture-independent code is simpler than i386's SRAT messing, 
about the same complexity as ppcs dealings with LMB (in fact, much of the 
code is lifted from ppc) and comparable in complexity to what IA64 does. 
For x86_64, there is less architecture-specific code that has to be 
understood.

> Keeping information in two places is usually a good cue that something
> is wrong. This code is also fragile and hard to test.
>

At minimum, it requires a boot test - not that massive a burden. For the 
active, a look at the value of the zones before and after the patches.

To test architectures that register PFNs in unexpected ways that I don't 
have a test machine for (like IA64), I wrote the attached test program. It 
was a simply case of

1. Few #defines to pretend it's compiled in-kernel
2. Cut and paste from the architecture-independent code in mem_init.c to
    the driver program
3. Pass in sample input from main() and see what pops out

It caught a number of simple bugs (including one this morning) without 
having to even boot a machine. The same type of testing is hard with the 
architecture specific code. This is sample output of the driver program 
handing PFN ranges supplied by IA64;

mel@joshua:~/tmp$ gcc driver_test.c -o driver_test && ./driver_test | grep 
-v "active with no"
Stage 1: Registering active ranges
add_active_range(0, 0, 4096): New
add_active_range(0, 0, 131072): Merging forward
add_active_range(0, 0, 131072): Merging backwards
add_active_range(0, 393216, 523264): New
add_active_range(0, 393216, 523264): Merging backwards
add_active_range(0, 393216, 524288): Merging forward
add_active_range(0, 393216, 524288): Merging backwards

Stage 2: Calculating zone sizes and holes
Hole found zone 1 index 1: 131072 -> 393216

Stage 3: Dumping zone sizes and holes
zone_size[0][0] =   131020 zone_holes[0][0] =        0
zone_size[0][1] =   393268 zone_holes[0][1] =   262144

Stage 4: Printing present pages
On node 0, 262144 pages
  zone 0 present_pages = 131020
  zone 1 present_pages = 131124

So, testing is not that hard.

How is the code fragile? Even *if* it is fragile, it only has to be fixed 
once to benefit any architecture using the code path.

>> I'll admit that for x86_64, the entire code path for initialisation (i.e.
>> architecture specific and architecture independent paths) is now more
>> complex. The architecture independent code needed to be able to handle
>> every variety of node layout which is overkill for x86_64. Nevertheless,
>> without size_zones(), I thought the architecture-specific code for x86_64
>> memory initialisation was a bit easier to read. With
>> architecture-independent zone size and hole calculation, you only have to
>> understand the relevant code once, not once for each architecture.
>
>
> I think i386 SRAT NUMA should be just removed at some point - it never
> worked all that well and is quite complicated.

Assuming you mean the code, are these patches not a readable replacement?

> That leaves IA64, x86-64
> and ppc64.  I suspect keeping the code there near their low level
> data structures is better.
>

For PPC64, the architecture-independent representation *is* the only copy 
(which is why 128 arch-specific LOC were deleted for ppc including a 
stryct init_node_data array of nids, start_pfns and end_pfns). IA64's low 
level representation uses addresses, not pfns, so having only one copy 
would be a very invasive patch which is not a good idea without a test 
box.

In many cases, the low-level representation between architectures is 
similar. The representation I use is the common elements all the 
architectures need - nid, start_pfn, end_pfn.

>>> I have my doubts that is really a improvement over the old state.
>>>
>>
>> For x86_64 in isolation or the entire set of patches?
>
> For x86-64/i386. I haven't read the other architectures.
>

ok.

>>> I think it would be better if you just defined some simple "library functions"
>>> that can be called from the architecture specific code instead of adding
>>> all this new high level code.
>>>
>>
>> What sort of library functions would you recommend? x86_64 uses
>> add_active_range() and free_area_init_nodes() from this patchset which
>> seemed fairly straight-forward.
>
> e.g. a generic size_zones(). Possibly some others.
>

For a generic size_zones(), one would need an architecture independent way 
to pass in active page frame ranges and what node they are on. So we end 
up with code very similar to what I've posted.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab