On Thu, 13 Apr 2006, Andi Kleen wrote: > On Thursday 13 April 2006 02:22, Mel Gorman wrote: > >> I experimented with the idea of all architectures sharing the struct >> node_active_region rather than storing the information twice. It got very >> messy, particularly for x86 because it needs to store more than nid, >> start_pfn and end_pfn for a range of page frames (see node_memory_chunk_s >> in arch/i386/kernel/srat.c). Worse, some architecture-specific code >> remembers the ranges of active memory as addresses and others as pfn's. In >> the end, I was not too worried about having the information in two places, >> because the active ranges are kept in __initdata and gets freed. > > The problem is not memory consumption but complexity of code/data structures. The architecture-independent code is simpler than i386's SRAT messing, about the same complexity as ppcs dealings with LMB (in fact, much of the code is lifted from ppc) and comparable in complexity to what IA64 does. For x86_64, there is less architecture-specific code that has to be understood. > Keeping information in two places is usually a good cue that something > is wrong. This code is also fragile and hard to test. > At minimum, it requires a boot test - not that massive a burden. For the active, a look at the value of the zones before and after the patches. To test architectures that register PFNs in unexpected ways that I don't have a test machine for (like IA64), I wrote the attached test program. It was a simply case of 1. Few #defines to pretend it's compiled in-kernel 2. Cut and paste from the architecture-independent code in mem_init.c to the driver program 3. Pass in sample input from main() and see what pops out It caught a number of simple bugs (including one this morning) without having to even boot a machine. The same type of testing is hard with the architecture specific code. This is sample output of the driver program handing PFN ranges supplied by IA64; mel@joshua:~/tmp$ gcc driver_test.c -o driver_test && ./driver_test | grep -v "active with no" Stage 1: Registering active ranges add_active_range(0, 0, 4096): New add_active_range(0, 0, 131072): Merging forward add_active_range(0, 0, 131072): Merging backwards add_active_range(0, 393216, 523264): New add_active_range(0, 393216, 523264): Merging backwards add_active_range(0, 393216, 524288): Merging forward add_active_range(0, 393216, 524288): Merging backwards Stage 2: Calculating zone sizes and holes Hole found zone 1 index 1: 131072 -> 393216 Stage 3: Dumping zone sizes and holes zone_size[0][0] = 131020 zone_holes[0][0] = 0 zone_size[0][1] = 393268 zone_holes[0][1] = 262144 Stage 4: Printing present pages On node 0, 262144 pages zone 0 present_pages = 131020 zone 1 present_pages = 131124 So, testing is not that hard. How is the code fragile? Even *if* it is fragile, it only has to be fixed once to benefit any architecture using the code path. >> I'll admit that for x86_64, the entire code path for initialisation (i.e. >> architecture specific and architecture independent paths) is now more >> complex. The architecture independent code needed to be able to handle >> every variety of node layout which is overkill for x86_64. Nevertheless, >> without size_zones(), I thought the architecture-specific code for x86_64 >> memory initialisation was a bit easier to read. With >> architecture-independent zone size and hole calculation, you only have to >> understand the relevant code once, not once for each architecture. > > > I think i386 SRAT NUMA should be just removed at some point - it never > worked all that well and is quite complicated. Assuming you mean the code, are these patches not a readable replacement? > That leaves IA64, x86-64 > and ppc64. I suspect keeping the code there near their low level > data structures is better. > For PPC64, the architecture-independent representation *is* the only copy (which is why 128 arch-specific LOC were deleted for ppc including a stryct init_node_data array of nids, start_pfns and end_pfns). IA64's low level representation uses addresses, not pfns, so having only one copy would be a very invasive patch which is not a good idea without a test box. In many cases, the low-level representation between architectures is similar. The representation I use is the common elements all the architectures need - nid, start_pfn, end_pfn. >>> I have my doubts that is really a improvement over the old state. >>> >> >> For x86_64 in isolation or the entire set of patches? > > For x86-64/i386. I haven't read the other architectures. > ok. >>> I think it would be better if you just defined some simple "library functions" >>> that can be called from the architecture specific code instead of adding >>> all this new high level code. >>> >> >> What sort of library functions would you recommend? x86_64 uses >> add_active_range() and free_area_init_nodes() from this patchset which >> seemed fairly straight-forward. > > e.g. a generic size_zones(). Possibly some others. > For a generic size_zones(), one would need an architecture independent way to pass in active page frame ranges and what node they are on. So we end up with code very similar to what I've posted. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab