On 2024/5/8 1:17, Yang Shi wrote: > On Tue, May 7, 2024 at 8:53 AM Ryan Roberts wrote: >> >> On 07/05/2024 14:53, Kefeng Wang wrote: >>> >>> >>> On 2024/5/7 19:13, David Hildenbrand wrote: >>>> >>>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 >>>>> >>>>>> suggest. If you want to try something semi-randomly; it might be useful to rule >>>>>> out the arm64 contpte feature. I don't see how that would be interacting >>>>>> here if >>>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with >>>>>> ARM64_CONTPTE (needs EXPERT) at compile time. >>>>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE, >>>>> but will have a try. >>> >>> After ARM64_CONTPTE disabled, memory read latency is similar with ARM64_CONTPTE >>> enabled(default 6.9-rc7), still larger than align anon reverted. >> >> OK thanks for trying. >> >> Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and using >> that for all sizes. That will presumably be considered "large" by malloc and >> will be allocated using mmap. So with the patch, it will be 2M aligned. Without >> it, it probably won't. I'm still struggling to understand why not aligning it in >> virtual space would make it more performant though... > > Yeah, I'm confused too. Me too, I get a smaps[_rollup] for 0.09375M size, the biggest difference for anon shows below, and all attached. 1) with efa7df3e3bb5 smaps ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0 Size: 524300 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 2048 kB Pss: 2048 kB Pss_Dirty: 2048 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 2048 kB Referenced: 2048 kB Anonymous: 2048 kB // we have 1 anon thp KSM: 0 kB LazyFree: 0 kB AnonHugePages: 2048 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 VmFlags: rd wr mr mw me ac ffff88eff000-ffff89000000 rw-p 00000000 00:00 0 Size: 1028 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 1028 kB Pss: 1028 kB Pss_Dirty: 1028 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 1028 kB Referenced: 1028 kB Anonymous: 1028 kB // another large anon KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 VmFlags: rd wr mr mw me ac and the smap_rollup 00400000-fffff56bd000 ---p 00000000 00:00 0 [rollup] Rss: 4724 kB Pss: 3408 kB Pss_Dirty: 3338 kB Pss_Anon: 3338 kB Pss_File: 70 kB Pss_Shmem: 0 kB Shared_Clean: 1176 kB Shared_Dirty: 420 kB Private_Clean: 0 kB Private_Dirty: 3128 kB Referenced: 4344 kB Anonymous: 3548 kB KSM: 0 kB LazyFree: 0 kB AnonHugePages: 2048 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB 2) without efa7df3e3bb5 smaps ffff9845b000-ffffb855f000 rw-p 00000000 00:00 0 Size: 525328 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 1128 kB Pss: 1128 kB Pss_Dirty: 1128 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 1128 kB Referenced: 1128 kB Anonymous: 1128 kB // only large anon KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 VmFlags: rd wr mr mw me ac and the smap_rollup, 00400000-ffffca5dc000 ---p 00000000 00:00 0 [rollup] Rss: 2600 kB Pss: 1472 kB Pss_Dirty: 1388 kB Pss_Anon: 1388 kB Pss_File: 84 kB Pss_Shmem: 0 kB Shared_Clean: 1000 kB Shared_Dirty: 424 kB Private_Clean: 0 kB Private_Dirty: 1176 kB Referenced: 2220 kB Anonymous: 1600 kB KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB > > I just ran the same command on 6.6.13 (w/o the thp alignment patch and > mTHP stuff) and 6.9-rc4 (w/ the thp alignment patch and all mTHP > stuff) on my arm64 machine, but I didn't see such a pattern. > > The result has a little bit fluctuation, for example, 6.6.13 has > better result with 4M/6M/8M, but 6.9-rc4 has better result for > 12M/16M/32M/48M/64M, and the difference may be quite noticeable. But > anyway I didn't see such a regression pattern. Although it is not fluctuation, but on our arm64, it is very noticeable. > > The benchmark is supposed to measure cache and memory latency, its > result strongly relies on the cache and memory subsystem, for example, > hw prefetcher, etc. Yes, I will try another type of arm64 if possible, no available machine now. > >> >> Is it possible to provide the smaps output for at least that 512M+8K block for >> both cases? It might give a bit of a clue. Will collect more smaps. >> >> Do you have traditional (PMD-sized) THP enabled? If its enabled and unaligned >> then the front of the buffer wouldn't be mapped with THP, but if it is aligned, >> it will. That could affect it. Yes, PMD-sized THP enabled. at least for above smaps, without efa7df3e3bb5, the anon don't be mappped with THP.