Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kefeng Wang <wangkefeng.wang@huawei.com>
To: "Huang, Ying" <ying.huang@intel.com>,
	David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Muchun Song <muchun.song@linux.dev>, <linux-mm@kvack.org>,
	Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
Date: Wed, 30 Oct 2024 11:04:46 +0800	[thread overview]
Message-ID: <b1dce36e-325e-45cf-b6e9-9e20d4b32550@huawei.com> (raw)
In-Reply-To: <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com>



On 2024/10/30 9:04, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>
>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>> shouldn't
>>>>>>>>>
>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>
>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>
>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>> the
>>>>>>>>> 'base'.
>>>>>>>>
>>>>>>>> It should be using folio_nr_pages().
>>>>>>>
>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>
>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>> Or am I missing something important?
>>>>>
>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>
>>>> And unfortunately neither the documentation nor the function name
>>>> expresses that :(
>>>>
>>>> I'm happy to review any patches that improve the situation here :)
>>>>
>>> Actually, could we drop the process_huge_page() totally, from my
>>> testcase[1], process_huge_page() is not better than clear/copy page
>>> from start to last, and sequential clearing/copying maybe more
>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>> to test to check the performance, since the process_huge_page()
>>> was submitted by Ying, what's your opinion?
> 
> I don't think that it's a good idea to revert the commit without
> studying and root causing the issues.  I can work together with you on
> that.  If we have solid and well explained data to prove
> process_huge_page() isn't benefitial, we can revert the commit.


Take 'fallocate 20G' as an example， before

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           3,118.94 msec task-clock                #    0.999 CPUs 
utilized
                 30      context-switches          #    0.010 K/sec 

                  1      cpu-migrations            #    0.000 K/sec 

                136      page-faults               #    0.044 K/sec 

      8,092,075,873      cycles                    #    2.594 GHz 
               (92.82%)
      1,624,587,663      instructions              #    0.20  insn per 
cycle           (92.83%)
        395,341,850      branches                  #  126.755 M/sec 
               (92.82%)
          3,872,302      branch-misses             #    0.98% of all 
branches          (92.83%)
      1,398,066,701      L1-dcache-loads           #  448.251 M/sec 
               (92.82%)
         58,124,626      L1-dcache-load-misses     #    4.16% of all 
L1-dcache accesses  (92.82%)
          1,032,527      LLC-loads                 #    0.331 M/sec 
               (92.82%)
            498,684      LLC-load-misses           #   48.30% of all 
LL-cache accesses  (92.84%)
        473,689,004      L1-icache-loads           #  151.875 M/sec 
               (92.82%)
            356,721      L1-icache-load-misses     #    0.08% of all 
L1-icache accesses  (92.85%)
      1,947,644,987      dTLB-loads                #  624.458 M/sec 
               (92.95%)
             10,185      dTLB-load-misses          #    0.00% of all 
dTLB cache accesses  (92.96%)
        474,622,896      iTLB-loads                #  152.174 M/sec 
               (92.95%)
                 94      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  (85.69%)

        3.122844830 seconds time elapsed

        0.000000000 seconds user
        3.107259000 seconds sys

and after（clear from start to end）

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           1,135.53 msec task-clock                #    0.999 CPUs 
utilized
                 10      context-switches          #    0.009 K/sec 

                  1      cpu-migrations            #    0.001 K/sec 

                137      page-faults               #    0.121 K/sec 

      2,946,673,587      cycles                    #    2.595 GHz 
               (92.67%)
      1,620,704,205      instructions              #    0.55  insn per 
cycle           (92.61%)
        394,595,772      branches                  #  347.499 M/sec 
               (92.60%)
            130,756      branch-misses             #    0.03% of all 
branches          (92.84%)
      1,396,726,689      L1-dcache-loads           # 1230.022 M/sec 
               (92.96%)
            338,344      L1-dcache-load-misses     #    0.02% of all 
L1-dcache accesses  (92.95%)
            111,737      LLC-loads                 #    0.098 M/sec 
               (92.96%)
             67,486      LLC-load-misses           #   60.40% of all 
LL-cache accesses  (92.96%)
        418,198,663      L1-icache-loads           #  368.285 M/sec 
               (92.96%)
            173,764      L1-icache-load-misses     #    0.04% of all 
L1-icache accesses  (92.96%)
      2,203,364,632      dTLB-loads                # 1940.385 M/sec 
               (92.96%)
             17,195      dTLB-load-misses          #    0.00% of all 
dTLB cache accesses  (92.95%)
        418,198,365      iTLB-loads                #  368.285 M/sec 
               (92.96%)
                 79      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  (85.34%)

        1.137015760 seconds time elapsed

        0.000000000 seconds user
        1.131266000 seconds sys

The IPC improved a lot，less LLC-loads and more L1-dcache-loads， but
this depends on the implementation of the microarchitecture.

1) Will test some rand test to check the different of performance as 
David suggested.

2) Hope the LKP to run more tests since it is very useful(more test set 
and different machines)


> 
>> I questioned that just recently [1], and Ying assumed that it still
>> applies [2].
>>
>> c79b57e462b5 ("mm: hugetlb: clear target
>> sub-page last when clearing huge page”) documents the scenario where
>> this matters -- anon-w-seq which you also run below.
>>
>> If there is no performance benefit anymore, we should rip that
>> out. But likely we should check on multiple micro-architectures with
>> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5
>> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment
>> cannot replicate that?>>
>>
>> [1]
>> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
>> [2]
>> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>
>>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
>>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
>>> case-anon-w-seq-hugetlb (2M PMD HugeTLB)
>>
>> But these are sequential, not random. I'd have thought access +
>> zeroing would be sequentially either way. Did you run with random
>> access as well>

Will do.
> > --
> Best Regards,
> Huang, Ying
>

next prev parent reply	other threads:[~2024-10-30  3:04 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-26  5:43 Kefeng Wang
2024-10-26  5:43 ` [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page() Kefeng Wang
2024-10-28 10:01   ` David Hildenbrand
2024-10-28  6:17 ` [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Huang, Ying
2024-10-28  6:35   ` Kefeng Wang
2024-10-28  7:03     ` Huang, Ying
2024-10-28  8:35       ` Kefeng Wang
2024-10-28 10:00 ` David Hildenbrand
2024-10-28 12:52   ` Kefeng Wang
2024-10-28 13:14     ` David Hildenbrand
2024-10-28 13:33       ` Kefeng Wang
2024-10-28 13:46         ` David Hildenbrand
2024-10-28 14:22           ` Kefeng Wang
2024-10-28 14:24             ` David Hildenbrand
2024-10-29 13:04               ` Kefeng Wang
2024-10-29 14:04                 ` David Hildenbrand
2024-10-30  1:04                   ` Huang, Ying
2024-10-30  3:04                     ` Kefeng Wang [this message]
2024-10-30  3:21                       ` Huang, Ying
2024-10-30  5:05                         ` Kefeng Wang
2024-10-31  8:39                           ` Huang, Ying
2024-11-01  7:43                             ` Kefeng Wang
2024-11-01  8:16                               ` Huang, Ying
2024-11-01  9:45                                 ` Kefeng Wang
2024-11-04  2:35                                   ` Huang, Ying
2024-11-05  2:06                                     ` Kefeng Wang
2024-12-01  2:15                             ` Andrew Morton
2024-12-01  5:37                               ` Huang, Ying
2024-12-02  1:03                                 ` Kefeng Wang
2024-12-06  1:47                                   ` Andrew Morton
2024-12-06  2:08                                     ` Kefeng Wang
2024-11-01  6:18                           ` Huang, Ying
2024-11-01  7:51                             ` Kefeng Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b1dce36e-325e-45cf-b6e9-9e20d4b32550@huawei.com \
    --to=wangkefeng.wang@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox