From: David Hildenbrand <david@redhat.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>,
"Huang, Ying" <ying.huang@intel.com>,
Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Yang Shi <shy828301@gmail.com>, Zi Yan <ziy@nvidia.com>,
Luis Chamberlain <mcgrof@kernel.org>,
Itaru Kitayama <itaru.kitayama@gmail.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance
Date: Thu, 31 Aug 2023 10:09:18 +0200 [thread overview]
Message-ID: <d71d9b4d-5ea7-2387-d27a-8fcd9384da52@redhat.com> (raw)
In-Reply-To: <747deb43-68c8-449f-b41a-91864820a699@intel.com>
On 31.08.23 10:02, Yin, Fengwei wrote:
>
>
> On 8/31/2023 3:57 PM, David Hildenbrand wrote:
>> On 31.08.23 03:40, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 15/08/2023 22:32, Huang, Ying wrote:
>>>>> Hi, Ryan,
>>>>>
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>> defaut to enabled, but there are some risks around internal
>>>>>> fragmentation that need to be better understood first.
>>>>>>
>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>>> where fallback (>) is performed for various reasons, such as the
>>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>>
>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>>
>>>>> IMHO, we should use the following semantics as you have suggested
>>>>> before.
>>>>>
>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint | S | S | LAF>S | THP>LAF>S
>>>>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>>
>>>>> Or even,
>>>>>
>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint | S | S | S | THP>LAF>S
>>>>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>>
>>>>> From the implementation point of view, PTE mapped PMD-sized THP has
>>>>> almost no difference with LAF (just some small sized THP). It will be
>>>>> confusing to distinguish them from the interface point of view.
>>>>>
>>>>> So, IMHO, the real difference is the policy. For example, prefer
>>>>> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs
>>>>> interface is used to specify system global policy. In the long term, it
>>>>> can be something like below,
>>>>>
>>>>> never: S # disable all THP
>>>>> madvise: # never by default, control via madvise()
>>>>> always: THP>LAF>S # prefer PMD-sized THP in fact
>>>>> small: LAF>S # prefer small sized THP
>>>>> auto: # use in-kernel heuristics for THP size
>>>>>
>>>>> But it may be not ready to add new policies now. So, before the new
>>>>> policies are ready, we can add a debugfs interface to override the
>>>>> original policy in /sys/kernel/mm/transparent_hugepage/enabled. After
>>>>> we have tuned enough workloads, collected enough data, we can add new
>>>>> policies to the sysfs interface.
>>>>
>>>> I think we can all imagine many policy options. But we don't really have much
>>>> evidence yet for what it best. The policy I'm currently using is intended to
>>>> give some flexibility for testing (use LAF without THP by setting sysfs=never,
>>>> use THP without LAF by compiling without LAF) without adding any new knobs at
>>>> all. Given that, surely we can defer these decisions until we have more data?
>>>>
>>>> In the absence of data, your proposed solution sounds very sensible to me. But
>>>> for the purposes of scaling up perf testing, I don't think its essential given
>>>> the current policy will also produce the same options.
>>>>
>>>> If we were going to add a debugfs knob, I think the higher priority would be a
>>>> knob to specify the folio order. (but again, I would rather avoid if possible).
>>>
>>> I totally understand we need some way to control PMD-sized THP and LAF
>>> to tune the workload, and nobody likes debugfs knob.
>>>
>>> My concern about interface is that we have no way to disable LAF
>>> system-wise without rebuilding the kernel. In the future, should we add
>>> a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
>>> stricter than "never"? "really_never"?
>>
>> Let's talk about that in a bi-weekly MM session. (I proposed it as a topic for next week).
>
> The time slot of the meeting is not friendly to our timezone. Like
> it's 1 or 2 AM. Yes. I know it's very hard to find a good time slot
> for US, EU and Asia. :(.
:/
Yeah, even for me in Germany it's usually already around 6-7pm.
>
> So maybe we still need to discuss it through mail?
I don't think we'll be done discussing that in one session. One of the
main goals is to get some input from the wider MM community.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2023-08-31 8:09 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-10 14:29 [PATCH v5 0/5] variable-order, large folios for anonymous memory Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 1/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 2/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance Ryan Roberts
2023-08-10 17:01 ` Yu Zhao
2023-08-10 19:12 ` Ryan Roberts
2023-08-10 19:46 ` Zi Yan
2023-08-11 0:36 ` Yin, Fengwei
2023-08-11 1:04 ` Zi Yan
2023-08-11 5:34 ` Yin, Fengwei
2023-08-11 14:33 ` Zi Yan
2023-08-12 0:23 ` Yin, Fengwei
2023-08-30 11:41 ` Ryan Roberts
2023-08-31 0:14 ` Yin, Fengwei
2023-08-11 0:27 ` Yin, Fengwei
2023-08-15 21:32 ` Huang, Ying
2023-08-30 12:07 ` Ryan Roberts
2023-08-31 1:40 ` Huang, Ying
2023-08-31 7:57 ` David Hildenbrand
2023-08-31 8:02 ` Yin, Fengwei
2023-08-31 8:09 ` David Hildenbrand [this message]
2023-08-31 12:29 ` Matthew Wilcox
2023-09-01 14:40 ` David Hildenbrand
2023-08-31 17:15 ` Yang Shi
2023-09-01 16:13 ` Matthew Wilcox
2023-09-01 17:18 ` Yang Shi
2023-09-04 10:05 ` Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 4/5] selftests/mm/cow: Generalize do_run_with_thp() helper Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 5/5] selftests/mm/cow: Add large anon folio tests Ryan Roberts
2023-08-10 15:13 ` [PATCH v5 0/5] variable-order, large folios for anonymous memory Ryan Roberts
2023-08-16 8:11 ` Itaru Kitayama
2023-08-16 9:25 ` Yin, Fengwei
2023-08-16 11:57 ` Itaru Kitayama
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d71d9b4d-5ea7-2387-d27a-8fcd9384da52@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=catalin.marinas@arm.com \
--cc=fengwei.yin@intel.com \
--cc=itaru.kitayama@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mcgrof@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=yuzhao@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox