Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Yin, Fengwei" <fengwei.yin@intel.com>
To: David Hildenbrand <david@redhat.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Yang Shi <shy828301@gmail.com>, Zi Yan <ziy@nvidia.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Itaru Kitayama <itaru.kitayama@gmail.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance
Date: Thu, 31 Aug 2023 16:02:59 +0800	[thread overview]
Message-ID: <747deb43-68c8-449f-b41a-91864820a699@intel.com> (raw)
In-Reply-To: <4e14730b-4e4c-de30-04bb-9f3ec4a93754@redhat.com>



On 8/31/2023 3:57 PM, David Hildenbrand wrote:
> On 31.08.23 03:40, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> On 15/08/2023 22:32, Huang, Ying wrote:
>>>> Hi, Ryan,
>>>>
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>> allocated in large folios of a determined order. All pages of the large
>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>> counting, rmap management lru list management) are also significantly
>>>>> reduced since those ops now become per-folio.
>>>>>
>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>> defaut to enabled, but there are some risks around internal
>>>>> fragmentation that need to be better understood first.
>>>>>
>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>> where fallback (>) is performed for various reasons, such as the
>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>
>>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint         | S         | LAF>S       | LAF>S         | THP>LAF>S
>>>>> MADV_HUGEPAGE   | S         | LAF>S       | THP>LAF>S     | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>
>>>> IMHO, we should use the following semantics as you have suggested
>>>> before.
>>>>
>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>> ----------------|-----------|-------------|---------------|-------------
>>>> no hint         | S         | S           | LAF>S         | THP>LAF>S
>>>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>
>>>> Or even,
>>>>
>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>> ----------------|-----------|-------------|---------------|-------------
>>>> no hint         | S         | S           | S             | THP>LAF>S
>>>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>
>>>>  From the implementation point of view, PTE mapped PMD-sized THP has
>>>> almost no difference with LAF (just some small sized THP).  It will be
>>>> confusing to distinguish them from the interface point of view.
>>>>
>>>> So, IMHO, the real difference is the policy.  For example, prefer
>>>> PMD-sized THP, prefer small sized THP, or fully auto.  The sysfs
>>>> interface is used to specify system global policy.  In the long term, it
>>>> can be something like below,
>>>>
>>>> never:      S               # disable all THP
>>>> madvise:                    # never by default, control via madvise()
>>>> always:     THP>LAF>S       # prefer PMD-sized THP in fact
>>>> small:      LAF>S           # prefer small sized THP
>>>> auto:                       # use in-kernel heuristics for THP size
>>>>
>>>> But it may be not ready to add new policies now.  So, before the new
>>>> policies are ready, we can add a debugfs interface to override the
>>>> original policy in /sys/kernel/mm/transparent_hugepage/enabled.  After
>>>> we have tuned enough workloads, collected enough data, we can add new
>>>> policies to the sysfs interface.
>>>
>>> I think we can all imagine many policy options. But we don't really have much
>>> evidence yet for what it best. The policy I'm currently using is intended to
>>> give some flexibility for testing (use LAF without THP by setting sysfs=never,
>>> use THP without LAF by compiling without LAF) without adding any new knobs at
>>> all. Given that, surely we can defer these decisions until we have more data?
>>>
>>> In the absence of data, your proposed solution sounds very sensible to me. But
>>> for the purposes of scaling up perf testing, I don't think its essential given
>>> the current policy will also produce the same options.
>>>
>>> If we were going to add a debugfs knob, I think the higher priority would be a
>>> knob to specify the folio order. (but again, I would rather avoid if possible).
>>
>> I totally understand we need some way to control PMD-sized THP and LAF
>> to tune the workload, and nobody likes debugfs knob.
>>
>> My concern about interface is that we have no way to disable LAF
>> system-wise without rebuilding the kernel.  In the future, should we add
>> a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
>> stricter than "never"?  "really_never"?
> 
> Let's talk about that in a bi-weekly MM session. (I proposed it as a topic for next week).

The time slot of the meeting is not friendly to our timezone. Like
it's 1 or 2 AM. Yes. I know it's very hard to find a good time slot
for US, EU and Asia. :(.

So maybe we still need to discuss it through mail?


Regards
Yin, Fengwei

> 
> As raised in another mail, we can then discuss
> * how we want to call this feature (transparent large pages? there is
>   the concern that "THP" might confuse users. Maybe we can consider
>   "large" the more generic version and "huge" only PMD-size, TBD)
> * how to expose it in stats towards the user (e.g., /proc/meminfo)
> * which minimal toggles we want
> 
> I think there *really* has to be a way to disable it for a running system, otherwise no distro will dare pulling it in, even after we figured out the other stuff.
> 
> Note that for the pagecache, large folios can be disabled and distributions are actively making use of that.
>

next prev parent reply	other threads:[~2023-08-31  8:03 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-10 14:29 [PATCH v5 0/5] variable-order, large folios for anonymous memory Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 1/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 2/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance Ryan Roberts
2023-08-10 17:01   ` Yu Zhao
2023-08-10 19:12     ` Ryan Roberts
2023-08-10 19:46       ` Zi Yan
2023-08-11  0:36         ` Yin, Fengwei
2023-08-11  1:04           ` Zi Yan
2023-08-11  5:34             ` Yin, Fengwei
2023-08-11 14:33               ` Zi Yan
2023-08-12  0:23                 ` Yin, Fengwei
2023-08-30 11:41                   ` Ryan Roberts
2023-08-31  0:14                     ` Yin, Fengwei
2023-08-11  0:27       ` Yin, Fengwei
2023-08-15 21:32   ` Huang, Ying
2023-08-30 12:07     ` Ryan Roberts
2023-08-31  1:40       ` Huang, Ying
2023-08-31  7:57         ` David Hildenbrand
2023-08-31  8:02           ` Yin, Fengwei [this message]
2023-08-31  8:09             ` David Hildenbrand
2023-08-31 12:29           ` Matthew Wilcox
2023-09-01 14:40             ` David Hildenbrand
2023-08-31 17:15           ` Yang Shi
2023-09-01 16:13             ` Matthew Wilcox
2023-09-01 17:18               ` Yang Shi
2023-09-04 10:05                 ` Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 4/5] selftests/mm/cow: Generalize do_run_with_thp() helper Ryan Roberts
2023-08-10 14:29 ` [PATCH v5 5/5] selftests/mm/cow: Add large anon folio tests Ryan Roberts
2023-08-10 15:13 ` [PATCH v5 0/5] variable-order, large folios for anonymous memory Ryan Roberts
2023-08-16  8:11 ` Itaru Kitayama
2023-08-16  9:25   ` Yin, Fengwei
2023-08-16 11:57     ` Itaru Kitayama

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=747deb43-68c8-449f-b41a-91864820a699@intel.com \
    --to=fengwei.yin@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=itaru.kitayama@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox