Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Roman Gushchin <guro@fb.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	David Nellans <dnellans@nvidia.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Song Liu <songliubraving@fb.com>
Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
Date: Tue, 2 Mar 2021 09:55:47 +0100	[thread overview]
Message-ID: <483b9681-497f-d86f-1f0b-14edb9d1c388@redhat.com> (raw)
In-Reply-To: <67B2C538-45DB-4678-A64D-295A9703EDE1@nvidia.com>

>>
>> However, I don't follow how this is actually really feasible in big scale. You could only ever collapse into a 1GB THP if you happen to have 1GB consecutive 2MB THP / 4k already. Sounds to me like this happens when the stars align.
> 
> Both the process_madvise() approach and my proposal require page migration to bring back THPs, since like you said having consecutive pages ready is extremely rare. IIUC, the process_madvise() approach reuses khugepaged code to collapse huge pages,
> namely first allocating a 2MB THP, then copying data over, finally free old base pages. My proposal would migrate pages within
> a virtual address range (>1GB and 1GB-aligned) to get all physical pages contiguous, then promote the resulting 1GB consecutive
> pages to 1GB THP. No new page allocation is needed.

I am missing how we can ever reliably form 1GB pages (esp. after the 
system ran for a while) without any kind of fragmentation avoidance / 
defragmentation mechanism that is aware of gigantic pages. For THP, 
pageblocks+compaction serve that purpose.

> 
> Both approaches would need user-space invocation, assuming either the application itself wants to get THPs for a specific region or a user-space daemon would do this for a group of application, instead of waiting for khugepaged to slowly (4096 pages every 10s) scan and do huge page collapse. User will pay the cost of getting THP. This also means THPs are not completely transparent to user, but I think it should be fine when users explicitly invoke these two methods to get THPs for better performance.

Here is the problem: these *advises* are not persistent. Assume your 
system has to swap and has to split the THP + write it to the swap 
backend. The gigantic page is lost for that part of the application. 
When loading the individual 4k pages out of swap there is no guarantee 
that we can form a 1 GB page again - and how should we know that the 
application wanted a 1 GB page at that position?

How would the application know that the advise was no dropped and that
a) There is no 1GB page anymore
b) It would have to re-issue the advise

Similarly, I am not convinced that the future of khugepaged is in user 
space.

> 
> The difference of my proposal is that it does not need a 1GB THP allocation, so there is no special requirements like using CMA
> or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. It makes creating THPs with orders > MAX_ORDER possible
> without other intrusive changes.

Anything that relies on large allocations succeeding purely because 
"ZONE_NORMAL memory is usually not fragmented after boot" is broken by 
design. That's why we have CMA, it can give guarantees (well, once we 
fix all remaining issues :) ).

-- 
Thanks,

David / dhildenb

next prev parent reply	other threads:[~2021-03-02  8:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-24 22:35 Zi Yan
2021-02-25 11:02 ` David Hildenbrand
2021-02-25 22:13   ` Zi Yan
2021-03-02  8:55     ` David Hildenbrand [this message]
2021-03-03 23:42       ` Zi Yan
2021-03-04  9:26         ` David Hildenbrand
2021-03-02  1:59 ` Roman Gushchin
2021-03-04 16:26   ` Zi Yan
2021-03-04 16:45     ` Roman Gushchin
2021-03-30 17:24       ` Zi Yan
2021-03-30 18:02         ` Roman Gushchin
2021-03-31  2:04           ` Zi Yan
2021-03-31  3:09           ` Matthew Wilcox
2021-03-31  3:32             ` Roman Gushchin
2021-03-31 14:48               ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=483b9681-497f-d86f-1f0b-14edb9d1c388@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dnellans@nvidia.com \
    --cc=guro@fb.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=rcampbell@nvidia.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songliubraving@fb.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox