Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Matthew Wilcox <willy@infradead.org>
To: Roman Gushchin <guro@fb.com>
Cc: Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	David Nellans <dnellans@nvidia.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Song Liu <songliubraving@fb.com>
Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
Date: Wed, 31 Mar 2021 04:09:35 +0100	[thread overview]
Message-ID: <20210331030935.GT351017@casper.infradead.org> (raw)
In-Reply-To: <YGNnnzwDIfdy2B/G@carbon.dhcp.thefacebook.com>

On Tue, Mar 30, 2021 at 11:02:07AM -0700, Roman Gushchin wrote:
> On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
> > On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
> > > I actually ran a large scale experiment (on tens of thousands of machines) in the last
> > > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
> > 
> > Thanks for the information. I finally have time to come back to this. Do you mind sharing
> > the total memory of these machines? I want to have some idea on the scale of this issue to
> > make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
> > or TBs memory?
> 
> There are different configurations, but in general they are in 100's GB or smaller.

Are you using ZONE_MOVEABLE?  Seeing /proc/buddyinfo from one of these
machines might be illuminating.

> > 
> > >
> > > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
> > > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
> > > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
> > > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
> > 
> > Is there a way of replicating such an environment with publicly available software?
> > I really want to understand the root cause and am willing to find a possible solution.
> > It would be much easier if I can reproduce this locally.
> 
> There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
> allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
> is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
> the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).

I think this is somewhere the buddy allocator could be improved.
Of course, it knows nothing of larger page orders (which needs to be
fixed), but in general, I would like it to do a better job of segregating
movable and unmovable allocations.

Let's take a machine with 100GB of memory as an example.  Ideally,
unmovable allocations would start at 4GB (assuming below 4GB is
ZONE_DMA32).  Movable allocations can allocate anywhere in memory, but
should avoid being "near" unmovable allocations.  Perhaps they start
at 5GB.  When unmovable allocations get up to 5GB, we should first exert
a bit of pressure to shrink the unmovable allocations (looking at you,
dcache), but eventually we'll need to grow the unmovable allocations
above 5GB and we should move, say, all the pages between 5GB and 5GB+1MB.
If this unmovable allocation was just temporary, we get a reassembled
1MB page.  If it was permanent, we now have 1MB of memory to soak up
the next few allocations.

The model I'm thinking of here is that we have a "line" in memory that
divides movable and unmovable allocations.  It can move up, but there
has to be significant memory pressure to do so.

next prev parent reply	other threads:[~2021-03-31  3:10 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-24 22:35 Zi Yan
2021-02-25 11:02 ` David Hildenbrand
2021-02-25 22:13   ` Zi Yan
2021-03-02  8:55     ` David Hildenbrand
2021-03-03 23:42       ` Zi Yan
2021-03-04  9:26         ` David Hildenbrand
2021-03-02  1:59 ` Roman Gushchin
2021-03-04 16:26   ` Zi Yan
2021-03-04 16:45     ` Roman Gushchin
2021-03-30 17:24       ` Zi Yan
2021-03-30 18:02         ` Roman Gushchin
2021-03-31  2:04           ` Zi Yan
2021-03-31  3:09           ` Matthew Wilcox [this message]
2021-03-31  3:32             ` Roman Gushchin
2021-03-31 14:48               ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210331030935.GT351017@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=dnellans@nvidia.com \
    --cc=guro@fb.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=rcampbell@nvidia.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songliubraving@fb.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox