Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dan Williams <dan.j.williams@intel.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, Kees Cook <keescook@chromium.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Keith Busch <keith.busch@intel.com>,
	Linux MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
Date: Thu, 10 Jan 2019 13:29:27 -0800	[thread overview]
Message-ID: <CAPcyv4gkSBW5Te0RZLrkxzufyVq56-7pHu__YfffBiWhoqg7Yw@mail.gmail.com> (raw)
In-Reply-To: <20190110105638.GJ28934@suse.de>

On Thu, Jan 10, 2019 at 2:57 AM Mel Gorman <mgorman@suse.de> wrote:
>
> On Mon, Jan 07, 2019 at 03:21:10PM -0800, Dan Williams wrote:
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [1].
> >
>
> So I glanced through the series and while I won't nak it, I'm not a
> major fan either so I won't ack it either.

Thanks for taking a look, some more comments / advocacy below...
because I'm not sure what Andrew will do with a "meh" response
compared to an ack.

> While there are merits to
> randomisation in terms of cache coloring, it may not be robust. IIRC, the
> main strength of randomisation vs being smart was "it's simple and usually
> doesn't fall apart completely". In particular I'd worry that compaction
> will undo all the randomisation work by moving related pages into the same
> direct-mapped lines. Furthermore, the runtime list management of "randomly
> place and head or tail of list" will have variable and non-deterministic
> outcomes and may also be undone by either high-order merging or compaction.

It's a fair point. To date we have not been able to measure the
average performance degrading over time (pages becoming more ordered)
but that said I think it would take more resources and time than I
have available for that trend to present. If it did present that would
only speak to a need to be more aggressive on the runtime
re-randomization. I think there's a case to be made to start simple
and only get more aggressive with evidence.

Note that higher order merging is not a current concern since the
implementation is already randomizing on MAX_ORDER sized pages. Since
memory side caches are so large there's no worry about a 4MB
randomization boundary.

However, for the (unproven) security use case where folks want to
experiment with randomizing on smaller granularity, they should be
wary of this (/me nudges Kees).

> As bad as it is, an ideal world would have a proper cache-coloring
> allocation algorithm but they previously failed as the runtime overhead
> exceeded the actual benefit, particularly as fully associative caches
> became more popular and there was no universal "one solution fits all". One
> hatchet job around it may be to have per-task free-lists that put free
> pages into buckets with the obvious caveat that those lists would need
> draining and secondary locking. A caveat of that is that there may need
> to be arch and/or driver hooks to detect how the colors are managed which
> could also turn into a mess.

We (Dave, I and others that took a look at this) started here, and the
"mess" looked daunting compared to randomization. Also a mess without
much more incremental benefit.

We also settled on a numa_emulation based approach for the cases where
an administrator knows they have a workload that can fit in the
cache... more on that below:

> The big plus of the series is that it's relatively simple and appears to
> be isolated enough that it only has an impact when the necessary hardware
> in place. It will deal with some cases but I'm not sure it'll survive
> long-term, particularly if HPC continues to report in the field that
> reboots are necessary to reshufffle the lists (taken from your linked
> documents). That workaround of running STREAM before a job starts and
> rebooting the machine if the performance SLAs are not met is horrid.

That workaround is horrid, and we have a separate solution for it
merged in commit cc9aec03e58f "x86/numa_emulation: Introduce uniform
split capability". When an administrator knows in advance that a
workload will fit in cache they can use this capability to run the
workload in a numa node that is guaranteed to not have cache conflicts
with itself.

Whereas randomization benefits the general cache-overcommit case. The
uniform numa split case addresses those niche users that can manually
time schedule jobs with different working set sizes... without needing
to reboot.

next prev parent reply	other threads:[~2019-01-10 21:29 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-07 23:21 [PATCH v7 0/3] mm: Randomize free memory Dan Williams
2019-01-07 23:21 ` Dan Williams
2019-01-07 23:21 ` [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
2019-01-07 23:21   ` Dan Williams
2019-01-08  0:18   ` Kees Cook
2019-01-08  0:18     ` Kees Cook
2019-01-08  1:48     ` Dan Williams
2019-01-08  1:48       ` Dan Williams
2019-01-08 23:24       ` Kees Cook
2019-01-08 23:24         ` Kees Cook
2019-01-10 10:56   ` Mel Gorman
2019-01-10 21:29     ` Dan Williams [this message]
2019-01-10 21:29       ` Dan Williams
2019-01-10 22:52       ` Kees Cook
2019-01-10 22:52         ` Kees Cook
2019-01-25 14:20   ` Michal Hocko
2019-01-29 19:26     ` Dan Williams
2019-01-29 20:04     ` Dan Williams
2019-01-07 23:21 ` [PATCH v7 2/3] mm: Move buddy list manipulations into helpers Dan Williams
2019-01-07 23:21   ` Dan Williams
2019-01-25 14:30   ` Michal Hocko
2019-01-29 19:27     ` Dan Williams
2019-01-07 23:21 ` [PATCH v7 3/3] mm: Maintain randomization of page free lists Dan Williams
2019-01-07 23:21   ` Dan Williams
2019-01-08  0:19   ` Kees Cook
2019-01-08  0:19     ` Kees Cook
2019-01-25 14:32   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4gkSBW5Te0RZLrkxzufyVq56-7pHu__YfffBiWhoqg7Yw@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=keescook@chromium.org \
    --cc=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=rppt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox