From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>, Shakeel Butt <shakeelb@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Yang Shi <shy828301@gmail.com>, Zi Yan <ziy@nvidia.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED
Date: Mon, 22 Nov 2021 10:19:09 +0100 [thread overview]
Message-ID: <861f98b5-9211-98c7-b4f7-fd71146aa64c@redhat.com> (raw)
In-Reply-To: <YZsi+RFed3hX9T8w@casper.infradead.org>
On 22.11.21 05:56, Matthew Wilcox wrote:
> On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote:
>> Many applications do sophisticated management of their heap memory for
>> better performance but with low cost. We have a bunch of such
>> applications running on our production and examples include caching and
>> data storage services. These applications keep their hot data on the
>> THPs for better performance and release the cold data through
>> MADV_DONTNEED to keep the memory cost low.
>>
>> The kernel defers the split and release of THPs until there is memory
>> pressure. This causes complicates the memory management of these
>> sophisticated applications which then needs to look into low level
>> kernel handling of THPs to better gauge their headroom for expansion. In
>> addition these applications are very latency sensitive and would prefer
>> to not face memory reclaim due to non-deterministic nature of reclaim.
>>
>> This patch let such applications not worry about the low level handling
>> of THPs in the kernel and splits the THPs synchronously on
>> MADV_DONTNEED.
>
> I've been wondering about whether this is really the right strategy
> (and this goes wider than just this one, new case)
>
> We chose to use a 2MB page here, based on whatever heuristics are
> currently in play. Now userspace is telling us we were wrong and should
> have used smaller pages.
IIUC, not necessarily, unfortunately.
User space might be discarding the whole 2MB either via a single call
(MADV_DONTNEED a 2MB range as done by virtio-balloon with "free page
reporting" or by virtio-mem in QEMU). In that case, there is nothing to
migrate and we were not doing anything wrong.
But more extreme, user space might be discarding the whole THP in small
pieces shortly over time. This for example happens when a VM inflates
the memory balloon via virtio-balloon. All inflation requests are 4k,
resulting in a 4k MADV_DONTNEED calls. If we end up inflating a THP
range inside of the VM, mapping to a THP range inside the hypervisor,
we'll essentially free a THP in the hypervisor piece by piece using
individual MADV_DONTNEED calls -- this happens frequently. Something
similar can happen when de-fragmentation inside the VM "moves around"
inflated 4k pages piece by piece to essentially form a huge inflated
range -- this happens less frequently as of now. In both cases,
migration is counter-productive, as we're just about to free the whole
range either way.
(yes, there are ways to optimize, for example using hugepage ballooning
or merging MADV_DONTNEED calls in the hypervisor, but what I described
is what we currently implement in hypervisors like QEMU, because there
are corner cases for everything)
Long story short: it's hard to tell what will happen next based on a
single MADV_DONTNEED call. Page compaction, in comparison, doesn't care
and optimized the layout as it observes it.
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-11-22 9:19 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-20 20:12 Shakeel Butt
2021-11-21 4:35 ` Matthew Wilcox
2021-11-21 5:25 ` Shakeel Butt
2021-11-22 0:50 ` Kirill A. Shutemov
2021-11-22 3:42 ` Shakeel Butt
2021-11-22 4:56 ` Matthew Wilcox
2021-11-22 9:19 ` David Hildenbrand [this message]
2021-12-08 13:23 ` Pankaj Gupta
2021-11-22 8:32 ` David Hildenbrand
2021-11-22 18:40 ` Shakeel Butt
2021-11-22 18:59 ` David Hildenbrand
2021-11-23 1:20 ` Shakeel Butt
2021-11-23 16:56 ` David Hildenbrand
2021-11-23 17:17 ` Shakeel Butt
2021-11-23 17:20 ` David Hildenbrand
2021-11-23 17:24 ` Shakeel Butt
2021-11-23 17:26 ` David Hildenbrand
2021-11-23 17:28 ` Shakeel Butt
2021-11-25 10:09 ` Peter Xu
2021-11-25 17:14 ` Shakeel Butt
2021-11-26 0:00 ` Peter Xu
2021-11-25 10:24 ` Peter Xu
2021-11-25 10:32 ` David Hildenbrand
2021-11-26 2:52 ` Peter Xu
2021-11-26 9:04 ` David Hildenbrand
2021-11-29 22:00 ` Yang Shi
2021-11-26 3:21 ` Shakeel Butt
2021-11-26 4:12 ` Peter Xu
2021-11-26 9:16 ` David Hildenbrand
2021-11-26 9:39 ` Peter Xu
2021-11-29 21:32 ` Yang Shi
2022-01-24 18:48 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=861f98b5-9211-98c7-b4f7-fd71146aa64c@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox