From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6EA51C2BD09
	for <linux-mm@archiver.kernel.org>; Thu, 27 Jun 2024 20:54:17 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E85ED6B0096; Thu, 27 Jun 2024 16:54:16 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E5C7B6B0098; Thu, 27 Jun 2024 16:54:16 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D23986B0099; Thu, 27 Jun 2024 16:54:16 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id B71076B0096
	for <linux-mm@kvack.org>; Thu, 27 Jun 2024 16:54:16 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 4D41DC0614
	for <linux-mm@kvack.org>; Thu, 27 Jun 2024 20:54:16 +0000 (UTC)
X-FDA: 82277871312.29.F82FB59
Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42])
	by imf05.hostedemail.com (Postfix) with ESMTP id 6D585100018
	for <linux-mm@kvack.org>; Thu, 27 Jun 2024 20:54:14 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=iCUlHJUF;
	spf=pass (imf05.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719521641;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=AlT32AwEdoPVjYNuTXt5JTHi96ylg/OlK/zGKdzc8A4=;
	b=SiRRb6Xl++s3TmVsjK+OFonYYAOATAoZNm5eYp1slTDpNtxtw+BfGdKF+yZbYD17P8gA6/
	/RVFlKcxRkAxHPPzG5RL842jjfRqotWMjuoM0wcbgi9iLj2c2Kt9iy3U07CySM1JnX3o/6
	7ek4QczXd9NOwqPejFMdTlWHxvGIUTA=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=iCUlHJUF;
	spf=pass (imf05.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719521641; a=rsa-sha256;
	cv=none;
	b=BRIiOXX9++JVHnlbjZeiYGJxt5pDvSI2kTGrNZeIXdjuLSgUu8clKQvU+JRkLuZOy6VUpP
	KCK6SNB2tSZcltDnN21qbHc6eUQRHSGjyw9vyF5uLdbzd35IiBsty10frOMrjr10xa4zuy
	vP1GecfQMlNjrK9JX6JtDNnpyI/DJRs=
Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-57d280e2d5dso2439218a12.1
        for <linux-mm@kvack.org>; Thu, 27 Jun 2024 13:54:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1719521653; x=1720126453; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=AlT32AwEdoPVjYNuTXt5JTHi96ylg/OlK/zGKdzc8A4=;
        b=iCUlHJUFQFSZbgT9JyU/oOSpG+tEzkdFa/g/wNVUF5nItZsIk5GTzQtTbIr7S8bn81
         1D9LpesvU9+vHITH1hfvNtLgMGia6TvVb8nm6PTPYKke9L5LiBBWrltMfoeuiUUegfAE
         ruSvz/xTpMJqaFT3CgvVWiHGLR4kYvbnazHKsGny07B3VGK+BXsBUg3vd152cwkxRZ0e
         SJW3iA6uBBlOyqgFg5/SHsHENRw6uuCYl+JX3U/UdNTmodRjbWjcjKVFUQdQG9fRJ/9g
         TBkKUJjf3O2LyuPdDpJhy4dqZ+t2RE4WjRprKN4eIglzYriwIt/l3bKNEBwFkB/I3FeB
         Z88Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719521653; x=1720126453;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=AlT32AwEdoPVjYNuTXt5JTHi96ylg/OlK/zGKdzc8A4=;
        b=EY4fCS/KxxBBC6tJZTBXfxbp96ZC8P/UVM7zlwXg0K6UzXIfp1jn5LvFPtd+9qb+HD
         ykg61XpIcW2zZt0iS/SHfFtxef9KKWADsQwVkKCoBp2IEC4rKEkEdkxTnsDeuXxAwM33
         M0shsphS/ZQ6U+UHXzNzJafH7QRI0o5otV7Sq2dDg1jq2KFDo5Gb/2oSCFGdUhwUWWjk
         X3OV/WRVlBhdgjYKKUte6JKzH35Ug1lo3WeGyKKPp/8Xomkm3T70gfF8+4HogT/Tlwdy
         lUJ79fVfw6FC0Gkny9jCWJBBCJwYIwPIYX018Gipn/CBsAnKGrexWeiXSgWXnXZeNEzT
         egqQ==
X-Forwarded-Encrypted: i=1; AJvYcCXregjgfj1fpnq+3Yr+zxTE7MIBGm3eCHZZoxxrVIkWMTAxurEM/uDX6wREzP+R7cSxmZx9sO/fCeGsogfKwQc6t+s=
X-Gm-Message-State: AOJu0YxFPHp1PZrTqXckFrymgYTbfYAH7dfR+P9AJCSj1eKGTaSByfbn
	QYzHvkabwlu5W/Cp/14IzqL+ZvZF3tHxqdcNg/U7ViuDFDSzaX3YbKrIRQvbqxydQSAGLOxYK1w
	NIeXrbwRIuaGid6rGPPgw/okMUuY=
X-Google-Smtp-Source: AGHT+IE9x4E5MM6TaVrnbdhbh7mCeyE1aICiy4hBzoXBeBxZrXCjNweRzOfu/n1ZrGERpGtocQjOXGke9jJH4XZ+Y7Y=
X-Received: by 2002:a17:907:c5ce:b0:a72:4c32:7d89 with SMTP id
 a640c23a62f3a-a724c327e82mr1200815166b.54.1719521652589; Thu, 27 Jun 2024
 13:54:12 -0700 (PDT)
MIME-Version: 1.0
References: <CAHbLzkq5H6Pht1XbCEJDe_n7=99c=oPOaY+1Dy9d=2gwiHKcWA@mail.gmail.com>
 <20240401191614.00007c83@Huawei.com> <afaddbaa-3c89-d5c1-a1e4-b2739f7d4490@linux.com>
 <ZhQbu8eLYwpTbhah@casper.infradead.org> <D83976C5-6EE5-474E-998C-A29879E93869@nvidia.com>
 <145031ae-1d4d-4b43-b2c9-aed0d10e86ca@arm.com> <7a8bcd48-47b4-4bc7-a38f-45cef9adc221@arm.com>
In-Reply-To: <7a8bcd48-47b4-4bc7-a38f-45cef9adc221@arm.com>
From: Yang Shi <shy828301@gmail.com>
Date: Thu, 27 Jun 2024 13:54:00 -0700
Message-ID: <CAHbLzkoysP+NturbDvpqhRC8WgV7J3e7eLJWtpSe5GQ2gfBXjw@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and
 analysis on ARM64
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>, lsf-pc@lists.linux-foundation.org, 
	olivier.singla@amperecomputing.com, Linux MM <linux-mm@kvack.org>, 
	Michal Hocko <mhocko@suse.com>, Dan Williams <dan.j.williams@intel.com>, 
	Christoph Lameter <cl@linux.com>, Matthew Wilcox <willy@infradead.org>, Zi Yan <ziy@nvidia.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: pnnu35onu63t6m7z8x79armirrozsfq4
X-Rspam-User: 
X-Rspamd-Queue-Id: 6D585100018
X-Rspamd-Server: rspam02
X-HE-Tag: 1719521654-728396
X-HE-Meta: U2FsdGVkX19lQYBU5JTxB3KKw94CWrxoBuVBcGBpYLJlkdqnJzBougxy3BzNQqQZ9uLW2OoccYgf78jsgfNXTnBtwQWvWUCKAm9Y5BLnrlk8HEDglzBdK498lpA8ZAf+G60RMCHG0ido6RJ6SqRijdGsvZYg2d2aOPKc5efKwPUmUDq/rvdhmztGL0Sgxm6S1M+Jt+LkjPbqcv7BBoTIllM31VQDw8bELERPNJIyEXYArEgf3KgcNXS4hQCl4WK8G622zK1AwWcFkaSXlTXE0fOvoks6ZV593xCrsM5LaAfAdzNdxGWacHfNYq99g5cTAQ0RvpEwRrEui44SbmujSnTMDnOj70Yw8u6Ov9Z4SG1xw7XHqOukzflwAOCSgM1BIpHfWCcXh+PHSQeZUZfv0CMCYqfajAI52DlxCSPXfK30eY/uR4lNGx2LWTHLVNXPg6OlpyHgcELCcisGqHAu9moz1ogHf1C7OK0W4H/q6f5WUWn+quMKUglXVIjixE1LJDdb2nhZS5AXFar/wK0/CkiW1FO2lz0cn12aKBwx/wDhKakHogvMnw64CqZdWIYoznKX2iBxDGRjHnR4XpUHD3fOWKX68RcI8EWHApgRrPvBTggw45+qV0d7ExfCENIwfHprpEENsaUoLr7UdboYvYmBM2XtX6wEAgJ+SPHFwUI2IXtLdZaGg+T8XSEYkIXvU2Mogx3UD7GqWqvTI8aNJZsbf/XGqlyWbg1yBFsoN7br5r0GAG9Oj4LjLiWMaaEpxpiVKFkjYoCtcnZL0gowO/nDi1l3PTcGn5ayop4Q//WUZ49SY6Oo9+C8LRt4lT5AOKsw3Z9ZBqEWPhGEUOlqcb2AnVY2iD9JbGoiP/y1Tspu4GGtP7H13jK8GB1PHSxmaTn5kinKLdLwaTCQ50reWyG5IvPhAOtOHoegJLtkaYFmbYXgOlqvPbbPdYpI+R6DSk1QNZN/tpUYj91iLQC
 1RUncfWV
 800sksUaDgdbJxCYOB2jtOtzET+3+2NGj33FUlABzC9mdj+qh5tWjf2TROveEimpZCuthX/tIESF4SofKSjJUvCNhNWXYSopGNMSJsTChihSOTuvfzlnVkeS+EeJA8KApJeBOqG9/aBQxLZn9MyODER5zLnQ3Ez1FIpArcKnfi/50chW5DM9F6sVuMs8NLDFhBt3IHm+ZWoBTHNdcV6yB13yWMXFWduXDuxrCLRPg5ODcdtMLCeFknqUKy3WyrfqPOBQPh2dwp1M/ub5tCGH0HOHkOep6JInQL8+d7zT9kkxxJGRMsxEa++4nhMqqlg1mzbjnzi6dHi4n0DFwRRLoHwLCaUF1bVhDfCuPG3K3w4FR7A7qYXoGc7cF82BcBym+LHOU2B8B0TSsSF083lDbmfGn6iH5jC+HAPPc
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jun 25, 2024 at 4:12=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> On 09/04/2024 11:47, Ryan Roberts wrote:
> > Thanks for the CC, Zi! I must admit I'm not great at following the list=
...
> >
> >
> > On 08/04/2024 19:56, Zi Yan wrote:
> >> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote:
> >>
> >>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) =
wrote:
> >>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
> >>>>
> >>>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
> >>>>> What open questions etc is it raising?
> >
> > I'm happy to see others looking at mTHP, and would be very keen to be i=
nvolved
> > in any discussion. Unfortunately I won't be able to make it to LSFMM th=
is year -
> > my wife is expecting a baby the same week. I'll register for online, bu=
t even
> > joining that is looking unlikely.
>
> [...]
>
> Hi Yang Shi,
>
> I finally got around to watching the video of your presentation; Thanks f=
or
> doing the work to benchmark this on your system.
>
> I just wanted to raise a couple of points, first on your results and seco=
ndly on
> your conclusions...

Thanks for following up. Sorry for the late reply, I just came back
from a 2 week vacation and still suffered from jet lag...

>
> Results
> =3D=3D=3D=3D=3D=3D=3D
>
> As I'm sure you have seen, I've done some benchmarking with mTHP and cont=
pte,
> also on an Ampere Altra system. Although my system has 2 NUMA nodes (80 C=
PUs per
> node), I've deliberately disabled one of the nodes to avoid noise from cr=
oss
> socket IO. So the HW should look and behave approximately the same as you=
rs.

I used 1 socket system, but 128 cores per node. I used taskset to bind
kernel build tasks on core 10 - 89.

>
> We have one overlapping benchmark - kernel compilation - and our results =
are not
> a million miles apart. You can see my results for 4KPS at [1] (and you ca=
n take
> 16KPS and 64KPS results for reference from [2]).
>
> page size   | Ryan   | Yang Shi
> ------------|--------|---------
> 16K (4KPS)  |  -6.1% |  -5%
> 16K (16KPS) |  -9.2% | -15%
> 64K (64KPS) | -11.4% | -16%
>
> For 4KPS, my "mTHP + contpte" line is equivalent to what you have tested.=
 I'm
> seeing -6% vs your -5%. But the 16KPS and 64KPS results diverge more. I'm=
 not
> sure why these results diverge so much, perhaps you have an idea? From my=
 side,
> I've run these benchmarks many many times with successive kernels and rev=
ised
> patches etc, and the numbers are always similar for me. I repeat multiple=
 times
> across multiple reboots and also disable kaslr and (user) aslr to avoid a=
ny
> unwanted noise/skew.
>
> The actual test is essentially:
>
> $ make defconfig && time make =E2=80=93s =E2=80=93j80 Image

I'm not sure whether the config may make some difference or not. I
used the default Fedora config. And I'm running my test on Fedora 39
with gcc (GCC) 13.2.1 20230918. I saw you were using ubuntu 22.04. Not
sure whether this is correlated or not.

And Matthew said he didn't see any number close to our number (I can't
remember what exactly he said, but he should mean it) in the
discussion. I'm not sure what number Matthew meant, or he meant your
number?

>
> I'd also be interested in how you are measuring memory. I've measured bot=
h peak
> and mean memory (by putting the workload in a cgroup) and see almost doub=
le the
> memory increase that you report for 16KPS. Our measurements for other con=
figs match.

I also used memory.peak to measure the memory consumption. I didn't
try different configs. I just noticed more cores may incur more memory
consumption. It is more noticeable with 64KPS.

>
> But I also want to raise a more general point; We are not done with the
> optimizations yet. contpte can also improve performance for iTLB, but thi=
s
> requires a change to the page cache to store text in (at least) 64K folio=
s.
> Typically the iTLB is under a lot of pressure and this can help reduce it=
. This
> change is not in mainline yet (and I still need to figure out how to make=
 the
> patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspec=
t this
> will also move the needle on the other benchmarks you ran. See [3] - I'd
> appreciate any thoughts you have on how to get something like this accept=
ed.

AFAIK, the improvement from reduced iTLB really depends on workloads.
IIRC, MySQL is more sensitive to it. We did some tests with
CONFIG_READ_ONLY_THP_FOR_FS enabled for MySQL, we saw decent
improvement, but I really don't remember the exact number.

>
> [1] https://lore.kernel.org/all/20240215103205.2607016-1-ryan.roberts@arm=
.com/
> [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.robert=
s@arm.com/
> [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@ar=
m.com/
>
> Conclusions
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> I think people in the room already said most of what I want to say;
> Unfortunately there is a trade-off between performance and memory consump=
tion.
> And it is not always practical to dole out the biggest THP we can allocat=
e; lots
> of partially used 2M chunks would lead to a lot of wasted memory. So we n=
eed a
> way to let user space configure the kernel for their desired mTHP sizes.
>
> In the long term, it would be great to support an "auto" mode, and the cu=
rrent
> interfaces leave the door open to that. Perhaps your suggestion to start =
out
> with 64K and collapse to higher orders is one tool that could take us in =
that
> direction. But 64K is arm64-specific. AMD wants 32K. So you still need so=
me
> mechanism to determine that (and the community wasn't keen on having the =
arch
> tell us that).
>
> It may actually turn out that we need a more complex interface to allow a=
 (set
> of) mTHP order(s) to be enabled for a specific VMA. We previously conclud=
ed that
> if/when the time comes, then madvise_process() should give us what we nee=
d. That
> would allow better integration with user space.

The internal fragmentation or memory waste for 2M THP is a chronic
problem. The medium sized THP can help tackle this, but the
performance may not be as good as 2M THP.

So after the discussion I was actually thinking that we may need two
policies based on the workloads since there seems to be no one policy
that works for everyone. One for max TLB utilization improvement, the
other for memory conservative.

For example, the workload which doesn't care too much about memory
waste, they can choose to allocate THP from the biggest suitable
order, for example, 2M for some VM workloads. On the other side of the
spectrum, we can start allocating from smaller order then collapse to
larger order.

The system can have a default policy, the users can change the policy
by calling some interfaces, for example, madvise(). Anyway, just off
the top of my head, I haven't invested too much time in this aspect
yet.

I don't think 64K vs 32K is a problem. The two 32K chunks in the same
64K chunk are properly aligned. 64K is not a very high order, so
starting from 64K for everyone should not be a problem. I don't see
why we have to care about this.

By all the means mentioned above, we may be able to achieve full
"auto" mode in the future.

Actually another problem about the current interface is we may end up
having the same behavior with different settings. For example, having
"inherit" for all orders and have "always" for top level knob may
behave the same as having all orders and top level knob set to
"always". This may result in some confusion and violate the rule for
sysfs interfaces.

>
> Your suggestion about splitting higher orders to 64K at swap out is inter=
esting;
> that might help with some swap fragmentation issues we are currently grap=
pling
> with. But ultimately spitting a folio is expensive and we want to avoid t=
hat
> cost as much as possible. I'd prefer to continue down the route that Chri=
s Li is
> taking us so that we can do a better job of allocating swap in the first =
place.

I think I meant splitting to 64K when we have to split. I don't mean
we split to 64K all the time. If we run into swap fragmentation,
splitting to small order may help reduce the premature OOM and the
cost of splitting may be worth it. Just like what we did for other
paths, for example, page demotion, migration, etc, we split the large
folio if there is not enough memory. I may not articulate this in the
slides and the discussion, sorry for the confusion.

If we have a better way to tackle the swap fragmentation without
splitting, that is definitely more preferred.

>
> Thanks,
> Ryan
>