From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6EA51C2BD09 for ; Thu, 27 Jun 2024 20:54:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E85ED6B0096; Thu, 27 Jun 2024 16:54:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E5C7B6B0098; Thu, 27 Jun 2024 16:54:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D23986B0099; Thu, 27 Jun 2024 16:54:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B71076B0096 for ; Thu, 27 Jun 2024 16:54:16 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4D41DC0614 for ; Thu, 27 Jun 2024 20:54:16 +0000 (UTC) X-FDA: 82277871312.29.F82FB59 Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf05.hostedemail.com (Postfix) with ESMTP id 6D585100018 for ; Thu, 27 Jun 2024 20:54:14 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=iCUlHJUF; spf=pass (imf05.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719521641; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AlT32AwEdoPVjYNuTXt5JTHi96ylg/OlK/zGKdzc8A4=; b=SiRRb6Xl++s3TmVsjK+OFonYYAOATAoZNm5eYp1slTDpNtxtw+BfGdKF+yZbYD17P8gA6/ /RVFlKcxRkAxHPPzG5RL842jjfRqotWMjuoM0wcbgi9iLj2c2Kt9iy3U07CySM1JnX3o/6 7ek4QczXd9NOwqPejFMdTlWHxvGIUTA= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=iCUlHJUF; spf=pass (imf05.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719521641; a=rsa-sha256; cv=none; b=BRIiOXX9++JVHnlbjZeiYGJxt5pDvSI2kTGrNZeIXdjuLSgUu8clKQvU+JRkLuZOy6VUpP KCK6SNB2tSZcltDnN21qbHc6eUQRHSGjyw9vyF5uLdbzd35IiBsty10frOMrjr10xa4zuy vP1GecfQMlNjrK9JX6JtDNnpyI/DJRs= Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-57d280e2d5dso2439218a12.1 for ; Thu, 27 Jun 2024 13:54:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719521653; x=1720126453; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=AlT32AwEdoPVjYNuTXt5JTHi96ylg/OlK/zGKdzc8A4=; b=iCUlHJUFQFSZbgT9JyU/oOSpG+tEzkdFa/g/wNVUF5nItZsIk5GTzQtTbIr7S8bn81 1D9LpesvU9+vHITH1hfvNtLgMGia6TvVb8nm6PTPYKke9L5LiBBWrltMfoeuiUUegfAE ruSvz/xTpMJqaFT3CgvVWiHGLR4kYvbnazHKsGny07B3VGK+BXsBUg3vd152cwkxRZ0e SJW3iA6uBBlOyqgFg5/SHsHENRw6uuCYl+JX3U/UdNTmodRjbWjcjKVFUQdQG9fRJ/9g TBkKUJjf3O2LyuPdDpJhy4dqZ+t2RE4WjRprKN4eIglzYriwIt/l3bKNEBwFkB/I3FeB Z88Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719521653; x=1720126453; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AlT32AwEdoPVjYNuTXt5JTHi96ylg/OlK/zGKdzc8A4=; b=EY4fCS/KxxBBC6tJZTBXfxbp96ZC8P/UVM7zlwXg0K6UzXIfp1jn5LvFPtd+9qb+HD ykg61XpIcW2zZt0iS/SHfFtxef9KKWADsQwVkKCoBp2IEC4rKEkEdkxTnsDeuXxAwM33 M0shsphS/ZQ6U+UHXzNzJafH7QRI0o5otV7Sq2dDg1jq2KFDo5Gb/2oSCFGdUhwUWWjk X3OV/WRVlBhdgjYKKUte6JKzH35Ug1lo3WeGyKKPp/8Xomkm3T70gfF8+4HogT/Tlwdy lUJ79fVfw6FC0Gkny9jCWJBBCJwYIwPIYX018Gipn/CBsAnKGrexWeiXSgWXnXZeNEzT egqQ== X-Forwarded-Encrypted: i=1; AJvYcCXregjgfj1fpnq+3Yr+zxTE7MIBGm3eCHZZoxxrVIkWMTAxurEM/uDX6wREzP+R7cSxmZx9sO/fCeGsogfKwQc6t+s= X-Gm-Message-State: AOJu0YxFPHp1PZrTqXckFrymgYTbfYAH7dfR+P9AJCSj1eKGTaSByfbn QYzHvkabwlu5W/Cp/14IzqL+ZvZF3tHxqdcNg/U7ViuDFDSzaX3YbKrIRQvbqxydQSAGLOxYK1w NIeXrbwRIuaGid6rGPPgw/okMUuY= X-Google-Smtp-Source: AGHT+IE9x4E5MM6TaVrnbdhbh7mCeyE1aICiy4hBzoXBeBxZrXCjNweRzOfu/n1ZrGERpGtocQjOXGke9jJH4XZ+Y7Y= X-Received: by 2002:a17:907:c5ce:b0:a72:4c32:7d89 with SMTP id a640c23a62f3a-a724c327e82mr1200815166b.54.1719521652589; Thu, 27 Jun 2024 13:54:12 -0700 (PDT) MIME-Version: 1.0 References: <20240401191614.00007c83@Huawei.com> <145031ae-1d4d-4b43-b2c9-aed0d10e86ca@arm.com> <7a8bcd48-47b4-4bc7-a38f-45cef9adc221@arm.com> In-Reply-To: <7a8bcd48-47b4-4bc7-a38f-45cef9adc221@arm.com> From: Yang Shi Date: Thu, 27 Jun 2024 13:54:00 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 To: Ryan Roberts Cc: Jonathan Cameron , lsf-pc@lists.linux-foundation.org, olivier.singla@amperecomputing.com, Linux MM , Michal Hocko , Dan Williams , Christoph Lameter , Matthew Wilcox , Zi Yan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: pnnu35onu63t6m7z8x79armirrozsfq4 X-Rspam-User: X-Rspamd-Queue-Id: 6D585100018 X-Rspamd-Server: rspam02 X-HE-Tag: 1719521654-728396 X-HE-Meta: U2FsdGVkX19lQYBU5JTxB3KKw94CWrxoBuVBcGBpYLJlkdqnJzBougxy3BzNQqQZ9uLW2OoccYgf78jsgfNXTnBtwQWvWUCKAm9Y5BLnrlk8HEDglzBdK498lpA8ZAf+G60RMCHG0ido6RJ6SqRijdGsvZYg2d2aOPKc5efKwPUmUDq/rvdhmztGL0Sgxm6S1M+Jt+LkjPbqcv7BBoTIllM31VQDw8bELERPNJIyEXYArEgf3KgcNXS4hQCl4WK8G622zK1AwWcFkaSXlTXE0fOvoks6ZV593xCrsM5LaAfAdzNdxGWacHfNYq99g5cTAQ0RvpEwRrEui44SbmujSnTMDnOj70Yw8u6Ov9Z4SG1xw7XHqOukzflwAOCSgM1BIpHfWCcXh+PHSQeZUZfv0CMCYqfajAI52DlxCSPXfK30eY/uR4lNGx2LWTHLVNXPg6OlpyHgcELCcisGqHAu9moz1ogHf1C7OK0W4H/q6f5WUWn+quMKUglXVIjixE1LJDdb2nhZS5AXFar/wK0/CkiW1FO2lz0cn12aKBwx/wDhKakHogvMnw64CqZdWIYoznKX2iBxDGRjHnR4XpUHD3fOWKX68RcI8EWHApgRrPvBTggw45+qV0d7ExfCENIwfHprpEENsaUoLr7UdboYvYmBM2XtX6wEAgJ+SPHFwUI2IXtLdZaGg+T8XSEYkIXvU2Mogx3UD7GqWqvTI8aNJZsbf/XGqlyWbg1yBFsoN7br5r0GAG9Oj4LjLiWMaaEpxpiVKFkjYoCtcnZL0gowO/nDi1l3PTcGn5ayop4Q//WUZ49SY6Oo9+C8LRt4lT5AOKsw3Z9ZBqEWPhGEUOlqcb2AnVY2iD9JbGoiP/y1Tspu4GGtP7H13jK8GB1PHSxmaTn5kinKLdLwaTCQ50reWyG5IvPhAOtOHoegJLtkaYFmbYXgOlqvPbbPdYpI+R6DSk1QNZN/tpUYj91iLQC 1RUncfWV 800sksUaDgdbJxCYOB2jtOtzET+3+2NGj33FUlABzC9mdj+qh5tWjf2TROveEimpZCuthX/tIESF4SofKSjJUvCNhNWXYSopGNMSJsTChihSOTuvfzlnVkeS+EeJA8KApJeBOqG9/aBQxLZn9MyODER5zLnQ3Ez1FIpArcKnfi/50chW5DM9F6sVuMs8NLDFhBt3IHm+ZWoBTHNdcV6yB13yWMXFWduXDuxrCLRPg5ODcdtMLCeFknqUKy3WyrfqPOBQPh2dwp1M/ub5tCGH0HOHkOep6JInQL8+d7zT9kkxxJGRMsxEa++4nhMqqlg1mzbjnzi6dHi4n0DFwRRLoHwLCaUF1bVhDfCuPG3K3w4FR7A7qYXoGc7cF82BcBym+LHOU2B8B0TSsSF083lDbmfGn6iH5jC+HAPPc X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 25, 2024 at 4:12=E2=80=AFAM Ryan Roberts = wrote: > > On 09/04/2024 11:47, Ryan Roberts wrote: > > Thanks for the CC, Zi! I must admit I'm not great at following the list= ... > > > > > > On 08/04/2024 19:56, Zi Yan wrote: > >> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote: > >> > >>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) = wrote: > >>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote: > >>>> > >>>>> Sounds like useful data, but is it a suitable topic for LSF-MM? > >>>>> What open questions etc is it raising? > > > > I'm happy to see others looking at mTHP, and would be very keen to be i= nvolved > > in any discussion. Unfortunately I won't be able to make it to LSFMM th= is year - > > my wife is expecting a baby the same week. I'll register for online, bu= t even > > joining that is looking unlikely. > > [...] > > Hi Yang Shi, > > I finally got around to watching the video of your presentation; Thanks f= or > doing the work to benchmark this on your system. > > I just wanted to raise a couple of points, first on your results and seco= ndly on > your conclusions... Thanks for following up. Sorry for the late reply, I just came back from a 2 week vacation and still suffered from jet lag... > > Results > =3D=3D=3D=3D=3D=3D=3D > > As I'm sure you have seen, I've done some benchmarking with mTHP and cont= pte, > also on an Ampere Altra system. Although my system has 2 NUMA nodes (80 C= PUs per > node), I've deliberately disabled one of the nodes to avoid noise from cr= oss > socket IO. So the HW should look and behave approximately the same as you= rs. I used 1 socket system, but 128 cores per node. I used taskset to bind kernel build tasks on core 10 - 89. > > We have one overlapping benchmark - kernel compilation - and our results = are not > a million miles apart. You can see my results for 4KPS at [1] (and you ca= n take > 16KPS and 64KPS results for reference from [2]). > > page size | Ryan | Yang Shi > ------------|--------|--------- > 16K (4KPS) | -6.1% | -5% > 16K (16KPS) | -9.2% | -15% > 64K (64KPS) | -11.4% | -16% > > For 4KPS, my "mTHP + contpte" line is equivalent to what you have tested.= I'm > seeing -6% vs your -5%. But the 16KPS and 64KPS results diverge more. I'm= not > sure why these results diverge so much, perhaps you have an idea? From my= side, > I've run these benchmarks many many times with successive kernels and rev= ised > patches etc, and the numbers are always similar for me. I repeat multiple= times > across multiple reboots and also disable kaslr and (user) aslr to avoid a= ny > unwanted noise/skew. > > The actual test is essentially: > > $ make defconfig && time make =E2=80=93s =E2=80=93j80 Image I'm not sure whether the config may make some difference or not. I used the default Fedora config. And I'm running my test on Fedora 39 with gcc (GCC) 13.2.1 20230918. I saw you were using ubuntu 22.04. Not sure whether this is correlated or not. And Matthew said he didn't see any number close to our number (I can't remember what exactly he said, but he should mean it) in the discussion. I'm not sure what number Matthew meant, or he meant your number? > > I'd also be interested in how you are measuring memory. I've measured bot= h peak > and mean memory (by putting the workload in a cgroup) and see almost doub= le the > memory increase that you report for 16KPS. Our measurements for other con= figs match. I also used memory.peak to measure the memory consumption. I didn't try different configs. I just noticed more cores may incur more memory consumption. It is more noticeable with 64KPS. > > But I also want to raise a more general point; We are not done with the > optimizations yet. contpte can also improve performance for iTLB, but thi= s > requires a change to the page cache to store text in (at least) 64K folio= s. > Typically the iTLB is under a lot of pressure and this can help reduce it= . This > change is not in mainline yet (and I still need to figure out how to make= the > patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspec= t this > will also move the needle on the other benchmarks you ran. See [3] - I'd > appreciate any thoughts you have on how to get something like this accept= ed. AFAIK, the improvement from reduced iTLB really depends on workloads. IIRC, MySQL is more sensitive to it. We did some tests with CONFIG_READ_ONLY_THP_FOR_FS enabled for MySQL, we saw decent improvement, but I really don't remember the exact number. > > [1] https://lore.kernel.org/all/20240215103205.2607016-1-ryan.roberts@arm= .com/ > [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.robert= s@arm.com/ > [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@ar= m.com/ > > Conclusions > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > I think people in the room already said most of what I want to say; > Unfortunately there is a trade-off between performance and memory consump= tion. > And it is not always practical to dole out the biggest THP we can allocat= e; lots > of partially used 2M chunks would lead to a lot of wasted memory. So we n= eed a > way to let user space configure the kernel for their desired mTHP sizes. > > In the long term, it would be great to support an "auto" mode, and the cu= rrent > interfaces leave the door open to that. Perhaps your suggestion to start = out > with 64K and collapse to higher orders is one tool that could take us in = that > direction. But 64K is arm64-specific. AMD wants 32K. So you still need so= me > mechanism to determine that (and the community wasn't keen on having the = arch > tell us that). > > It may actually turn out that we need a more complex interface to allow a= (set > of) mTHP order(s) to be enabled for a specific VMA. We previously conclud= ed that > if/when the time comes, then madvise_process() should give us what we nee= d. That > would allow better integration with user space. The internal fragmentation or memory waste for 2M THP is a chronic problem. The medium sized THP can help tackle this, but the performance may not be as good as 2M THP. So after the discussion I was actually thinking that we may need two policies based on the workloads since there seems to be no one policy that works for everyone. One for max TLB utilization improvement, the other for memory conservative. For example, the workload which doesn't care too much about memory waste, they can choose to allocate THP from the biggest suitable order, for example, 2M for some VM workloads. On the other side of the spectrum, we can start allocating from smaller order then collapse to larger order. The system can have a default policy, the users can change the policy by calling some interfaces, for example, madvise(). Anyway, just off the top of my head, I haven't invested too much time in this aspect yet. I don't think 64K vs 32K is a problem. The two 32K chunks in the same 64K chunk are properly aligned. 64K is not a very high order, so starting from 64K for everyone should not be a problem. I don't see why we have to care about this. By all the means mentioned above, we may be able to achieve full "auto" mode in the future. Actually another problem about the current interface is we may end up having the same behavior with different settings. For example, having "inherit" for all orders and have "always" for top level knob may behave the same as having all orders and top level knob set to "always". This may result in some confusion and violate the rule for sysfs interfaces. > > Your suggestion about splitting higher orders to 64K at swap out is inter= esting; > that might help with some swap fragmentation issues we are currently grap= pling > with. But ultimately spitting a folio is expensive and we want to avoid t= hat > cost as much as possible. I'd prefer to continue down the route that Chri= s Li is > taking us so that we can do a better job of allocating swap in the first = place. I think I meant splitting to 64K when we have to split. I don't mean we split to 64K all the time. If we run into swap fragmentation, splitting to small order may help reduce the premature OOM and the cost of splitting may be worth it. Just like what we did for other paths, for example, page demotion, migration, etc, we split the large folio if there is not enough memory. I may not articulate this in the slides and the discussion, sorry for the confusion. If we have a better way to tackle the swap fragmentation without splitting, that is definitely more preferred. > > Thanks, > Ryan >