From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C1B10C5475B
	for <linux-mm@archiver.kernel.org>; Wed,  6 Mar 2024 15:51:26 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 494416B007E; Wed,  6 Mar 2024 10:51:26 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 444736B0080; Wed,  6 Mar 2024 10:51:26 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 30BD96B0081; Wed,  6 Mar 2024 10:51:26 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 1F2466B007E
	for <linux-mm@kvack.org>; Wed,  6 Mar 2024 10:51:26 -0500 (EST)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id E27481C0272
	for <linux-mm@kvack.org>; Wed,  6 Mar 2024 15:51:25 +0000 (UTC)
X-FDA: 81867053730.25.1BB3CB6
Received: from mail-vk1-f174.google.com (mail-vk1-f174.google.com [209.85.221.174])
	by imf07.hostedemail.com (Postfix) with ESMTP id E00874000F
	for <linux-mm@kvack.org>; Wed,  6 Mar 2024 15:51:23 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=QHiqUbYj;
	spf=pass (imf07.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.221.174 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709740284;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=EDJvlUQkFOrn0GISHj47I7Ccjg7p1B1XCbcddG43WlU=;
	b=tL9oZHpWNAciMVeDPV623WolI3MZvzcXH6mRFbtgeCLXJNT9vft8m/EXaNzvGuSDTJcTvC
	Nqs2aAENlV6dwijASlIJ1hN1tZ37XPe5GvHdV0qv+H4BS0c1iGzQk1BWaKHlxp2QVOu88p
	KRR26qhe8iqcrWQLa01CaQtvC+EBXDI=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709740284; a=rsa-sha256;
	cv=none;
	b=q62Byngkk9t4GjZy78oYy3/swruTNystMQFNwE4oPFhdUT2Exwewd7YfSkWkUoNU5F4njp
	G6QlrmhuwDsZX2T26aJHjOvFMC+p8q9Zj3nOU3F2o/tFErKbmaZarW/ld+kXcJ1M6CvL2W
	io3+hDsPhl68cbaAm4LH1ZaLuCYxeA8=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=QHiqUbYj;
	spf=pass (imf07.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.221.174 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
Received: by mail-vk1-f174.google.com with SMTP id 71dfb90a1353d-4d33dd7d354so1565496e0c.1
        for <linux-mm@kvack.org>; Wed, 06 Mar 2024 07:51:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1709740282; x=1710345082; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=EDJvlUQkFOrn0GISHj47I7Ccjg7p1B1XCbcddG43WlU=;
        b=QHiqUbYjI2/H7nCdDjszZdsWGgNm3M/Uv9/Z4eOkKMUyNHhhc6lbgEN9rRrhwL/qvO
         E5SbgA2UseiVjU5VULdkD2Lc9l5xbxB5O5/spmChhIAarYtLb63HHH524CvefXnkHqW5
         niwTOa27SMkJp9+CUYyHo0jIAa6CKwx0Jafr35SpeqWW16CpK2irF21AI+ls+LFVEUjy
         UUOf90Wr+CohFVJM11GRaaUAg3J+JNpiqJJMZ4aUjMzdDXsRdUcyWU/j0a7DmJdR++Hq
         VlaXlPvY9DA66eGRCOoMeLoKcI0Q9TGJLLADJm4jG6jB/hTPzoveNOv26kPVNAWAhvYp
         Se/g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709740282; x=1710345082;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=EDJvlUQkFOrn0GISHj47I7Ccjg7p1B1XCbcddG43WlU=;
        b=AXLjKbkGSau4e39lzo818qczLs3rGCq+NCY66N6/SLmbs2QnjE24EjP8VQMXH1uzbO
         0YlxwJc+gpeEUZtj+FaKOcWBZ6h0wuz7iRf/h5FEGmUBlq+a+Nnnfvm0VR9gsNF2XT3B
         bga1AhHOyjFX6TCqmnLheWwRN7GSpK2dMwcW/lTAM/XL9H23ACPwV0Bq/fTX6FSuy07Y
         +ixQOPVaUMvkFj2J0qWVWypR/MTLBwFbiT7X66fGIij17B6dfe4n6pCzX7nC8Ph6ZcAN
         VA6OcHsgDnFGUr41/yEDQMyIUeZ+hrHF4gL9tSISafSJHr8k2mCKYg0hwsRCeO2p3YJ2
         uVnw==
X-Forwarded-Encrypted: i=1; AJvYcCWZvF/f61swr41ika8Rp6c8e5jXzSlmsGu+eoK5ywpiWVunULRIeLa9YOdz8N09tob+QdO1aNxnzBf08LLwLVnCJrc=
X-Gm-Message-State: AOJu0YyBbJVZEeDjbc9/AELlHmt8j3y1ftPU30B7Q619wXQcJM/4O72n
	7Z7E9r7+m27tmeuycJc9yvQM88W20Sofh/VENnRiILr2JPR9sH52Q5WBo2ZT/vI=
X-Google-Smtp-Source: AGHT+IHpOiyLf3Yu0PbMDCE78xekbvj//0QrdopA3Qk6yFdivhOfvGBuvKVRxjs5COSnF4q4b/0E5Q==
X-Received: by 2002:a05:6122:3a0f:b0:4ca:615e:1b6b with SMTP id fp15-20020a0561223a0f00b004ca615e1b6bmr5297095vkb.1.1709740282566;
        Wed, 06 Mar 2024 07:51:22 -0800 (PST)
Received: from localhost (2603-7000-0c01-2716-da5e-d3ff-fee7-26e7.res6.spectrum.com. [2603:7000:c01:2716:da5e:d3ff:fee7:26e7])
        by smtp.gmail.com with ESMTPSA id w11-20020a05620a094b00b0078821ef8162sm4016439qkw.9.2024.03.06.07.51.21
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 06 Mar 2024 07:51:21 -0800 (PST)
Date: Wed, 6 Mar 2024 10:51:10 -0500
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yu Zhao <yuzhao@google.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	Jonathan Corbet <corbet@lwn.net>,
	Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Subject: Re: [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations
Message-ID: <20240306155110.GB891917@cmpxchg.org>
References: <20240229183436.4110845-1-yuzhao@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20240229183436.4110845-1-yuzhao@google.com>
X-Rspamd-Queue-Id: E00874000F
X-Rspam-User: 
X-Stat-Signature: 4yn6kwbyyjdmuw6ay3bcyqe3crfz5icu
X-Rspamd-Server: rspam03
X-HE-Tag: 1709740283-655790
X-HE-Meta: U2FsdGVkX18j6vdHXNc8VTh5bNngZ6cz0iALi0fypcVo0+Nt+xDiJhmR2svoROnTdyk0r6Q0wlFi57UrwpsmVaj9SUWc8HolTUQDe2f4D7zK1ROu5yo8BN8AlVcGzijpQe7+bpGe9aU+p+q2Cnhp2RviUvVgWEqzuGdP9blJjTKsDQRcc0DAiU02jqQufgAlCa8SaI3LvJhIUEn+o8JiQJvOeSgOF3w4wvyXAfPsGI0WV8xZoz4DZ9lib9R3yyTzwj9Z641UN1yCHjrpv1na1haGX3CKMbZyMKmg4mPuXbl28f5Ampp2BGEf1t72h7Xap8wXUMf3DXwqcZWCTEP8oUNWZN99cc+si/j1zEZpT9SbXnuqMTiUh3HcRIdZAT7HaT3R4zR/FXeAtpKvJmcHQz/7fOlDth4s96uSMxhEU5Sc8qlUICV5A7jHkpMlE+ILVWt6r3qUTgPRttKju6rp3Zl3tt4NhewpCBoo4j/gMmFfPu38WSbbcwE25rqS7B4ZN+PMEidNQvcKKMFX2FfoMtoE7rEwA8Q9dCsJKg+/uJpEgLgLVdp1xBDaHs39GTWgr4v5qYVct6ow9mn5/kQ0pGy/d6/uQ/Xq/lSLoUXQbQtNDYigjpVgvXlSvleg1DJDhc708lb6NVyvdvRQQJVjP5yJGnMh0TQ9v1DSyO56d0+eBdwinRAv0OFH/Z4B/X51Ivl97Nmew7Kq2uqJxBye2fw/to1fHwM1BRHQiaCufGsTqvzv/Hhu+k4xT7gT7ROeAeSJwt6dZLo1PM2/mm556DFzm79jrid6Sx3MKSI9XNfalO6f7RTebm6TMwqjTlip6C9dFMwPw/5Sv0cq6YnptfUCK4oZjyqpE4H1XPZn1+xtdpJVDLK300uRRq90bDat5C8w5Zc1px4kTX+2KiTlQNe5Px6VfexSf4HVLQXPoP5LTydq3IRUfgVcpXZzbnULwkRSsx2tYJGN9QlLvzt
 BnWhO+PZ
 6AnTLyCmIUa2p+t2R+ik1Z2Zj0mskqjaUE4+tip1mZQbI4iOv/xQIBi/Z5pCSMuCA3APUbWFiSXyYo7hLy937JWmXNFv6efL3gHh0c8asbKWHtIe2HYqEdR7waOVAdxICR8dBuoSM/Dh9jQg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Feb 29, 2024 at 11:34:32AM -0700, Yu Zhao wrote:
> TAO is an umbrella project aiming at a better economy of physical
> contiguity viewed as a valuable resource. A few examples are:
> 1. A multi-tenant system can have guaranteed THP coverage while
>    hosting abusers/misusers of the resource.
> 2. Abusers/misusers, e.g., workloads excessively requesting and then
>    splitting THPs, should be punished if necessary.
> 3. Good citizens should be awarded with, e.g., lower allocation
>    latency and less cost of metadata (struct page).
> 4. Better interoperability with userspace memory allocators when
>    transacting the resource.
> 
> This project puts the same emphasis on the established use case for
> servers and the emerging use case for clients so that client workloads
> like Android and ChromeOS can leverage the recent multi-sized THPs
> [1][2].
> 
> Chapter One introduces the cornerstone of TAO: an abstraction called
> policy (virtual) zones, which are overlayed on the physical zones.
> This is in line with item 1 above.

This is a very interesting topic to me. Meta has collaborated with CMU
to research this as well, the results of which are typed up here:
https://dl.acm.org/doi/pdf/10.1145/3579371.3589079

We had used a dynamic CMA region, but unless I'm missing something
about the policy zones this is just another way to skin the cat.

The other difference is that we made the policy about migratetypes
rather than order. The advantage of doing it by order is of course
that you can forego a lot of compaction work to begin with. The
downside is that you have to be more precise and proactive about
sizing the THP vs non-THP regions correctly, as it's more restrictive
than saying "this region just has to remain compactable, but is good
for small and large pages" - most workloads will have a mix of those.

For region sizing, I see that for now you have boot parameters. But
the exact composition of orders that a system needs is going to vary
by workload, and likely within workloads over time. IMO some form of
auto-sizing inside the kernel will make the difference between this
being a general-purpose OS feature and "this is useful to hyperscalers
that control their whole stack, have resources to profile their
applications in-depth, and can tailor-make kernel policies around the
results" - not unlike hugetlb itself.

What we had experimented with is a feedback system between the
regions. It tracks the amount of memory pressure that exists for the
pages in each section - i.e. how much reclaim and compaction is needed
to satisfy allocations from a given region, and how many refaults and
swapins are occuring in them - and then move the boundaries
accordingly if there is an imbalance.

The first draft of this was an extension to psi to track pressure by
allocation context. This worked quite well, but was a little fat on
the scheduler cacheline footprint. Kaiyang (CC'd) has been working on
tracking these input metrics in a leaner fashion.

You mentioned a pageblock-oriented solution also in Chapter One. I had
proposed one before, so I'm obviously biased, but my gut feeling is
that we likely need both - one for 2MB and smaller, and one for
1GB. My thinking is this:

1. Contiguous zones are more difficult and less reliable to resize at
   runtime, and the huge page size you're trying to grow and shrink
   the regions for matters. Assuming 4k pages (wild, I know) there are
   512 pages in a 2MB folio, but a quarter million pages in a 1GB
   folio. It's much easier for a single die-hard kernel allocation to
   get in the way of expanding the THP region by another 1GB page than
   finding 512 disjunct 2MB pageblocks somewhere.

   Basically, dynamic adaptiveness of the pool seems necessary for a
   general-purpose THP[tm] feature, but also think adaptiveness for 1G
   huge pages is going to be difficult to pull off reliably, simply
   because we have no control over the lifetime of kernel allocations.

2. I think there also remains a difference in audience. Reliable
   coverage of up to 2MB would be a huge boon for most workloads,
   especially the majority of those that are not optimized much for
   contiguity. IIRC Willy mentioned before somewhere that nowdays the
   optimal average page size is still in the multi-k range.

   1G huge pages are immensely useful for specific loads - we
   certainly have our share of those as well. But the step size to 1GB
   is so large that:

   1) it's fewer applications that can benefit in the first place

   2) it requires applications to participate more proactively in the
      contiguity efforts to keep internal fragmentation reasonable

   3) the 1G huge pages are more expensive and less reliable when it
      comes to growing the THP region by another page at runtime,
      which remains a forcing function for static, boot-time configs

   4) the performance impact of falling back from 1G to 2MB or 4k
      would be quite large compared to falling back from 2M. Setups
      that invest to overcome all of the above difficulties in order
      to tickle more cycles out of their systems are going to be less
      tolerant of just falling back to smaller pages

   As you can see, points 2-4 take a lot of the "transparent" out of
   "transparent huge pages".

So it might be best to do both, and have each one do their thing well.

Anyway, I think this would be a super interesting and important
discussion to have at LSFMM. Thanks for proposing this.

I would like to be part of it, and would also suggest to have Kaiyang
(CC'd) in the room, who is the primary researcher on the Contiguitas
paper referenced above.