From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B72B6C0218A
	for <linux-mm@archiver.kernel.org>; Sat,  1 Feb 2025 13:29:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1707B6B007B; Sat,  1 Feb 2025 08:29:34 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 120B56B0082; Sat,  1 Feb 2025 08:29:34 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F29D96B0083; Sat,  1 Feb 2025 08:29:33 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id D3D146B007B
	for <linux-mm@kvack.org>; Sat,  1 Feb 2025 08:29:33 -0500 (EST)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 7EAE216038A
	for <linux-mm@kvack.org>; Sat,  1 Feb 2025 13:29:33 +0000 (UTC)
X-FDA: 83071457826.12.C769090
Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175])
	by imf12.hostedemail.com (Postfix) with ESMTP id A42BF40005
	for <linux-mm@kvack.org>; Sat,  1 Feb 2025 13:29:31 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=ZEg4hFiM;
	spf=pass (imf12.hostedemail.com: domain of 42.hyeyoo@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=42.hyeyoo@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1738416571;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:in-reply-to:
	 references:dkim-signature; bh=lWLzyqeFTIZTY0IJD5DIWGpOzwUb+TSKnktWNSl3WSc=;
	b=ou+rFIaCjgbM7P12SPmUrOQJruGOtJO4AZLifUlUur6MXm//kx7Uv7pYpuPA8sWGbENccI
	r6Y+bqWzLNpKNTxIO3tFNjTAf601inE3T4ZTfcYreWJkRrl7PVmho7pCiUoVn7n1dHR4ks
	9b1nKCnZuCycL255bYqlQuguzSKyLeU=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738416571; a=rsa-sha256;
	cv=none;
	b=VcdzKetcHTyTcI5M/piraxdwqNOr9/a5R99ISwOGJEqxb/yjIFKFHUPRyyD9xmgdUzZBNI
	sQVzMY06MsC7xx+ZKNnQG3USPi9eGmKe3c9q1Z66Az8VfU3dAkpRLJ6niBwN3wCCdeZYql
	uVWEtH+Fh9JR72uQytxJilQ7+C6xzZo=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=ZEg4hFiM;
	spf=pass (imf12.hostedemail.com: domain of 42.hyeyoo@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=42.hyeyoo@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-2165cb60719so53280585ad.0
        for <linux-mm@kvack.org>; Sat, 01 Feb 2025 05:29:31 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1738416570; x=1739021370; darn=kvack.org;
        h=content-disposition:mime-version:message-id:subject:cc:to:from:date
         :from:to:cc:subject:date:message-id:reply-to;
        bh=lWLzyqeFTIZTY0IJD5DIWGpOzwUb+TSKnktWNSl3WSc=;
        b=ZEg4hFiMikkWwsg7OCqTBK8rLBZVbKiKVD967HLWWyoYzyHaElbASZMkzEviOYqSzs
         NS+bA9T37uv4ux0Qh9gFUgsIbSWTajEUFZGF8OuE2FTVQjoqwiI/tqE3crRL7IRmw9pE
         C92eeb2DDS9LdENThdxT0jf5+FjOosud5JG+UGB1VxucRFEPgAVJ5pBSkTBq/7ukNILJ
         gPqEshJ0lp8nuHV8d4kn9Rp7yFYC1Xw0DTC8io3J2qHqWu+hTjZUYUoO6cWv2qP8VGNR
         XmGB1skCMG1UFXDw5Vipf9t3dQsgWxhq/JSxRtD1lWo5eZ6PhcNyHnYv1X9DSveuSTn6
         Ty6A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1738416570; x=1739021370;
        h=content-disposition:mime-version:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=lWLzyqeFTIZTY0IJD5DIWGpOzwUb+TSKnktWNSl3WSc=;
        b=GKChUE6DEHvnv3Orf8iOwSNCeCXG0PkIMUgDzefWM6I//z922oLIGAdyAysSe5+96+
         yLszqK5ZpfGslgLZAgf6/nq9MnTy/VhzjhUxrP0at+fd83LqgrHkApQY2mtgvVJHA2KR
         CEEoRCn1G3vyrYUMrqwgZdrSpV2slxfSDLWZk6DOKJhno+UhO9M+aURwRDVKRYhFc3YT
         6fLNv9O1V+iFj0RDHC4HSh6x+OmFcbBk70ZOJaam6TgKWGok8edFZjswRANj3Ty2ttzU
         pdG1h7YVIrmf3w9M9DvvbVbapVUSZpx7Ber5yFQIRJiTWN+bX5S+yXXUwHoP2F2l5Lrs
         DsKw==
X-Forwarded-Encrypted: i=1; AJvYcCV0zYHF99Nv8gIB8qu4DZBxQlMWf0IRiYmNQqCtECBmNY4cBOLC5kZVXE4RNYfTPb9SjuOgxOC1rA==@kvack.org
X-Gm-Message-State: AOJu0Yy2sxGT8kX08vSZhWbjS8NizR3DmYv0+JiCiQ1VhlLdLn1HT1/x
	5sOYKZiJScrdSMBVXXmJ/+VVuuqWWDdtW0ahbZ6xav6d6mMZC4fB
X-Gm-Gg: ASbGncsCprrxnytjktPPtp3Wd5fWfQZQ9MJT82VYGPfwbS+nJIB8RAoFvn0M7um7vkJ
	3FIw2wz9ztnWL4EQbkvmfCtFXfyw2mX7RLSaVhHJYhnKymhEtlLC/bdpdjcDXKzhyRBuG8vQUOs
	38zheRCHITLLZy6Bhuz5q6Kw3mYqhUL/RdSpqZosYOuIBrQILaY7qEkMZq456OtoOkgOp+vs3Vt
	2Ty1OuWSt2uop9+zERCNMxwE5viWMCOHDtpv+ZoocnIEm+Z3KjLBWKvC2e+6siNYcT2E+uVcz/4
	9+HzgZw3UDgmezi4eXg5Xnv1/A==
X-Google-Smtp-Source: AGHT+IGjgJM/7O8I2xkrmZ1xJ4IrMggF295ewGFg+Yp6YAmV0wtPLdd/Wxbh07JxqxNL/k+vFbsKeA==
X-Received: by 2002:a17:902:cf01:b0:215:7446:2151 with SMTP id d9443c01a7336-21dd7c4c15dmr263409365ad.4.1738416570151;
        Sat, 01 Feb 2025 05:29:30 -0800 (PST)
Received: from localhost.localdomain ([1.245.180.67])
        by smtp.gmail.com with ESMTPSA id d9443c01a7336-21de3320ec8sm45682435ad.248.2025.02.01.05.29.27
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sat, 01 Feb 2025 05:29:29 -0800 (PST)
Date: Sat, 1 Feb 2025 22:29:23 +0900
From: Hyeonggon Yoo <42.hyeyoo@gmail.com>
To: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org, Byungchul Park <byungchul@sk.com>,
	Honggyu Kim <honggyu.kim@sk.com>
Subject: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel
 allocations from slow tier
Message-ID: <Z54hUTXRsw0LYQ8b@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Stat-Signature: yu95zon6x4wduocm7irbumqzx44iwtyn
X-Rspam-User: 
X-Rspamd-Queue-Id: A42BF40005
X-Rspamd-Server: rspam03
X-HE-Tag: 1738416571-864724
X-HE-Meta: U2FsdGVkX1/7STflyL8keiwfEEg2pGC3QRHomlwcAnypM45B+c0mLz8Lqbaohy/qlAw7208QZ029c9OFyGx1GNV7x56hm800BywLv05f7qY7DFz1NPdcpMEG0TBiRtbT/MQUQ5W5SGZXBqtdRqSNHO+IZaM6+T498pzfKB4SxPE2k9TCEMLYWWiX2QUDVuk1OxreMXL2R/fSrK/jDhElC1Z5eP2Qrg9Q95nA/GBxZjaKdciybckAM1rNoY9Knb1Gzm+iJ8X4wYFDDtkffFzYK5iLIsPEEWD1EXJ6Ynyd8yyoGJPfiLLVMSPu0if7wB5KWBUCMa4HwIteqJN9SicRJoYlCaTkky+Y00qANU+TJALHnYi3wGNVJAVVEKZLQpdYw1Ln6ci8hQYvXxHjrJXfOE23ZL+dM7o0NxW1HZEubOAyDV4bYtYKQ+2H2Qkh3+o/bNUkomXOK5VOOhsp+rhpGhlY6kNrsvjNrzDYtoGRiglZtGmjnxAtfEBl8P54kvyU8YY7lcOcysI6CybxxcuUQIa883G/gJabE2ZJEgZcVoSObBvn4UCAroq1B+7399ihi1orOmj6gNrJihFAmbvLmckE08/KU1/Ccr92LJ9FaqTBND8xb5Z5cnIs6NMuHUQSfTr/OXV75h8UQ72pPef7fQNta29ysvJH9fEcxNsD+xFlbY5FIryLgBlu1O75i6gC0jt4O1h618hxCsAXvoGsXRMGTJrC1ijIaLxhAEdAK57DVTxyxHeIUFi2HOsSdTYx2x/MpULC+AchJnK0G1YgCnFuVeZEaXEHs+pxBtGpqfPgcmZLBXN8O++QL6LAGkhchZyDeZLzeKakBGqHzs2bKwu0qXdB8LrzwwQYZvRoKI+95QKyq8PoTO6TFa7/BhZtMX9X9WrWAmaRqc1/YiTV0iG/TulsKXwJPBr4skiilfAUahIeR8TbjpGrdxN2HsMUr8WaipxraCvStonbGxW
 TAWqWATa
 rvUgio/QHylis9oOOun8i1iWKfGhORNia4/Aqo6/PH6DC99najE8HlK1Szb+IDz5CgCa7TtySzhYIDzitGmSDDc82tsV0n2MS7HFbfqXlxRUFChtyRerc04FyVu3We4QF9S5WM0YIqAaC5xdIHNDluOGjE2s3qsnTwNFd5bWvzOQ8cpn6MAQNZCLz1NdFErZqaIVc6rEL8WrB6sIBA9Mw4vDSL/Tsx6dPevhhaf+Cb0eNCaKcyE9alh/CfHVWUR6Yhtu95IZ2qm8tCjRKW79tqz6xkreDFBkjDKf51cOqBYZRBHfnz5+BypLrb3aHB4KAriPDg8psYwTrg9nxMsoftjvDMc2e+2qRuE6L0zPeKF3VW8xvKt0MLONjmkpeLf7jK5YJmFNz8cQm40c=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.004194, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi,

Byungchul and I would like to suggest a topic about the performance impact of
kernel allocations on CXL memory.

As CXL-enabled servers and memory devices are being developed, CXL-supported
hardware is expected to continue emerging in the coming years.

The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality.
The hot-plugged memory allows either unmovable kernel allocations
(ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE)
depending on the hot-plug policy.

Recently, Byungchul and I observed a measurable performance degradation with
memhp_default_state=online compared to memhp_default_state=online_movable
on a server where the ratio of memory capacity between DRAM and CXL is 1:2
when running the llama.cpp workload with the default mempolicy.
The workload performs LLM inference and pressures the memory subsystem
due to its large working set size.

Obviously, allowing kernel allocations from CXL memory degrades performance
because kernel memory like page tables, kernel stacks, and slab allocations,
is accessed frequently and may reside in physical memory with significantly
higher access latency.

However, as far as I can tell there are at least two reasons why we need to
support ZONE_NORMAL for CXL memory (please add if there are more):
  1. When hot-plugging a huge amount of CXL memory, the size of
     the struct page array might not fit into DRAM
     -> This could be relaxed with memmap_on_memory
  2. To hot-unplug CXL memory, pages in CXL memory should be migrated to DRAM,
     which means sometimes some portion of CXL memory should be ZONE_NORMAL.

So, there are certain cases where we want CXL memory to include ZONE_NORMAL,
but this also degrades performance if we allow _all_ kinds of kernel
allocations to be served from CXL memory.

For ideal performance, it would be beneficial to either:
  1) Restrict allocating certain types (e.g. page tables, kernel stacks,
     slabs) of kernel memory from slow tier, or
  2) Allow migrating certain types of kernel memory from slow tier to
     fast tier.

At LSF/MM/BPF, I would like to discuss potential directions for addressing
this problem, ensuring the enablement of CXL memory while minimizing its
performance degradation.

Restricting certain types of kernel allocations from slow tier
==============================================================

We could restrict some kernel allocations to fast tier by passing a
nodemask to __alloc_pages() (with only nodes in fast tier set) or
using a GFP flag like __GFP_FAST_TIER which does the same thing.

This prevents kernel allocations from slow tier and thus avoids
performance degradation due to the high access latency of CXL.
However, binding all leaf page tables to fast tier might not be ideal
due to 1) increased latency from premature reclamation
and 2) premature OOM kill [1].

Migrating certain types of kernel allocations from slow to fast tier
====================================================================

Rather than binding kernel allocations to fast tier and causing premature
reclamation & OOM kill, policies for migrating kernel pages may be more
effective, such as:
  - Migrating page tables to fast tier,
    triggered by data-page promotion [1]
  - Migrating to fast tier when there is low memory pressure:
    - Migrating slab movable objects [2]
    - Migrating kernel stacks (if that's feasible)

although this sounds more intrusive and we need to think about robust policies
that do not degrade existing traditional memory systems.

Any opinions will be appreciated.
Thanks!

[1] https://dl.acm.org/doi/10.1145/3459898.3463907
[2] https://lore.kernel.org/linux-mm/20190411013441.5415-1-tobin@kernel.org