From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7693ECD128A for ; Tue, 2 Apr 2024 00:17:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D52E6B009E; Mon, 1 Apr 2024 20:17:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0866F6B009F; Mon, 1 Apr 2024 20:17:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E688D6B00A0; Mon, 1 Apr 2024 20:17:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C4EBF6B009E for ; Mon, 1 Apr 2024 20:17:45 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 922E31A0887 for ; Tue, 2 Apr 2024 00:17:45 +0000 (UTC) X-FDA: 81962678490.30.84B792E Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) by imf25.hostedemail.com (Postfix) with ESMTP id 393DFA0004 for ; Tue, 2 Apr 2024 00:17:43 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=CUHQNr7f; spf=pass (imf25.hostedemail.com: domain of horenchuang@bytedance.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=horenchuang@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712017063; a=rsa-sha256; cv=none; b=gdAsFlefltoQHfxf6IINke4nEO1fX/1rjng8Q846HoabvgbmeGNUhA+d9HU0VFHeTCx3ni OCGevu5rQA3qEWv8s5OD/ovVT61bH10xsb2kiCs2Q8i5nysaLD81zjYxcpbqVHfj8bcloK /v4wQ4iFLc2HC6elPQfU4iH/Y0ul0QM= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=CUHQNr7f; spf=pass (imf25.hostedemail.com: domain of horenchuang@bytedance.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=horenchuang@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712017063; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=gW68csmOvBMTYM9kf5jTmocSt+5Cfpv4k9wOl9BtKJ0=; b=CqQ0i5K4/9UGYCFUZrU6lDuRGWn4bzU3MuFo8njko7xfY+LYld7nOWRyHMX/gGxEny2lMX vE7JcNJvyc+a1u/wrstXmwhqXXsZdH6fsTPLD+KQmu4drXPRfGdE5nXryhbjatF44NpmH0 ULjdWbwGO6PQvMIvEUHxMrQU5Uuqm2s= Received: by mail-qv1-f43.google.com with SMTP id 6a1803df08f44-6990dd16041so3402716d6.0 for ; Mon, 01 Apr 2024 17:17:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1712017062; x=1712621862; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=gW68csmOvBMTYM9kf5jTmocSt+5Cfpv4k9wOl9BtKJ0=; b=CUHQNr7fJk+RTdnYKQUVicAy0AAEHS68MAjqd8DoAGL0JUEZB5lbnEBYZDv7pKqG6p XNukQ7GK2aM+e3M5ZJNvyJ9xy7aqBcEjqkFeyBKRb4GbSbxJyorHz7HMBEXujloy/9In M7djSmgfNkq3YSyTe0ji6NrzKnVTcVkpp671CYPXzMOqdA1wHca627jXSIE4pVwSLaQK Hx96MkvUsJ1Zl81u4g8U50aBh2IcSrsOeXnYWyiVCwSx3ppbwso3NjoFKOeJBzgfczio Xkjs8q/byYTH/j4AJlyvIsDknYYHTMaMqKxJI9m4vnx5LRkiElDuJEyPRt9ABa1jaAvw vymg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712017062; x=1712621862; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=gW68csmOvBMTYM9kf5jTmocSt+5Cfpv4k9wOl9BtKJ0=; b=Ug77ADCPj6i0zNDjGBPUnd3+JXWs16LGX0nnC++Q7u8pazHSIBvcIhSjXyVrwfjA9E syBQMzPT4lph1mwuXV9QoRNuf7xXiIfq4otOcdNlev9zkkCc6c2IhN5pHByXKl/4CGR4 QLsz3c0r9T4EddbDRJ5a9PSbxUUw+V6uuR2T2OhFbOAhZOGS1cl9XLmb2owh4SGMpf0U DwtPCqKTDMci/qT/q34gkrGtkgp9a01dXdDmOdYG8Dlb9p3UN59ay/ArVU6Dg2ZixyQm Po2SwN40iOTMeGcbKd3K7LFqPgSn+FzMxD90IyRL/yOXKwiJrZ6H960HTPZh4IYWqa5j qZQA== X-Forwarded-Encrypted: i=1; AJvYcCUB3zU2FhIao1GAw+UDQ5UE3VN7NhoNrTYycYQ9hIFi4xUm49/EB6ylGjn+QaHWbCni4nAHNh2rXZEG+QR9eB/ro1E= X-Gm-Message-State: AOJu0YxZC3Pec8wIv902Gyd7BtygSXjf7Ev+22FCii1YoYFaZaboaoNx xa98ogDrzeZUHsH0yAlGqX8tvbtzke4aDG6mo9l5Yg7qiVB45+0/HxfNv9dBUy4= X-Google-Smtp-Source: AGHT+IHI1NovuRSzlyW3MYabIw0PVUmZGVgQf3UfVLAx8wFoGSl+V+dNtBDwUZKth1SSmnW9MKt4eA== X-Received: by 2002:a0c:f707:0:b0:699:bbd:7976 with SMTP id w7-20020a0cf707000000b006990bbd7976mr3538796qvn.22.1712017062196; Mon, 01 Apr 2024 17:17:42 -0700 (PDT) Received: from n231-228-171.byted.org ([130.44.212.125]) by smtp.gmail.com with ESMTPSA id e10-20020a0562141d0a00b00698f9771822sm3013112qvd.83.2024.04.01.17.17.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Apr 2024 17:17:41 -0700 (PDT) From: "Ho-Ren (Jack) Chuang" To: "Huang, Ying" , "Gregory Price" , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, "Eishan Mirakhur" , "Vinicius Tavares Petrucci" , "Ravis OpenSrc" , "Alistair Popple" , "Srinivasulu Thanneeru" , "SeongJae Park" , Dan Williams , Vishal Verma , Dave Jiang , Andrew Morton , nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , qemu-devel@nongnu.org Subject: [PATCH v10 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes Date: Tue, 2 Apr 2024 00:17:36 +0000 Message-Id: <20240402001739.2521623-1-horenchuang@bytedance.com> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 393DFA0004 X-Stat-Signature: wsuuir6mqog97mx56mdj5q4topzbf9no X-Rspam-User: X-HE-Tag: 1712017063-988869 X-HE-Meta: U2FsdGVkX1/Psc5kYOvT4k4gsfK6vfwgUdiQV50+lbQ6B7Fwl56dYmSVudjAE4fgDAg2rbK5EVTZdeJiPgluWd7CDdra5fc+hesw7Bwh1V0DXRHioJiikK6Ary13lWDrjU70gaTIgaDf5BRDfGE71YXmPvzHA/UVBUZU/g2JZHJXSOvbiuA1VDKa6nJliw2zvRN2CzR5TpXKl2gmycwcGOcKxCbn4xsbJjkVbCxdTgW3bTo2KkpP/kqSnJoxMrg4xgY+wCnUTA8hwK7I9wOdx0DUtpclzlpdFvWhKVUM7R2FgjH7MqqBLtnbS93maKAg1beusNvl3PeTyivB/T18cME1UxOy+/KuSzIUvYAvYOqc4B4FoHoJN9Q1Ln2qeG7n/8xu7d2vsOASs1VYWrr0t7dEG5Gl784YTcndqUcO5iSLft+2VRc6BAVluAJHaJq3lh/YdFIRckEGhDppTLdyhwBn5ymg2wSLyKxiaLoMDd7m+8Pn5//jAtCczLZwgc97E4C+Kd95xrTOt0dsoTUMnDf2jSsIXivZR86smkrGXAcXzg6h7no3oMqx6qrQvGYoT5ZKbwx24EJZl2gKY+JCU1R19m57EY/sQE+Ppy1IUglsTlpqbZ7joGPWaGAVLjpzWfQnfF+IzfzTH3rDDKUAvJDS0A8DlYa/QCpwtpy+db0KQ1Bs1LDkJOEHxiC7D96FqJOcvp88/0bZRDiR6WNOkiEiSKqqMjgoEfqxlvAI9+QVLXeBkuEvG83BBX8H7DMi0xDScosJ3RlzdrePCvL8AitOqKhm6hHYhjlX7S6qPJfAHxc1+GGWfhcdr7jm4YrH0YtO+4OcThqrM8QA8uMV5wwazlqZ1smMVK3297EBYNo8jHjGiWKYVGn5OKs4ME2AgJfiYvq3QwTIvi9y+x3zqIRukguVL6nINBw7+BMpHXAEkXJ7tMdZ8+e+uk4sH9N40SH+ZOle9wvFPCfmj83 o355oIU3 gFWSVaVw0lC2dH13eO29QUW6JK26njjgAFcQK4c/mlMcmFa6zG7Zyzvccsf5hKZMye35UNsSWd7eB/VuVNfaiIRQ2UWZrH+uTdP9N3IiuytbDNPCHr0ku6FYnnWjQ3CzaImlIMNUMX/vpcKiW5V4Fr+S6aGaMTsXVxIKMxGttDiVhuQ4tY7AI+HipQQv8cLpAL9HqPwPUaxxRjR0bITGM31by2sD1H2ODJOiew5ZMqpvFsIYzIlOh8lEbnXfvtpfnVOipmQ+J8d4gylBUrftn0RAB02cGthr+l1Xg66kMh9ncxOI0G73nvs2/gt6h3KcHUywCvba0JsI7iB6LbsVR+7CrwF2nOGgk/Z4NBrQvegN9ETDjxLwf7v40sPfkWvqI9NnxGPaGW2VxahTuVWmVurMbq4vysb5vUBZGGp4tMdZF4yoS/1eDZ5PS3tVvwqajM8WeirQK445nCwW7HVvh9eVdLxrc/jnXVY0hAOO2pcy3xtN6Q4NUFHGDiVExQ79kT/SedoteHC7hQTREkw8Lo/nlqT29GrcR6rIdalI4R+0VsOvDYrNivgML/5yKVN/P+x7sJxd7V8WzrfyX16mdyNZ+kIgjrncEpqud7YvwdTEjt9c= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: When a memory device, such as CXL1.1 type3 memory, is emulated as normal memory (E820_TYPE_RAM), the memory device is indistinguishable from normal DRAM in terms of memory tiering with the current implementation. The current memory tiering assigns all detected normal memory nodes to the same DRAM tier. This results in normal memory devices with different attributions being unable to be assigned to the correct memory tier, leading to the inability to migrate pages between different types of memory. https://lore.kernel.org/linux-mm/PH0PR08MB7955E9F08CCB64F23963B5C3A860A@PH0PR08MB7955.namprd08.prod.outlook.com/T/ This patchset automatically resolves the issues. It delays the initialization of memory tiers for CPUless NUMA nodes until they obtain HMAT information and after all devices are initialized at boot time, eliminating the need for user intervention. If no HMAT is specified, it falls back to using `default_dram_type`. Example usecase: We have CXL memory on the host, and we create VMs with a new system memory device backed by host CXL memory. We inject CXL memory performance attributes through QEMU, and the guest now sees memory nodes with performance attributes in HMAT. With this change, we enable the guest kernel to construct the correct memory tiering for the memory nodes. - v10: Thanks to Andrew's and SeongJae's comments, * Address kunit compilation errors * Resolve the bug of not returning the correct error code in `mt_perf_to_adistance` -v9: * Address corner cases in `memory_tier_late_init`. Thank Ying's comments. * https://lore.kernel.org/lkml/20240329053353.309557-1-horenchuang@bytedance.com/T/#u -v8: * Fix email format * https://lore.kernel.org/lkml/20240329004815.195476-1-horenchuang@bytedance.com/T/#u -v7: * Add Reviewed-by: "Huang, Ying" -v6: Thanks to Ying's comments, * Move `default_dram_perf_lock` to the function's beginning for clarity * Fix double unlocking at v5 * https://lore.kernel.org/lkml/20240327072729.3381685-1-horenchuang@bytedance.com/T/#u -v5: Thanks to Ying's comments, * Add comments about what is protected by `default_dram_perf_lock` * Fix an uninitialized pointer mtype * Slightly shorten the time holding `default_dram_perf_lock` * Fix a deadlock bug in `mt_perf_to_adistance` * https://lore.kernel.org/lkml/20240327041646.3258110-1-horenchuang@bytedance.com/T/#u -v4: Thanks to Ying's comments, * Remove redundant code * Reorganize patches accordingly * https://lore.kernel.org/lkml/20240322070356.315922-1-horenchuang@bytedance.com/T/#u -v3: Thanks to Ying's comments, * Make the newly added code independent of HMAT * Upgrade set_node_memory_tier to support more cases * Put all non-driver-initialized memory types into default_memory_types instead of using hmat_memory_types * find_alloc_memory_type -> mt_find_alloc_memory_type * https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchuang@bytedance.com/T/#u -v2: Thanks to Ying's comments, * Rewrite cover letter & patch description * Rename functions, don't use _hmat * Abstract common functions into find_alloc_memory_type() * Use the expected way to use set_node_memory_tier instead of modifying it * https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchuang@bytedance.com/T/#u -v1: * https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchuang@bytedance.com/T/#u Ho-Ren (Jack) Chuang (2): memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types memory tier: create CPUless memory tiers after obtaining HMAT info drivers/dax/kmem.c | 20 +----- include/linux/memory-tiers.h | 14 ++++ mm/memory-tiers.c | 127 ++++++++++++++++++++++++++++++----- 3 files changed, 126 insertions(+), 35 deletions(-) -- Ho-Ren (Jack) Chuang