From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81B66C36001 for ; Fri, 21 Mar 2025 15:52:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 193C6280002; Fri, 21 Mar 2025 11:52:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 11D60280001; Fri, 21 Mar 2025 11:52:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ED83B280002; Fri, 21 Mar 2025 11:52:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id CCF0F280001 for ; Fri, 21 Mar 2025 11:52:54 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 6A7941202B9 for ; Fri, 21 Mar 2025 15:52:55 +0000 (UTC) X-FDA: 83246001510.19.52F4608 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf07.hostedemail.com (Postfix) with ESMTP id 2952940002 for ; Fri, 21 Mar 2025 15:52:51 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; spf=pass (imf07.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742572373; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SBS6rE1owGfmmEO+77vCV/607NFJg6ESr03kORW/fWs=; b=c71Cr+m9Wgzl/H/K841pLSfyfzeg7J6K69QzafXW3gmTJ3aWXqlazif2UoFkjMjfAbldg3 llZ1GpSS0Eb4ZAH6zW0HNPnIYOZ82KrPjlSr2nDKJCtxUWbYNEOQ0QCeTtX0woS7Kg4oZM fu62xXfnylxXQPugktk6CjXTeLZZ7as= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742572373; a=rsa-sha256; cv=none; b=EW22eHmxgBnNpVDu0ie53zCDwimFUW55d8R5VwQtnez0g6akI3mauwSgJsT2t0ogEcXMsk olO/r2eSL1395lZJ7dGS+5qyMwzoncfbf9pUJAtBG0twRwbTTAA4lZTwZqncTacZwrY3VD CDOSjreDL90ArGBkeGQkA155ILbqCeQ= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; spf=pass (imf07.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4ZK6KK5w52z6L55F; Fri, 21 Mar 2025 23:47:49 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id 3CFF414080C; Fri, 21 Mar 2025 23:52:47 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 21 Mar 2025 16:52:45 +0100 Date: Fri, 21 Mar 2025 15:52:44 +0000 From: Jonathan Cameron To: Raghavendra K T CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit Message-ID: <20250321155244.00006338@huawei.com> In-Reply-To: <20250319193028.29514-1-raghavendra.kt@amd.com> References: <20250319193028.29514-1-raghavendra.kt@amd.com> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.177.66] X-ClientProxiedBy: lhrpeml100006.china.huawei.com (7.191.160.224) To frapeml500008.china.huawei.com (7.182.85.71) X-Stat-Signature: fs4ytryz4ste8nebm19hcuxgn8g8jtdo X-Rspam-User: X-Rspamd-Queue-Id: 2952940002 X-Rspamd-Server: rspam08 X-HE-Tag: 1742572371-286168 X-HE-Meta: U2FsdGVkX19ZQJ8juEbVhGOltIbNmKp7sBa0+V/H/lecuQ9Sue4MWNMxj+Yt01Ms6Tk6oAxhMMbsGCAmh5WUOCCf9NZUpnlAKksRLrcbsmMvsTsNQ+kXtTaE4odH1KIwWyM7Vvxmz+ZF91d7/tywSNGXlpzwJiQ40snOeYjM9S3+x338cZL4fFxuHHDQSPsb8AnZPcRdWeAis/ueTnHXbV5L66W5QLtxoO2Ap5op2csJ1cmRuNjellWSSBaxzwNTBq03QDZtXIqA22/Iz3BZziUDg+hOJMuiqWpaUxWEeMNnzPu9q9/mR8cVlTXORDcPo3seixyxbu6BxZzN7jpLSUPNoLpuTD4FenytVgK+AMaN8q8LPQvexC5sw6HRQyNTg/3GHdy+uf8ni8m2ufEtXFf+Om3oe1/Gdq2UH+NUgWbXS6SoNB+b8CXvCq8mA5YVvBEOmbwCvOjjRkcIDusQ0WOXbqZqC27F/kUwKEFhLaH9T1i4SW5Ag+8nbjnpQHTuflnd3Nw/6S93iXzCFIRX5fXcTvMG0D8vX3LMAhJUFAu9KckNd5oJmewkjTO+hFbfdT2zp/+Kevzo725lnJYT9t2fMxpMbQL0X5O55RpQSIEBrQ45QFPICawhn7TfpS1402WqHlosHOJVz4CdcSV6hqF27A8tf83uTBDSu86DciiB/dRgCfUeGg5c0zZRQyr7cSZeEYoVn5qylFRdZJ+FvXJF9cTgGMC7ZZUmpIihQDto53PpengjLtiol781olgR+l+7fPZQ4R9qTdh3saS0UY1wrzm8wh/WCqJ9/1cTKVVdsx35jU2BIB8ayubRAhLdD/i2+QAZURz17CAg/slfFOBoDHJj/f8+oKMLnhQVEw0rN6z+bZEXdq/q9PozKzgy2uXA1Wb0hObxTMHw4061/2jS0f2a8vVps0RemXJj4Jhcx0yGKQnsmHSRyiJJwhA8iAsSTjPU4oOj/6bOzvN 2n+lybye 7+XEgf6/b8ZxofV+eFlquiYZeIg1lrhS73YFoAsceSymBX/V5Q3FxUn+WmcCjsXv9lHEu9rZP/5Jgq6CTyXXOjiGiuquCoR9BwlqEUleKLjeuSEAfV+3c7LaZXY5BrXIQMsWyj0X3ITxiz+L+LrjY41ZmTrTRpTNXr15LAtSqmbUhLZZinYzw59l3fa7J4M2x+PC//BWvVGMH/WjJtfcHZ26QfX19pG39SuYWugn6D1IZ9k2vi9l4X7GZvNVuPepQS+e78zmZuc8o2p8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.009237, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 19 Mar 2025 19:30:15 +0000 Raghavendra K T wrote: > Introduction: > ============= > In the current hot page promotion, all the activities including the > process address space scanning, NUMA hint fault handling and page > migration is performed in the process context. i.e., scanning overhead is > borne by applications. > > This is RFC V1 patch series to do (slow tier) CXL page promotion. > The approach in this patchset assists/addresses the issue by adding PTE > Accessed bit scanning. > > Scanning is done by a global kernel thread which routinely scans all > the processes' address spaces and checks for accesses by reading the > PTE A bit. > > A separate migration thread migrates/promotes the pages to the toptier > node based on a simple heuristic that uses toptier scan/access information > of the mm. > > Additionally based on the feedback for RFC V0 [4], a prctl knob with > a scalar value is provided to control per task scanning. > > Initial results show promising number on a microbenchmark. Soon > will get numbers with real benchmarks and findings (tunings). > > Experiment: > ============ > Abench microbenchmark, > - Allocates 8GB/16GB/32GB/64GB of memory on CXL node > - 64 threads created, and each thread randomly accesses pages in 4K > granularity. So if I'm reading this right, this is a flat distribution and any estimate of what is hot is noise? That will put a positive spin on costs of migration as we will be moving something that isn't really all that hot and so is moderately unlikely to be accessed whilst migration is going on. Or is the point that the rest of the memory is also mapped but not being accessed? I'm not entirely sure I follow what this is bound by. Is it bandwidth bound? > - 512 iterations with a delay of 1 us between two successive iterations. > > SUT: 512 CPU, 2 node 256GB, AMD EPYC. > > 3 runs, command: abench -m 2 -d 1 -i 512 -s > > Calculates how much time is taken to complete the task, lower is better. > Expectation is CXL node memory is expected to be migrated as fast as > possible. > > Base case: 6.14-rc6 w/ numab mode = 2 (hot page promotion is enabled). > patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled). > we expect daemon to do page promotion. > > Result: > ======== > base NUMAB2 patched NUMAB1 > time in sec (%stdev) time in sec (%stdev) %gain > 8GB 134.33 ( 0.19 ) 120.52 ( 0.21 ) 10.28 > 16GB 292.24 ( 0.60 ) 275.97 ( 0.18 ) 5.56 > 32GB 585.06 ( 0.24 ) 546.49 ( 0.35 ) 6.59 > 64GB 1278.98 ( 0.27 ) 1205.20 ( 2.29 ) 5.76 > > Base case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled). > patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled). > base NUMAB1 patched NUMAB1 > time in sec (%stdev) time in sec (%stdev) %gain > 8GB 186.71 ( 0.99 ) 120.52 ( 0.21 ) 35.45 > 16GB 376.09 ( 0.46 ) 275.97 ( 0.18 ) 26.62 > 32GB 744.37 ( 0.71 ) 546.49 ( 0.35 ) 26.58 > 64GB 1534.49 ( 0.09 ) 1205.20 ( 2.29 ) 21.45 Nice numbers, but maybe some more details on what they are showing? At what point in the workload has all the memory migrated to the fast node or does that never happen? I'm confused :( Jonathan