From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CF1AC636D7 for ; Mon, 13 Feb 2023 06:31:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87BD46B0071; Mon, 13 Feb 2023 01:31:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 82CAC6B0073; Mon, 13 Feb 2023 01:31:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F3936B0074; Mon, 13 Feb 2023 01:31:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 59EE46B0071 for ; Mon, 13 Feb 2023 01:31:57 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 26A66A1026 for ; Mon, 13 Feb 2023 06:31:57 +0000 (UTC) X-FDA: 80461298274.07.FEF2034 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf19.hostedemail.com (Postfix) with ESMTP id AF7431A0005 for ; Mon, 13 Feb 2023 06:31:53 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=a5tucVnJ; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676269914; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1877XPLafJXum3sSa1j5iEOOqgU3KipbB/8CmLGSi1Y=; b=HOezV31ahPO/FHn9vC8RPn1vjvGEF64S8fDRKtC0MMzZmNYLrjwjpI5y3xoBjWRCDcrAxD G7X+dUl6e2cH+4zVD6Az0+JcQ13XUjHpeBSnfYZLkdORZG+lkzvSHiimkeDYxnmDf9zKsE CaMfzBEb9U6EFzaLQ2Xat0ULLID+H7E= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=a5tucVnJ; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676269914; a=rsa-sha256; cv=none; b=MVE3H4+HdWayesSRtzaZupOxH7XdErsMFvpzqNp9mtKNK5ZG+spuKa2cVBl9IanlN0QWXQ ApmUinjxuYYIyOZ4YAnMGN5u8Tf7C6rA7V7flb1OXzxtS7/YUxyctccZ0TZoroz2YO/hrd l7cSsXvA3pty8x3oFSa7cjdj1CVA9P4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1676269913; x=1707805913; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=tico6HTkgBXy26etUgH3HgWqEVLIHc9gyZPN+ynMOls=; b=a5tucVnJSQUKRdUrFppOGL2ygjc/VoEZLHSXoQRueKYPaoAbkuj+/L3L /xFHF+KBxZf7VFKwP7I8w45iMmI9smCNdsfOOdvwL5co6loic24rdcmOY Buob7F1D5aZjlpU5e0LOJolWiiO1c5KGopvQGOU2W3Es+KRPly/238yo+ 6DRGCA04x0cdvj79ehqhGdxSqSgYiP/GCxk7XL08wbtxYC0FBrXpJmQbU nzNHUFaCAUOU7Hg3d+ruWWMJCfvPYGeH/N1TMbrQuc5wI++vl7LhsOdia jzdUVEJVTTdj/rmErADGAH4Az1Wle0Sr2y2iswyva1QHJk8pajhLVKV/y A==; X-IronPort-AV: E=McAfee;i="6500,9779,10619"; a="329449571" X-IronPort-AV: E=Sophos;i="5.97,293,1669104000"; d="scan'208";a="329449571" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Feb 2023 22:31:51 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10619"; a="757454965" X-IronPort-AV: E=Sophos;i="5.97,293,1669104000"; d="scan'208";a="757454965" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Feb 2023 22:31:48 -0800 From: "Huang, Ying" To: Bharata B Rao Cc: , , , , , , , , , , , , Subject: Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing References: <20230208073533.715-1-bharata@amd.com> <878rh2b5zt.fsf@yhuang6-desk2.ccr.corp.intel.com> <72b6ec8b-f141-3807-d7f2-f853b0f0b76c@amd.com> Date: Mon, 13 Feb 2023 14:30:53 +0800 In-Reply-To: <72b6ec8b-f141-3807-d7f2-f853b0f0b76c@amd.com> (Bharata B. Rao's message of "Mon, 13 Feb 2023 11:22:12 +0530") Message-ID: <87zg9i9iw2.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: AF7431A0005 X-Stat-Signature: dgas9a8e6qjzhnke97wmc9n39yd8ujkr X-HE-Tag: 1676269913-394058 X-HE-Meta: U2FsdGVkX1+qtH9Q9Q/1egDeW/UlYRhd+MOTiyiq257AdvokgWcTGWQ8ANk29iyPL79Wbl27Sgc7zBHiQ7Pb1pi8oBJ0ftuFV0oDuwnn0ZsRqA4sxRrgNK4UKek1gvUdqPqrsaU49PQY2OewRwsAcNzqv3QytEtJCJQpq4jwx2kasgOqF4u8hrAEFr+T8VPNJLEylh10hhOgYV6yE0AX1FdElklU+euqAIClrJzdmEgxf+W+XTQtFteEtpeVxkAQ3/mr+pgrPh4XABYKLLiYM6Ktq4h4DvSov8SZQ/E0m7LjnQapZahovS2nlL78d2IHqcXl4RzEMKtkvVjotsHtnpvyaGueAdXwkOiqHghNgRkwVuCGIw8hcQTXs8sOzagO2kayAasfCf1cFBsu5VyJUqa10ilzB9MmtZSIJBrEdf7qm2Ng48y9iqn4sGExZBsioGbpUEAYqND1Ib7jxx0Z1RWKzpDYWvE8+IPqmeKuXojQB9evGlA7eLSCKZdByHi1duztSBHrYOg2Hzv+Y4/r2IcKtAC/4q1WFOFF5fHaNpdKhGPcPNRwFSqXGTA570j/eKv4Sh57GX0GUywE2Z2DydlhDTUVcjqeo39SwaVb8Kn1BH/GDz65WGCDEIKz3WqVCof0f+A5eO5TLKsaA4w9/TMw/nYwMXqirATrMxAaUQS2X7up8nProwqBdo2SCR2hLykQRz0ma0qi7iUifytM2ez9Kx3pgBL6zxG2qVXQ6emkNKT51g+oAsSEkd3JX4HKcnRY1gzAIWYPkiXSPB4ibN0vGXYmQCDv2WKRzzxOJlaxGdu4JXB2qpBLxopE0MtArjbE3xR4VHPvzioJrwA4Ciq8w/3inEhH6d1zcof8Z9G7tGQuABIjs6Ea5x4e3S8+RB1I2h9L9dvmmVmo8EDDZ6xDiA3G1svztfcq0+r/hZQt2VGlbM6KR0NmwSBo6IhSHfRyOdwovSpX2lb17Zd qi9oI7gH TreqijGRayqG0yXA7DmEws9UFY+qGSwnE0a6dcJbBsZiPMJc6anLRH/l8V/tvQqDX+BX7poSy5Ll5vDfdwNegoKJknhZPXjg+ovtnJqdWrv4VB9mDkZZ+pHpLjDs/O1BDYIZ95xeyWvv8CZFo3iA7ARnvJAiVr6b5dqf2k3UjkdwJsb44UiK3MePO3RVwCL16V00R X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Bharata B Rao writes: > On 2/13/2023 8:56 AM, Huang, Ying wrote: >> Bharata B Rao writes: >> >>> Hi, >>> >>> Some hardware platforms can provide information about memory accesses >>> that can be used to do optimal page and task placement on NUMA >>> systems. AMD processors have a hardware facility called Instruction- >>> Based Sampling (IBS) that can be used to gather specific metrics >>> related to instruction fetch and execution activity. This facility >>> can be used to perform memory access profiling based on statistical >>> sampling. >>> >>> This RFC is a proof-of-concept implementation where the access >>> information obtained from the hardware is used to drive NUMA balancing. >>> With this it is no longer necessary to scan the address space and >>> introduce NUMA hint faults to build task-to-page association. Hence >>> the approach taken here is to replace the address space scanning plus >>> hint faults with the access information provided by the hardware. >> >> You method can avoid the address space scanning, but cannot avoid memory >> access fault in fact. PMU will raise NMI and then task_work to process >> the sampled memory accesses. The overhead depends on the frequency of >> the memory access sampling. Please measure the overhead of your method >> in details. > > Yes, the address space scanning is avoided. I will measure the overhead > of hint fault vs NMI handling path. The actual processing of the access > from task_work context is pretty much similar to the stats processing > from hint faults. As you note the overhead depends on the frequency of > sampling. In this current approach, the sampling period is per-task > and it varies based on the same logic that NUMA balancing uses to > vary the scan period. > >> >>> The access samples obtained from hardware are fed to NUMA balancing >>> as fault-equivalents. The rest of the NUMA balancing logic that >>> collects/aggregates the shared/private/local/remote faults and does >>> pages/task migrations based on the faults is retained except that >>> accesses replace faults. >>> >>> This early implementation is an attempt to get a working solution >>> only and as such a lot of TODOs exist: >>> >>> - Perf uses IBS and we are using the same IBS for access profiling here. >>> There needs to be a proper way to make the use mutually exclusive. >>> - Is tying this up with NUMA balancing a reasonable approach or >>> should we look at a completely new approach? >>> - When accesses replace faults in NUMA balancing, a few things have >>> to be tuned differently. All such decision points need to be >>> identified and appropriate tuning needs to be done. >>> - Hardware provided access information could be very useful for driving >>> hot page promotion in tiered memory systems. Need to check if this >>> requires different tuning/heuristics apart from what NUMA balancing >>> already does. >>> - Some of the values used to program the IBS counters like the sampling >>> period etc may not be the optimal or ideal values. The sample period >>> adjustment follows the same logic as scan period modification which >>> may not be ideal. More experimentation is required to fine-tune all >>> these aspects. >>> - Currently I am acting (i,e., attempt to migrate a page) on each sampled >>> access. Need to check if it makes sense to delay it and do batched page >>> migration. >> >> You current implementation is tied with AMD IBS. You will need a >> architecture/vendor independent framework for upstreaming. > > I have tried to keep it vendor and arch neutral as far > as possible, will re-look into this of course to make the > interfaces more robust and useful. > > I have defined a static key (hw_access_hints=false) which will be > set only by the platform driver when it detects the hardware > capability to provide memory access information. NUMA balancing > code skips the address space scanning when it sees this capability. > The platform driver (access fault handler) will call into the NUMA > balancing API with linear and physical address information of the > accessed sample. Hence any equivalent hardware functionality could > plug into this scheme in its current form. There are checks for this > static key in the NUMA balancing logic at a few points to decide if > it should work based on access faults or hint faults. > >> >> BTW: can IBS sampling memory writing too? Or just memory reading? > > IBS can tag both store and load operations. Thanks for your information! >> >>> This RFC is mainly about showing how hardware provided access >>> information could be used for NUMA balancing but I have run a >>> few basic benchmarks from mmtests to check if this is any severe >>> regression/overhead to any of those. Some benchmarks show some >>> improvement, some show no significant change and a few regress. >>> I am hopeful that with more appropriate tuning there is scope for >>> futher improvement here especially for workloads for which NUMA >>> matters. >> >> What's your expected improvement of the PMU based NUMA balancing? It >> should come from reduced overhead? higher accuracy? Quicker response? >> I think that it may be better to prove that with appropriate statistics >> for at least one workload. > > Just to clarify, unlike PEBS, IBS works independently of PMU. Good to known this, Thanks! > I believe the improvement will come from reduced overhead due to > sampling of relevant accesses only. > > I have a microbenchmark where two sets of threads bound to two > NUMA nodes access the two different halves of memory which is > initially allocated on the 1st node. > > On a two node Zen4 system, with 64 threads in each set accessing > 8G of memory each from the initial allocation of 16G, I see that > IBS driven NUMA balancing (i,e., this patchset) takes 50% less time > to complete a fixed number of memory accesses. This could well > be the best case and real workloads/benchmarks may not get this much > uplift, but it does show the potential gain to be had. Can you find a way to show the overhead of the original implementation and your method? Then we can compare between them? Because you think the improvement comes from the reduced overhead. I also have interest in the pages migration throughput per second during the test, because I suspect your method can migrate pages faster. Best Regards, Huang, Ying