From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9453BCDB474 for ; Fri, 20 Oct 2023 06:13:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F30548D01BD; Fri, 20 Oct 2023 02:13:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EE0358D001D; Fri, 20 Oct 2023 02:13:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA8148D01BD; Fri, 20 Oct 2023 02:13:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CB2C08D001D for ; Fri, 20 Oct 2023 02:13:52 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 92E26807A1 for ; Fri, 20 Oct 2023 06:13:52 +0000 (UTC) X-FDA: 81364823904.28.EEF4D82 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by imf18.hostedemail.com (Postfix) with ESMTP id 5BA531C000A for ; Fri, 20 Oct 2023 06:13:50 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=kESeb6Dk; spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697782430; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=I/wFvgbuAVJqGcHzudqfpPu0vyaXkrIYjS+j8y77GqI=; b=vRv9m2IuoN6OJDDvcNhMhPvncuJzmb/TchFwHiHViohe8O7Q1brPptRr8m+c3VZQx1LjXJ cwaTlEJbehW0xyOJyHMBpMhIz5zTs8PhEqEDAT9ajA/DlXBpRlibLYN67+ufiYqVhHo0XF p7JY7ssT2J3LqY1eMm/9ksy6Rzinbs8= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=kESeb6Dk; spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697782430; a=rsa-sha256; cv=none; b=XBcakw2JZiRAWsWrJDtOM5NsWsX3XJefqwhvkTXPne9ZtMc5KXEnz8+f2LnKSN8dmEvstF GKnj21m1WD30usUG/jQYM9ij+1Sy/d8kCj8yZzit6Qld91jDIdZVIVVAh0o5j6DBghP2EX wyxJfxsJ/4LEf3jxKUhMpXWP+vUrgds= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697782430; x=1729318430; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=emDe/46ozX0hx7U6yivMfWjRH+gZmCG3CieXsBPTIhI=; b=kESeb6Dkcje0qRSgDV6tk5jOMFRUjsxO+obXPT9lcNnhNpKGIQDoa3Li gf8wC+fom2fJQRbUm9Ib7u7Sxl9bh9gBXWikNmtlv9ZmKguZSg0eHnVup 7Y9ndAaUOnXZxH7tKPMpYQUXe4A+ktio5e3zwKWIBYiEM73X2zYLx/xzr auZVpgSBCPrh0NqYFh9YGdE5i1bGeHxKoi07WYms9R0iuZXF2QWeBL/sV eauQzR8RPtE93VD0OucoxNlGOsQzS+2NqYW2LOsghaxyhZzizfPP0NkUW c8IqjfPQLIuC5UyGmUsaUyEMnyxWh3cCn0kvtNVftieWTKSLft29Yugjq Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10868"; a="450662738" X-IronPort-AV: E=Sophos;i="6.03,238,1694761200"; d="scan'208";a="450662738" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Oct 2023 23:13:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10868"; a="827623898" X-IronPort-AV: E=Sophos;i="6.03,238,1694761200"; d="scan'208";a="827623898" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Oct 2023 23:13:43 -0700 From: "Huang, Ying" To: Gregory Price Cc: Gregory Price , , , , , , Aneesh Kumar K.V , Wei Xu , Alistair Popple , Dan Williams , Dave Hansen , Johannes Weiner , "Jonathan Cameron" , Michal Hocko , "Tim Chen" , Yang Shi Subject: Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving In-Reply-To: (Gregory Price's message of "Tue, 17 Oct 2023 22:47:36 -0400") References: <20231009204259.875232-1-gregory.price@memverge.com> <87o7gzm22n.fsf@yhuang6-desk2.ccr.corp.intel.com> <87pm1cwcz5.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edhrunvp.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 20 Oct 2023 14:11:40 +0800 Message-ID: <87fs25g6w3.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 5BA531C000A X-Rspam-User: X-Stat-Signature: s6b9nmy4ri7js7hohj37satr6zfuwfym X-Rspamd-Server: rspam01 X-HE-Tag: 1697782430-513211 X-HE-Meta: U2FsdGVkX1+gIDN54Qz08CHlikRmFbjlkDl3BUFVEpTrnVVuVef9FOFQTq1KvTScgPej8jqMhOBLeYXIOKjX8CdhLYPIs44n7J+8qTmtH+n4CBaJaeO/eaokXDkSfOadjEKzZtvaQ1o7latGUhisi3r+HvqeQDCeCgMqQJvT66cDBXuxnvAJBYl1bRlyssz529XE05srfLj7qfbi/wD2Z8NaAIvI9GgmEB6t5YNGfr0c2dATBvBCGK9SCi0U01vnO4+6+HFlWlV9Rym32U4aXg55UoEW8PobjSYnnU7nbRdbIziTWh3jHN8f3vgv6K3EbNvPp+0slWIMG3JiBo5XbuKWkcvuPqEQPa+uI/GnWoqyg61LgCraS+f5F9/zalYuZFUcWSHHAx+TWiKLlAJyqg6016Fd+c/Huqhts2ymGaNVryrnrt6igZWZv6UppR7oZ+R9zxdFR/LGSfIMSVyQfPHECrE9/Dy/84HfuynFnS3ctxSq0Frv73jhwCqSX/LRXPvusD55KYMUcyWQjIIwLz+mLZI1PD2RmzDD+KAoL9AhIaXwcXa5Vh+ReHyj6sff6SB8umNDRN9md66EZUcwW0Y/lCmdezuc565OzZkG6X5VLIZ+rNr8XazQCBxZ6u9iUXh++6B9+fL942ehZXobzIKKTMD1OaQxJg73IRwY4Oq/3loUSqU4B7akNJC341eNFdJ1CLDdBXqgp3+wx+XmLxxYrT1WcYrbO3rnvOwOruHwaH5NlgI64RPZ2z5iPMNqXyvYN6xHxROXe5NQm7mZM4n0LGkpqmAd1ew+b/tXGA5xSKqnar8G28LYrMB2LvrcomntR/OiPfvowEscuQeTku8vwSkRTIlY+EqbV5SkOnjyjwbWDwFsnbi9ZZs5qh+jjF/sGbhaykolRuXcmqDS+6xp4nsfQMu+G+jXX9XkNiG7FLqfTKTj1if+l9UZUqAq4em8QuZt6SeKcjzgMje Viny8uAA +n0Go+WjNA+19chpJ38fQEZ0Hydtkb1kLdQXa1ZReOKm2gEJvUuyImKpxwZLYJ5QUFWPkopasQXL8mg36fZSrrc3Ys93oyYh+uBDgSzTGHhFX2ed/RxM4IlEr8EsGk/1TdRD8ZmWFtkZNqB00iUIXsrO1pvFdzluWxWbWgrUe7l1tlGVTb+zbsqj6MhyOxKG9xvfXWgwr2V4obEffG9bQPd+IIydZK4myrYgeVeOpA00+JsIYiHyCLMNBK92noLpPTv7K7J6Kluv6g7VgMXpI7Cjsjw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Gregory Price writes: [snip] > Example 1: A single-socket system with multiple CXL memory devices > === > CPU Node: node0 > CXL Nodes: node1, node2 > > Bandwidth attributes (in theory): > node0 - 8 channels - ~307GB/s > node1 - x16 link - 64GB/s > node2 - x8 link - 32GB/s > > In a system like this, the optimal distribution of memory on an > interleave for maximizing bandwidth is about 76%/16%/8%. > > for the sake of simplicity: --weighted-interleave=0:76,1:16,0:8 > but realistically we could make the weights sysfs values in the node > > Regardless of the mechanism to engage this, the most effective way to > capture this in the system is by applying weights to nodes, not tiers. > If done in tiers, each node would be assigned to its own tier, making > the mechanism equivalent. So you might as well simplify the whole thing > and chop the memtier component out. > > Is this configuration realistic? *shrug* - technically possible. And in > fact most hardware or driver based interleaving mechanisms would not > really be able to manage an interleave region across these nodes, at > least not without placing the x16 driver in x8 mode, or just having the > wrong distribution %'s. > > > > Example 2: A dual-socket system with 1 CXL device per socket > === > CPU Nodes: node0, node1 > CXL Nodes: node2, node3 (on sockets 0 and 1 respective) > > Bandwidth Attributes (in theory): > nodes 0 & 1 - 8 channels - ~307GB/s ea. > nodes 2 & 3 - x16 link - 64GB/s ea. > > This is similar to example #1, but with one difference: A task running > on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3. > This is because on access to nodes 1 and 3, the cross-socket link (UPI, > or whatever AMD calls it) becomes a bandwidth chokepoint. > > So from the perspective of node 0, the "real total" available bandwidth > is about 307GB+64GB+(41.6GB * UPI Links) in the case of intel. so the > best result you could get is around 307+64+164=535GB/s if you have the > full 4 links. > > You'd want to distribute the cross-socket traffic proportional to UPI, > not the total. > > This leaves us with weights of: > > node0 - 57% > node1 - 26% > node2 - 12% > node3 - 5% > > Again, naturally nodes are the place to carry the weights here. In this > scenario, placing it in memory-tiers would require that 1 tier per node > existed. Does the workload run on CPU of node 0 only? This appears unreasonable. If the memory bandwidth requirement of the workload is so large that CXL is used to expand bandwidth, why not run workload on CPU of node 1 and use the full memory bandwidth of node 1? If the workload run on CPU of node 0 and node 1, then the cross-socket traffic should be minimized if possible. That is, threads/processes on node 0 should interleave memory of node 0 and node 2, while that on node 1 should interleave memory of node 1 and node 3. But TBH, I lacks knowledge about the real life workloads. So, my understanding may be wrong. Please correct me for any mistake. -- Best Regards, Huang, Ying [snip]