From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45A6EC4167D for ; Tue, 31 Oct 2023 15:56:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B929A8D0018; Tue, 31 Oct 2023 11:56:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B42008D0012; Tue, 31 Oct 2023 11:56:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A32E98D0018; Tue, 31 Oct 2023 11:56:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9141C8D0012 for ; Tue, 31 Oct 2023 11:56:32 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 60C57B5C11 for ; Tue, 31 Oct 2023 15:56:32 +0000 (UTC) X-FDA: 81406209024.17.3A7994B Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf03.hostedemail.com (Postfix) with ESMTP id 5756620007 for ; Tue, 31 Oct 2023 15:56:30 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=PjG5beWQ; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf03.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698767790; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kcqO6QLqoWLHFsYYNG8GZ2WufbipJ9PS9Z8s132JUF0=; b=b/zzSTza018RaWnZhCefCAb0lxWerNSfyiV4p7zx+c1IztA+3SrWUDaPdRn/szKgwal2gt UgpSuAT5fr9Fehw9IsfYlzfTVjRwzNmZJIm4BgARabL1awCwfvGqfQMul/75TQh54YysMX sRQnXvfZ2Jhul4JGu/2edL3DM+R9xdw= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=PjG5beWQ; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf03.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698767790; a=rsa-sha256; cv=none; b=3qf2yauF3qqJIl+jfiky88/R4xd7GRJrqte8HjMRTCxQBjOQQQeVJqqD96HZ0sGGLePxOB OmtvylyD0na7lGdQoDfE4PKrSzE9zbIIjfWLfzSWpw58PnmaLwWgh4iJjXVDkUIoGEbErq 3I8ysont7IC23fo7bQksTWds9iRFTsQ= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 3B98021847; Tue, 31 Oct 2023 15:56:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1698767788; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kcqO6QLqoWLHFsYYNG8GZ2WufbipJ9PS9Z8s132JUF0=; b=PjG5beWQTeTrQUOfOjdE55R4tYzVI+HBaDqg4F9XzTIUY3xCwOvp/da4SK6M24HVLB0e4I y3B1VUk/LfDwQXmmeQDB2K7m7IT01FTfR7/i/tMqUhI9VRq23bieV6SV9CKvYDgAoUwfoC 6oH5WsqCKz6cK23Mgf/Ls7Np+1d6f8Y= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 16AC2138EF; Tue, 31 Oct 2023 15:56:28 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id ntLLAqwjQWUsQwAAMHmgww (envelope-from ); Tue, 31 Oct 2023 15:56:28 +0000 Date: Tue, 31 Oct 2023 16:56:27 +0100 From: Michal Hocko To: Johannes Weiner Cc: Gregory Price , linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com, dave.hansen@intel.com, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Message-ID: References: <20231031003810.4532-1-gregory.price@memverge.com> <20231031152142.GA3029315@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231031152142.GA3029315@cmpxchg.org> X-Rspam-User: X-Stat-Signature: o55h6wrg1h3ctdns4wz5bdpgb6tsyah9 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5756620007 X-HE-Tag: 1698767790-764130 X-HE-Meta: U2FsdGVkX1+No4ivTb+UoeSprBdKBbPHxsEXUmrKYG98B560sFu3wGxbO9xF3MPfngPhPjE92FCP/gUsKikLjKouPMAYv5B3WRRyklq6YnqB7OKwhX7+/2hdvwCylQDDoNB4uEUBp5+9nf2zK6hBrFJ43fhL+q1zo+dUXL1RPg4YgCY/aMtD/5bFejPwAV5KVF9OCI6S1A2crWLlJuMebIZkNuBm280+hMEvJezPSDaS/opRZnJtlDDWEXE+MGG97hC84YT5y1P0vPJBrNubSKdGQP3E5fMHJMo9yXTiknLGZxoqZ77JmLV2ZRQpH5BbqtIRtxezEVREIeE8KdQoyggkUaA5bqOOiUUQ/DTALA5FNzcEuxP4pucLvSXjiyLB6oJLWvKe4x/D6ZclOYEZnD1OrZoIFPWO+Abh61RV+lrFtIbC98Pb8QdRq0Uvy6x2W9cm7/ezIhkFsbxcVxQtQqESZlGiVeoQqRh6JoQbYCN5bppqsaTRKyCQ4sW8oHPLLlV8U//opukjs3mOXhujDEKsFiARKD2MrKXj3yyFzPwOkNp7mI0XImU7o9B77x8KS3DK6j0uvkGfHD8XCj4/Es218fcIOrptZtjJv4wQ5+cONt6f1Gndmjr+LVuP0qJReF94zjnT+jLR8PacxmYa9iNtuK3DqBSAlpw1f7fW00lhlQt5X0NDRmXu3F+1iG51CSTbgVyZQW6iW3GNh6ImUHAl8ugt7P2oy7smEzeVHhP7RhBZd1S+c+IiIwmVG2HmSHr/DhSnmloVVw7T+hLNgclgZw5uzmATyqShUE4SOZBH10m6YlJ+0c1ji5RDn7mDRDOp65y75G6IqqHQSxmR5u9MTpf4YyQ4WsdXInIQMsNqxUF+OQBAYuvWxzarfxXXsBHRviX+DBe+HVeRiI+dds3OeZWwBw7QQ7r/GtUhD5QA3j1i7lTaexDYbJH2zd91fERAWwDaLkr4bpc5PUz 48XU1t5j 0urPmOKFJcl+fL/edw5ZT0c8qy012VueRrEmJ5jAFzfXzwl8J5nyZTGkIOrzhENG6v59uUhWBb+KlekSpn/D+1ujOGwTU8/4LORN65sAXKSQSM2k1RVfqKdaEQezFJKHBlCryi4wu+A1Uq1QCDt6oxVcFIQgRPOSU3m4OnAn/7F5ATfbIakFT8r9TsyUy8QFVNb21y16sEn9YtZ5BbnP7e39nSUWy+tidiPv7E9W20MjLHIqYZP1x8S8iE61pgtNaLNlHuHL+pDXQKJnpMUfht/0u8qgGOzxbRPy7Gj3ql32OlXVnb8LnO43bnCaaCs1YbM42xt3DjA6VzhbDIny9iGbIxCl19LYexHZbJK5VAw8zDh6IlFm7PRrdWh/OD6ioY2it X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue 31-10-23 11:21:42, Johannes Weiner wrote: > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: > > On Mon 30-10-23 20:38:06, Gregory Price wrote: > > > This patchset implements weighted interleave and adds a new sysfs > > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > > > > > The il_weight of a node is used by mempolicy to implement weighted > > > interleave when `numactl --interleave=...` is invoked. By default > > > il_weight for a node is always 1, which preserves the default round > > > robin interleave behavior. > > > > > > Interleave weights may be set from 0-100, and denote the number of > > > pages that should be allocated from the node when interleaving > > > occurs. > > > > > > For example, if a node's interleave weight is set to 5, 5 pages > > > will be allocated from that node before the next node is scheduled > > > for allocations. > > > > I find this semantic rather weird TBH. First of all why do you think it > > makes sense to have those weights global for all users? What if > > different applications have different view on how to spred their > > interleaved memory? > > > > I do get that you might have a different tiers with largerly different > > runtime characteristics but why would you want to interleave them into a > > single mapping and have hard to predict runtime behavior? > > > > [...] > > > In this way it becomes possible to set an interleaving strategy > > > that fits the available bandwidth for the devices available on > > > the system. An example system: > > > > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > > > > > In this setup, the effective weights for nodes 0-3 for a task > > > running on Node 0 may be [60, 20, 10, 10]. > > > > > > This spreads memory out across devices which all have different > > > latency and bandwidth attributes at a way that can maximize the > > > available resources. > > > > OK, so why is this any better than not using any memory policy rely > > on demotion to push out cold memory down the tier hierarchy? > > > > What is the actual real life usecase and what kind of benefits you can > > present? > > There are two things CXL gives you: additional capacity and additional > bus bandwidth. > > The promotion/demotion mechanism is good for the capacity usecase, > where you have a nice hot/cold gradient in the workingset and want > placement accordingly across faster and slower memory. > > The interleaving is useful when you have a flatter workingset > distribution and poorer access locality. In that case, the CPU caches > are less effective and the workload can be bus-bound. The workload > might fit entirely into DRAM, but concentrating it there is > suboptimal. Fanning it out in proportion to the relative performance > of each memory tier gives better resuls. > > We experimented with datacenter workloads on such machines last year > and found significant performance benefits: > > https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ Thanks, this is a useful insight. > This hopefully also explains why it's a global setting. The usecase is > different from conventional NUMA interleaving, which is used as a > locality measure: spread shared data evenly between compute > nodes. This one isn't about locality - the CXL tier doesn't have local > compute. Instead, the optimal spread is based on hardware parameters, > which is a global property rather than a per-workload one. Well, I am not convinced about that TBH. Sure it is probably a good fit for this specific CXL usecase but it just doesn't fit into many others I can think of - e.g. proportional use of those tiers based on the workload - you get what you pay for. Is there any specific reason for not having a new interleave interface which defines weights for the nodemask? Is this because the policy itself is very dynamic or is this more driven by simplicity of use? -- Michal Hocko SUSE Labs