From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6B5ACD3445 for ; Tue, 19 Sep 2023 07:15:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3A0C66B04B9; Tue, 19 Sep 2023 03:15:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 34EA96B04BA; Tue, 19 Sep 2023 03:15:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 217426B04BB; Tue, 19 Sep 2023 03:15:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 13C236B04B9 for ; Tue, 19 Sep 2023 03:15:19 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id EEFD9A0AA0 for ; Tue, 19 Sep 2023 07:15:18 +0000 (UTC) X-FDA: 81252485916.03.CB09896 Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) by imf05.hostedemail.com (Postfix) with ESMTP id 023F4100004 for ; Tue, 19 Sep 2023 07:15:15 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="K/tvsKmo"; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=pass (imf05.hostedemail.com: domain of mingo.kernel.org@gmail.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=mingo.kernel.org@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695107716; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FWdryEK98eIB3OwrZ7gXoSG69IfBkDIYZQBa+WiHnls=; b=rdkgf7CH7Cax8sv+qlycRw3jMDDMiz0bRpxawhx2jazZKeU+fmXexB5ByDZUiZd2j8Erg9 KtvfDQYGsjPGoVQdnP0ueHCJEqFy2kN8xESQ1PZ0YVzpkNLSJo2dFuftr3exe3afxFRAuR NqXTeoKMyTz1vIAe04N6vup7nxhptn4= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="K/tvsKmo"; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=pass (imf05.hostedemail.com: domain of mingo.kernel.org@gmail.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=mingo.kernel.org@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695107716; a=rsa-sha256; cv=none; b=6CCW0qZNW3z2zQJ89w4cBpZQpMOFXQ8CqpTxh+g+VaAVU6RN1LKlOVl/5KxRNSu/PbuWeE Abt6y9h2eUqxCFz2ZINX0a/zJVVMBeRYElPSpfTLVeQVfsgko+GgzRFZXS2ZSmWTAOp8iX xDQBwOdpX+JDVrJrvbN1Naobyrvvwuc= Received: by mail-lf1-f51.google.com with SMTP id 2adb3069b0e04-502b1bbe5c3so8904402e87.1 for ; Tue, 19 Sep 2023 00:15:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695107714; x=1695712514; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=FWdryEK98eIB3OwrZ7gXoSG69IfBkDIYZQBa+WiHnls=; b=K/tvsKmoJqYHGgVMUY+A5EKYc/oHT9Aq9i6d/EPF7yFtlCtHW86blWiDAAeSQ0bH+6 B1yff2fuwq6dX5Z5UJ12QKkk3JZbT9x9JjjCt6rSatP4D2OArRARq9zIxNYd8GBOQ5AK V+lvZhP2JFfUP1578tG6zo7RCJtioVh4uH7KXhOqYkRyWrLGgJ2SFa/lcZTy5rJTl2JF B7OKzcSraJRZ6YVWXO/fhSYBrN5BJPO7PMhmej0apkBtFU3IBOL7Nx2JcFD5bANyLF7p WNYp6OKq0ChoJ2+IfxZXcOGHch6hfAeoXWR2YNMIwbWobDw6E+rQnR8gdFrtzX3xVQO/ nm4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695107714; x=1695712514; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FWdryEK98eIB3OwrZ7gXoSG69IfBkDIYZQBa+WiHnls=; b=tOeKIhjQ92MyQAbjccxkbL4T77nqF8vW5EHkpo/5IqTfX4ODSGe9VZM4xOcXI0t8sN NBOqTAdh10h3og+pDQ4eDrAfCY7agnMcqGA3fJB9gHeB07gceDAlD2xe+SW2ZSzo8bRt d+cERNoJ/jauYEbHbDGVDjj1v/fh/0fHQgHS7D5eDeL+oLbq3IjSzJlw+9mqSkkazRd4 /8bbq6J6PLM8swvF8qjnPBYURC8XoeUPQsravmv6LEf7XGIIIKUj1oZK7Nd0+81T8RFf vg0d5Ct2Kmz9JxADxJdFvcEImmOsh2bA+XTStEF9F9BX1fvgxBwUpL7Bz3obRapuBpxa yIkw== X-Gm-Message-State: AOJu0Yzhbnd2cZc4T6bWxbdkV+mSgqgmdf7okvhzAZfGpDdB7MCnF/gm RWSSQ6XwtupYpJ/w1Ojrd4s= X-Google-Smtp-Source: AGHT+IGQBcc/FuyYzkwtcJF2vCH6/U+pm48uZ8fj/45jFexnqaFWVicb2b9cWjpaapwnTKdmbHwFlg== X-Received: by 2002:ac2:442e:0:b0:503:26bd:7f58 with SMTP id w14-20020ac2442e000000b0050326bd7f58mr1657297lfl.41.1695107713570; Tue, 19 Sep 2023 00:15:13 -0700 (PDT) Received: from gmail.com (1F2EF265.nat.pool.telekom.hu. [31.46.242.101]) by smtp.gmail.com with ESMTPSA id x7-20020a05600c2d0700b00404719b05b5sm13863325wmf.27.2023.09.19.00.15.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Sep 2023 00:15:12 -0700 (PDT) Date: Tue, 19 Sep 2023 09:15:10 +0200 From: Ingo Molnar To: Raghavendra K T Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ingo Molnar , Peter Zijlstra , Mel Gorman , Andrew Morton , David Hildenbrand , rppt@kernel.org, Juri Lelli , Vincent Guittot , Bharata B Rao , Aithal Srikanth , kernel test robot , Sapkal Swapnil , K Prateek Nayak Subject: Re: [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning Message-ID: References: <719f0729-d28f-d12f-cff4-ab8115861d30@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <719f0729-d28f-d12f-cff4-ab8115861d30@amd.com> X-Rspamd-Queue-Id: 023F4100004 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: zu4bcr7pf6hxsctg9wj46776wme3n7ci X-HE-Tag: 1695107715-120506 X-HE-Meta: U2FsdGVkX18SxYjnruhOjM9Z2aAxeJBQ1rJ6ilD5GbtmbxmFPkmvCETYNouBt7Yu5Fpnf/ujl0mJz947EgWmvy6ECzbCQqw+IZ0QPhSYZpQ0ExLHTfvWKuKmPriKHf8jz8E3Xx5ZTuF4xWdr3PSurOtr/HGErkawFHd/0vCWbDu893cdQ6xAopXo06FtPwc9zbwQYjofs0iCn3uriXiSFAkjDMH7PHMH0QXop2OcWizvRbV4jYKZd+4n/K6Ijk1VjdXBVHqAe3UBcFfraW8FOyynmwv7+7F9Z19CBcq4t2fGqzDsAR6ipaCOTxrBJ620+S9lNujOu3nWacpyJFZWTIRv7fqdzTQ6B4tKXfy5AcyYyhTegKjJwnhyDnZru9LLugMUt7qCA7XkhhOPDGi/sUzoA+rceG7Z4I7mNTTHLtZ/veomdFtsBiH8gC3Icg6r68/Pu93TwfnWeE2E+sxIRK5P5TNQSZEXQHmTxYPlYOnC9qgpuA9BbodgiCwhOZYJC/PIAOu+MPShTVD1qoffPDM8puYBWZmIejivK8+TFWyG/KtD90SnmL3kvCxgK+Bpczj9wFgKqLIK4Lzp3jAy/Pqtip4Vf/XIwWM9vv6taIDjAvzU2Z6bDeIdBErvS59uRUlwzyOZmRoZf5Re0I3GRWPY3iqrj9DQYrAhI5JYqx0lYTIR4rYJ5zRuA1KaGV4ucpFY3z8m8NCJytYuMwYg5UYHaVsLyXQ4sJ1TEY2fIykZWBzQ1/2hrXCf/G8oV/UtbJv3kwn2YhELkLoFLGyt6q8zxkqcOOs+jMiLhG/fisd7KicHinnAIwfc3tmwil373MagSDuWZRXjDitZV8bXWg7VMIjOQC8xWPZEZ4DKXQzIwn7JdmcL10kHHrR/lgsf87ANI0WpSLDdXX2FewVU/PKX2HNbEW0ciUXm5juVf3KDl/tmEepVkZAhPgQcr3KoEy6dOBOzAKai7KUxFJw +TKjRWd2 jopToIvXFLN+RgdUt/oX5WcHK4xwWcoNTFJOXd7rQelG6fa59RguPtXOZuQhNe5O3L3jN8E1gx77GBDK5WLHm0zzHsoi5nK/1QwPMXCtas6Z75NuUg1Od92XgZrWmp0JB8cG8l0AKN9W34KcBfPmPY9GkaXn4JMt/N9DFHeoVdqK8kuxJrLmLfYhNgeXgpbq9DvjJL0wJHAonVM/5b5082SG/GxFGUQbpDHfSSh+2BOnpQc/HfdeVM/Ah6bA8X3imj71h+RmEoYNXc2QaJ6PNGx5os6bn7Am1+6nsb9zhu6AbOT1cCejbh/nc/VBDt28MGtUTBr2qFh907I2KiraZdDfjS76wCZPC2Z4y0PlGMRjELFwD7pWz/huic2XrlrghCfgln1gXHDSL0CcaNmM7cSoArrjf1Rq9D9IaTp2AZsClutJcVm4yAwOiYqmi6m81Mbji+D/RsM5Yg4h0yfOrTpAVXxexOhRNAnLXYf2ISCO7N3laBe/xImVYOg33oremHYhhR1+2u8zg9PozHOw0W7lmYaDslKIT+WGCNWc6Gyg+km51qI7QgQy3ztBHTXMs9RVwlYzFPDkLXuz/SEmfM/NtPJNuWnrLbVgV X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: * Raghavendra K T wrote: > On 8/29/2023 11:36 AM, Raghavendra K T wrote: > > Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") [1] > > VMA scanning is allowed if: > > 1) The task had accessed the VMA. > > Rationale: Reduce overhead for the tasks that had not > > touched VMA. Also filter out unnecessary scanning. > > > > 2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2. > > Rationale: Understanding initial characteristics of VMAs and also > > prevent VMA scanning unfairness. > > > > While that works for most of the times to reduce scanning overhead, > > there are some corner cases associated with it. > > > > This was found in an internal LKP run and also reported by [2]. There was > > an attempt to fix. > > > > Link: https://lore.kernel.org/linux-mm/cover.1685506205.git.raghavendra.kt@amd.com/T/ > > > > This is a fully different series after Mel's feedback to address the issue > > and also a continuation of enhancing VMA scanning for NUMA balancing. > > > > Problem statement (Disjoint VMA set): > > ====================================== > > Let's look at some of the corner cases with a below example of tasks and their > > access pattern. > > > > Consider N tasks (threads) of a process. > > Set1 tasks accessing vma_x (group of VMAs) > > Set2 tasks accessing vma_y (group of VMAs) > > > > Set1 Set2 > > ------------------- -------------------- > > | task_1..task_n/2 | | task_n/2+1..task_n | > > ------------------- -------------------- > > | | > > V V > > ------------------- -------------------- > > | vma_x | | vma_y | > > ------------------- -------------------- > > > > Corner cases: > > (a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ). > > suppose Set1 tasks gets more opportunity to scan (May be because of the > > activity pattern of tasks or other reasons in current design) in the above > > example, then vma_x gets scanned more number of times than vma_y. > > > > some experiment is also done here which illustrates this unfairness: > > Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/ > > > > (b) Sizes of vmas can differ. > > Suppose size of vma_y is far greater than the size of vma_x, then a bigger > > portion of vma_y can potentially be left unscanned since scanning is bounded > > by scan_size of 256MB (default) for each iteration. > > > > (c) Highly active threads trap a few VMAs frequently, and some of the VMAs not > > accessed for long time can potentially get starved of scanning indefinitely > > (Mel). There is a possibility of lack of enough hints/details about VMAs if it > > is needed later for migration. > > > > (d) Allocation of memory in some specific manner (Mel). > > One example could be, Suppose a main thread allocates memory and it is not > > active. When other threads tries to act upon it, they may not have much > > hints about it, if the corresponding VMA was not scanned. > > > > (e) VMAs that are created after two full scans of mm (mm->numa_scan_seq > 2) > > will never get scanned. (Observed rarely but very much possible depending on > > workload behaviour). > > > > Above this, a combination of some of the above (e.g., (a) and (b)) can > > potentially amplifyi/worsen the side effect. > > > > This patchset, tries to address the above issues by enhancing unconditional > > VMA scanning logic. > > > > High level ideas: > > ================= > > Idea-1) Depending on vma_size, populate a per vma_scan_select value, decrement it > > and when it hits zero do force scan (Mel). > > vma_scan_select value is again repopulated when it hits zero. > > > > This is how VMA scanning phases looks like after implementation: > > > > |<---p1--->|<-----p2----->|<-----p2----->|... > > > > Algorithm: > > p1: New VMA, initial phase do not scan till scan_delay. > > > > p2: Allow scanning if the task has accessed VMA or vma_scan_select hit zero. > > > > Reinitialize vma_scan_select and repeat p2. > > > > pros/cons: > > + : Ratelimiting is inbuilt to the approach > > + : vma_size is taken into account for scanning > > +/-: Scanning continues forever > > - : Changes in vma size is taken care after force scan. i.e., > > vma_scan_select is repopulated only after vma_scan_select hits zero. > > > > Idea-1 can potentially cover all the issues mentioned above. > > > > Idea-2) Take bitmask_weight of latest access_pids value (suggested by Bharata). > > If number of tasks accessing vma is >= 1, unconditionally allow scanning. > > > > Idea-3 ) Take bitmask_weight of access_pid history of VMA. If number of tasks > > accessing VMA is > THRESHOLD (=3), unconditionally allow scanning. > > > > Rationale (Idea-2,3): Do not miss out scanning of critical VMAs. > > > > Idea-4) Have a per vma_scan_seq. allow the unconditional scan till vma_scan_seq > > reaches a value proportional (or equal) to vma_size/scan_size. > > This a complimentary to Idea-1. > > > > this is how VMA scanning phases looks like after implementation: > > > > |<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> ...|| > > RESET RESET > > Algorithm: > > p1: New VMA, initial phase do not scan till scan_delay. > > > > p2: Allow scanning if task has accessed VMA or vma_scan_seq has reached till > > f(vma_size)/scan_size) for e.g., f = 1/2 * vma_size/scan_size. > > > > p3: Allow scanning if task has accessed VMA or vma_scan_seq has reached till > > f(vma_size)/scan_size in a rate limited manner. This is an optional phase. > > > > p4: Allow scanning iff task has accessed VMA. > > > > Reset after p4 (optional). > > > > Repeat p2, p3 p4 > > > > Motivation: Allow agressive scanning in the beginning followed by a rate > > limited scanning. And then completely disallow scanning to avoid unnecessary > > scanning. Reset time could be a function of scan_delay and chosen long enough > > to aid long running task to forget history and start afresh. > > > > + : Ratelimiting need to be taken care separately if needed. > > +/-: Scanning continues only if RESET of vma_scan_seq is implemented. > > + : changes in vma size is taken care in every scan. > > > > Current patch series implements Ideas 1, 2, 3 + extension of access PID history > > idea from PeterZ. > > > > Results: > > ====== > > Base: 6.5.0-rc6+ (4853c74bd7ab) > > SUT: Milan w/ 2 numa nodes 256 cpus > > > > mmtest numa01_THREAD_ALLOC manual run: > > > > base patched > > real 1m22.758s 1m9.200s > > user 249m49.540s 229m30.039s > > sys 0m25.040s 3m10.451s > > > > numa_pte_updates 6985 1573363 > > numa_hint_faults 2705 1022623 > > numa_hint_faults_local 2279 389633 > > numa_pages_migrated 426 632990 > > > > kernbench > > base patched > > Amean user-256 21989.09 ( 0.00%) 21677.36 * 1.42%* > > Amean syst-256 10171.34 ( 0.00%) 10818.28 * -6.36%* > > Amean elsp-256 166.81 ( 0.00%) 168.40 * -0.95%* > > > > Duration User 65973.18 65038.00 > > Duration System 30538.92 32478.59 > > Duration Elapsed 529.52 533.09 > > > > Ops NUMA PTE updates 976844.00 962680.00 > > Ops NUMA hint faults 226763.00 245620.00 > > Ops NUMA pages migrated 220146.00 207025.00 > > Ops AutoNUMA cost 1144.84 1238.77 > > > > Improvements in other benchmarks I have tested. > > Time based: > > Hashjoin 4.21% > > Btree 2.04% > > XSbench 0.36% > > > > Throughput based: > > Graph500 -3.62% > > Nas.bt 3.69% > > Nas.ft 21.91% > > > > Note: VMA scanning improvements [1] has refined scanning so much that > > system overhead we re-introduce with additional scan look glaringly > > high. But If we consider the difference between before [1] and current > > series, overall scanning overhead is considerably reduced. > > > > 1. Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t > > 2. Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/ > > > > Note: Patch description is again repeated in some patches to avoid any > > need to copy from cover letter again. > > > > Peter Zijlstra (1): > > sched/numa: Increase tasks' access history > > > > Raghavendra K T (5): > > sched/numa: Move up the access pid reset logic > > sched/numa: Add disjoint vma unconditional scan logic > > sched/numa: Remove unconditional scan logic using mm numa_scan_seq > > sched/numa: Allow recently accessed VMAs to be scanned > > sched/numa: Allow scanning of shared VMAs > > > > include/linux/mm.h | 12 +++-- > > include/linux/mm_types.h | 5 +- > > kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++------- > > 3 files changed, 102 insertions(+), 24 deletions(-) > > > > Hello Andrew, > > I am Resending patch rebasing to mm-unstable, adding results from Oliver > and Swapnil. Just for the record, a final version of this series should be submitted via the scheduler tree, not -mm. Thanks, Ingo