From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 958AAC61DA4 for ; Fri, 3 Feb 2023 11:17:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ACF276B0072; Fri, 3 Feb 2023 06:17:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A7FFD6B0073; Fri, 3 Feb 2023 06:17:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9472C6B0074; Fri, 3 Feb 2023 06:17:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 860EF6B0072 for ; Fri, 3 Feb 2023 06:17:34 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 450ED1C6943 for ; Fri, 3 Feb 2023 11:17:34 +0000 (UTC) X-FDA: 80425730028.10.C634FD4 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) by imf09.hostedemail.com (Postfix) with ESMTP id 7DCC9140003 for ; Fri, 3 Feb 2023 11:17:30 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=infradead.org header.s=desiato.20200630 header.b=eRUYjM6m; spf=none (imf09.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.92.199) smtp.mailfrom=peterz@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675423052; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=n6BMnRqodGEdSdHoW4ANPnBbVuiZx8SV3Fs1Jtl+/ok=; b=Ln3zetSu6qlcNGcIJmEv5t+O/c1knH5ulU+qLEIkCTwl+XZ8SfMHPVQw0UawQNh5cYR+Ar Iqtgv1mmwtb2NKXEZ5tk80PHhghFYamzhEGE18tDDzMQjB3xBhy18F+ZV3y4w8u2Xnv98S rnQn5KmtV+ZqR9SDvSAS5+Vmr5DXz84= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=infradead.org header.s=desiato.20200630 header.b=eRUYjM6m; spf=none (imf09.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.92.199) smtp.mailfrom=peterz@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675423052; a=rsa-sha256; cv=none; b=SXH+4qpRtQyJJEL3iFRIA079d9w5U9v4rUDqI7gsMJ57ZHxk72811xNZr8U6COYsOCzroD qmShAkq/S3Oby3jZF015hJEDb9B+4o4ejlKV7XWvrsK9UxBqF+ER55pLklt6CVjmPvzYVU psh8FEGHOGm7kmyIKwVfwTliY3Cb6Ys= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=n6BMnRqodGEdSdHoW4ANPnBbVuiZx8SV3Fs1Jtl+/ok=; b=eRUYjM6mnRt3qbbIWi9IvXZDni bHX68Pan196Us9z97WLUaA0W0h7avdFSk8eTktALo+bZB7rgRUW/O1Ak/GxlS08lP3Ra/82vw+a3U 6LMvTU58IybrXUHcv1TUQbhQCBNJS9ZjtFmbG/BZkEn15Q3mRN4SuUdJeStcuO+VWetCGEQ63H6GU 9eZp8x6gevff74U+Fkc0o+A0E0H7ZP0Gr8vaQUCXjLM+K6iB7NVepD2fGSABUwbswWH+IvbCGSTBk Koo2Wc6awJ6r6Ulh04DSbStlZAf6MJn4LeAePmepmoOsbaD2Jo18LHxeHTZBNe873tRY9Pr9oYDmo SuqQPUnQ==; Received: from j130084.upc-j.chello.nl ([24.132.130.84] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.96 #2 (Red Hat Linux)) id 1pNu2F-005Ua9-0h; Fri, 03 Feb 2023 11:16:47 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id EFC7A300446; Fri, 3 Feb 2023 12:15:48 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id E3AAB212BDE63; Fri, 3 Feb 2023 12:15:48 +0100 (CET) Date: Fri, 3 Feb 2023 12:15:48 +0100 From: Peter Zijlstra To: Raghavendra K T Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ingo Molnar , Mel Gorman , Andrew Morton , David Hildenbrand , rppt@kernel.org, Bharata B Rao , Disha Talreja Subject: Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic Message-ID: References: <5f0872657ddb164aa047a2231f8dc1086fe6adf6.1675159422.git.raghavendra.kt@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5f0872657ddb164aa047a2231f8dc1086fe6adf6.1675159422.git.raghavendra.kt@amd.com> X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: tieyka7muzmxkus9w1j8y35i8bt64u77 X-Rspamd-Queue-Id: 7DCC9140003 X-HE-Tag: 1675423050-527798 X-HE-Meta: U2FsdGVkX1+fgIw01IOE7Q9XEVBBiaMjdBVoiJILValyW87oQ77hdidQYbY3tDK3e8OIJ9FKzJvGPz5Ns1ZI1R0SXEsu/h4LmRqcJhnNcNg3VvPrHWhprNJFm0YsjEQktvt/48zQVn4LIQ/6hS/jKXDcSGHp6N59KAuCob6ahqIYJrbbVfw0NqN/VrK0bwOo5VRq1gXOsYgbAIDTmMS2w5mNfx68pfq90p6Gpkt8PNYSON6BFMSmq9ph7YqGV3TYG/cRPU+cHVhM7wfDl5643HhcAzv+utzzKa9+xv4xKRCYVNgo7qH87quzQ60WZNyPF08m9Tk+ko57WV+pHzjHYQuXuwA6UFZz4a5XWr9rIzRRfmf2dX07cswBuVVMVtzyI7IAWyFryV0kM5+T9FsJfdsKKbPZ0nIfTdKh64VAj6WRXANQK7WSEt2la4Jrw6nfhT/8MkWgyNSysufVI95h8F1FO1sd/cRYP/DwRV83wOwDUIXWv+zBFrHMv6ZeJNkVhu/FLMa+7yEQVZrATAGXvnhkcPvpJeFLUNpF7iDyqmocP0qzFdrBxM2JIPj8KNAF/A5S1R0qIUbKUBxczom8tXsR4Hq7n7hCemkYeSTkfuct8JmCivLcpT3L2WxvTEmgZ1kZAAjSQejPgFkSvP9Viy5FiUT+WHI4gaVduDH+AgLzMDDma/jfjJ3raJY5HW5/22zz9OeX/UofCYhZki5iPlyhuEif3yeCLDAuWeYx4UzbpQw7dilWfgKYH/fmZYF8RMxrt1HN2fZGOim6FCRAEBOyDq4pPegzCQ8wlNOyxrwVJkhdqKyhwZfqkZE4uPtJv9QaI3Cf+9IfH/CGS8nr6m3VTgGKl0aXlgRTuscn8d77CQfTAS4qIvdXDntBb7aADHSoBq6iwNGXR/qWktSC17hZM745YTk0Z7J0xgkc4KCAjFuKsw51R3egu7l8dSitAbQqkz3Lcw4kckDjO+A YbtWCOav BuIrHghdJRPoXBCuxP2NMuOAeMfJr4vCBkbIhx2H2HUkYiDvXAjcmg++jdfpZ8S8vS8fI0VBe4pHg39awnUsoWy9dYygJFJzHtzwvCJ1B4vJUFOMOTW7wENItW+Oqcp4YpP/tHQHnCLaKVjX2sjSJXsRUJyRpde3jEENGPfTsWBKAnhjeyQvT6k2bAsFI8HahdpXzmt0RPHM2DB/dWddlWsf8XK+uLYdDB/C0RF5E50H+pq8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote: > During the Numa scanning make sure only relevant vmas of the > tasks are scanned. > > Before: > All the tasks of a process participate in scanning the vma > even if they do not access vma in it's lifespan. > > Now: > Except cases of first few unconditional scans, if a process do > not touch vma (exluding false positive cases of PID collisions) > tasks no longer scan all vma. > > Logic used: > 1) 6 bits of PID used to mark active bit in vma numab status during > fault to remember PIDs accessing vma. (Thanks Mel) > > 2) Subsequently in scan path, vma scanning is skipped if current PID > had not accessed vma. > > 3) First two times we do allow unconditional scan to preserve earlier > behaviour of scanning. > > Acknowledgement to Bharata B Rao for initial patch > to store pid information. > > Suggested-by: Mel Gorman > Signed-off-by: Raghavendra K T > --- > include/linux/mm.h | 14 ++++++++++++++ > include/linux/mm_types.h | 1 + > kernel/sched/fair.c | 15 +++++++++++++++ > mm/huge_memory.c | 1 + > mm/memory.c | 1 + > 5 files changed, 32 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 74d9df1d8982..489422942482 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1381,6 +1381,16 @@ static inline int xchg_page_access_time(struct page *page, int time) > last_time = page_cpupid_xchg_last(page, time >> PAGE_ACCESS_TIME_BUCKETS); > return last_time << PAGE_ACCESS_TIME_BUCKETS; > } > + > +static inline void vma_set_active_pid_bit(struct vm_area_struct *vma) > +{ > + unsigned int active_pid_bit; > + > + if (vma->numab) { > + active_pid_bit = current->pid % BITS_PER_LONG; > + vma->numab->accessing_pids |= 1UL << active_pid_bit; > + } > +} Perhaps: if (vma->numab) __set_bit(current->pid % BITS_PER_LONG, &vma->numab->pids); ? Or maybe even: bit = current->pid % BITS_PER_LONG; if (vma->numab && !__test_bit(bit, &vma->numab->pids)) __set_bit(bit, &vma->numab->pids); > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 060b241ce3c5..3505ae57c07c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2916,6 +2916,18 @@ static void reset_ptenuma_scan(struct task_struct *p) > p->mm->numa_scan_offset = 0; > } > > +static bool vma_is_accessed(struct vm_area_struct *vma) > +{ > + unsigned int active_pid_bit; > + /* * Tell us why 2.... */ > + if (READ_ONCE(current->mm->numa_scan_seq) < 2) > + return true; > + > + active_pid_bit = current->pid % BITS_PER_LONG; > + > + return vma->numab->accessing_pids & (1UL << active_pid_bit); return __test_bit(current->pid % BITS_PER_LONG, &vma->numab->pids) > +} > + > /* > * The expensive part of numa migration is done from task_work context. > * Triggered from task_tick_numa(). > @@ -3032,6 +3044,9 @@ static void task_numa_work(struct callback_head *work) > if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan)) > continue; > /* * tell us more... */ > + if (!vma_is_accessed(vma)) > + continue; > + > do { > start = max(start, vma->vm_start); > end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); This feels wrong, specifically we track numa_scan_offset per mm, now, if we divide the threads into two dis-joint groups each only using their own set of vmas (in fact quite common for workloads with proper data partitioning) it is possible to consistently sample one set of threads and thus not scan the other set of vmas. It seems somewhat unlikely, but not impossible to create significant unfairness. > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 811d19b5c4f6..d908aa95f3c3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1485,6 +1485,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) > bool was_writable = pmd_savedwrite(oldpmd); > int flags = 0; > > + vma_set_active_pid_bit(vma); > vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); > if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { > spin_unlock(vmf->ptl); > diff --git a/mm/memory.c b/mm/memory.c > index 8c8420934d60..2ec3045cb8b3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4718,6 +4718,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > bool was_writable = pte_savedwrite(vmf->orig_pte); > int flags = 0; > > + vma_set_active_pid_bit(vma); > /* > * The "pte" at this point cannot be used safely without > * validation through pte_unmap_same(). It's of NUMA type but Urghh... do_*numa_page() is two near identical functions.. is there really no sane way to de-duplicate at least some of that? Also, is this placement right, you're marking the thread even before we know there's even a page there. I would expect this somewhere around where we track lastpid. Maybe numa_migrate_prep() ?