From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B32EC3DA61 for ; Wed, 24 Jul 2024 12:12:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9FC866B0093; Wed, 24 Jul 2024 08:12:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9AAF46B0095; Wed, 24 Jul 2024 08:12:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 873596B0096; Wed, 24 Jul 2024 08:12:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 68A2C6B0093 for ; Wed, 24 Jul 2024 08:12:04 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DC2C1C08A5 for ; Wed, 24 Jul 2024 12:12:03 +0000 (UTC) X-FDA: 82374532926.01.F7B1C5F Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) by imf01.hostedemail.com (Postfix) with ESMTP id E42FE40017 for ; Wed, 24 Jul 2024 12:12:00 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=FLswKbdw; spf=pass (imf01.hostedemail.com: domain of wuyun.abel@bytedance.com designates 209.85.210.175 as permitted sender) smtp.mailfrom=wuyun.abel@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721823097; a=rsa-sha256; cv=none; b=RkDeIxcYraTTd/IkLVHLzJvz175/P17sKfTtkAcQgV6XRg2ODBYYrXkqMlB/BigVhjcarU gfDPo+navjVgQo95XhTKsSOIvzsT7asmzWyDcLEdnp2pDr65ubq+nEmH5cJbH300GdVT44 vUt7TZqJTSutDZ/fIXJx2SsXNs14yqo= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=FLswKbdw; spf=pass (imf01.hostedemail.com: domain of wuyun.abel@bytedance.com designates 209.85.210.175 as permitted sender) smtp.mailfrom=wuyun.abel@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721823097; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cFy7+njE3VJU3/3ZouefuwdTQET6AYUhpuoWintiyWk=; b=Lc1i6np2tx9u9qH84zSz7IFIql2ev2/o2j2rpzi1YvxEyUjRSziEPxv9mu2wBIWWh4c4cO dwiaWkxzgsIHLBPHggAHBbs9USknfEKksmjHmG7cxwsTpMBFFR5NO/LcsSKHTtR2FOJ1g0 Z6HV+eH9C9JDjYzGN0waC0tnHe/+A9c= Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-70d150e8153so607362b3a.0 for ; Wed, 24 Jul 2024 05:12:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1721823119; x=1722427919; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=cFy7+njE3VJU3/3ZouefuwdTQET6AYUhpuoWintiyWk=; b=FLswKbdwqVf17JU+W0ZZggFtHQSzZZKrpbQWAE1nS5GfeqM4PPE6KMSo1oRvnfvCh2 fEzoENsiLt0KDPgs82tD3+cIfxYHh4SlWujhcMfI+E9lpclXF5v+55mljgywzvD5r42q CZexpoKqnRe4ctoZ1+VSilFnYtU0QrBIr1er4ztBcDlndoWOrvmO1XWGYBsWQvR6nw80 TNxv/WXfvbNtdiZb08XNfKJ/4iFMXOwF8DceU7Gbsr+Hm6wBflHxTN1dAe8Fhr9JOfry HTvwcqZq+77NPtndyrMaMbEcW1nOw+sxrNh62aJbg6Tup/nSt0xDqoraC9DpLSMhHBAU 8JFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721823119; x=1722427919; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cFy7+njE3VJU3/3ZouefuwdTQET6AYUhpuoWintiyWk=; b=nprUXmf0fyBMLGNY5ntpCCWRmvXOgI/d24DibztS59lBZ7wRB4Sc2I492A7GqmZPwL zcIEdKyxAF4c3VUeB+Rn6G7JEfSHEFBuBx0k5Of4GYypb3RGnzy+4mPcj4qGhNfe5wUn oWxYDryARrYsUdIY5JeWwI1b+I/szGCvNGtJV1kSOxbkQJw9IGH0381B5lfQeIbB6Krf nAmLpeS17Dd5JYeIB/LKSgIQ+tT+EEPxELwogVYdNCoU5wTr/ndUo0ji42idVp9HHZmA B8L/UfTq/PPe47/FQdvLwoMFEYZITJpPafoAMgwRgAqnjQ+KDswPCeXx5cgTHJVfy+Lg 3FmA== X-Forwarded-Encrypted: i=1; AJvYcCX+64ARckuFJSrBU+m+2ar3MwoafPHWKCvmTtkUWHT7eDUqVaf/YbJEhJ7TfjuHS+G4onxhrxPEGkMtFENrHWO/EgQ= X-Gm-Message-State: AOJu0YzXUDEy6eNlLNY3SRc9RXIwPMhxz9gYGrlECCI5kaQebgJ8O8B3 F7c3EbL3pJCtzzrx860AjbouYPx8umioY1JgNSKAbIeru5+c0NVi4XQzMzgKJ1w= X-Google-Smtp-Source: AGHT+IFI0t1oh+Q+zYfO2EfCvj5CVJm/CirsXbPQ4VIGjBy05FeIYMitLubcUYvT3km0VNBcy2yVNA== X-Received: by 2002:a05:6a20:2443:b0:1c3:f4b6:f83c with SMTP id adf61e73a8af0-1c463782af3mr2905049637.26.1721823119250; Wed, 24 Jul 2024 05:11:59 -0700 (PDT) Received: from [10.254.218.171] ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-70e9cf572e7sm1186944b3a.29.2024.07.24.05.11.56 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 24 Jul 2024 05:11:58 -0700 (PDT) Message-ID: <790f44b4-2cad-469f-adcb-aa1cb31ff802@bytedance.com> Date: Wed, 24 Jul 2024 20:11:52 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] mm/numa_balancing: Fix the memory thrashing problem in the single-threaded process To: Zhongkun He Cc: peterz@infradead.org, mgorman@suse.de, ying.huang@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240723053250.3263125-1-hezhongkun.hzk@bytedance.com> Content-Language: en-US From: Abel Wu In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: hymermb6u9oqupcog796ekihacc3fib8 X-Rspamd-Queue-Id: E42FE40017 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1721823120-922392 X-HE-Meta: U2FsdGVkX1/cpbjJ3dYLsXNNHinBgM0gDc/Zn4qrs5To2J/izoiXDfKYgzPaFYlxvlHLDuAdJ6xsfiMgWL1K0aHyFQMVWi3XdyFDa/Dt66sUGrNtkrtI0FVQnbyFkUPkt1wQmrgTTlMJk278Nc3wRga8WGCXOzyGbAkWxtDWacOFnQYq1bShdnKw/qxLbdG7t/gNvf8kT+izL/zikstppcp2V8obmpGZy/9+C5+BYrMzkiyfUoqb1VCk0TNhApqZusmg4cNLneWPSXj67zMnTZFpagnPyIH1iVdIYa4uBkQRrrRl9D3S8ZDr1bOyiWGWpf2XtczTdcrBYm3zF0qdamoGfZFKy4oU5fAuL8LUVcpDWsoegxTuQWSvXYJ0DzeM9+bQq/cpscCsv/NLUo7yCilazYC6hHI191uk+pMyRjVazrmjEQ2LvHVP+ILDoG5sSyuWRdD1EdxZ4xIOdRsUw3rGsUQLSkSKITmFnyK1X7qAKy2Se8zX/xR4QqSMl9t5wOzSs+exLsjrrcxHwvXu8/jKRr1WyN65UIbB2sCHmjWDhCWwR7v+TDs6KUB+7foSCMis9fGyOU5PKO7TGef2KXwjAl/SiVqhyAeK5yn5Ev/tVXEaoEHcn+vBaZiXqcZPyKFHMcAQ4dueFTcVrHRRaOVIL4rqLBkwey9qb0vwhv71kETmO3zXWPSwV4JiDBx1DTAsLgDi/sCIg606SAollPRiY3edhpisNnJ5V7Q5xFk+FdZikV9IDsAPKU+op99xwo/daYDqpQN5xb14yeRrLqdj3PAmMwqCnVFZddyJvxRtfsWPQ9T4EQQ+gYHEiraunZ4lNvEUtXC/x+KtOKjDsV/Dis5T/T047WSdZJLmgM/gt4AjAJlDpu/mSuAKjgi0VBYJSxEtBYkf5pqgCepfbSla+SzgAko2j5+JC2NUhdsgzG0T9KGo35Uw2rcRXxWuQxlk2ovPdjH1LSb7MFs e4CR9Y3C Kz8UmtBGCqWPHgaYM+lMe4TbDgwt7j+rtg0xoKTasJxV3/3Of+Ah+jmFBHZwio2/QFti8DlynaGkLczkbevm7MKzDGtYbbfR2LwJ3W/y6MhSxPoNcveebLaeaiOWJf9KOIsxVcwc1zSnIL9oJKP0xCX+HCpYeagD5ZS29biBOS/fmo/t6AKcr7y4VYSZ1cUVqR30YhvTrNk+7AtF01WVTZJE6ghWBPHoj/QEiKl2XuMzz1o2g/CHPbZyfIfNdKnKgQ19SHSC1SmB6hZ3z51W19i8BaUzW32s3CazCOBCBx45JpFKiCJi3O+B4PLEdocyD+ms2v3m7/cZ9+dHk4OexUaIBxxPX7SyIyUBROsYFq9wJQHyQWt0uSovCtInSP09it2r2M4oq6/8bW4bNgrlp6A92Iq2APQOkNozv+UXRs6zO2m4hGbmaTnrWk6v2sBdCTeC4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000214, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 7/24/24 11:55 AM, Zhongkun He Wrote: > On Tue, Jul 23, 2024 at 9:39 PM Abel Wu wrote: >> >> Hi Zhongkun, >> >> On 7/23/24 1:32 PM, Zhongkun He Wrote: >>> I found a problem in my test machine that the memory of a process is >>> repeatedly migrated between two nodes and does not stop. >>> >>> 1.Test step and the machines. >>> ------------ >>> VM machine: 4 numa nodes and 10GB per node. >>> >>> stress --vm 1 --vm-bytes 12g --vm-keep >>> >>> The info of numa stat: >>> while :;do cat memory.numa_stat | grep -w anon;sleep 5;done >>> anon N0=98304 N1=0 N2=10250747904 N3=2634334208 >> >> I am curious what was the exact reason made the worker migrated >> to N3? And later... > > The maximum capacity of each node is 10 GB, but it requires 12GB, > so there's always 2G on other nodes. With the patch below we only > have page_faults in other nodes, not local. so we will migrate pages > to other nodes because p->numa_preferred_nid is always the other node. Ahh sorry, I didn't notice the size of the node... > >> >>> anon N0=98304 N1=0 N2=10250747904 N3=2634334208 >>> anon N0=98304 N1=0 N2=9937256448 N3=2947825664 >>> anon N0=98304 N1=0 N2=8863514624 N3=4021567488 >>> anon N0=98304 N1=0 N2=7789772800 N3=5095309312 >>> anon N0=98304 N1=0 N2=6716030976 N3=6169051136 >>> anon N0=98304 N1=0 N2=5642289152 N3=7242792960 >>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296 >>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296 >>> anon N0=98304 N1=0 N2=4837007360 N3=8048074752 >>> anon N0=98304 N1=0 N2=3763265536 N3=9121816576 >>> anon N0=98304 N1=0 N2=2689523712 N3=10195558400 >>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312 >>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312 >>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312 >> >> .. why it was moved back to N2? > > The private page_faults on N2 are larger than that on N3. > >> >>> anon N0=98304 N1=0 N2=3320455168 N3=9564626944 >>> anon N0=98304 N1=0 N2=4394196992 N3=8490885120 >>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296 >>> anon N0=98304 N1=0 N2=6174195712 N3=6710886400 >>> anon N0=98304 N1=0 N2=7247937536 N3=5637144576 >>> anon N0=98304 N1=0 N2=8321679360 N3=4563402752 >>> anon N0=98304 N1=0 N2=9395421184 N3=3489660928 >>> anon N0=98304 N1=0 N2=10247872512 N3=2637209600 >>> anon N0=98304 N1=0 N2=10247872512 N3=2637209600 >>> >>> 2. Root cause: >>> Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded >>> PTEs which are on the right node")the PTE of local pages will not be >>> changed in change_pte_range() for single-threaded process, so no >>> page_faults information will be generated in do_numa_page(). If a >>> single-threaded process has memory on another node, it will >>> unconditionally migrate all of it's local memory to that node, >>> even if the remote node has only one page. >> >> IIUC the remote pages will be moved to the node where the worker >> is running since local (private) PTEs are not set to protnone and >> won't be faulted on. >> > > Yes. > >>> >>> So, let's fix it. The memory of single-threaded process should follow >>> the cpu, not the numa faults info in order to avoid memory thrashing. >> >> Don't forget the 'Fixes' tag for bugfix patches :) > > OK, thanks. > >> >>> >>> ...> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 24dda708b699..d7cbbda568fb 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -2898,6 +2898,12 @@ static void task_numa_placement(struct task_struct *p) >>> numa_group_count_active_nodes(ng); >>> spin_unlock_irq(group_lock); >>> max_nid = preferred_group_nid(p, max_nid); >>> + } else if (atomic_read(&p->mm->mm_users) == 1) { >>> + /* >>> + * The memory of a single-threaded process should >>> + * follow the CPU in order to avoid memory thrashing. >>> + */ >>> + max_nid = numa_node_id(); >>> } >>> >>> if (max_faults) { >> >> Since you don't want to respect the faults info, can we simply >> skip task placement? > > This is a good suggestion. It would be even better if there were some > feedback from others. Although it is still a pity that, if remote node holds more hot pages than p->numa_preferred_nid, we have to migrate them to local rather than simply migrate the task to the hot node.