Re: [PATCH v12 12/14] mm: multi-gen LRU: debugfs interface

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Qi Zheng <zhengqi.arch@bytedance.com>
To: Yu Zhao <yuzhao@google.com>, Andrew Morton <akpm@linux-foundation.org>
Cc: "Andi Kleen" <ak@linux.intel.com>,
	"Aneesh Kumar" <aneesh.kumar@linux.ibm.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Hillf Danton" <hdanton@sina.com>, "Jens Axboe" <axboe@kernel.dk>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Mel Gorman" <mgorman@suse.de>,
	"Michael Larabel" <Michael@michaellarabel.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Tejun Heo" <tj@kernel.org>, "Vlastimil Babka" <vbabka@suse.cz>,
	"Will Deacon" <will@kernel.org>,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	page-reclaim@google.com, "Brian Geffon" <bgeffon@google.com>,
	"Jan Alexander Steffens" <heftig@archlinux.org>,
	"Oleksandr Natalenko" <oleksandr@natalenko.name>,
	"Steven Barrett" <steven@liquorix.net>,
	"Suleiman Souhlal" <suleiman@google.com>,
	"Daniel Byrne" <djbyrne@mtu.edu>,
	"Donald Carr" <d@chaos-reins.com>,
	"Holger Hoffstätte" <holger@applied-asynchrony.com>,
	"Konstantin Kharlamov" <Hi-Angel@yandex.ru>,
	"Shuang Zhai" <szhai2@cs.rochester.edu>,
	"Sofia Trinh" <sofia.trinh@edi.works>,
	"Vaibhav Jain" <vaibhav@linux.ibm.com>
Subject: Re: [PATCH v12 12/14] mm: multi-gen LRU: debugfs interface
Date: Wed, 22 Jun 2022 17:16:29 +0800	[thread overview]
Message-ID: <214db251-827c-715c-54cf-9c0e9bb5fe30@bytedance.com> (raw)
In-Reply-To: <20220614071650.206064-13-yuzhao@google.com>



On 2022/6/14 15:16, Yu Zhao wrote:
> Add /sys/kernel/debug/lru_gen for working set estimation and proactive
> reclaim. These techniques are commonly used to optimize job scheduling
> (bin packing) in data centers [1][2].
> 
> Compared with the page table-based approach and the PFN-based
> approach, this lruvec-based approach has the following advantages:
> 1. It offers better choices because it is aware of memcgs, NUMA nodes,
>     shared mappings and unmapped page cache.
> 2. It is more scalable because it is O(nr_hot_pages), whereas the
>     PFN-based approach is O(nr_total_pages).
> 
> Add /sys/kernel/debug/lru_gen_full for debugging.
> 
> [1] https://dl.acm.org/doi/10.1145/3297858.3304053
> [2] https://dl.acm.org/doi/10.1145/3503222.3507731
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>   include/linux/nodemask.h |   1 +
>   mm/vmscan.c              | 412 ++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 403 insertions(+), 10 deletions(-)
> 

Hi Yu,

> +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
> +				 size_t len, loff_t *pos)
> +{
> +	void *buf;
> +	char *cur, *next;
> +	unsigned int flags;
> +	struct blk_plug plug;
> +	int err = -EINVAL;
> +	struct scan_control sc = {
> +		.may_writepage = true,
> +		.may_unmap = true,
> +		.may_swap = true,
> +		.reclaim_idx = MAX_NR_ZONES - 1,
> +		.gfp_mask = GFP_KERNEL,
> +	};
> +
> +	buf = kvmalloc(len + 1, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(buf, src, len)) {
> +		kvfree(buf);
> +		return -EFAULT;
> +	}
> +
> +	if (!set_mm_walk(NULL)) {

The current->reclaim_state will be dereferenced in set_mm_walk(), so
calling set_mm_walk() before set_task_reclaim_state(current,
&sc.reclaim_state) will cause panic:

[ 1861.154916] BUG: kernel NULL pointer dereference, address: 
0000000000000008
[ 1861.155720] #PF: supervisor read access in kernel mode
[ 1861.156263] #PF: error_code(0x0000) - not-present page
[ 1861.156805] PGD 0 P4D 0
[ 1861.157107] Oops: 0000 [#1] PREEMPT SMP PTI
[ 1861.157560] CPU: 5 PID: 1017 Comm: bash Not tainted 5.19.0-rc2+ #244
[ 1861.158227] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.14.0-0-g14
[ 1861.159419] RIP: 0010:set_mm_walk+0x15/0x60
[ 1861.159878] Code: e8 30 5f 01 00 48 c7 43 70 00 00 00 00 5b c3 31 f6 
eb e2 66 90 0f 1f f
[ 1861.161806] RSP: 0018:ffffc90006dd3d58 EFLAGS: 00010246
[ 1861.162356] RAX: 0000000000000000 RBX: 00005582747a70b0 RCX: 
0000000000000000
[ 1861.163109] RDX: ffff88810a198000 RSI: 00005582747a70c1 RDI: 
0000000000000000
[ 1861.163855] RBP: ffff888104f4e400 R08: 0000000000000000 R09: 
ffff888100042400
[ 1861.164597] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff888685896fc0
[ 1861.165334] R13: 00005582747a70b0 R14: ffff888103ef2e40 R15: 
0000000000000011
[ 1861.166083] FS:  00007f843df57740(0000) GS:ffff888666b40000(0000) 
knlGS:0000000000000000
[ 1861.166921] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1861.167527] CR2: 0000000000000008 CR3: 0000000684e7e000 CR4: 
00000000000006e0
[ 1861.168272] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1861.169020] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[ 1861.169867] Call Trace:
[ 1861.170159]  <TASK>
[ 1861.170396]  lru_gen_seq_write+0xbf/0x600
[ 1861.170837]  ? _raw_spin_unlock+0x15/0x30
[ 1861.171272]  ? wp_page_reuse+0x5f/0x70
[ 1861.171678]  ? do_wp_page+0xda/0x3e0
[ 1861.172063]  ? __handle_mm_fault+0x92f/0xeb0
[ 1861.172529]  full_proxy_write+0x4d/0x70
[ 1861.172941]  vfs_write+0xb8/0x2a0
[ 1861.173302]  ksys_write+0x59/0xd0
[ 1861.173667]  do_syscall_64+0x34/0x80
[ 1861.174055]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

> +		kvfree(buf);
> +		return -ENOMEM;
> +	}
> +
> +	set_task_reclaim_state(current, &sc.reclaim_state);
> +	flags = memalloc_noreclaim_save();
> +	blk_start_plug(&plug);
> +
> +	next = buf;
> +	next[len] = '\0';
> +
> +	while ((cur = strsep(&next, ",;\n"))) {
> +		int n;
> +		int end;
> +		char cmd;
> +		unsigned int memcg_id;
> +		unsigned int nid;
> +		unsigned long seq;
> +		unsigned int swappiness = -1;
> +		unsigned long opt = -1;
> +
> +		cur = skip_spaces(cur);
> +		if (!*cur)
> +			continue;
> +
> +		n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
> +			   &seq, &end, &swappiness, &end, &opt, &end);
> +		if (n < 4 || cur[end]) {
> +			err = -EINVAL;
> +			break;
> +		}
> +
> +		err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt);
> +		if (err)
> +			break;
> +	}
> +
> +	blk_finish_plug(&plug);
> +	memalloc_noreclaim_restore(flags);
> +	set_task_reclaim_state(current, NULL);
> +
> +	clear_mm_walk();

Ditto, we can't call clear_mm_walk() after 
set_task_reclaim_state(current, NULL).

Maybe it can be modified as follows:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2422edc786eb..552e6ae5243e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5569,12 +5569,12 @@ static ssize_t lru_gen_seq_write(struct file 
*file, const char __user *src,
                 return -EFAULT;
         }

+       set_task_reclaim_state(current, &sc.reclaim_state);
         if (!set_mm_walk(NULL)) {
                 kvfree(buf);
                 return -ENOMEM;
         }

-       set_task_reclaim_state(current, &sc.reclaim_state);
         flags = memalloc_noreclaim_save();
         blk_start_plug(&plug);

@@ -5609,9 +5609,9 @@ static ssize_t lru_gen_seq_write(struct file 
*file, const char __user *src,

         blk_finish_plug(&plug);
         memalloc_noreclaim_restore(flags);
+       clear_mm_walk();
         set_task_reclaim_state(current, NULL);

-       clear_mm_walk();
         kvfree(buf);

         return err ? : len;

Thanks,
Qi

> +	kvfree(buf);
> +
> +	return err ? : len;
> +}
> +


-- 
Thanks,
Qi

next prev parent reply	other threads:[~2022-06-22  9:16 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-14  7:16 [PATCH v12 00/14] Multi-Gen LRU Framework Yu Zhao
2022-06-14  7:16 ` [PATCH v12 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-06-14  7:16 ` [PATCH v12 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-06-14  7:16 ` [PATCH v12 03/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-06-14  7:16 ` [PATCH v12 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Yu Zhao
2022-06-14  7:16 ` [PATCH v12 05/14] mm: multi-gen LRU: groundwork Yu Zhao
2022-06-14  7:16 ` [PATCH v12 06/14] mm: multi-gen LRU: minimal implementation Yu Zhao
2022-06-14  7:16 ` [PATCH v12 07/14] mm: multi-gen LRU: exploit locality in rmap Yu Zhao
2022-06-29  4:44   ` Barry Song
2022-06-14  7:16 ` [PATCH v12 08/14] mm: multi-gen LRU: support page table walks Yu Zhao
2022-06-14  7:23   ` Yu Zhao
2022-06-14  7:16 ` [PATCH v12 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao
2022-06-14  7:16 ` [PATCH v12 10/14] mm: multi-gen LRU: kill switch Yu Zhao
2022-06-14  7:16 ` [PATCH v12 11/14] mm: multi-gen LRU: thrashing prevention Yu Zhao
2022-06-14  7:16 ` [PATCH v12 12/14] mm: multi-gen LRU: debugfs interface Yu Zhao
2022-06-22  9:16   ` Qi Zheng [this message]
2022-06-22 19:13     ` Yu Zhao
2022-06-23  2:36       ` Qi Zheng
2022-06-14  7:16 ` [PATCH v12 13/14] mm: multi-gen LRU: admin guide Yu Zhao
2022-06-14  7:16 ` [PATCH v12 14/14] mm: multi-gen LRU: design doc Yu Zhao
2022-06-22  6:21 ` [PATCH v12 00/14] Multi-Gen LRU Framework Yu Zhao
2022-06-22 18:45   ` Andrew Morton
2022-06-22 18:58     ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=214db251-827c-715c-54cf-9c0e9bb5fe30@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=Hi-Angel@yandex.ru \
    --cc=Michael@michaellarabel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=bgeffon@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=d@chaos-reins.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=djbyrne@mtu.edu \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=heftig@archlinux.org \
    --cc=holger@applied-asynchrony.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=oleksandr@natalenko.name \
    --cc=page-reclaim@google.com \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    --cc=sofia.trinh@edi.works \
    --cc=steven@liquorix.net \
    --cc=suleiman@google.com \
    --cc=szhai2@cs.rochester.edu \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vaibhav@linux.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox