From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F13AC433EF for ; Wed, 22 Jun 2022 09:16:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C7E908E00A7; Wed, 22 Jun 2022 05:16:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C09488E00A5; Wed, 22 Jun 2022 05:16:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A81D18E00A7; Wed, 22 Jun 2022 05:16:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 922DF8E00A5 for ; Wed, 22 Jun 2022 05:16:57 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id E99991212CE for ; Wed, 22 Jun 2022 09:16:56 +0000 (UTC) X-FDA: 79605317232.23.D5E4FAB Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf12.hostedemail.com (Postfix) with ESMTP id 27A3C4001A for ; Wed, 22 Jun 2022 09:16:50 +0000 (UTC) Received: by mail-pl1-f172.google.com with SMTP id g8so14866241plt.8 for ; Wed, 22 Jun 2022 02:16:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:content-language:to :cc:references:from:in-reply-to:content-transfer-encoding; bh=r3OCePRAU6CHUwZ+8IsXDeKrnwn43yWY5sOfCn3kLbQ=; b=MREgIwFBfb0WS5TmXBjXPd7ZK3Noxsnf+935uTosWx5Jx22OUQColcGy865Hm063oe cwN6J+cBlAHICjuvYQ7ediamTkFngNnX9frH+bBpxHjucRY09GrVkDtgE2Sbm26a5TfX td8GYKoUAZVO1Zf9f6y4b15+XZK4+C7lBetjNHP4evkWHA0ZurA2kOAsaU63KedHK4vd qng3VEr7SFFKJNRrVmssjVA/uYk21hy1YONa5auxJCAhmp4/tm2xeDRS37xiwDLr9fe7 /p0JoQc+LoqbkuE94P/4wwrgdl80lZEasggpZenF584jLklGUQmcZ/gS1Kg3LXqiL2iE Gokg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=r3OCePRAU6CHUwZ+8IsXDeKrnwn43yWY5sOfCn3kLbQ=; b=tcn8E6Iwtm+3lpTq3URwWm3ekK6+rgzdNokYybrV4epaYr2YsShpixSWYica0paQvp w+xyNKqgo1YEX8anHkXLVLBC1Sdh2buXkzIQavcya1CECCN7UtXAxs0eazPa9uiN41WW s8HtXgFGDgkvioBu777fIsjYRxwuNiVSShdeHXFVrtLj7U6oll9FQFlIXfEB9T+H8295 mRCaWrBVNlHHMbteM/qycAck4lTEyfT3fS5TPtri+nmViYzBfncgJ3+21SLJUxYGaxzb +v01+7N3Zi6rBWsTQoMIjbo8IPsApU3oeca2wfQoF3Vrvf+sjPhWImk/FIw4ooAfgZNa KH3Q== X-Gm-Message-State: AJIora/bExEMVhA1QG84QJosGFsVmsTtlVwRPG6Ezt0yPP9e+JXQvzWi 71xNBeE4sT/guSV+yNUzOc3Krg== X-Google-Smtp-Source: AGRyM1uoLV75q763zHawCFFTUGGoYyd+3Kfbn5wjvmNM/OR2iTX2AbONQF/vOOv05okzwXQlcYE8Lw== X-Received: by 2002:a17:90a:428f:b0:1ec:888b:f1d3 with SMTP id p15-20020a17090a428f00b001ec888bf1d3mr25304778pjg.201.1655889403882; Wed, 22 Jun 2022 02:16:43 -0700 (PDT) Received: from [10.4.214.173] ([139.177.225.253]) by smtp.gmail.com with ESMTPSA id y5-20020a170902d64500b001641a68f1c7sm12177873plh.273.2022.06.22.02.16.32 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 22 Jun 2022 02:16:43 -0700 (PDT) Message-ID: <214db251-827c-715c-54cf-9c0e9bb5fe30@bytedance.com> Date: Wed, 22 Jun 2022 17:16:29 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 Subject: Re: [PATCH v12 12/14] mm: multi-gen LRU: debugfs interface Content-Language: en-US To: Yu Zhao , Andrew Morton Cc: Andi Kleen , Aneesh Kumar , Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Peter Zijlstra , Tejun Heo , Vlastimil Babka , Will Deacon , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, page-reclaim@google.com, Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , =?UTF-8?Q?Holger_Hoffst=c3=a4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain References: <20220614071650.206064-1-yuzhao@google.com> <20220614071650.206064-13-yuzhao@google.com> From: Qi Zheng In-Reply-To: <20220614071650.206064-13-yuzhao@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655889416; a=rsa-sha256; cv=none; b=qXwk2QJ8jxkwzQle4LHuJqAu87wKxEExUj6bP1wiIMaTdSOl1folQ1A8Z3/Rw/JfqtBCJd wyfsnybOBNnmVl9F4No5seJtnowYnDzZWMphHsQuWPGeUueuvx6aGm8n+q/Z+rZTtMtz5E +XOG0AwTLfx7+Gf7BjdtSG2ClC9nLpI= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=MREgIwFB; spf=pass (imf12.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1655889416; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r3OCePRAU6CHUwZ+8IsXDeKrnwn43yWY5sOfCn3kLbQ=; b=VOP1Sx7pAKRhl1al/zSJQKcpgmI9uQGfyjXRE2wRnEnIiwCeoovjtevZXrqfJyaNfK0dCf ToUjBvAz2R6oLiQbY15yeBMn+C10l/fQPkXwEgmwhXhy1Vm+Ic7ibhGpfPP//2irQJN3xx Zi9HJAddOi/glQBttk1kGlGZ8Wm1mwo= X-Rspamd-Queue-Id: 27A3C4001A X-Rspam-User: Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=MREgIwFB; spf=pass (imf12.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam08 X-Stat-Signature: w7y7crfjh1sf7tjifc35wr64816k89mt X-HE-Tag: 1655889410-238768 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2022/6/14 15:16, Yu Zhao wrote: > Add /sys/kernel/debug/lru_gen for working set estimation and proactive > reclaim. These techniques are commonly used to optimize job scheduling > (bin packing) in data centers [1][2]. > > Compared with the page table-based approach and the PFN-based > approach, this lruvec-based approach has the following advantages: > 1. It offers better choices because it is aware of memcgs, NUMA nodes, > shared mappings and unmapped page cache. > 2. It is more scalable because it is O(nr_hot_pages), whereas the > PFN-based approach is O(nr_total_pages). > > Add /sys/kernel/debug/lru_gen_full for debugging. > > [1] https://dl.acm.org/doi/10.1145/3297858.3304053 > [2] https://dl.acm.org/doi/10.1145/3503222.3507731 > > Signed-off-by: Yu Zhao > Acked-by: Brian Geffon > Acked-by: Jan Alexander Steffens (heftig) > Acked-by: Oleksandr Natalenko > Acked-by: Steven Barrett > Acked-by: Suleiman Souhlal > Tested-by: Daniel Byrne > Tested-by: Donald Carr > Tested-by: Holger Hoffstätte > Tested-by: Konstantin Kharlamov > Tested-by: Shuang Zhai > Tested-by: Sofia Trinh > Tested-by: Vaibhav Jain > --- > include/linux/nodemask.h | 1 + > mm/vmscan.c | 412 ++++++++++++++++++++++++++++++++++++++- > 2 files changed, 403 insertions(+), 10 deletions(-) > Hi Yu, > +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, > + size_t len, loff_t *pos) > +{ > + void *buf; > + char *cur, *next; > + unsigned int flags; > + struct blk_plug plug; > + int err = -EINVAL; > + struct scan_control sc = { > + .may_writepage = true, > + .may_unmap = true, > + .may_swap = true, > + .reclaim_idx = MAX_NR_ZONES - 1, > + .gfp_mask = GFP_KERNEL, > + }; > + > + buf = kvmalloc(len + 1, GFP_KERNEL); > + if (!buf) > + return -ENOMEM; > + > + if (copy_from_user(buf, src, len)) { > + kvfree(buf); > + return -EFAULT; > + } > + > + if (!set_mm_walk(NULL)) { The current->reclaim_state will be dereferenced in set_mm_walk(), so calling set_mm_walk() before set_task_reclaim_state(current, &sc.reclaim_state) will cause panic: [ 1861.154916] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 1861.155720] #PF: supervisor read access in kernel mode [ 1861.156263] #PF: error_code(0x0000) - not-present page [ 1861.156805] PGD 0 P4D 0 [ 1861.157107] Oops: 0000 [#1] PREEMPT SMP PTI [ 1861.157560] CPU: 5 PID: 1017 Comm: bash Not tainted 5.19.0-rc2+ #244 [ 1861.158227] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g14 [ 1861.159419] RIP: 0010:set_mm_walk+0x15/0x60 [ 1861.159878] Code: e8 30 5f 01 00 48 c7 43 70 00 00 00 00 5b c3 31 f6 eb e2 66 90 0f 1f f [ 1861.161806] RSP: 0018:ffffc90006dd3d58 EFLAGS: 00010246 [ 1861.162356] RAX: 0000000000000000 RBX: 00005582747a70b0 RCX: 0000000000000000 [ 1861.163109] RDX: ffff88810a198000 RSI: 00005582747a70c1 RDI: 0000000000000000 [ 1861.163855] RBP: ffff888104f4e400 R08: 0000000000000000 R09: ffff888100042400 [ 1861.164597] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888685896fc0 [ 1861.165334] R13: 00005582747a70b0 R14: ffff888103ef2e40 R15: 0000000000000011 [ 1861.166083] FS: 00007f843df57740(0000) GS:ffff888666b40000(0000) knlGS:0000000000000000 [ 1861.166921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1861.167527] CR2: 0000000000000008 CR3: 0000000684e7e000 CR4: 00000000000006e0 [ 1861.168272] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1861.169020] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1861.169867] Call Trace: [ 1861.170159] [ 1861.170396] lru_gen_seq_write+0xbf/0x600 [ 1861.170837] ? _raw_spin_unlock+0x15/0x30 [ 1861.171272] ? wp_page_reuse+0x5f/0x70 [ 1861.171678] ? do_wp_page+0xda/0x3e0 [ 1861.172063] ? __handle_mm_fault+0x92f/0xeb0 [ 1861.172529] full_proxy_write+0x4d/0x70 [ 1861.172941] vfs_write+0xb8/0x2a0 [ 1861.173302] ksys_write+0x59/0xd0 [ 1861.173667] do_syscall_64+0x34/0x80 [ 1861.174055] entry_SYSCALL_64_after_hwframe+0x46/0xb0 > + kvfree(buf); > + return -ENOMEM; > + } > + > + set_task_reclaim_state(current, &sc.reclaim_state); > + flags = memalloc_noreclaim_save(); > + blk_start_plug(&plug); > + > + next = buf; > + next[len] = '\0'; > + > + while ((cur = strsep(&next, ",;\n"))) { > + int n; > + int end; > + char cmd; > + unsigned int memcg_id; > + unsigned int nid; > + unsigned long seq; > + unsigned int swappiness = -1; > + unsigned long opt = -1; > + > + cur = skip_spaces(cur); > + if (!*cur) > + continue; > + > + n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid, > + &seq, &end, &swappiness, &end, &opt, &end); > + if (n < 4 || cur[end]) { > + err = -EINVAL; > + break; > + } > + > + err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt); > + if (err) > + break; > + } > + > + blk_finish_plug(&plug); > + memalloc_noreclaim_restore(flags); > + set_task_reclaim_state(current, NULL); > + > + clear_mm_walk(); Ditto, we can't call clear_mm_walk() after set_task_reclaim_state(current, NULL). Maybe it can be modified as follows: diff --git a/mm/vmscan.c b/mm/vmscan.c index 2422edc786eb..552e6ae5243e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5569,12 +5569,12 @@ static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, return -EFAULT; } + set_task_reclaim_state(current, &sc.reclaim_state); if (!set_mm_walk(NULL)) { kvfree(buf); return -ENOMEM; } - set_task_reclaim_state(current, &sc.reclaim_state); flags = memalloc_noreclaim_save(); blk_start_plug(&plug); @@ -5609,9 +5609,9 @@ static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, blk_finish_plug(&plug); memalloc_noreclaim_restore(flags); + clear_mm_walk(); set_task_reclaim_state(current, NULL); - clear_mm_walk(); kvfree(buf); return err ? : len; Thanks, Qi > + kvfree(buf); > + > + return err ? : len; > +} > + -- Thanks, Qi