From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A2D6C433EF for ; Tue, 26 Apr 2022 08:45:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF53B6B0073; Tue, 26 Apr 2022 04:45:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EA4F56B0074; Tue, 26 Apr 2022 04:45:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6C7B6B0075; Tue, 26 Apr 2022 04:45:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id C27406B0073 for ; Tue, 26 Apr 2022 04:45:32 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8BB5420C91 for ; Tue, 26 Apr 2022 08:45:32 +0000 (UTC) X-FDA: 79398396504.23.F73796A Received: from r3-21.sinamail.sina.com.cn (r3-21.sinamail.sina.com.cn [202.108.3.21]) by imf26.hostedemail.com (Postfix) with SMTP id 32A2F140044 for ; Tue, 26 Apr 2022 08:45:28 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([114.249.57.134]) by sina.com (172.16.97.32) with ESMTP id 6267B10D0001B84C; Tue, 26 Apr 2022 16:45:03 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 472313628924 From: Hillf Danton To: Dave Chinner Cc: Roman Gushchin , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yang Shi , Kent Overstreet Subject: Re: [PATCH v2 0/7] mm: introduce shrinker debugfs interface Date: Tue, 26 Apr 2022 16:45:17 +0800 Message-Id: <20220426084517.3458-1-hdanton@sina.com> In-Reply-To: References: <20220422202644.799732-1-roman.gushchin@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 32A2F140044 Authentication-Results: imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.21 as permitted sender) smtp.mailfrom=hdanton@sina.com; dmarc=none X-Rspam-User: X-Stat-Signature: ukekntwcipdg6eeiqysmqekjjcmxj7am X-HE-Tag: 1650962728-60156 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 26 Apr 2022 16:02:19 +1000 Dave Chinner wrote: > On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote: > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). > > In general, I've had no trouble identifying individual shrinker > instances because I'm always looking at individual subsystem > shrinker tracepoints, too. Hence I've almost always got the > identification information in the traces I need to trace just the > individual shrinker tracepoints and a bit of sed/grep/awk and I've > got something I can feed to gnuplot or a python script to graph... > > > They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > IDGI. profiling shrinkers iunder ideal conditions when there isn't > memory pressure is largely a useless exercise because execution > patterns under memory pressure are vastly different. Well how many minutes, two or ten, does it take for kswapd to reclaim 100 xfs objects at DEF_PRIORITY-3? > > All the problems with shrinkers show up when progress cannot be made > as fast as memory reclaim wants memory to be reclaimed. How do you > trigger priority windup causing large amounts of deferred processing > because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do > you simulate objects getting dirtied in memory so they can't be > immediately reclaimed so the shrinker can't make any progress at all > until IO completes? How do you simulate the unbound concurrency that > direct reclaim can drive into the shrinkers that causes massive lock > contention on shared structures and locks that need to be accessed > to free objects? > > IOWs, if all you want to do is profile shrinkers running in the > absence of memory pressure, then you can do that perfectly well with > the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't > need some complex debugfs API just to profile the shrinker > behaviour. Hm ... given ext4, what sense does xfs make? Or vice verse? Or Given wine, why Coke? I want to see the minutes recycling ten ext4 objects with xfs intact before waking kswapd up. Hillf > > So why do we need any of the complexity and potential for abuse that > comes from exposing control of shrinkers directly to userspace like > these patches do? > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > /sys/kernel/slab contains read-only usage information - it is > analagous for visibility arguments, but it is not equivalent for > the rest of the "active" functionality you want to add here.... > > > For each shrinker registered in the system a directory is created. The directory > > contains "count" and "scan" files, which allow to trigger count_objects() > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > and/or per-node object count and shrink only a specific memcg/node. > > Great, but why does the shrinker introspection interface need active > scan control functions like these? > > > To make debugging more pleasant, the patchset also names all shrinkers, > > so that debugfs entries can have more meaningful names. > > > > Usage examples: > > > > 1) List registered shrinkers: > > $ cd /sys/kernel/debug/shrinker/ > > $ ls > > dqcache-16 sb-cgroup2-30 sb-hugetlbfs-33 sb-proc-41 sb-selinuxfs-22 sb-tmpfs-40 sb-zsmalloc-19 > > kfree_rcu-0 sb-configfs-23 sb-iomem-12 sb-proc-44 sb-sockfs-8 sb-tmpfs-42 shadow-18 > > sb-aio-20 sb-dax-11 sb-mqueue-21 sb-proc-45 sb-sysfs-26 sb-tmpfs-43 thp_deferred_split-10 > > sb-anon_inodefs-15 sb-debugfs-7 sb-nsfs-4 sb-proc-47 sb-tmpfs-1 sb-tmpfs-46 thp_zero-9 > > sb-bdev-3 sb-devpts-28 sb-pipefs-14 sb-pstore-31 sb-tmpfs-27 sb-tmpfs-49 xfs_buf-37 > > sb-bpf-32 sb-devtmpfs-5 sb-proc-25 sb-rootfs-2 sb-tmpfs-29 sb-tracefs-13 xfs_inodegc-38 > > sb-btrfs-24 sb-hugetlbfs-17 sb-proc-39 sb-securityfs-6 sb-tmpfs-35 sb-xfs-36 zspool-34 > > Ouch. That's not going to be useful for humans debugging a system as > there's no way to cross reference a "superblock" with an actual > filesystem mount point. Nor is there any way to reallly know that > all the shrinkers in one filesystem are related. > > We normally solve this by ensuring that the fs related object has > the short bdev name appended to them. e.g: > > $ pgrep xfs > 1 I root 36 2 0 60 -20 - 0 - Apr19 ? 00:00:10 [kworker/0:1H-xfs-log/dm-3] > 1 I root 679 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfsalloc] > 1 I root 680 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs_mru_cache] > 1 I root 681 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs-buf/dm-1] > ..... > > Here we have a kworker process running log IO completion work on > dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer > task for xfs-buf workqueue on dm-1. > > We need the same name discrimination for shrinker information here, > too - just saying "this is an XFS superblock shrinker" is just not > sufficient when there are hundreds of XFS mount points with a > handful of shrinkers each. > > > 2) Get information about a specific shrinker: > > $ cd sb-btrfs-24/ > > $ ls > > count count_memcg count_memcg_node count_node scan scan_memcg scan_memcg_node scan_node > > > > 3) Count objects on the system/root cgroup level > > $ cat count > > 212 > > > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine) > > $ cat count_node > > 209 3 > > So a single space separated line with a number per node? > > When you have a few hundred nodes and hundreds of thousands of objects per > node, we overrun the 4kB page size with a single line. What then? > > > 5) Count objects for each memcg (output format: cgroup inode, count) > > $ cat count_memcg > > 1 212 > > 20 96 > > 53 817 > > 2297 2 > > 218 13 > > 581 30 > > 911 124 > > > > What does "" mean? > > Also, this now iterates separate memcg per line. A parser now needs > to know the difference between count/count_node and > count_memcg/count_memcg_node because they are subtly different file > formats. These files should have the same format, otherwise it just > creates needless complexity. > > Indeed, why do we even need count/count_node? They are just the > "index 1" memcg output, so are totally redundant. > > > 6) Same but with a per-node output > > $ cat count_memcg_node > > 1 209 3 > > 20 96 0 > > 53 810 7 > > 2297 2 0 > > 218 13 0 > > 581 30 0 > > 911 124 0 > > > > So now we have a hundred nodes in the machine and thousands of > memcgs. And the information we want is in the numerically largest > memcg that is last in the list. ANd we want to graph it's behaviour > over time at high resolution (say 1Hz). Now we burn huge amounts > of CPU counting memcgs that we don't care about and then throwing > away most of the information. That's highly in-efficient and really > doesn't scale. > > [snap active scan interface] > > This just seems like a solution looking for a problem to solve. > Can you please describe the problem this infrastructure is going > to solve? > > Cheers, > > Dave. > -- > Dave Chinner > dchinner@redhat.com > >