From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B806BC28B20 for ; Wed, 2 Apr 2025 11:32:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87BEB280007; Wed, 2 Apr 2025 07:32:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 82A7E280001; Wed, 2 Apr 2025 07:32:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6CC09280007; Wed, 2 Apr 2025 07:32:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4C76C280001 for ; Wed, 2 Apr 2025 07:32:20 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 837611A0117 for ; Wed, 2 Apr 2025 11:32:21 +0000 (UTC) X-FDA: 83288890482.04.811875F Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf07.hostedemail.com (Postfix) with ESMTP id 6924640008 for ; Wed, 2 Apr 2025 11:32:19 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=onb2q8ii; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf07.hostedemail.com: domain of david@fromorbit.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743593539; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LXwOcJIfsNDmkh8KBv4r5tSOmCSgR5wa5mtpqDG75es=; b=yc+/UOOtfuQwDG0CHdyfQenuKWp//etunsmIBFJFWMjXgP5+uHEk1uoW9WzeA0+AMxeD1y XipBUOUmosCuRjHJMP+vj20XR4BaREar5qEmqAzN8HX13sS1V+qv0WbiOcAWjGXnOgqhaD IRzSKw33lTxD1SwDEfqJeSvyBQx51t0= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=onb2q8ii; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf07.hostedemail.com: domain of david@fromorbit.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743593539; a=rsa-sha256; cv=none; b=iNn8wb6iCgYsVy7iEDA8Uj/V6+lSVFCzfFHyHK7qA1FnG72Z3KlvQcsIukUJ1Cn5IzU5qy myQNprpwWWgcFaWMB+dsuubcc0HyCna/9ZwK5WvYvm39Elf3XL6DLaI7r8ngXILNLPag3+ uXCJeqqof6/iSkY2ZFLHimE7abFR8Jc= Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-30572effb26so709655a91.0 for ; Wed, 02 Apr 2025 04:32:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1743593538; x=1744198338; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=LXwOcJIfsNDmkh8KBv4r5tSOmCSgR5wa5mtpqDG75es=; b=onb2q8ii4WYFrGIL3n2KYU9UFw4t6vZHDHG72ksG0WmBZHYcU6A79wHjqrHKvxI7eG Yu8px+zInZyO+sAdYR6DYjnBdrpe6dDO6tu8aemvF7r8zEZbGdwZWi7n9mzIXZmqwxFI d1KLbYnuTb2WeVwH96YHHySJVWcXWe48Tq2uDgOvj4EXyOb55LibW4tsQVS4+pM2HNCP scls90JQqtgXIsu48uYFVODyk3UEK98Q6fsnZ5lOk370GT2okaJs/AcLYN210uIdabkb +PUNOSnJnTSCS/uq5fwZv9S9gX1jyEXmshSi6DDxgKZXSqvmZnB9QWu+vxJkz1x1id67 BS4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743593538; x=1744198338; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LXwOcJIfsNDmkh8KBv4r5tSOmCSgR5wa5mtpqDG75es=; b=aCE3EaiCPTyDBkLBGurVxDMR4VWeeEADUpKkFWY4ylKcX59tfN5lMsZ3OzOVk9qkNT N0Idm//WtcxGCZbxRi4bcZEV6SBuK+FVLMqCGJfIV0SUtCSaw0gmBVE1/qjJmXjoBJzK 3nqSy5QSQiULYQHFCacCwC5KD0werEoUlfG9I/j9gZDYet9tfbcptZmguTdQEIvhygz5 Clo4x53ZMQezYjX9ufIc5yG0NAtjzPcPExypx9qHioUipsFamFd7TAvLu5jgf4MTCsxY dLSAlHDbZ7Enfd/+V8jNYmOpaNbFYgW/qULosm98udNwFjEngaMi/WwhwWvR0JhLqrdM bDeQ== X-Forwarded-Encrypted: i=1; AJvYcCVbbHVNZnsxy7+aBsiPyuS+H6dWhD/ha85+JJhfyyqXjamT8ixuHFGtOdZbouRWzx6l/8AJ+KaNKA==@kvack.org X-Gm-Message-State: AOJu0YxKvMLj7nI+ASe2wQRKIsOVO7IS7lZL58ClpOuGzNCS97v5cfGA XhkKx3CZOFGYqsoatd8JTOFJ/LzLPFZsCYNkzlmj8OmmUofK7QA3PA0ZD008MPk= X-Gm-Gg: ASbGncuXNyJP+UtsVgV347VuK5iaPlEGYvbaahXdsCsD+hJoYdbdDVTlHHps+me2vkb okvDbells7FYxADDnilBalHheTdtBQF7qkDHOGbc71ufl9/Fsxlc8zJKrIDmPZXTHbstV++zwBT Y+JLCiIe5HJqTsPmtfIYfDt/stZhNl+n6anD3mX9lsos2JwkCzGxCv4wlReQKDX6LDSb6xnoJKJ Iu/cyh/JUYv37gEUOhhE2onaXN53ShXbC8ki8KpQx3T4qYOLKO5Mrfvjn6pUH7JOC4h/E+cQ9b4 zCkJsLGrsBllXSW5bPH3CyV26KohE/z+qRRe/YiZE/vmgMqvfY1SpNtzxr0aZ6H0kIaM7W4r71C jwkQLvnu6gAPWbeT//A== X-Google-Smtp-Source: AGHT+IEW0RkUkaqsfmHCYcslKY/K1pIkzGv7tJkMVsZIwP9qMwmTC4V+xh/sZ6DBeQUvji7VXgaIWQ== X-Received: by 2002:a17:90b:1344:b0:2ff:6788:cc67 with SMTP id 98e67ed59e1d1-3056096950cmr7891522a91.34.1743593538155; Wed, 02 Apr 2025 04:32:18 -0700 (PDT) Received: from dread.disaster.area (pa49-181-60-96.pa.nsw.optusnet.com.au. [49.181.60.96]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3056f8d6c17sm1335390a91.41.2025.04.02.04.32.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Apr 2025 04:32:17 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.98) (envelope-from ) id 1tzwKM-00000003WFl-0iwn; Wed, 02 Apr 2025 22:32:14 +1100 Date: Wed, 2 Apr 2025 22:32:14 +1100 From: Dave Chinner To: Yafang Shao Cc: Harry Yoo , Kees Cook , joel.granados@kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Josef Bacik , linux-mm@kvack.org, Vlastimil Babka Subject: Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files Message-ID: References: <20250401073046.51121-1-laoar.shao@gmail.com> <3315D21B-0772-4312-BCFB-402F408B0EF6@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 6924640008 X-Stat-Signature: nzhb5xyq9z35k18scb1poa7b913oprm5 X-Rspam-User: X-HE-Tag: 1743593539-874105 X-HE-Meta: U2FsdGVkX19aeKmOH3gc7ZElT3cd96xQaKoFk8Af0F0d4soad+8o3ukVh5WEq+ex512ccwqVR2z0yMIZVYZlJupWLfbLHgKQeK5QrtPt/Jedn5eOyEsZDRB0022cEmWuhJ/z4gBWJAnoBE6/+MD3/mxvzJJTQL7elnnpbrYVlgaFGTgGHgIoJEuOrGDK6U3JNdVOL3wO6y1GghIYHozw0dgXdje0FOmSVSRDNoPYaHWTeR43zRYPVwQY+t/FLkoaO0NZ1egCdplDbidXmf9OEjUhM3i6TCfx+1ju0Vt+AKiaPT7PlWgOrZ5DDwexLWk0tEJtNzi+o2uT/9MgF5wOAly8K/ReWtz+/ILs1B1EsW9lGify6Do0N9lopBVbAZTwapj7bkkh15ZAOzpOl4fo28/bMpkBz1NBzP9eUwJp/0SihAlJqXXNfk5MjtEckfTIAcQIPdWQYIEid2aEEDHmNy6kKqTFfal7qsc4pNd7CGx3DLoqmkOf6YMFzTg2ywBk4jU1MgWpT7wTS4bTFyCAAf19ZAYBRYo3dTRdI5m9kLMOl0HLMO9VVgbQ+P9U255Bb4d9PdnAr0sqzeX0oKpghvRqrllX0I35g5PtoF9FLl2+xji0D9Q/B0DZzzhd6PAgwymphS5WCG/qy9aQal4IH2LbQNyxIP+OR+UWbI2JmKSjaASBAJKGGKNJwKJZH1SYFRpksIaZoGPUtGcQaceLkY6MKr96/ZzOk9GU+E56cTvrKipGCHxBaPSnVI3iFM4duACvRsk9r+ttjJBdiSa+4KsH+QoixmDUELfNW6qtXQnrEzPMDrx0fV9dNMXnpfWwjte1CqP6uasFBmMu5SaXaGaIP+RkU/1NDSPYcPttHWQ+Fl7kqoR0mOeAoQJm+FPe7cQhZ7SRfxjURr4KS+IVHz5ilkp68YzUc2SWUttOdggHRWc878jtkWozttbyOnd/UIhxVg5uT1lnLy0QCne JyKmUSiR jUvnvtIoIbD6RSdyLf4RTcmlXjpO6ZdC5dI8r72gmhcQZDOAkSF1dzaOprkUYDPVywEF0nRRyxXgO6xAWyDdfnRnw9lSCfNqAPHJclZGFxAL47a820kUUtXvBa3rDgKNS1LS9YpExSfQqZ/homebeatBnKCVHSn+E7hKXkVMI+QGH0+doKY8K7QGclvQ1ob9cAStTnk3U/yYsa11uf6WUxwm93tRjgCS/fdJNZZwlkEkxaXw2xoUGoq7zivQ1miFA3sYUs4GvUYKPt8xkWlbcUU/rzWfEaX5lM4FsafOW2NXUqqGcivI5P2nBbOvX5B6ClPNNdNpl0bsETqNQwRdoT3ZlC9+t6R+xVDBHH4akXspVHM77+Kg27C/xlAIEngapu3HPAqroNLldu/fpVC0TXi40vakZvcHTSJ+UGmtrF8G7h9rvWthvvsOCUu5wKiGo9+Hr11MAEYANDYO1Fd5MdRlrJfcokm3SEHR8bQwje0YYCSJMHyVZJD23c4QXWU4ZU4u0nF4e/yyhzeuYB54a2dA7GdTFpi6CAPRw5/cNZgg8KRWD6APhEQQ7o6ARSoE6SwVBiVeZfyj43WriFReNIp/XJKeSTCU+D715vO7y/ye3DK8DXvfZ1+4MFsa6YJaiXDbzg3RFOa8srsiN+ybhMa1EhQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 02, 2025 at 04:42:06PM +0800, Yafang Shao wrote: > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo wrote: > > > > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote: > > > > > > > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao wrote: > > > >While investigating a kcompactd 100% CPU utilization issue in production, I > > > >observed frequent costly high-order (order-6) page allocations triggered by > > > >proc file reads from monitoring tools. This can be reproduced with a simple > > > >test case: > > > > > > > > fd = open(PROC_FILE, O_RDONLY); > > > > size = read(fd, buff, 256KB); > > > > close(fd); > > > > > > > >Although we should modify the monitoring tools to use smaller buffer sizes, > > > >we should also enhance the kernel to prevent these expensive high-order > > > >allocations. > > > > > > > >Signed-off-by: Yafang Shao > > > >Cc: Josef Bacik > > > >--- > > > > fs/proc/proc_sysctl.c | 10 +++++++++- > > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > > > >index cc9d74a06ff0..c53ba733bda5 100644 > > > >--- a/fs/proc/proc_sysctl.c > > > >+++ b/fs/proc/proc_sysctl.c > > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > > > error = -ENOMEM; > > > > if (count >= KMALLOC_MAX_SIZE) > > > > goto out; > > > >- kbuf = kvzalloc(count + 1, GFP_KERNEL); > > > >+ > > > >+ /* > > > >+ * Use vmalloc if the count is too large to avoid costly high-order page > > > >+ * allocations. > > > >+ */ > > > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > > > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); > > > > > > Why not move this check into kvmalloc family? > > > > Hmm should this check really be in kvmalloc family? > > Modifying the existing kvmalloc functions risks performance regressions. > Could we instead introduce a new variant like vkmalloc() (favoring > vmalloc over kmalloc) or kvmalloc_costless()? We should fix kvmalloc() instead of continuing to force subsystems to work around the limitations of kvmalloc(). Have a look at xlog_kvmalloc() in XFS. It implements a basic fast-fail, no retry high order kmalloc before it falls back to vmalloc by turning off direct reclaim for the kmalloc() call. Hence if the there isn't a high-order page on the free lists ready to allocate, it falls back to vmalloc() immediately. For XFS, using xlog_kvmalloc() reduced the high-order per-allocation overhead by around 80% when compared to a standard kvmalloc() call. Numbers and profiles were documented in the commit message (reproduced in whole below)... > > I don't think users would expect kvmalloc() to implictly decide on using > > vmalloc() without trying kmalloc() first, just because it's a high-order > > allocation. Right, but users expect kvmalloc() to use the most efficient allocation paths available to it. In this case, vmalloc() is faster and more reliable than direct reclaim w/ compaction. Hence vmalloc() should really be the primary fallback path when high-order pages are not immediately available to kmalloc() when called from kvmalloc()... -Dave. -- Dave Chinner david@fromorbit.com commit 8dc9384b7d75012856b02ff44c37566a55fc2abf Author: Dave Chinner Date: Tue Jan 4 17:22:18 2022 -0800 xfs: reduce kvmalloc overhead for CIL shadow buffers Oh, let me count the ways that the kvmalloc API sucks dog eggs. The problem is when we are logging lots of large objects, we hit kvmalloc really damn hard with costly order allocations, and behaviour utterly sucks: - 49.73% xlog_cil_commit - 31.62% kvmalloc_node - 29.96% __kmalloc_node - 29.38% kmalloc_large_node - 29.33% __alloc_pages - 24.33% __alloc_pages_slowpath.constprop.0 - 18.35% __alloc_pages_direct_compact - 17.39% try_to_compact_pages - compact_zone_order - 15.26% compact_zone 5.29% __pageblock_pfn_to_page 3.71% PageHuge - 1.44% isolate_migratepages_block 0.71% set_pfnblock_flags_mask 1.11% get_pfnblock_flags_mask - 0.81% get_page_from_freelist - 0.59% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 3.24% try_to_free_pages - 3.14% shrink_node - 2.94% shrink_slab.constprop.0 - 0.89% super_cache_count - 0.66% xfs_fs_nr_cached_objects - 0.65% xfs_reclaim_inodes_count 0.55% xfs_perag_get_tag 0.58% kfree_rcu_shrink_count - 2.09% get_page_from_freelist - 1.03% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 4.88% get_page_from_freelist - 3.66% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 1.63% __vmalloc_node - __vmalloc_node_range - 1.10% __alloc_pages_bulk - 0.93% __alloc_pages - 0.92% get_page_from_freelist - 0.89% rmqueue_bulk - 0.69% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath 13.73% memcpy_erms - 2.22% kvfree On this workload, that's almost a dozen CPUs all trying to compact and reclaim memory inside kvmalloc_node at the same time. Yet it is regularly falling back to vmalloc despite all that compaction, page and shrinker reclaim that direct reclaim is doing. Copying all the metadata is taking far less CPU time than allocating the storage! Direct reclaim should be considered extremely harmful. This is a high frequency, high throughput, CPU usage and latency sensitive allocation. We've got memory there, and we're using kvmalloc to allow memory allocation to avoid doing lots of work to try to do contiguous allocations. Except it still does *lots of costly work* that is unnecessary. Worse: the only way to avoid the slowpath page allocation trying to do compaction on costly allocations is to turn off direct reclaim (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags). Unfortunately, the stupid kvmalloc API then says "oh, this isn't a GFP_KERNEL allocation context, so you only get kmalloc!". This cuts off the vmalloc fallback, and this leads to almost instant OOM problems which ends up in filesystems deadlocks, shutdowns and/or kernel crashes. I want some basic kvmalloc behaviour: - kmalloc for a contiguous range with fail fast semantics - no compaction direct reclaim if the allocation enters the slow path. - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails The really, really stupid part about this is these kvmalloc() calls are run under memalloc_nofs task context, so all the allocations are always reduced to GFP_NOFS regardless of the fact that kvmalloc requires GFP_KERNEL to be passed in. IOWs, we're already telling kvmalloc to behave differently to the gfp flags we pass in, but it still won't allow vmalloc to be run with anything other than GFP_KERNEL. So, this patch open codes the kvmalloc() in the commit path to have the above described behaviour. The result is we more than halve the CPU time spend doing kvmalloc() in this path and transaction commits with 64kB objects in them more than doubles. i.e. we get ~5x reduction in CPU usage per costly-sized kvmalloc() invocation and the profile looks like this: - 37.60% xlog_cil_commit 16.01% memcpy_erms - 8.45% __kmalloc - 8.04% kmalloc_order_trace - 8.03% kmalloc_order - 7.93% alloc_pages - 7.90% __alloc_pages - 4.05% __alloc_pages_slowpath.constprop.0 - 2.18% get_page_from_freelist - 1.77% wake_all_kswapds .... - __wake_up_common_lock - 0.94% _raw_spin_lock_irqsave - 3.72% get_page_from_freelist - 2.43% _raw_spin_lock_irqsave - 5.72% vmalloc - 5.72% __vmalloc_node_range - 4.81% __get_vm_area_node.constprop.0 - 3.26% alloc_vmap_area - 2.52% _raw_spin_lock - 1.46% _raw_spin_lock 0.56% __alloc_pages_bulk - 4.66% kvfree - 3.25% vfree - __vfree - 3.23% __vunmap - 1.95% remove_vm_area - 1.06% free_vmap_area_noflush - 0.82% _raw_spin_lock - 0.68% _raw_spin_lock - 0.92% _raw_spin_lock - 1.40% kfree - 1.36% __free_pages - 1.35% __free_pages_ok - 1.02% _raw_spin_lock_irqsave It's worth noting that over 50% of the CPU time spent allocating these shadow buffers is now spent on spinlocks. So the shadow buffer allocation overhead is greatly reduced by getting rid of direct reclaim from kmalloc, and could probably be made even less costly if vmalloc() didn't use global spinlocks to protect it's structures. Signed-off-by: Dave Chinner Reviewed-by: Allison Henderson Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong