From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91579C25B74 for ; Fri, 24 May 2024 17:17:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F1BEF6B008C; Fri, 24 May 2024 13:17:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC98E6B0095; Fri, 24 May 2024 13:17:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D69F16B0093; Fri, 24 May 2024 13:17:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B1F976B0088 for ; Fri, 24 May 2024 13:17:37 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 30F9B160A51 for ; Fri, 24 May 2024 17:17:37 +0000 (UTC) X-FDA: 82153946154.10.FDD5BDB Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf17.hostedemail.com (Postfix) with ESMTP id D46F540015 for ; Fri, 24 May 2024 17:17:34 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=eGEtu5xh; spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716571055; a=rsa-sha256; cv=none; b=eJwwUUlLXAsyx3WZbWAVnBFYp++H91WLMFtydWcatzgTUp0p2tjn1DM1ldXmMLh4LplfoU rR0TZ3I17kAlRjIkkyKFJOQ0m6fUx20x7IvLgq1bIg3JkV5VpIlFgVkjd9Dt7lIpJu0MXJ /ss1LytDaxyAcKBBuHgeO4gqogWdEVc= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=eGEtu5xh; spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716571055; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=hn4HOYrwd/hZjxdn3TG9eMXBAuvQiVl01Gd/+5tpDNU=; b=Or8df7opOQDHRjTmYO7AHySIznsr0IvlTfhyBJ7EHB/mnZBRY1SQGZLSLNStdYJVDAllof HX5NRxHshpqy/cowW1h7t+bUJuXTjrVAj5mUSu3x80M7xq0crtdJtPkXIM8jtmU18usq2+ ecSynfPOcw2xZJbe+x7C58ECT3wGEOw= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id E09F0CE1911; Fri, 24 May 2024 17:17:30 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C7AEFC2BBFC; Fri, 24 May 2024 17:17:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1716571050; bh=LWI53cB4MZ9NTt9TO+i5sJJm697nQeTm1z7xgs/U1n0=; h=From:Subject:Date:To:Cc:From; b=eGEtu5xh4B1wWNrFGgqZYe1RbfRL7r91Vpqr98iX7fH9QLmd3C32CB9+2gikzV6gQ iZ5wvyMYCbsVfTLA7o7AFgKsYPQ+H6UHbdX+D2fN479qyzF1E9NSqzh/FsL56w+MZu ASV2Mok1qIUmggQWKYeLdAG7ojUGXtGUohSrLrWuxfut4Q7tzKpFIyNbHw9pAonWPB h837qro56VY2AbfHQvZs5iP6QxJx74GVIuV6pSf1Za7oDJmz62djWeGYtrRBleWKLJ zv/fLoU1kzIHIaV1aORFe9lNwuRWAhTPbshkNZ1Mq0maTgAqWJI4cAkcq7rRJbyeaC O8EEFFnrkBmnA== From: Chris Li Subject: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Date: Fri, 24 May 2024 10:17:17 -0700 Message-Id: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-B4-Tracking: v=1; b=H4sIAJ3LUGYC/x3MTQqAIBBA4avErBP8hegq0UJsqgFR0ahAvHvS8 lu8V6FgJiwwDxUy3lQohg4xDuBOGw5ktHWD5FJzIxUrj03Meh+dvWJmwijt9MTRoYYepYw7vf9 wWVv7AGQJcvRgAAAA To: Andrew Morton Cc: Kairui Song , Ryan Roberts , "Huang, Ying" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.12.4 X-Rspamd-Queue-Id: D46F540015 X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: ap4bmsoai7tsfsucgp6mq3eqsjpw1kw5 X-HE-Tag: 1716571054-463107 X-HE-Meta: U2FsdGVkX19GKc+Gr8IBuZz+Td5jhlL538TEs3ItC1qDeIBcGb4A0DsOf3qo2Kvt95h5LB3n4q7mTxROQziuHGuldguBUn3GD4gVLKbpYGSsAvyVbF7KU3z8VaiIbQC7zVNlGtFzRys/afPA1pkiD4PhreNHvJIV4jq9o1UUwHXDBuYibO1QzzRUPBMs8gjUhXOuufga14K3zCGzfqZxA3AvCPONArVkuKyt3/g/XodeFil7ehpbpI0OlldYQKfx3D5ph3RRWOXYSiACBv0/C6A0yjVE1ZsP9MyQP+iLd0a1LyaNt7mRwg8eVd2D9TrK04n5EDh8jUrWr2EkbneALNytaw1gfULf/ZFsLcPLzKhd1tnTL6VNCpOccy8RlKZbgzIqw8G7p1e8DnAl93b1D/jTlAXn1sH0vYNLavwiGgjWyWO2AUyTu6PmJr9XqXOTwPKjzr4+LMMpaFtte8Iu3upvNYSTkHTTWGZRq2xqk7S0GEmdViKvQiV3KJiwwG4XqiIsH8QdfiXl1P+Hb6IDLKSyE0ZSRD+Ql7q7ulzXiqMnxhelTjDOZXjc+zvrFy929mOQ8xxPWT/dO5I2ZIqRC+T77JgdfpQAtZxTVMfLgg78quANB3jNZYavapjuPfiFnzxPO5gRUGzgK5F3WlyFSoGcOgDMnyfFO1h8ua7zeKyzcfnC/FAFJvLMHq/AYr5j6lm4cBghzRf8/zmA/OFeF/PvqIwOdWZCYR9CFYafnM+SU9IMoUfXjlpm974Hvnz69BYx1snKvBXks4wRnYq7RB8TeigzvvVyRpgojzY29sU2jXLJ8xDWLs9olBWbF7eGL2I8Q/ODuAElXgbAsDMklXkiefrTiDJREPgjbOCu+jk3NA4BM77LOZOkBiudHrKIkW8lhKI6gBu8girnOPQg/nCjUoI2ViubFFATDFIJ+f5aFIvnvzutnwnoxIcgqxPFbVkM/q5VUQrrYD4rFc6 iiIuGuUH 7O9/x259FczSklTSpdCg1jHQNElmSBAy+73JuNeGoMQSf1MunejafcH2gCUiq5v08rFmExi4iutA0IM5Q1zDy4VJ3whyga6ZS1YvvUEZoCHNbB17JrcNX1NUPxkNQ2+sU3sVkiLqRlGnnVI0pPo+ImS744Knop41yYFbrk+nHzBLqejwmKsNEcyA+7t06LUK8UcgspV56EDHKXmKvTepWfh059y20dgO1dvd8sH7C5lIjpfJ2ENBamojk6Gpao3BO//fEL3WjKao87Uzw/EmWPFQcmu2rdRj4/TFqwHXCwj6DsTdC4FRpeUFDxpfTiXl9jIxGXsxxNi1/Fe4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is the short term solutiolns "swap cluster order" listed in my "Swap Abstraction" discussion slice 8 in the recent LSF/MM conference. When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is introduced, it only allocates the mTHP swap entries from new empty cluster list. That works well for PMD size THP, but it has a serius fragmentation issue reported by Barry. https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ The mTHP allocation failure rate raises to almost 100% after a few hours in Barry's test run. The reason is that all the empty cluster has been exhausted while there are planty of free swap entries to in the cluster that is not 100% free. Address this by remember the swap allocation order in the cluster. Keep track of the per order non full cluster list for later allocation. This greatly improve the sucess rate of the mTHP swap allocation. While I am still waiting for Barry's test result. I paste Kairui's test result here: I'm able to reproduce such an issue with a simple script (enabling all order of mthp): modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) swapoff -a mkswap /dev/ram0 swapon /dev/ram0 rmdir /sys/fs/cgroup/benchmark mkdir -p /sys/fs/cgroup/benchmark cd /sys/fs/cgroup/benchmark echo 8G > memory.max echo $$ > cgroup.procs memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ --ratio 1:0 --pipeline 8 -d 1024 Before: Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 After: Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 And the fallback ratio dropped by a lot: Before: hugepages-32kB/stats/anon_swpout_fallback:15997 hugepages-32kB/stats/anon_swpout:18712 hugepages-512kB/stats/anon_swpout_fallback:192 hugepages-512kB/stats/anon_swpout:0 hugepages-2048kB/stats/anon_swpout_fallback:2 hugepages-2048kB/stats/anon_swpout:0 hugepages-1024kB/stats/anon_swpout_fallback:0 hugepages-1024kB/stats/anon_swpout:0 hugepages-64kB/stats/anon_swpout_fallback:18246 hugepages-64kB/stats/anon_swpout:17644 hugepages-16kB/stats/anon_swpout_fallback:13701 hugepages-16kB/stats/anon_swpout:18234 hugepages-256kB/stats/anon_swpout_fallback:8642 hugepages-256kB/stats/anon_swpout:93 hugepages-128kB/stats/anon_swpout_fallback:21497 hugepages-128kB/stats/anon_swpout:7596 (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) After: hugepages-32kB/stats/swpout:34445 hugepages-32kB/stats/swpout_fallback:0 hugepages-512kB/stats/swpout:1 hugepages-512kB/stats/swpout_fallback:134 hugepages-2048kB/stats/swpout:1 hugepages-2048kB/stats/swpout_fallback:1 hugepages-1024kB/stats/swpout:6 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:35495 hugepages-64kB/stats/swpout_fallback:0 hugepages-16kB/stats/swpout:32441 hugepages-16kB/stats/swpout_fallback:0 hugepages-256kB/stats/swpout:2223 hugepages-256kB/stats/swpout_fallback:6278 hugepages-128kB/stats/swpout:29136 hugepages-128kB/stats/swpout_fallback:52 Reported-by: Barry Song <21cnbao@gmail.com> Tested-by: Kairui Song Signed-off-by: Chris Li --- Chris Li (2): mm: swap: swap cluster switch to double link list mm: swap: mTHP allocate swap entries from nonfull list include/linux/swap.h | 18 ++-- mm/swapfile.c | 252 +++++++++++++++++---------------------------------- 2 files changed, 93 insertions(+), 177 deletions(-) --- base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed change-id: 20240523-swap-allocator-1534c480ece4 Best regards, -- Chris Li