From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3943C27C53 for ; Fri, 7 Jun 2024 18:48:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 546D36B009A; Fri, 7 Jun 2024 14:48:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4F71D6B009B; Fri, 7 Jun 2024 14:48:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3BE316B009C; Fri, 7 Jun 2024 14:48:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 1E3DC6B009A for ; Fri, 7 Jun 2024 14:48:46 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id AADEF41114 for ; Fri, 7 Jun 2024 18:48:45 +0000 (UTC) X-FDA: 82204979010.27.13E79B2 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf09.hostedemail.com (Postfix) with ESMTP id 16DCE140008 for ; Fri, 7 Jun 2024 18:48:42 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=g8VJrH9X; spf=pass (imf09.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717786123; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nnbV65WonHjRnxZcpqK8ZMBmeL8oIcIAx+g7w3Pt44Q=; b=zi1d69sERyYVLNo9nqwClT+UrJd8di+F34MLTA2oAzxv6MjJWhg502xSg3G35qkXF7r6zN V41Z32wLnAw8Fg3h+FeQm8jKpN9u7ppdZKWgQ7EFv4++x8y+DraW1a6JmmzY1fe4f11Kue kVyq7vdMTzfM5bKOjNPc7LQv3ZzxnI4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717786123; a=rsa-sha256; cv=none; b=fMc1BsVKzYgakfAWyYALmxI/HiD7qUincz4IZA6qy9S3BRi1wh2OHAu2XDf8yvBR1UPiZg wHdHsXG0qCrK+RDPc0XjxZj1t7N3uIU9uw5lZojtf5YIGeRSukyFfEjSf5lGfzfPEVV35W i/JQ8at+KEVvPqcSTjoGL+mLGcaqKls= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=g8VJrH9X; spf=pass (imf09.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id E4651CE1DEB for ; Fri, 7 Jun 2024 18:48:39 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3F2B8C4AF08 for ; Fri, 7 Jun 2024 18:48:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717786119; bh=6HjJD5Rfq1iZGExUbZIjDq3pW6coSBgd6qbR3D0SB9M=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=g8VJrH9XAilIFtygwg0/l11LeAt9g97Q0R733xO1qJjdCYOgZiymXsfBl7MYrtctv QLoBG4d//8mIpO1NyHbLwWeapw0NUb1mYMN+GhUYmvnS3huFcJ7WGLwFm/DCCFLXHv 1irUgQYlM79yWlNTVT98hdLlPpaTqgTwDGOkBs4I1pHasCZ6buc3g9oxfHYDUDVFnU tGN5NzYQ0/QKWBwjI3KO0ZTc6Lur6bXrBn4nhMOKI5Foo4yemeQHFhFuKRkouf9j+d ikq/GY2OwOHzn6AAQ5rFgo3mB/Mi0PCuBiEcMN36wlpux2Dfbrc/SSDjLbaW3U7etu EA4UFaGUCxocw== Received: by mail-lj1-f182.google.com with SMTP id 38308e7fff4ca-2e73441edf7so26552951fa.1 for ; Fri, 07 Jun 2024 11:48:39 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCVHGu43UeqgszNU3qR6kA2iKsc4xV8n4SLMsQbjiZ0L89g94u7z6KNNeld7EhqYvflOveT6WcqvJeI12dvn/2DrcXE= X-Gm-Message-State: AOJu0YyHQwChGH9BjykTeXmGmogoEO2dRYcDHSUzrBnJj5VKdKC2ZUE3 CES6/Ex6UxB9Cwf3bYYe5xgWPT2qLhWvzZKsjtQJf5ZksG6+v1EJRRqsJYq38KPho1kYQ0J5mhf rbEdi7Qku5VZOzbclrGOeQTM6iw== X-Google-Smtp-Source: AGHT+IFuD7yf3GOn3Evtk26a81S7Sxc1Jdov6cTCU48u0NRhdgA7f/DSjrkJM9u73de6sHPluMzw4ZEdcpnElQv3Fi0= X-Received: by 2002:a2e:8049:0:b0:2de:74b1:6007 with SMTP id 38308e7fff4ca-2eadce727c9mr24332951fa.36.1717786117911; Fri, 07 Jun 2024 11:48:37 -0700 (PDT) MIME-Version: 1.0 References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <968fec1a-9a54-4b2d-a54c-653d84393c82@arm.com> In-Reply-To: <968fec1a-9a54-4b2d-a54c-653d84393c82@arm.com> From: Chris Li Date: Fri, 7 Jun 2024 11:48:26 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order To: Ryan Roberts Cc: Andrew Morton , Kairui Song , "Huang, Ying" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 16DCE140008 X-Rspam-User: X-Stat-Signature: 4x1hutf8kac5434fqq6ifp1hrw494pm8 X-HE-Tag: 1717786122-810187 X-HE-Meta: U2FsdGVkX19TX24pmGT7SacOTqSdwvOaUW0I7+HuXFOVSU76NTC/GVxNdw0AbZkTUDcqxi7w2kQ9+mwROMCvGU5a1aEK5B2jPCI1+zqnO8/rRmKyLnulgTN5NOpm+MzzsXYKcgZJZshJGZF6R+YZtsektlGifjubzJfwM5knu0/3nFmXDpneic9iJ+td/iURkt/PClR4ZjdIyFtiWNPxo+887oyy89pDZwxrV1ZzNqcerPuQfE++/Kby5wdrx1Kb+4jyjtYeODeJaftPfsg3LTmUClTymf6j6VFP4OpwZnjx/hTPm+qMjf03ho+wjMy43F1vRonkAHTWu52/TRI+z+85lC0ZH89TAMxMcjTfPuYsI6nid19Ik2S74VPgLJ0usFtxmrXUj/WCogRSmZBHkS4sRchqwa0RsYnnU/s8bw5aMevVB+9aWzCxbPKTK6UyVF8cZj2GkxNKOnSu4Hft5FwTHgyHB5sLAMCqhEnFOqD2rbXpKYfCmW+rY1ctEnBdeNS/K/DlimVDfoAtu46H6GmiMOrrD9ReON8YFCq4O6FZ76oMGump1TvWb4eUu9dNgbN9m5e+z4GKctooKzOeKGAyGRnrN//13ERYeDBW9aGOqN0ohWYtd7X3ystoinM7zK7gvSKfaTN4GHmU+V4ATi4aBOxIH+lBYr4g9D4lNv+Ap90U1kedQ7JKQ1da96+cnZ/QBsXpLjM1bQ6Hphg2lZMpQQUL0x0Qx/ltVVD0rvPnvDHzc+sEaPrJamNA+uPRcJ1cMPvWFQJDXmkwp1ZKXlrlCo/ouuoKksALoTvbBjY3XweNsePw89er2xZ1XqDLGRZCQCXa7n0oiPA9MA3w4lexsqjnAt/tIymS6B7x1k6nx4beLf5LCDKIrvS+TFgSeq3nF8l3+O32IIcLJfQfDVt/MO9uoCaSb3w3QrFClZxP3l9TZYF9533wlQCi9xnZSNgvpxwt44YA7FPfrOn 0uFp+MVa 9xYEHPHbVWic65/wUhOH5gvhQdlFZdP3h/6LjPfH+IpwOmELJa+1QuvS0ugm3xviq8fOWn5QN6lhsCoRNEr/Olo3sbkcT0mYVWexeGMoMeif+lo2Q3IXX6UtOCIG09h/NJgN6PHJMa2gx+gV3ujQGXTEf46rpMbxONYg2m2Nz6YCj3BmfVt5LO9jn0wBhiM8OmdjUbbmzNO7PatmcMxq4a5RL8WYZvkis/SiftxGV46o2/KQdBgCY38iiuRznnVE1ZkY0Di3bMvBvAe3qXkzssWkTLd3URuaYf3a/Es6Zu4QfcNcaie6fT3EiYiKsfv78NioJsZU9dKDAKWIYHp3akNYSy8Hi4ZmhNBj4UQjtU/jvfyAK2dL90xfYMMug3aHX6PMptG4u9ft7nqh09rmEFRAvIfzqE7eEUdAU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 7, 2024 at 2:43=E2=80=AFAM Ryan Roberts = wrote: > > Sorry I'm late to the discussion - I've been out for the last 3.5 weeks a= nd just > getting through my mail now... No problem at all, please take it easy. > > > On 24/05/2024 18:17, Chris Li wrote: > > This is the short term solutiolns "swap cluster order" listed > > in my "Swap Abstraction" discussion slice 8 in the recent > > LSF/MM conference. > > I've read the article on lwn and look forward to watching the video once > available. The longer term plans look interesting. > > > > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > > orders" is introduced, it only allocates the mTHP swap entries > > from new empty cluster list. That works well for PMD size THP, > > but it has a serius fragmentation issue reported by Barry. > > Yes, that was a deliberate initial approach to be conservative, just like= the > original PMD-size THP support. I'm glad to see work to improve the situat= ion! > > > > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+= Ah+NSgNQ@mail.gmail.com/ > > > > The mTHP allocation failure rate raises to almost 100% after a few > > hours in Barry's test run. > > > > The reason is that all the empty cluster has been exhausted while > > there are planty of free swap entries to in the cluster that is > > not 100% free. > > > > Address this by remember the swap allocation order in the cluster. > > Keep track of the per order non full cluster list for later allocation. > > I don't immediately see how this helps because memory is swapped back in > per-page (currently), so just because a given cluster was initially fille= d with That is not the case for Barry's setup, he has some other patch series to swap in mTHP as a whole. Especially in for the mTHP store in the zsmalloc as bigger than 4K pages. https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > entries of a given order, doesn't mean that those entries are freed in at= omic > units; only specific pages could have been swapped back in, meaning the h= oles > are not of the required order. Additionally, scanning could lead to order= -0 > pages being populated in random places. Yes, that is a problem we need to address. The proposed short term solution is to have an isolation scheme preventing the high order swap entry mix with the lower order one inside one cluster. That is easy to do and has some test results confirming the reservation/isolation effect. > > My naive assumption was that the obvious way to solve this problem in the= short > term would be to extend the scanning logic to be able to scan for an arbi= trary > order. That way you could find an allocation of the required order in any= of the > clusters, even a cluster that was not originally allocated for the requir= ed order. > > I guess I should read your patches to understand exactly what you are doi= ng > rather than making assumptions... Scanning is not enough. We need to have some way to prevent the fragmentation from happening. Once the fragmentation has happened, it can't be easily reversed. Scanning does not help the fragmentation aspect. Chris > > Thanks, > Ryan > > > > > This greatly improve the sucess rate of the mTHP swap allocation. > > While I am still waiting for Barry's test result. I paste Kairui's test > > result here: > > > > I'm able to reproduce such an issue with a simple script (enabling all = order of mthp): > > > > modprobe brd rd_nr=3D1 rd_size=3D$(( 10 * 1024 * 1024)) > > swapoff -a > > mkswap /dev/ram0 > > swapon /dev/ram0 > > > > rmdir /sys/fs/cgroup/benchmark > > mkdir -p /sys/fs/cgroup/benchmark > > cd /sys/fs/cgroup/benchmark > > echo 8G > memory.max > > echo $$ > cgroup.procs > > > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B = binary & > > > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > > -P memcache_binary -n allkeys --key-minimum=3D1 \ > > --key-maximum=3D18000000 --key-pattern=3DP:P -c 1 -t 32 \ > > --ratio 1:0 --pipeline 8 -d 1024 > > > > Before: > > Totals 48805.63 0.00 0.00 5.26045 = 1.19100 38.91100 59.64700 51063.98 > > After: > > Totals 71098.84 0.00 0.00 3.60585 = 0.71100 26.36700 39.16700 74388.74 > > > > And the fallback ratio dropped by a lot: > > Before: > > hugepages-32kB/stats/anon_swpout_fallback:15997 > > hugepages-32kB/stats/anon_swpout:18712 > > hugepages-512kB/stats/anon_swpout_fallback:192 > > hugepages-512kB/stats/anon_swpout:0 > > hugepages-2048kB/stats/anon_swpout_fallback:2 > > hugepages-2048kB/stats/anon_swpout:0 > > hugepages-1024kB/stats/anon_swpout_fallback:0 > > hugepages-1024kB/stats/anon_swpout:0 > > hugepages-64kB/stats/anon_swpout_fallback:18246 > > hugepages-64kB/stats/anon_swpout:17644 > > hugepages-16kB/stats/anon_swpout_fallback:13701 > > hugepages-16kB/stats/anon_swpout:18234 > > hugepages-256kB/stats/anon_swpout_fallback:8642 > > hugepages-256kB/stats/anon_swpout:93 > > hugepages-128kB/stats/anon_swpout_fallback:21497 > > hugepages-128kB/stats/anon_swpout:7596 > > > > (Still collecting more data, the success swpout was mostly done early, = then the fallback began to increase, nearly 100% failure rate) > > > > After: > > hugepages-32kB/stats/swpout:34445 > > hugepages-32kB/stats/swpout_fallback:0 > > hugepages-512kB/stats/swpout:1 > > hugepages-512kB/stats/swpout_fallback:134 > > hugepages-2048kB/stats/swpout:1 > > hugepages-2048kB/stats/swpout_fallback:1 > > hugepages-1024kB/stats/swpout:6 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:35495 > > hugepages-64kB/stats/swpout_fallback:0 > > hugepages-16kB/stats/swpout:32441 > > hugepages-16kB/stats/swpout_fallback:0 > > hugepages-256kB/stats/swpout:2223 > > hugepages-256kB/stats/swpout_fallback:6278 > > hugepages-128kB/stats/swpout:29136 > > hugepages-128kB/stats/swpout_fallback:52 > > > > Reported-by: Barry Song <21cnbao@gmail.com> > > Tested-by: Kairui Song > > Signed-off-by: Chris Li > > --- > > Chris Li (2): > > mm: swap: swap cluster switch to double link list > > mm: swap: mTHP allocate swap entries from nonfull list > > > > include/linux/swap.h | 18 ++-- > > mm/swapfile.c | 252 +++++++++++++++++--------------------------= -------- > > 2 files changed, 93 insertions(+), 177 deletions(-) > > --- > > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > > change-id: 20240523-swap-allocator-1534c480ece4 > > > > Best regards, >