From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6068C25B7E for ; Thu, 30 May 2024 01:13:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6BA9D6B009F; Wed, 29 May 2024 21:13:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 66AF66B00A1; Wed, 29 May 2024 21:13:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 559736B00A3; Wed, 29 May 2024 21:13:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 372076B009F for ; Wed, 29 May 2024 21:13:54 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id E48E1C1483 for ; Thu, 30 May 2024 01:13:53 +0000 (UTC) X-FDA: 82173290346.04.6E14B57 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf20.hostedemail.com (Postfix) with ESMTP id 59C3C1C001D for ; Thu, 30 May 2024 01:13:50 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=EezWnhBV; spf=pass (imf20.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717031632; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=S+grnH+sgOeJbk9lXb2h37PvubgL1vwYnb8JaKLl/5s=; b=y/gyWMdorGaykUHtu0AK93O37pG3LP1SrazRNNN8iK9vYl+0Et1kD1PLvSYvfFHQB7xoMG 39MfIn4igXxRywjidpkZotsVQcKcwuMBnkU7TjT+R0VpwSjadcFUPXbzqwYnicLSyT7sAs LH2UeXqbpp0HVUODvtoODQRsAlg+Dbw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717031632; a=rsa-sha256; cv=none; b=fH/uvGs60t48H2GK0Uz0AvJbwTxXW/jAnoDh7i6YNpoaEjeIGbOIGbOgXh1tK+6YwI/UO1 RBL0ORxloh1SRogme1Um8/jixTKgDkApXIOkRi7rwdZjuWWB2f4nhYOC3r2VnYmsbeuJRL HBWxR2egRImmDZ2uviQJQbgLZVVIKqo= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=EezWnhBV; spf=pass (imf20.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 3D109CE19DA for ; Thu, 30 May 2024 01:13:48 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 67928C4AF07 for ; Thu, 30 May 2024 01:13:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717031627; bh=lcfwicAIhv+K0jpxdsaL8YKqGVirMg2TDZNQUpc39lk=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=EezWnhBVWI+faDkkufEkaa6+L3V4/PGXv7l/i9xzhdTyWur/63vBrPFySd73ylpOt mNbhvKPCpuhXgp1uEiACJHtNoi2r93fVyUEew11XhXDv/OhwGlYKoGX84LnYFG9Uw1 hftqJS7tDZuri6oPiErTSZTmUvJwpIxrIYQAmAeWjXNENjXWsqpRiKLCnGlULzAqOa EqYkN9hvLxxndn9zerf3Vwwa2uH7iWDT+OyP/sfE9G8ojUReJtHtWwYppUGeA54JzS hp5wb8m5369TK2QwWCYOpYJD5pKrBQ8JW9N2qALY8Qm+fagsiMv41nvk+3n/PQI7Z3 ZjfDMndSFqy+g== Received: by mail-io1-f44.google.com with SMTP id ca18e2360f4ac-7e8dbc4bc67so20131739f.0 for ; Wed, 29 May 2024 18:13:47 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCV2v//qbSJVRXFTs3y6gxRSA047jXEMPKkPaPY0O4FxGuU3KjQ6NXkzm9Ej2J3uqjjzk/7Th8mLJYWQgmelexEBK9Y= X-Gm-Message-State: AOJu0YwJKN9hJFEO/NLzPXFiTe23URSaV2sP7cPGO/E5Et7+bEgbWs0v LP4mMJweS6+RRcZWdMHK+EjHBEQA/aXCGvU000KmWCAJ3sMHUsOL4RFnMvOQXQ3wmSTF+nctzqI G8k56KgsSzZxUN2v32QdKvRKGL3HXNGKYbNUa X-Google-Smtp-Source: AGHT+IFJlbAmTUHyMZPyp0uxyaBN5bKFFMlEsD8pvP89Kez4hnrjtLgIpLQMGASW/1AJqfbpoEBzJzNmRO1vToTUEwM= X-Received: by 2002:a05:6602:29c5:b0:7dd:c59c:83a5 with SMTP id ca18e2360f4ac-7eaf6d2e5f4mr47171039f.9.1717031626540; Wed, 29 May 2024 18:13:46 -0700 (PDT) MIME-Version: 1.0 References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Chris Li Date: Wed, 29 May 2024 18:13:33 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order To: "Huang, Ying" Cc: Andrew Morton , Kairui Song , Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: o65d9dkys8eesqx7whfpdqhwxrwu6o18 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 59C3C1C001D X-HE-Tag: 1717031630-584907 X-HE-Meta: U2FsdGVkX186HgxLv+Z/5cTSjs0516JjvI1TSojesc/JUkWQBCuTN7YARs39QbQ/MmsRmglRSZlSrU2+UU6JGuSyDQ2pxf9ap/QKzbRo+//ySN2tEJPrLysZ1RulQxMQD6eYuyBDZkfwfrpbWA+wjHNT0ROZ3Mvf97sLHL6NHikBg6WI3YLFGmAOg6fwQphp552kXUlZFhoD4KpLKHYwqilv+toKBvEQ6OoGcMtFB74tlZzcVlP1DH0Jf2MQuI9SiR8mcbSvMHQc7dQJr+/ihRbv3CI8KtPfSM94O7W0QsnEgE5pn/rtA8X+VKOnoCAZGScKIFAeRsPG+B8tq4lWj/yF1FeVgKDocnX5gXynsO7YSNbKhNFTNUyWbAT3AmCdEovQVFPA7dgsoO9WQ6c75AtMzxleX+hj8RDI8DTeLfv+IPU7zJCwvTqE4W8necMoSHO8kR7TO94u/t4Ds+W3Cuni4NtvCO1uZOzWUkl3CQUJmf2EVgfhQUpk1bB4x6JY6jiNp+foYVuP/yxqSCBnEyFLFEdlOhjF0pVpbsASNaVteIOb+778+WaeFs9d0IlJMdTff+6va8S372ZAfm7QMomZ9IVXQtFikJw6pucXQGgeMt94gzTvAp45BLofFZC+mnlXsj6PpbfYxIxLo7kTPQpBsZAJ4n2DeUAoCKcfKvKJcQl8bLCvxtWplSf26QWzE4hEde0waSnSUmgcNXKTbPh7F62vn747wpL22I3urEOz4uFXFNvRbRQpI//JJDfuBJCnqZ7nfToSw6dNygeSig+npF2HeJ377tIQxRTyA5OlR9nYfkOjtZBsaJ/28nZ10xwb9agI2oUhO+Qu+G/tVFR47NNK5p9lgO2wvIZDHE6dYqPiC6aT4ABd6WMThJPgRzfntBqGZuLIVPMSR3zVsJ96KQI2AiizbMtZ2gE/VxrjOIodpQX5EKkXjH2+KvEbgoS8UsahdZMeeZ2A31M 4PEoPEYk 4or6Xpu+TbV/uI3t4HfsfJ4w63QbSk6Cpg65GdBKoTonNueD6uEkDtLFqBu+VtF/0VQQRNHmDY7Jn0QxjB0QhHBxqJtKBJQNMS1c497no204gUskqAbgoNH0WPb76gzj+3EaGEqJ+LHbb5DcgZNsvEDuyossw2XeGmn8Ad1ZHZs6fFAF/qnvMrT1HgS6Wr+rvk948tYQ7EhKWWn2MAWcLCTsinyPsMy87YJm2MrRFskktjQW0FQMiNt1jbdES61fExpvdlF/ECYGebLhSdN+DUSsTzeca0ZpXQPt0u5h9qMyNBrOu78hzu/jxgkfaDz9dx6UovKIr6dRhUycqP377n/9OTh3o4DiKft3zDw8MhSAnEc8LLQ2uU4WPUNq9Xwy46r0E X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Ying, On Wed, May 29, 2024 at 1:57=E2=80=AFAM Huang, Ying = wrote: > > Chris Li writes: > > > I am spinning a new version for this series to address two issues > > found in this series: > > > > 1) Oppo discovered a bug in the following line: > > + ci =3D si->cluster_info + tmp; > > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > > That is a serious bug but trivial to fix. > > > > 2) order 0 allocation currently blindly scans swap_map disregarding > > the cluster->order. > > IIUC, now, we only scan swap_map[] only if > !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[orde= r]). > That is, if you doesn't run low swap free space, you will not do that. You can still swap space in order 0 clusters while order 4 runs out of free_cluster or nonfull_clusters[order]. For Android that is a common case. > > > Given enough order 0 swap allocations(close to the > > swap file size) the order 0 allocation head will eventually sweep > > across the whole swapfile and destroy other cluster order allocations. > > > > The short term fix is just skipping clusters that are already assigned > > to higher orders. > > Better to do any further optimization on top of the simpler one. Need > to evaluate whether it's necessary to add more complexity. I agree this needs more careful planning and discussion. In Android's use case, the swapfile is always almost full. It will run into this situation after long enough swap time. Once the order 0 swap entry starts to pollute the higher order cluster, there is no going back(until the whole cluster is 100% free). > > In the long term, I want to unify the non-SSD to use clusters for > > locking and allocations as well, just try to follow the last > > allocation (less seeking) as much as possible. > > I have thought about that too. Personally, I think that it's good to > remove swap_map[] scanning. The implementation can be simplified too. Agree. I look at the commit that introduces the SSD cluster. The commit message indicates a lot of CPU time spent in swap_map scanning, especially when the swapfile is almost full. The main motivation to introduce the cluster in HDD is to simplify and unify the code. > I don't know whether do we need to consider the performance of HDD swap > now. I am not sure about that either. We can make the best effort to reduce the = seek. Chris > -- > Best Regards, > Huang, Ying > > > On Fri, May 24, 2024 at 10:17=E2=80=AFAM Chris Li w= rote: > >> > >> This is the short term solutiolns "swap cluster order" listed > >> in my "Swap Abstraction" discussion slice 8 in the recent > >> LSF/MM conference. > >> > >> When commit 845982eb264bc "mm: swap: allow storage of all mTHP > >> orders" is introduced, it only allocates the mTHP swap entries > >> from new empty cluster list. That works well for PMD size THP, > >> but it has a serius fragmentation issue reported by Barry. > >> > >> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp= +Ah+NSgNQ@mail.gmail.com/ > >> > >> The mTHP allocation failure rate raises to almost 100% after a few > >> hours in Barry's test run. > >> > >> The reason is that all the empty cluster has been exhausted while > >> there are planty of free swap entries to in the cluster that is > >> not 100% free. > >> > >> Address this by remember the swap allocation order in the cluster. > >> Keep track of the per order non full cluster list for later allocation= . > >> > >> This greatly improve the sucess rate of the mTHP swap allocation. > >> While I am still waiting for Barry's test result. I paste Kairui's tes= t > >> result here: > >> > >> I'm able to reproduce such an issue with a simple script (enabling all= order of mthp): > >> > >> modprobe brd rd_nr=3D1 rd_size=3D$(( 10 * 1024 * 1024)) > >> swapoff -a > >> mkswap /dev/ram0 > >> swapon /dev/ram0 > >> > >> rmdir /sys/fs/cgroup/benchmark > >> mkdir -p /sys/fs/cgroup/benchmark > >> cd /sys/fs/cgroup/benchmark > >> echo 8G > memory.max > >> echo $$ > cgroup.procs > >> > >> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B= binary & > >> > >> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > >> -P memcache_binary -n allkeys --key-minimum=3D1 \ > >> --key-maximum=3D18000000 --key-pattern=3DP:P -c 1 -t 32 \ > >> --ratio 1:0 --pipeline 8 -d 1024 > >> > >> Before: > >> Totals 48805.63 0.00 0.00 5.26045 = 1.19100 38.91100 59.64700 51063.98 > >> After: > >> Totals 71098.84 0.00 0.00 3.60585 = 0.71100 26.36700 39.16700 74388.74 > >> > >> And the fallback ratio dropped by a lot: > >> Before: > >> hugepages-32kB/stats/anon_swpout_fallback:15997 > >> hugepages-32kB/stats/anon_swpout:18712 > >> hugepages-512kB/stats/anon_swpout_fallback:192 > >> hugepages-512kB/stats/anon_swpout:0 > >> hugepages-2048kB/stats/anon_swpout_fallback:2 > >> hugepages-2048kB/stats/anon_swpout:0 > >> hugepages-1024kB/stats/anon_swpout_fallback:0 > >> hugepages-1024kB/stats/anon_swpout:0 > >> hugepages-64kB/stats/anon_swpout_fallback:18246 > >> hugepages-64kB/stats/anon_swpout:17644 > >> hugepages-16kB/stats/anon_swpout_fallback:13701 > >> hugepages-16kB/stats/anon_swpout:18234 > >> hugepages-256kB/stats/anon_swpout_fallback:8642 > >> hugepages-256kB/stats/anon_swpout:93 > >> hugepages-128kB/stats/anon_swpout_fallback:21497 > >> hugepages-128kB/stats/anon_swpout:7596 > >> > >> (Still collecting more data, the success swpout was mostly done early,= then the fallback began to increase, nearly 100% failure rate) > >> > >> After: > >> hugepages-32kB/stats/swpout:34445 > >> hugepages-32kB/stats/swpout_fallback:0 > >> hugepages-512kB/stats/swpout:1 > >> hugepages-512kB/stats/swpout_fallback:134 > >> hugepages-2048kB/stats/swpout:1 > >> hugepages-2048kB/stats/swpout_fallback:1 > >> hugepages-1024kB/stats/swpout:6 > >> hugepages-1024kB/stats/swpout_fallback:0 > >> hugepages-64kB/stats/swpout:35495 > >> hugepages-64kB/stats/swpout_fallback:0 > >> hugepages-16kB/stats/swpout:32441 > >> hugepages-16kB/stats/swpout_fallback:0 > >> hugepages-256kB/stats/swpout:2223 > >> hugepages-256kB/stats/swpout_fallback:6278 > >> hugepages-128kB/stats/swpout:29136 > >> hugepages-128kB/stats/swpout_fallback:52 > >> > >> Reported-by: Barry Song <21cnbao@gmail.com> > >> Tested-by: Kairui Song > >> Signed-off-by: Chris Li > >> --- > >> Chris Li (2): > >> mm: swap: swap cluster switch to double link list > >> mm: swap: mTHP allocate swap entries from nonfull list > >> > >> include/linux/swap.h | 18 ++-- > >> mm/swapfile.c | 252 +++++++++++++++++-------------------------= --------- > >> 2 files changed, 93 insertions(+), 177 deletions(-) > >> --- > >> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > >> change-id: 20240523-swap-allocator-1534c480ece4 > >> > >> Best regards, > >> -- > >> Chris Li > >> >