From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33899C52D7C for ; Mon, 19 Aug 2024 08:42:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BEE176B008A; Mon, 19 Aug 2024 04:42:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BC4B16B008C; Mon, 19 Aug 2024 04:42:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A8D2E6B0092; Mon, 19 Aug 2024 04:42:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 89B566B008A for ; Mon, 19 Aug 2024 04:42:49 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 2EF121C4C60 for ; Mon, 19 Aug 2024 08:42:49 +0000 (UTC) X-FDA: 82468354458.23.43E5C0C Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by imf17.hostedemail.com (Postfix) with ESMTP id BCC2B40003 for ; Mon, 19 Aug 2024 08:42:46 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=dwe1vpPO; spf=pass (imf17.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724056928; a=rsa-sha256; cv=none; b=jyrz4hF5XuVKXA1PlMIpy5l41I6+O2f0Zhb5LPMc18E40Ke7B0En1tMY1Kjd7N2sKjMvfH VQIWf3vc6Lr0YDXW3NNrC2TSsOT2yIb5oByoBI9A2p4zXDDUwFEuLv6q5TqsmjsNG0OAEk 1zU4pZOnQ57ezQFTFszq/ZnVGDbm8fI= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=dwe1vpPO; spf=pass (imf17.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724056928; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hR2EJee21xGG4Jx4+FxAyW39tvOglBi9/pWdQn50fhM=; b=uT8l0gq3wptxkBM2+XWxN3SrRlcBTrq6ETYUxl/Guxk7Oi/JZmNGA6kJQYVpISjZ78vQZT CW7six3hpZXgDb4Op01NlxFmIFosU/QPqXAx/+sJmjr4qaqoMRqCsdsjDCCGkfFOPwS01C p1ZhNbIu4Cjm4Hz4lcXZrWdUQuFknTo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1724056967; x=1755592967; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=C7CX1eVpEgJh9Du2mCHC2zACyBnXeWvRXR+fomLra7k=; b=dwe1vpPOEzSHFDmKwDEg4Pzh0wRbTiUlYkFnFFogxpoFql9UHBOajY/X TZsq/QnIbAbAvP4z/QMT2/l0jmuc4ijwy+74hR2nkT1dN5cWP+lPuZqs4 jKMdlYuH12toV3Jc/JbgEETTM8PXZ+4qp3RSH8P3j2M3iLTjAXKP1uWL/ SgsB9YEeGeatXCRwmzxEMhIE+pVR0sTV/HDyyrRh5bGbruIgLENIJvJzm 8tCIGYNFMunCYpRHlH8fdwmyaptrCSRwfK3ODobMj2Ffzf3UiuoSTu9zD 3tzMszm4oLz4pX9sP2TkZXkx4mGxrOgDVZvk308vMHovLl4SQy6oG6y3t A==; X-CSE-ConnectionGUID: bgfw5fcAQKOjS18pA2qRCQ== X-CSE-MsgGUID: 87tMaOc3T36lBh4iR7gyzA== X-IronPort-AV: E=McAfee;i="6700,10204,11168"; a="39805892" X-IronPort-AV: E=Sophos;i="6.10,158,1719903600"; d="scan'208";a="39805892" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Aug 2024 01:42:45 -0700 X-CSE-ConnectionGUID: qAeKlVYSTBesivUcFlpiVA== X-CSE-MsgGUID: LVD4xajlRMKAvPa+e7r1tA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,158,1719903600"; d="scan'208";a="60458060" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa010-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Aug 2024 01:42:43 -0700 From: "Huang, Ying" To: Chris Li Cc: Hugh Dickins , Andrew Morton , Kairui Song , Ryan Roberts , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Subject: Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order In-Reply-To: (Chris Li's message of "Fri, 16 Aug 2024 00:47:37 -0700") References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> <87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sevfza3w.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 19 Aug 2024 16:39:10 +0800 Message-ID: <87plq4hpox.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: tji69sjy4ncxk98yjokub55eugzficg3 X-Rspamd-Queue-Id: BCC2B40003 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1724056966-727349 X-HE-Meta: U2FsdGVkX1+TU8w0BgF9CqvBlwD/4fRo8V6MoBqpUi4u5xiZWT3g+WNPyp+GuK/IGLEszLXIWmGwS2Nb9XEDjMxzxmit9mByPkhaxkM/oXdbjuNvujd6bznnDvjen7c7oseKtTmfIXe+hQRcyPNH9eaXOI/3TlV+EJAUaw0hLsCaqsRGB79c/RUBOHPcd5Uu+A+DH1lerfk2QnUO+8lrBhMm35ue3IcKjC8qeWCn3+9yjs196exD6/NtLkbMZQ2aafs03NF667PxGH9AiaWEzpskrBR2N5erMDP2cu26M6pCqc0L1jKE3VhzsNxlOXhTIrp2P8URo1gxy9vR8ByEoB98ZM1ah3YOy8nnspI0GHS4ucZlq071fj0bTfqeJojNJuPPEdwjg6UO7LkyM8Z0jOsnIenL94LBK52KKgCwpe40w6FUuqy9tD57UIyO3V4s3hD7zzp2pjB/D3m9lG3upi4uBT6/Z6u930iEnmt1KC+5kztYpAcXp1oRwtmtrJU4UrSQjfcYtu6K1fPYJJ88p2vvjqxzi3mFgHc8SNjCWH94PUKeAk9iYzEy8yjRjeszPIuULUWbtEiWwmDKuni3I/jUofwvnsuAL2vD1jdRbiPFmK5Cs4vB0kZU3xCgukF4zfDZoeLXXeSwv9L3McyhgEadqVvyR9jvM7nUgmrGcXBQGEFi8X+sxXPoy49H9zyYdOSdGbxnwDpiOROD5ioUmYRsnw/UTjGhzYthibYvoZoXRNvmxf72lqkoUBQavhVYAZUqMDAXl5RWPDlL9JNVGWUvbFGHZqjRFS6oHEtMBUgO5+c9pVFidL2+7tydW5nrUfq+WX9i6nc/QQ8A9YCooUOOPU+cdXxK6qAMmvMW/B05URAYVMJFr5trlNhmBC6t0vTngfGOVwJPzq2XCuujSP5R5KVGzKMszwRwkweOvPoj/6+qogyk+L6SQALgWXXGgrnwrq7SXTE7NdRa1BC c7nHKvox AvJxkhKUW8xw5Q/Jp8MO4+pFAxY1ivSmllHS6Nk8ICyMtm/JlsHxZ6OZlglBa1BxTJlJsM9640qtN7hk6eH/+XPgIqnYjdV4a8mC3pINLxNp/l49qGCNTy9HY5SLJupw9RNg/AU3OUIlRHCzDDoLToR8iNiN2mWu+QFHeKN+leVYeFdVj8lRbgoSaLhXs1TKEdhuCUEOI1i9/K7d9NupErt1INhnWFo5eJr6v8mXdVKMi6wduLp68xaP3yLDbLSKFjClR X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li writes: > On Thu, Aug 8, 2024 at 1:38=E2=80=AFAM Huang, Ying = wrote: >> >> Chris Li writes: >> >> > On Wed, Aug 7, 2024 at 12:59=E2=80=AFAM Huang, Ying wrote: >> >> >> >> Hi, Chris, >> >> >> >> Chris Li writes: >> >> >> >> > This is the short term solutions "swap cluster order" listed >> >> > in my "Swap Abstraction" discussion slice 8 in the recent >> >> > LSF/MM conference. >> >> > >> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> >> > orders" is introduced, it only allocates the mTHP swap entries >> >> > from the new empty cluster list. It has a fragmentation issue >> >> > reported by Barry. >> >> > >> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQd= SMp+Ah+NSgNQ@mail.gmail.com/ >> >> > >> >> > The reason is that all the empty clusters have been exhausted while >> >> > there are plenty of free swap entries in the cluster that are >> >> > not 100% free. >> >> > >> >> > Remember the swap allocation order in the cluster. >> >> > Keep track of the per order non full cluster list for later allocat= ion. >> >> > >> >> > This series gives the swap SSD allocation a new separate code path >> >> > from the HDD allocation. The new allocator use cluster list only >> >> > and do not global scan swap_map[] without lock any more. >> >> >> >> This sounds good. Can we use SSD allocation method for HDD too? >> >> We may not need a swap entry allocator optimized for HDD. >> > >> > Yes, that is the plan as well. That way we can completely get rid of >> > the old scan_swap_map_slots() code. >> >> Good! >> >> > However, considering the size of the series, let's focus on the >> > cluster allocation path first, get it tested and reviewed. >> >> OK. >> >> > For HDD optimization, mostly just the new block allocations portion >> > need some separate code path from the new cluster allocator to not do >> > the per cpu allocation. Allocating from the non free list doesn't >> > need to change too >> >> I suggest not consider HDD optimization at all. Just use SSD algorithm >> to simplify. > > Adding a global next allocating CI rather than the per CPU next CI > pointer is pretty trivial as well. It is just a different way to fetch > the next cluster pointer. For HDD optimization, I mean original no struct swap_cluster_info etc. >> >> >> >> >> Hi, Hugh, >> >> >> >> What do you think about this? >> >> >> >> > This streamline the swap allocation for SSD. The code matches the >> >> > execution flow much better. >> >> > >> >> > User impact: For users that allocate and free mix order mTHP swappi= ng, >> >> > It greatly improves the success rate of the mTHP swap allocation af= ter the >> >> > initial phase. >> >> > >> >> > It also performs faster when the swapfile is close to full, because= the >> >> > allocator can get the non full cluster from a list rather than scan= ning >> >> > a lot of swap_map entries. >> >> >> >> Do you have some test results to prove this? Or which test below can >> >> prove this? >> > >> > The two zram tests are already proving this. The system time >> > improvement is about 2% on my low CPU count machine. >> > Kairui has a higher core count machine and the difference is higher >> > there. The theory is that higher CPU count has higher contentions. >> >> I will interpret this as the performance is better in theory. But >> there's almost no measurable results so far. > > I am trying to understand why don't see the performance improvement in > the zram setup in my cover letter as a measurable result? IIUC, there's no benchmark score difference, just system time. And the number is low too. For Kairui's test, does all performance improvement come from "swapfile is close to full"? >> >> > The 2% system time number does not sound like much. But consider this >> > two factors: >> > 1) swap allocator only takes a small percentage of the overall workloa= d. >> > 2) The new allocator does more work. >> > The old allocator has a time tick budget. It will abort and fail to >> > find an entry when it runs out of time budget, even though there are >> > still some free entries on the swapfile. >> >> What is the time tick budget you mentioned? > > I was under the impression that the previous swap entry allocation > code will not scan 100% of the swapfile if there is only one entry > left. > Please let me know if my understanding is not correct. > > /* time to take a break? */ > if (unlikely(--latency_ration < 0)) { > if (n_ret) > goto done; > spin_unlock(&si->lock); > cond_resched(); > spin_lock(&si->lock); > latency_ration =3D LATENCY_LIMIT; > } IIUC, this is to reduce latency via cond_resched(). If n_ret !=3D 0, we have allocated some swap entries successfully, it's OK to return to reduce allocation latency. > >> >> > The new allocator can get to the last few free swap entries if it is >> > available. If not then, the new swap allocator will work harder on >> > swap cache reclaim. >> > >> > From the swap cache reclaim aspect, it is very hard to optimize the >> > swap cache reclaim in the old allocation path because the scan >> > position is randomized. >> > The full list and frag list both design to help reduce the repeat >> > reclaim attempt of the swap cache. >> >> [snip] >> -- Best Regards, Huang, Ying