From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D146BC27C53 for ; Thu, 20 Jun 2024 02:32:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C96C8D009A; Wed, 19 Jun 2024 22:32:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 279008D0066; Wed, 19 Jun 2024 22:32:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 11A968D009A; Wed, 19 Jun 2024 22:32:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E7ECC8D0066 for ; Wed, 19 Jun 2024 22:32:25 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 48FCAC0ADC for ; Thu, 20 Jun 2024 02:32:25 +0000 (UTC) X-FDA: 82249693050.21.5EA5864 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.14]) by imf24.hostedemail.com (Postfix) with ESMTP id 975CF180006 for ; Thu, 20 Jun 2024 02:32:21 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=KQqxeZfi; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718850738; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=veYqSVFJ1/v2tHNKJQpn956OZ9C389kYblHEKfN2cYU=; b=nt7xbAdszFoF0USY2jJTFLtkiFT6xNThziDx8p/dfXSq7Umq+7q8p7Efw3JSZXfRYG3Yhm GEN4qvm9Jq7+f82l0LGgmjDc3epldBfI8lA2tT8lCNxxG8CMTpwtPybMGZLgc8w8Jcl7p8 QL+YdErY6uOJ3Jaoeul060/QJNkSDkw= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=KQqxeZfi; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718850738; a=rsa-sha256; cv=none; b=cHY28+rdNriCku5cllheN4ieTwBuFWzPt+K2ekl76jJB5Mi/F8tCw/uEgwaOU2rXU2nHyQ Csq1NK/9roQNlKBiO/m74gJrcqEz1X3c8KV/RADIZSL3d162DFa03ta+bs0mPN2Qhsu5ht Z2R7UAHJQuPbj/V1h5E1o12kfgE2J5I= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1718850742; x=1750386742; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=2IVaSUq/iqjRz1dwkCv6hbmqTtIV5f/0QYK7K/TxyV0=; b=KQqxeZfiEN+sJCARIWlHxlbXmPji+MbquzaaA/MG9FBiLzICfNeYeEX2 dtMnTM8mQ+frWiQFfM6uWGcvsFnBB9nqBOYPbNRW3KttGq6khJLdySqNX rSZSD/PLiULV/aj7syHl5VhIqVahHumxifhaARWKRVCWVBISKPEpM94eR xmM2WfPr6FF5KSOIEu3QnWW58BuKWkIaPBoLif2vrjtPwARi5v8mAQJH6 QhN2khQwaMRphqqYplvdi1CkO8QbN0fkYHhOvrKNs0RGkTibH3HNRC8oO s5fDyhby7CLzCpLIuwn3tFqwjOWravqscJKijeNrlaED98IHalM2AdtUt g==; X-CSE-ConnectionGUID: M/JI1xs0QYOaitH2Ee/klQ== X-CSE-MsgGUID: y7iBBjC6RkyIoQYgzM1nyw== X-IronPort-AV: E=McAfee;i="6700,10204,11108"; a="19624452" X-IronPort-AV: E=Sophos;i="6.08,251,1712646000"; d="scan'208";a="19624452" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by orvoesa106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2024 19:32:20 -0700 X-CSE-ConnectionGUID: XGSfu4pGR66bCvhbMnSqcw== X-CSE-MsgGUID: LPkXu6MRQAWDSmmn540Wow== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,251,1712646000"; d="scan'208";a="46461011" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2024 19:32:18 -0700 From: "Huang, Ying" To: Chris Li Cc: Andrew Morton , Kairui Song , Ryan Roberts , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Subject: Re: [PATCH v3 0/2] mm: swap: mTHP swap allocator base on swap cluster order In-Reply-To: <20240619-swap-allocator-v3-0-e973a3102444@kernel.org> (Chris Li's message of "Wed, 19 Jun 2024 02:20:28 -0700") References: <20240619-swap-allocator-v3-0-e973a3102444@kernel.org> Date: Thu, 20 Jun 2024 10:30:27 +0800 Message-ID: <87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 975CF180006 X-Stat-Signature: 6dgz4ibkyib4m5ina6rn3o1mdgd3ehhq X-Rspam-User: X-HE-Tag: 1718850741-936952 X-HE-Meta: U2FsdGVkX1/C65GsePdAdSHdLZflm0N/UI6y4n53pTFynCD2b9pBe7XGlo3CS6Fvj6foijdVxOL484fGJzKf7XJrM0yqiNG8n0UTPPUk+0MBPQOpHRmHVK1dzolphh/Lsy98Yw8qQqIhER5NyrUQPZ3OEEXfKKhQLc54UE7P1PTmmzunuvV9c35j5cN3rpfZvGlnrRvV+mPC84cBb+eN8O0t3ocjFl4bVLWNxtWi2OgMBU0rqhB3ZjJHbo6eELxHbFBerjZsbcOOEThI5jdiOqAd4aqKzxk8MDOkHodUNBxHh/4LDM3xBQmrhffkIuiQ0lENVFwDSbUagBrlOml0kc0TgKJGhi+Y/wO2CYM+RP2jZGcG5q5lDxquqUW5hRjNfkIC9cKYQP84cWoGkX752qM2QskNF2nuV1+UOxEVFbbhieB5+RTTCyBo6NhU3JshzcyC5xVSmqJtw4d8+8dltor58GV168Ri/uYZiNLzbXYwvctwBxm+dB+IFZ2253/BQGm/m8uKWcLXSVHRkjEcyCahEYjl1AQ38DLLT7q3+n8+Rz7/vu3QKnlmNFuK7iqyACHv1DDIcoq2yiNDnZhMkdcUPLlpIxBiV9mX6QShktLVxHVOWFVawE5FQMxf8Pv/AKiGr0AOxDYBzdZjs4GLT9/ohO7bOpxVE4Aw/AbbqQh1HyyOjPLgreX6aPVSAsMxWka6xSb64SdClvAC7jqFqTLFbY1tJRtW2vr9PcwbgAXPAFcLzz3st0YlmAz/SrVW5yUaM8eyKNN0pGO3RQWG9GvSVALZSNWODC1tDA5zDm7X+dxz6jVMLD1S9sUtGlcqW3ssk+hfoQZ624OSebZX3YGuZ0xMxGGtvoQTzdj2LRcjiFLE9IeA82AppEZaEY8sAKCmeJjppns72NHl+HhyCVXoasuSi73cuujhLFwoLhANJBBH8ATBglOElYut4Ol+NsZBzLStFw0sHUtd5AF ewEUsS38 bqIS5EGzsZlysC66OIvgqi0hrMGuYdE/EGA/YoXD7iur7ehsTnasVDFw6ri7NJZo7S+8yHTlYffKoRLFwMgdRteKLBXVS4sLkEj7zYj9QCOVxCfC2uBtcuVrmkU/D+qq30pedC2SLrV+WmW8iSDS08a8fLWUt0RinfWECGIYcFuS1XTX1GgY8rSLINkOmr4eE2amIfnpoiB0gzTxO092CjA+R1GYkNXtHbvfi X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li writes: > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. It has a fragmentation issue > reported by Barry. > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. > > User impact: For users that allocate and free mix order mTHP swapping, > It greatly improves the success rate of the mTHP swap allocation after the > initial phase. > > Barry provides a test program to show the effect: > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ > > Without: > $ mthp-swapout > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% > > $ mthp-swapout -s > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > > With: > $ mthp-swapout > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > > $ mthp-swapout -s > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% Unfortunately, the data is gotten using a special designed test program which always swap-in pages with swapped-out size. I don't know whether such workloads exist in reality. Otherwise, you need to wait for mTHP swap-in to be merged firstly, and people reach consensus that we should always swap-in pages with swapped-out size. Alternately, we can make some design adjustment to make the patchset work in current situation (mTHP swap-out, normal page swap-in). - One non-full cluster list for each order (same as current design) - When one swap entry is freed, check whether one "order+1" swap entry becomes free, if so, move the cluster to "order+1" non-full cluster list. - When allocate swap entry with "order", get cluster from free, "order", "order+1", ... non-full cluster list. If all are empty, fallback to order 0. Do you think that this works? > Reported-by: Barry Song <21cnbao@gmail.com> > Signed-off-by: Chris Li > --- > Changes in v3: > - Using V1 as base. > - Rename "next" to "list" for the list field, suggested by Ying. > - Update comment for the locking rules for cluster fields and list, > suggested by Ying. > - Allocate from the nonfull list before attempting free list, suggested > by Kairui. Haven't looked into this. It appears that this breaks the original discard behavior which helps performance of some SSD, please refer to commit 2a8f94493432 ("swap: change block allocation algorithm for SSD"). And as pointed out by Ryan, this may reduce the opportunity of the sequential block device writing during swap-out, which may hurt performance of SSD too. [snip] -- Best Regards, Huang, Ying