From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 55CC6CE7AB0
	for <linux-mm@archiver.kernel.org>; Mon,  9 Sep 2024 07:22:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DC36D6B0133; Mon,  9 Sep 2024 03:22:51 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D72E76B0134; Mon,  9 Sep 2024 03:22:51 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C142B6B0135; Mon,  9 Sep 2024 03:22:51 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id A56AF6B0133
	for <linux-mm@kvack.org>; Mon,  9 Sep 2024 03:22:51 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 5E28F41DBA
	for <linux-mm@kvack.org>; Mon,  9 Sep 2024 07:22:51 +0000 (UTC)
X-FDA: 82544357742.13.B9C38D5
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13])
	by imf05.hostedemail.com (Postfix) with ESMTP id 59B3B10000D
	for <linux-mm@kvack.org>; Mon,  9 Sep 2024 07:22:48 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=FfdbkzL1;
	spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725866468;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=dj1w+0hV8WMGF6V0wW7JevD9ZkdK/xEf6CQLRne3O30=;
	b=2kEaU6XByo0NH4sOU3jzhPZmjJB9xU6egERtvOWZKAvC55K1ccR5E9LhxWVNiez82D+wXk
	TRdxsoJEhEcGtqIW3PpCKu8g0ZIDZNyZozOzOCXxSshlDOjkj3tCy3/b/KFDMaEYVgAl/b
	fBEHHITuR9NdwB0fFZT4ZN2FwzsNGXg=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725866468; a=rsa-sha256;
	cv=none;
	b=bPOL/2zANLDuT1BXNV4QcZFDCWPC+dxg2RcVe4vNxHGi8fo2c3dWR/B8G96gib/AzgV0IC
	BrAvLaYqg69Id0bW5hlvecc400VP0kSDfykPDy7g4gc/M+TS+YO/TM3BjokOmh+bPgGtNU
	acHlc7B0I9aqwnib8phggObQHkFCOpg=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=FfdbkzL1;
	spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1725866569; x=1757402569;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=HHYCMioBl3rZ8NMchmxx02IbaAD4rnswOq1CX26vmWg=;
  b=FfdbkzL1x3ucO/np2eUltQOgCz15dY2OQ41IyLoaxRtK2Og9IkKHkATJ
   nbhAt1LV/MnA47fbYUkJUOAyv+6IBqbJu0h/R/+DS4joIE5jXColix3Im
   18uTZj7ZLPjxt5KvYTU+5Ejw2FhXDADcil0blTn22O2jHsROOyvfJVx6L
   eTmC3TkFPu37Iwx/FBRqVYHWTF4rHCtllRwAC/UZV7K+NUTPuEBhWvHX7
   exRXIQKIDE/vcwql4HDjY1uu4Ob+Ndw/THGj9bZ4ZstPJ1UkkQnMM/Tls
   BSaaC3g8h6C9Quyyph1XTuCI52Xw4MTm9wh1pOvD5sc1Gm2Sgl1X+mkVZ
   A==;
X-CSE-ConnectionGUID: 71PWMAxFSBeHRTNNSfNLIA==
X-CSE-MsgGUID: RYs+0iYnTHSfLMxMD1/+0A==
X-IronPort-AV: E=McAfee;i="6700,10204,11189"; a="27466233"
X-IronPort-AV: E=Sophos;i="6.10,213,1719903600"; 
   d="scan'208";a="27466233"
Received: from orviesa004.jf.intel.com ([10.64.159.144])
  by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2024 00:22:47 -0700
X-CSE-ConnectionGUID: VRmB02zvTlmGloW/9qXSHg==
X-CSE-MsgGUID: jr46ujuNR6eZpe1z4xnnBQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.10,213,1719903600"; 
   d="scan'208";a="71524171"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2024 00:22:45 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,  Kairui Song
 <kasong@tencent.com>,  Hugh Dickins <hughd@google.com>,  Ryan Roberts
 <ryan.roberts@arm.com>,  Kalesh Singh <kaleshsingh@google.com>,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  Barry Song
 <baohua@kernel.org>
Subject: Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from
 nonfull list
In-Reply-To: <CACePvbUp1-BsWgYX0hWDVYT+8Q2w_E-0z5up==af_B5KJ7q=VA@mail.gmail.com>
	(Chris Li's message of "Mon, 26 Aug 2024 14:26:19 -0700")
References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org>
	<20240730-swap-allocator-v5-2-cb9c148b9297@kernel.org>
	<87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbX1M8tfqj__nvMwvD0P0abEjbju2gQDEea9BPZ6eUuRuQ@mail.gmail.com>
	<871q2lhr4s.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbUp1-BsWgYX0hWDVYT+8Q2w_E-0z5up==af_B5KJ7q=VA@mail.gmail.com>
Date: Mon, 09 Sep 2024 15:19:11 +0800
Message-ID: <874j6p1ehc.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 59B3B10000D
X-Stat-Signature: r1rw7onkdfjsru6ftwm6n8rwz6pno7re
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1725866568-494996
X-HE-Meta: U2FsdGVkX18WQINK7C/qfvng+0Fj2gTgeKajyyYhl15K9GCRHEbU3eoKfRTJoSddoehlG8L2mam6lZG/EADRjBMcHg3BuPE8n8+ABe4V3hiIv/SvYYPd3ubcr5ZkHIGBWxj67jpovCpe3xRyAHenARgkx9XlCYSM3qacCFXxqV4pqx98NRADys9giR5lmsbTLkwrXMmv2Ml1FjEIiFJ+ihkX3F79pA4W0YucuaktHYZEvPZpxgCcWpre6jolii+MujpafZhG2EgyQAOJpa1gOh10clGyobnIt8jNeJBfMUCchtQgHERf3dYWWYZxKtoNkEv/g/RnunS8ZYJl0/2Qp0Pu/bCXhFVwv8UHBAbHEvkRiPzTl6wV4wPw7C7+nm+eHamz3/YTEc6zEpVTFxd/m1aF5nb8RpgTlg6vQZu5KdKO9CvU6zbOPpR1/2/F7dZk4Q3rG8mtDNlDuCBTGpbTLUSo1tB/IUyjmTUIPxr6dqESpLmrClc04N21fDhLvDFAcF34c6fQlMWhzgDxNV9FToyIupCDgkRpzmdlMHYuulmdEmcorEH22O23w1uidWyX1n5Xpw5aTL+WYwdw2wkyYLA1picfsNOEo7wvcKxSPHNX667D41z6iKA9K36Puwn8cn/xYq8jt6j65ng0O5lUQyMo17QC1XCZRz5ZvUbaIZ96jxs0XFCRPU1NuuMWR/vkPxoKFjZ9qLzdj8HqJIq5AnpbvC9VomXOOGJQs6r8CVekFxz7TEPbPuxXB9JiVzYQ52XBhezPdC3IvrtyQPL4r9NrnvQcuZE08l5d48get+9ZPVj+IUG5qSdOMOPJnTMSzcknvmYoR5CaUJNZEsGdpgr+4f0p7UsRviGJ2VgrF03MhCVxLH+9Ep8U47D8jTPd512FO2G0uXh8NFNfiA9NVodJCWy6S4S/+NQJTjhBI/5DhedMTssd4OPkCmK8UmjXG693SMiNrlu3pMME85i
 D3ogYos2
 kCniX4LK4GC+e3latqWUAG6Glr2wZlwLMFyNqjRZl01+D1R53rT9t6ZQ+1lgltMVN1ZnOMhYIoJbydcx7pWyQ7BkiqyEEfXyTZWx0mEt8WFp7kAVwLS4s3Q1JgSru1CD1KMsOK4ormt8wvYgZSmxBhYWt/eS4GBcMSxkuqtfaxzgKyg2S4qzBtk3DKC6uvb4pHLGLOlvkLVmJ/nWtEy3Q9YpGKCEHvvrEG576
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Mon, Aug 19, 2024 at 1:11=E2=80=AFAM Huang, Ying <ying.huang@intel.com=
> wrote:
>> > BTW, what is your take on my  previous analysis of the current SSD
>> > prefer write new cluster can wear out the SSD faster?
>>
>> No.  I don't agree with you on that.  However, my knowledge on SSD
>> wearing out algorithm is quite limited.
>
> Hi Ying,
>
> Can you please clarify. You said you have limited knowledge on SSD
> wearing internals. Does that mean you have low confidence in your
> verdict?

Yes.

> I would like to understand your reasoning for the disagreement.
> Starting from which part of my analysis you are disagreeing with.
>
> At the same time, we can consult someone who works in the SSD space
> and understand the SSD internal wearing better.

I think that is a good idea.

> I see this is a serious issue for using SSD as swapping for data
> center usage cases. In your laptop usage case, you are not using the
> LLM training 24/7 right? So it still fits the usage model of the
> occasional user of the swap file. It might not be as big a deal. In
> the data center workload, e.g. Google's swap write 24/7. The amount of
> data swapped out is much higher than typical laptop usage as well.
> There the SSD wearing out issue would be much higher because the SSD
> is under constant write and much bigger swap usage.
>
> I am claiming that *some* SSD would have a higher internal write
> amplification factor if doing random 4K write all over the drive, than
> random 4K write to a small area of the drive.
> I do believe having a different swap out policy controlling preferring
> old vs new clusters is beneficial to the data center SSD swap usage
> case.
> It come downs to:
> 1) SSD are slow to erase. So most of the SSD performance erases at a
> huge erase block size.
> 2) SSD remaps the logical block address to the internal erase block.
> Most of the new data rewritten, regardless of the logical block
> address of the SSD drive, grouped together and written to the erase
> block.
> 3) When new data is overridden to the old logical data address, SSD
> firmware marks those over-written data as obsolete. The discard
> command has the similar effect without introducing new data.
> 4) When the SSD driver runs out of new erase block, it would need to
> GC the old fragmented erased block and pertectial write out of old
> data to make room for new erase block. Where the discard command can
> be beneficial. It tells the SSD firmware which part of the old data
> the GC process can just ignore and skip rewriting.
>
> GC of the obsolete logical blocks is a general hard problem for the SSD.
>
> I am not claiming every SSD has this kind of behavior, but it is
> common enough to be worth providing an option.
>
>> > I think it might be useful to provide users an option to choose to
>> > write a non full list first. The trade off is more friendly to SSD
>> > wear out than preferring to write new blocks. If you keep doing the
>> > swap long enough, there will be no new free cluster anyway.
>>
>> It depends on workloads.  Some workloads may demonstrate better spatial
>> locality.
>
> Yes, agree that it may happen or may not happen depending on the
> workload . The random distribution swap entry is a common pattern we
> need to consider as well. The odds are against us. As in the quoted
> email where I did the calculation, the odds of getting the whole
> cluster free in the random model is very low, 4.4E10-15 even if we are
> only using 1/16 swap entries in the swapfile.

Do you have real workloads?  For example, some trace?

--
Best Regards,
Huang, Ying